AudioBox: Why We Moved to DigitalOcean
We received an amazing writeup today from Claudio Poli, maker of AudioBox, a media platform in the Cloud. He wanted to share his experience so we've decided to post it here on our blog.
For us, developers and system administrators, the 'Cloud' offers many possibilities in the form of different key services. Nowadays if you need a component for your platform, be it critical (database, hosting) or a commodity one (external queues, free Heroku dynos), there is a cloud service that provides it, and the problem usually boils down to how much does it cost.
In most small start-ups simply there isn't much money to invest, and on the other hand there are always critical decisions about hardware and software services awaiting approval.
I've seen cloud services that let's you pay quite a big chunk of money per month for something that you can easily make in-house, and great services at affordable prices. It's a wild market out there, so after doing our research it was time to test and experiment.
First critical decision: hosting. Where do we put this code? Do we need other services from the same provider? What about the costs? Does it scale easily? Will it be easy to switch or will I become vendor locked-in? Decisions, decisions...
AudioBox.fm, our startup, initially adopted Amazon EC2 as its home, because of the flexibility in automated horizontal scaling (with a proper, non-trivial configuration however) and the presence of other niceties, such as Security Groups, Load Balancers, S3 in the same zone and more.
Let me get started by saying that our platform is fairly clean and lightweight. Every line of code has been benchmarked and deeply tested. It's a platform composed, like many others today, by small mini-applications, each with a specific role and often written in a different language from the main one.
Our core app, which provides API and front-end is Ruby, some key components is Node.js and C/Java. Whilst I can spend some (many) words about all the problems you can incur with Ruby (memory and concurrency to name few), this is not my main goal; what I'm going to describe is our real life, direct experience in hosting a website/platform on EC2, like many of you do.
After much effort, the day finally comes; AudioBox.fm on TechCrunch and all other major tech blogs. They are saying that the product is worth five minutes of their lives. Suddenly you have people visiting from USA, China, Japan, Europe and South America. Judging from the analytics and the server logs it seems the entire planet is trying to take a look at the web pages trying to score an account, which proves to be difficult given the sheer amount of traffic.
Our auto scale policies go wild and spin up another instance. When the instance boots up it pulls the latest version of the code and proceed to a self-deploy. We get notified. It works; our nights passed testing all this finally has been proven to work, and work smoothly. We are happy. We feel safe that we can continue serving the traffic. But wait, another instance fires up. And another one, up until our auto scale limit.
It's quite clear that something does not work as expected. Our auto scale policies were set to trigger alarms after an average use of CPU (>80%) for 10 minutes. They went on fire rapidly.
At this point I'll just mention that those auto scaled instances stayed up for days after the initial wave, but ultimately kept the site alive, albeit slow. No one of them got terminated due to the scale down policy. This definitely was not in the plans. You may ask what type of instances we were running? C1.medium, 32-bit. Underpowered you say? Well, maybe. But remember it's a web site/service we are talking about; it shouldn't take so much CPU power to process even the most basic JSON calls.
Fast-forward to the next natural step: trying to find a problem within your apps/frameworks of choice. We run the profilers. Switched the main applications to use JRuby for better insight and tools, trying to find memory leaks or where the most expensive calls are made. Signed up for New Relic, etc. only to find ourselves at the starting point: the apps are fine.
And then it was clear: we are succumbing the infamous "CPU steal time". This term rings a bell? Welcome to the Cloud.
That's it, since our initial months on EC2 we observed a dramatic drop in the performance allotted to us by the supervisor running on EC2 and this left consistently our apps in a crippled state with a steal time going beyond 20%. There are different kinds of instances of EC2, but they are quite pricey, even if you decide to pay in advance and use reserved instances. Scale vertically? Been there, done that: problem not solved and got huge bill at the end of the month, because they are limited as well, and they have an odd ratio between CPU/memory.
We did put the platform under load test but they were sketchy: you may get different results from production to staging due to where your server is located and on which supervisor. In any case the results were somewhat uncomfortable, but ultimately something we were able to live with, at least initially, with horizontal scaling.
The problem is that I don't need nor want cluster instances or high CPU (with low memory) to run a website. Neither you have the money for them, right?
There must be something better. Then we discovered DigitalOcean, which provides inexpensive and powerful servers.
The switch from a mindset like the one you use on EC2 to a VPS can be quite brutal for a big, dynamic-by-nature, architecture. And for different reasons:
- You don't have certain services that you took for granted when you made the initial choice.
- You need to rethink your auto scaling architecture. Auto configured load balancers that must stay up to date based on how many instances you want to spin up if you want to stay dynamic. DigitalOcean API comes in help here. Unless your site is small and you can make adjustments to your HAProxy manually.
- Custom monitoring solutions, you don't have CloudWatch, but you probably need something better anyway.
And more. Basically you're on your own, and for a typical small start-up team I realize it can be though.
But hear this. Nothing is impossible and with some practice you can achieve 100x the throughput you are having on EC2. If you are not a system administrator there are plenty of guides that can walk you through the basics of setting up a secure and fast DigitalOcean server.
Now, a single DigitalOcean 4-core/8GB machine can keep up all the traffic we are having, with an overall CPU usage that never hits the 30% mark. For just 80 bucks per month it's a godsend.
We are about to launch the AudioBox.fm Android app, and I bet we'll have to have, maybe, fire up just another server. I can finally focus on fleshing out our next new features for enterprise and partners.
Every innovative platform has a unique architecture under the hood, and sometimes the right decisions can be made only after lot of trial and errors, as in our case. But while it's common to have a small team working onto a big project, often without deep budgets, you always have to choose carefully the best tools for the job. There are mistakes that can end up costing you and your start-up a lot.
For AudioBox.fm, DigitalOcean helped dramatically cut costs, well before the real fun happened, and increasing the overall performance and responsiveness of the platform.