Den Golotyuk, Engineer
We are a young team of 15 people. We are building a tool to detect critical changes in business indicators' behavior. The key feature of our system is that all data are collected in the unsampled form and all metrics are checked for deviations in real time to automatically notify the client on important changes.
Technologically speaking — it's all about big numbers even for small projects. We process 20 billion events a day. Data should be collected from many sources and be constantly checked for anomalies. A single client event can lead to a dozen other events being stored and processed (e.g. different slices and intersections of a certain metric).
We'd like to share our experience on scaling our app to 700 million requests per day in just 6 months on top of DigitalOcean infrastructure.
Our project is still young but is growing very quickly. From the start we needed a special platform that would ensure that architecture wouldn't hold us back later. Everything had to be done quickly with lots of trials and modifications. Dedicated servers were out of the question as they are too slow and complicated.
We knew that we would face scalability issues very soon, so we decided to build a growth-focused system from the start. What it meant for us:
The architecture of dedicated servers does not differ much from cloud systems when it comes to scaling. They both include multiple nodes, redundancy, automatic failover, sharding, and balancing. But relying on high-performance hardware from the start of the project will surely become a problem later. Many system components will need to be radically changed. Utilizing small nodes for the same components right from the start helps to create a scalable architecture. It might be a more complicated way, but it's definitely much faster.
Our expenses could be reduced by 30% if we were to switch to powerful dedicated servers. But in this case, we would have to pay for hardware failures and hire a system administrator (yes, we do not have one right now!) to deal with all this stuff.
Amazon Web Services is a popular solution for startups. Not just for cloud nodes, but for the entire infrastructure. However, we had some specific requirements:
We still use some Amazon services (for example, Route 53), but 95% of our system works with DigitalOcean nodes.
So here comes DigitalOcean.
As the project is growing fast, we are facing several challenges that we'd like to share with DigitalOcean readers.
The core of our system is based on small 1GB x 1-core nodes. Yet we put a lot of thought into choosing an appropriate configuration from the get-go. For a while, we were creating all our nodes using the smallest 0.5GB x 1-core plan. However, the issue was that the bulk of resources were utilized by the operating system and almost nothing was left for such demanding services as MySQL.
Currently, we have settled on using the following configurations:
All DigitalOcean nodes have SWAP disabled. It's very convenient as it ensures that services are working with the highest performance. However, on the second month of growth (when we had a few dozen nodes) we experienced constant problems as some processes were killed by the system. Sometimes it was MySQL:
As we began enabling SWAP, we soon burned our fingers — some nodes started performing 10 times slower. Finally, we disabled SWAP for almost all nodes and decided to scale horizontally. Now the number of nodes is steadily increasing and SWAP is enabled only for a few of them.
SWAP is used only for servers with secondary functions, which do not create any visible application performance issues for users. For example, full-text search nodes can get into SWAP during text re-indexing process, which slows down operation, but is not critical to users.
Resize is a powerful and cool feature. But you should be careful with it. Every time you resize a node, the growth problem is just postponed. Changing server performance only solves today's problem. A scalable solution can only be created by adding new nodes and distributing the load between them. Resize is more of an exception, but sometimes can act as an emergency measure.
Communication between nodes remains our greatest challenge. Needless to say, Private Networking should be used for communication.
Some of our nodes are located in different data centers and communicate over the Internet. Traffic restrictions and security requirements dictate that we use SSL and GZIP data compression for efficiency.
Small nodes based architecture poses another problem when several nodes send requests to a single node. During growth the number of sending nodes is constantly increasing, so is the load on the receiving node.
Currently, we use this solution for frontends and databases. We use Consistent Hashing to determine data nodes, so we've got high load peaks on some DBs:
We solved this by creating intermediate nodes with a simple function of aggregating and proxying requests. This reduces the load on databases nodes:
And here comes the awesome DigitalOcean API. We use the following approach for all our nodes:
Each DigitalOcean node has an external IP address and is available via internet. To ensure communication security between different nodes and clients we use SSL only. Additionally, service access is limited at the hardware level; connections are accepted only from IP addresses in the white list.
At first, we were afraid that node performance would be very difficult to predict, as it is impossible to predict their shared physical environment load. We track Stolen CPU indicator:
And almost always it has acceptable values with rare peaks.
DigitalOcean proved to be really good for this issue. Only 1% of our nodes experienced rebooting from the very start.
Nevertheless, you should be prepared for node rebooting. In any case it's an absolute requirement if you're creating a stable system.
In spite of the opinion that cloud services are only suitable for small projects and startups, we decided to launch our service using DigitalOcean. At the moment we are satisfied with the selected approach, since it allows for quick responses to load growth, even within a two-fold increase over a few hours.
Drop us a line, we'll be happy to hear your comments and share our experience.
Contact our Customer Success team to get answers.