Rick O'Toole, Cofounder and CTO
Rockerbox is an advertising and analytics company specializing in understanding user behavior. Our product depends on understanding the current behavior of our advertiser's audience and using this information to model and find similar users.
Efficiency is part of our DNA at Rockerbox. Our goal is to make the advertising and marketing efforts of our customers more efficient and effective.
Efficiency doesn't stop at the service we provide our customers. It is also embedded in the technology we build and the processes we design. We have designed a system with the capacity to process up to 120TB of data on a peak day. Using other web services and cloud providers, we could easily throw away a lot of money just trying to hit response time mandates set by our partners.
With DigitalOcean, we are able to operate with a cost structure that is roughly 20% of what other independent cloud providers were charging us when things like bandwidth overage charges were tacked on. DigitalOcean's simple pricing has made it easy for us to embrace scale without worry about minimums, discounts, and penalties.
Data collection is at the core of our platform. To collect this data, we need a constant presence on the internet that is dependable, redundant, and scalable.
We also need to be able to process, store, and retrieve this information for use internally by our media traders and externally as part of our analytics platform. Based on these requirements, we need to strike a balance between three different variables:
To balance these three requirements while maintaining cost efficiency on DigitalOcean, we split Droplets based on whether they need to be network efficient or CPU efficient.
To collect data, we deploy a plethora of boxes with static IP addresses which act as our primary interface to ad exchanges, third-party data sources, and on-page pixels. With DigitalOcean, we are able to easily provision these static boxes without paying more for the right to reserve the IP address.
Different data sources have different requirements as related to network efficiency. We have 3 different types of statically deployed network boxes:
To maximize processing power and long-term storage, we also run a cluster of boxes where dynamic, changing workloads can be processed. Our statically deployed network boxes pass the data they collect to our cluster, where we prepare, process, and model our data.
For this, we wanted the flexibility to write diverse applications that operate within a set amount of infrastructure while having the flexibility to pull more resources for distributed jobs.
Specifically, we run a Mesos Cluster with a Hadoop-distributed file system (HDFS) on top of DigitalOcean. This cluster consists of 50+ (8 core Droplets) VMs and is optimized for disk space, CPU, and memory. These boxes handle our data pipeline, model generation, databases, and end-user applications.
Mesos gives us the flexibility to run:
By using Mesos, we have a few advantages:
Below is an example of how data is received by our network optimized boxes, passed to our data processing applications within the Mesos cluster, and ultimately written to a data store that end users have access to:
To support the platform we described above, we run approximately 200 Droplets on DigitalOcean.
Looking back at historical data to analyze and test models is an important capability of our infrastructure and tech stack. To enable cost-effective long-term storage, we synchronize a majority of our log-level data stored in HDFS to DigitalOcean's Object Storage product, Spaces. We use the DigitalOcean Spaces NYC3 region, which is located in the same city as the majority of our Droplets.
When we want to retrieve information stored on Spaces, we simply mount an external HDFS table using the standard Apache Hadoop connector for S3 that points to our Space and execute HQL queries against this data source. These types of queries require a strong match between the partitioning schema we originally used when writing our logs to match our access pattern. This is rare since most data is partitioned based on time, but it allows us access when debugging and also allows us to integrate directly with our "active" data sources. There is no cost to upload data to Spaces, and no cost for bandwidth between a Spaces region and Droplets in the same region, so this setup does not add bandwidth costs.
At peak, we are processing over 200k requests per second. This translates to a data pipeline that is capable of processing over 3B requests per day. While other companies have huge fleets of servers to run this type of infrastructure, we're able to accomplish this with 200+ VMs. With DigitalOcean, we have been able to make scaling the technical infrastructure of our business cost-effective and efficient.
Contact our Customer Success team to get answers.