Over the last few weeks, our object storage solution, Spaces, has experienced multiple incidents resulting in intermittent drops in availability, performance, and functionality. We recognize the impact these outages have on your work and business, and we want to provide more details about the causes and what to expect from DigitalOcean.
As adoption of Spaces has steadily increased, we’ve encountered challenges that only became visible once we reached this scale. The majority of the incidents fall into two main areas:
Customer Impact: Some users may have seen HTTP 503 errors (SC - Service Unavailable or PR - Slow Down) from the Spaces API or their application. Others might have experienced interrupted operations.
|Older version of Ceph being used in NYC3||Reduced performance of RADOS gateways||Resolved|
|Slow or blocked requests||“Hot spots” in cluster creating lag||Identified|
|Increased latency accessing Spaces in NYC3 from Droplets hosted outside of NYC3||Overloaded RADOS gateways||Monitoring|
Identified - root cause is known and our engineering team is working to
Monitoring - a fix has been implemented to resolve the issue
Resolved - the issue has been confirmed as completely resolved
An older version of Ceph being used in our NYC3 datacenter progressively reduced the performance of the Ceph RADOS gateways. This resulted in error requests increasing over time. To work around this problem, we began temporary, staggered reboots of the gateway nodes. Although these reboots were relatively quick and expected customer impact was minimal, some users who were in the process of a large or lengthy upload or download would have seen an interruption with that transfer.
Additionally, we experienced several cases where storage clusters would periodically experience slow or blocked requests. Our engineering team investigated the issue and discovered that some workloads were causing “hot spots” in the cluster. The read/write data pattern creating the hot spots caused some disks to lag behind in request completion. The investigation into the first several cases required several hours to complete. During this time, the overall performance of the cluster was adversely impacted and cascaded into the front end gateways, which ultimately reduced availability.
We also received reports from some users that traffic to Spaces in our NYC3 datacenter was much slower from external points than it was from Droplets also hosted in NYC3. Our investigation uncovered that our external RADOS gateways were overloaded, causing increased latency for all customers accessing a bucket from a Droplet hosted in a datacenter other than NYC3. We have added more RADOS gateway instances into our Spaces cluster to sustain the increased traffic. Other customers reported getting HTTP 503 - Slow Down errors, even though they were well below the request limits. This was a result of uneven request distribution among our proxy processes. We are currently working to provide more accurate rate limits to all Spaces users.
By upgrading our NYC3 datacenter to a newer version of Ceph, the temporary reboots have ended and the performance degradation over time has been resolved. Additionally, during the month of April, we experienced several incidents related to slow requests. Our investigation uncovered that our Ceph index OSDs were configured suboptimally, progressively causing latencies to increase over time as more objects were added to a bucket. We have since changed the index OSD configuration and the issue has been completely resolved.
In an effort to speed up time to detection and resolution of hot spot related lags, we have added improved tooling and alerting systems. We are also continuing to investigate and invest in performance optimizations for Spaces clusters so we can better handle a variety of workloads.
To address the slowness in traffic - which could be due to increased inbound traffic or to cluster slowness resulting from slow requests - we have increased the number of gateways and made configuration changes to optimize handling of traffic to and from the Spaces clusters.
Customer Impact: Customers have encountered several problems: errors indicating “something went wrong,” slow load times, failed delete operations, and incorrect object counts.
|Slow Spaces file listing||Slow response on the Spaces page||Monitoring|
|Discrepancy in number of objects||UI not reflecting accurate data||Monitoring|
The backend issues described above also impacted UI functionality. As the number of objects in a bucket increased, the Spaces page on the Control Panel - especially the listing of objects - would slow down. In some cases the customer would receive a generic error indicating something went wrong.
Customers have also reported discrepancies between the number of objects displayed in a bucket and the actual number of objects that are stored. Our investigation found that failed or retried multipart uploads ended up with zombie objects that needed to be cleared up on the backend. Additionally, some customers who attempted to delete a Space from the UI have reported that the operation was successful, but the objects were not actually removed. We discovered a race condition between the UI and the backend when a delete command was issued. The delete appears successful but the objects are not actually deleted. We are implementing better detection and recovery mechanisms to remedy the problem.
We have implemented several optimizations to the Control Panel that, in conjunction with the backend, will load bucket information and run operations faster.
The above issues identified as “resolved” all have a confirmed, permanent fix in place. The issues related to Ceph have effective detection tooling and workarounds in place until we are able to administer a permanent fix. Remediating the issues causing the Spaces outages is a top priority, and our team is already taking steps to prevent, detect, and mitigate the underlying problems. We are also working with the Ceph community to prioritize issues that have been logged with their defect management system.
Over the next 60 days, we are planning to release new features that will help reduce latency for serving static assets to end users - in many cases, dramatically. If you’re interested in signing up for the private beta waitlist for these features, you can do so here.
We understand the severity of these incidents and apologize for the impact and frustrations these occurrences have had on our customers. The effort to mitigate these issues will last over the coming weeks and months, and we will continue to provide updates here to keep our users informed of progress towards full resolution.