Last night a user reported that fog.io's integration with our API was not passing the scrub flag. This started a conversation on GitHub about users who were able to see prior data on a virtual server that was newly created, but did not have the scrub flag passed. We wanted to address these concerns to make sure that we are being transparent and to make it clear that customer security remains paramount. At no time was customer data "leaked" between accounts. This would require that a user not scrub their volume after destroying their server; in this instance data would be recoverable and should be considered not sensitive.
This is an issue that we cleared up earlier in the year with scrubbing the drive. Taking everything into consideration, we made this the default behavior for all destroys. However as utilization of our cloud went up, we saw that scrubbing was starting to cause degradation in performance and caused many destroys to run for an extended period of time.
We then made the decision to update this behavior into a separately controllable and user-initiated action which was called "scrubbing". This was made public in the control panel as a simple check box on the destroy menu and inside the API as scrub_data – a boolean parameter. Given some of the usage patterns we observed with customers during the on-boarding process, whereby many customers rapidly created and destroyed servers, we mistakenly assumed that this should be the default initial behavior. As a result, we switched the default mode away from scrubbing to improve performance, given that customers would have complete control over this action themselves.
The second mistake that we made was not notifying our customers that use the API. We should have sent an email to let each of them know of this change in default behavior; that way they could make any appropriate code changes necessary, as well as have enough notice to roll out those changes before the new default API behavior went into production.
We were wrong on both counts. We failed to deliver that message explicitly via email, and we should have taken more factors into account when determining the default behavior for a feature– specifically the multitude of customer concerns other than performance.
Our first and immediate update is to ensure that a clean system is provided during creates, regardless of what method was taken for initiating a destroy. Engineers are updating the code base right now to ensure that will be the default behavior, and we will provide another notice when that code is live.
The scrub feature will remain, allowing customers to take an extra level of precaution if they choose to scrub the data after the delete.
As we've grown, we have also seen a need to greatly improve our communication with our customers regarding updates, changes, and features. If anyone has any concerns or questions, we would love to hear from you. Please feel free to email me directly at Moisey@digitalocean.com.
We wanted to provide an update on all of the changes that have been deployed to production, as well as provide more information concerning issues customers have brought up.
We've received feedback from customers that the original tone of the blog post, which remains unedited and can still be viewed, wasn't an admission of a mistake on our part. We wanted to clarify this was absolutely a mistake on our part, and we since deployed several fixes and policy changes which are detailed below.
We have updated the destroy method to scrub on all destroys, both for web and API requests.
We should never have updated the default behavior to an insecure method. Going forward we will always ensure that the defaults remain sane, and that the customer's concerns and their security are highest priority.
We employ two different types of filesystems for our KVM virtual servers: First is QCOW, which operates as a sparse file and allocates blocks in real time. The remaining free space that exists on a virtual server has meta data stored, which tracks if a particular block has been written to previously; if it hasn't been, it always returns a 0 (nil) value. The issue occurred on our LVM virtual servers, where the filesystem layout is left up to us to put down. In this case we zero out the volume (which is what the scrub flag is for); however we made an update where it was no longer the default which caused this issue. We have now defaulted to destroys with scrub enabled, and also updated how we layout our LVM volumes, by first putting down a dmzero sparse LVM volume and then layering the LVM FS on top of it. This essentially creates the same sparse file behavior as we have with our QCOW virtual servers; moreover, it ensures that any blocks that were not previously scrubbed cannot be accessed when a new virtual server is created.
This code was deployed approximately two hours after we posted this blog post, and it was another huge failure on our part to not immediately provide an update that explicitly stated which changes were made and when they were implemented. We mistakenly left the Resolution section of the original blog post as the only guidance, which was not nearly explicit enough-- especially in our failure to mention when these fixes would be rolled out.
Now that we have deployed the necessary fixes, we are going to evaluate new filesystems for creating virtual servers to see if we can find the correct balance between performance and security. We also want to re-iterate that as we evaluate alternatives, the balance will be 100% security and only then will we try to minimize any performance impact.
We will post via our blog and twitter to announce any new features, products, or updates and ensure that we live up to our promise of sane defaults with security always at the forefront.
We wanted to thank all of our customers for their continued support and we hope to make 2014 an even better year than 2013. For anyone that has lost trust or has any issues please feel free to send me a direct email - moisey AT DO.