Inside DO: Banishing Your Sysadmin Fears


Jay Gordon, TechOps Engineer at DigitalOcean, shares his theory of the sysadmin mindset.

Jay Gordon

My life as a sysadmin used to be filled with fear.

I was afraid I would do something terribly wrong with the systems I was running; something that would take a long time and a lot of money to fix.

For example, if I upgraded my Linux kernel, would my services fail to boot? Would it take intense troubleshooting to correct while our production system was out of action for hours? Getting a new server up meant spending hours with my host: waiting for credentials, waiting for restoration from a traditional backup service, and then finally working with engineering staff to bring the application back online.

Even worse, the longer I took to restore a failed system, the more my coworkers subjected me to increasing blame and shame. I was afraid, my team experienced internal strife, and my company lost money. These are just not cool things to have directed at you day after day, simply for trying to get work done. Fear is not a motivator, and fear is not a way to run a business.

You can fail. It’s okay.

Fear of failure should not stop you using Linux. The answer is not to stop failing. It’s to encourage it.

“I must not fear. Fear is the mind-killer.”

—Frank Herbert

Encourage failure? What kind of crazy are you talking about?

When failure is cheap, it’s not a problem if your new code doesn’t deploy the way you thought.

It’s not a problem if your new custom-compiled version of MySQL doesn’t start the way you thought.

It’s not a problem if your WordPress upgrade failed.

Snapshots Are Your Friends

I want to share a development workflow that makes failure cheap in terms of both time and money.

Let’s say you’re about to test an upgrade to your application.

  1. Snapshot the server running your old, working version of the app
  2. Create a staging environment from the snapshot
  3. Test your new app on the staging server
  4. Deploy to your live production server
  5. Did it fail? Restore quickly to your working snapshot
  6. Continue troubleshooting on the staging server at your leisure

With this workflow, you can fail often and still return to operation quickly without having to rebuild everything. This workflow requires on-demand snapshots, which become possible with a responsive cloud host.

A quick failure/restoration cycle also lets you meet a low Recovery Point Objective (RPO) — i.e., a recent restoration point for your data — and Recovery Time Objective (RTO) — i.e., the amount of time to restore service.

A workflow that uses lots of snapshots becomes cost effective in the cloud, where typical server costs are minimal (pennies for hours of service). You can iterate through software troubleshooting without the costs of the hardware and people traditionally needed to set up new staging or restored server environments. With cloud hosting, the only sysadmin you need to get from zero to 88 MPH on your application is you.

Failure is okay, as long as you have a plan. Using tools to help you move, fail, and recover should be part of your planning. If you fail often but cheaply, you'll have the time, money and most importantly, the confidence to fail, learn, and eventually prosper as a developer and sysadmin.

Here are a few resources to learn more about snapshot-level backups.