Question

What's your worst mistake / best lesson learned administering a server?

“Fail fast, learn fast” has become a bit of a mantra for entrepreneurs these days. While they’re not always as forgiving when their website is down, those of us on the technical side learn from our mistakes too.

We’ve all fat fingered a command or two, but what have you learned from it? Share your stories, not just of your worst sysadmin mistakes, but also of the best lessons you’ve learned from them.

Subscribe
Share

I accidentally deleted “.ssh” directory via bash script execution and as a result I lost access to the server :)


Submit an answer
You can type!ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

These answers are provided by our Community. If you find them useful, show some love by clicking the heart. If you run into issues leave a comment, or add your own answer to help others.

Accidentally dropped a production database instead of a development one. Site naturally came crashing down.

Since then I double/triple check which server I’m connecting to before doing anything destructive.

I was rolling my own self-scaling cloud earlier this year via a nodejs script that spun up X number of droplets using the Digital Ocean API. In synchronous code, it would have been a simple for loop, but this was async and parallel. But I had set X to a fairly modest number (5) so I didn’t think much could go awry. First test, it becomes clear I wasn’t incrementing my loop correctly, and in the two seconds it took me to realize my mistake and hit Ctrl+C, I’d already created over 200 droplets on Digital Ocean! Luckily, I’d already created a script to mass delete droplets, so I was able to correct my mistake within a few minutes. Since then I have always implemented “dry run” versions of any script to test the control flow logic before I implement the actual commands.

Typing rm -rf /* instead of rm -rf ./*… Taught me to be far more carful when using sudo and globs.

It was 2001, my first job.

I was asked to move one of the servers we had on prems to a bigger, brand new hardware. They could afford and schedule a timeframe to keep it down so we agreed for Saturday morning.

I was alone at the lab so I decided to take a shortcut and enjoy the weekend with friends: took the HD from the old server, one of the disks from the new one, connected both to my PC and proceeded using dd to move data from the original disk to the new one.

At this point you might know what happened: I ran dd with source and destination swapped.

Spent the weekend to restore all the backups, hoping nobody noticed. My boss walked in on Sunday morning to pick up some stuff he left at the office and was like “What are you doing here?”. I tried to make up something but he smiled and said “I knew you screwed up!”. Turned out they could afford to keep that server down for days.

I don’t work anymore in ops but from that time I always question my ideas before taking actions.

Over 20 years ago when at lunch we got an urgent call that a production system was down. When we asked what happened, we all laughed out loud because documentation can always be improved.

There was a documented process that sometimes on Unix (this is before Linux) you need to kill the parent process of a command that is either frozen or using excessive resources. A junior staff resource that was learning to become a sysadmin had followed the procedure and learned a valuable lesson.

We forgot to mention that killing init (that is pid=1) is not a good idea because experienced sysadmins automatically know that. An exception that needs to the documented.

Backups backups backups. It’s always about backups.

Many years ago I got a notification from my boss at the time that our financials ERP system - PeopleSoft - was down. Though we weren’t a 24/7 operation sometimes the accounting director would come in and do a bit of work on the weekends. It was a Sunday morning. I checked things out remotely and saw that the database would not mount and showed in the list as “corrupted.” To date myself it was a SQL server 6.5 server. In those days - the late '90s - we had racks of actual physical servers and this SQL server was on one of them, running NT 4.0.

I drove the 45 minutes to the office and started working on the issue. The newest backup we had was two days old from Friday morning - a conventional database dump from SQL Server that I backed up to another server on the network. We had been using ARCServe for backups but it was notoriously unreliable and inconsistent. In fact backups rarely worked with the software even after months of troubleshooting. Draconian software licensing schemes have nothing on ARCServe of that era. Installing and re-installing the software - which I had done probably close to a dozen times in a matter of months - required me to call CA and get an 20+ alphanumeric activation key for the core software and each of the dozens of agents. Every re-install was a painful descent into the 5th concentric circle of call center hell.

Numerous reboots and basic recovery attempts got me nowhere fast with the SQL server so I decided to attempt to copy the raw database files over to another system as I knew it was possible to recover from the binary files and I thought it would be a good idea to have an original of the files in case I goofed something up. In those days of slow servers and 10/100 networks this was extremely painful for a somewhat large database. Impatient, I kept cancelling the copy process.

With the one good two day-old database dump I figured I was golden - though I was missing a full day of work still - but fate turned out to be cruel mistress that Sunday. I can’t recall how exactly I got to the point where I was attempting to resize the RAID on the server that had the one good database dump - it may have been because I didn’t have enough room for a copy of the raw financials database files - but I ended up toasting the RAID configuration. Unlike most RAID controllers then and now the configuration on this HP RAID controller was not stored in several locations across multiple drives. This would have made configuration recovery easy. If you deleted it on this particular controller, that was it, you were done and had to start over.

I was faced with a long day and night of trying to recover the one decent backup I had - the second closest backup was 4 days old. It never happened. I’ll never forget that sinking feeling I felt when the corporate office employees started coming in early Monday morning as I had been there all night. I had to debrief the director of accounting then my boss - the IT director - of what had happened. My boss told me to go home and go to sleep and come in Tuesday. They ended up hiring an outside consultant - a Peoplesoft SQL server expert - who ultimately arrived at the same conclusion and we had to roll back to the 4 day-old backup. The company estimated it lost $25,000 in overtime wages re-entering accounting data and on the consultant.

The worst part happened about 4 weeks later when we again had the same basic issue - a corruption of the financials database. I still hadn’t mastered the ARCServe problems and the database dumps were erratic. We “only” lost 2 days of work that time but it was still pretty ugly. The PeopleSoft programmers hated me. I was capable of monitoring the Internet-bound traffic and in those days AIM was not encrypted. I got to watch the cruel intra-company chatter between IT employees about how much they despised me. Even though the company incredibly never fired me - they basically felt I had never had proper training on SQL server (which I really hadn’t) so couldn’t be expected to know how to maintain, backup, and recover it properly but by then the company’s resentment was palpable and I couldn’t take it.

I ended up taking a position at a small startup a few months after the incident.

Lesson: If you’re responsible for backups, make really, really, really sure you’re doing them properly and they’re easily recoverable.

I was about to clean up some stuff on the web server of a big german university in 2005. I wrote rm -rf to the command line and was going to copy&paste the name of the directory I wanted to delete - when I got a phone call. I was asked for a php snippet which I sent by email to the guy on the phone. I copy&pasted the snippet in the email. After the phone call I saw the rm -rf and remembered that I got the directory name on the clipboard. I pasted and pressed enter … but what I pasted wasn’t the directory name but the php snipped which started with a comment /**

Apart from still being logged into a server instead of my home machine when doing “shutdown -h now” and then leaving the house (people called me instantly, telling me their email wasn’t working, hooray for human monitoring - often faster than my own systems) as I’m sure all of us have done at one point in time, I do have one horror story from about a year ago:

I was updating the kernel one of my servers and due to some kind of weird situation managed to destroy my grub configuration so it didn’t come back up. Luckily it was late at night and I could resurrect the machine within an hour or so and reboot cleanly without causing much service interruption. But the whole time this production machine was down, I sure was sweating and swearing at myself a lot! I am now much more conscious when working on anything kernel/boot related.

Luckily, I didn’t had bad situations on servers I manage, but on my personal PCs… Anyways I find it relevant.

I will never forget:

  1. sudo apt-get remove wine*

Well… I it was my beginning with Linux so I didn’t paid attention to packages it will remove.

Needless to say, it removed most of system packages. Everything I spent time to super config PC and install all, I had to do again.

I thought it will remove wine and all wine packages.

Anyways I learned after that was is globbing and don’t use it on servers. xD

Worst mistake?

Working with bad hosting providers.

I can break it down to three main problems I have with “bad” hosting:

#1 Working with no shell access a) I’ve lost files when moving them between two folders: how in the world does cpanel manage to mess up mv /* ../?

#2 Working with no control over resources a) I’ve wondered for a couple of hours why a migrated website wasn’t working properly: Hard drive was full. Cpanel didn’t complain (Actually, instead of scrolling down to the last performed copy, it displays the top row of the log by default, which is so useful compared to shell commands).

b) Having to call customer support to increase the PHP memory limit. Thanks guys for the default 64MB. You know, you could have displayed the memory limit somewere in the UI, like, right under the bandwidth and hard drive usage?

c) You cannot optimize without basic access to configuration files Some shared hosting providers get it right. Some just don’t. And you have to live with that.

#3 Working without automation If you’re stuck working with #1 or #2, tough luck, you can’t fully automate tasks. Don’t get me started on what could be automated with shell access. I could do migrations, cloning of websites, automatic optimization, automatic backups and deployment with a single click.

Any cost they might think they are saving by hiring a cheap hosting provider is offset by the extra hours they have to pay me to work with their broken tools.

Bottom line: Next time I’m looking for a job, I’ve got to do a lookup on who is hosting their company website. Chances are, they use the same hosting for their clients. I should probably automate that and turn it into a website… :)