What's your worst mistake / best lesson learned administering a server?

September 23, 2016 2.5k views
Getting Started Linux Basics

"Fail fast, learn fast" has become a bit of a mantra for entrepreneurs these days. While they're not always as forgiving when their website is down, those of us on the technical side learn from our mistakes too.

We've all fat fingered a command or two, but what have you learned from it? Share your stories, not just of your worst sysadmin mistakes, but also of the best lessons you've learned from them.

1 comment
12 Answers

Accidentally dropped a production database instead of a development one. Site naturally came crashing down.

Since then I double/triple check which server I'm connecting to before doing anything destructive.

I was rolling my own self-scaling cloud earlier this year via a nodejs script that spun up X number of droplets using the Digital Ocean API. In synchronous code, it would have been a simple for loop, but this was async and parallel. But I had set X to a fairly modest number (5) so I didn't think much could go awry. First test, it becomes clear I wasn't incrementing my loop correctly, and in the two seconds it took me to realize my mistake and hit Ctrl+C, I'd already created over 200 droplets on Digital Ocean! Luckily, I'd already created a script to mass delete droplets, so I was able to correct my mistake within a few minutes. Since then I have always implemented "dry run" versions of any script to test the control flow logic before I implement the actual commands.

It was my first and worst mistake. A long time ago (2004, I guess) I was sysadmin consultant and I was explained why we have beware in run command as root user. I told:

  • Remember, You never use this command:
rm -rf /

The customer ask: "What does this do?" I said: "Delete all files". Unfortunately in the prompt server a typed and enter.

rm -rf /var

Well, I had to reinstall linux server with all configuration and permission again. After years the Coreutils was modified to don't run "rm -rf /" but if always that possible, never user root to administration tasks.

I had a mailserver running and was implementing spam traps.. but misconfigured them and it was set to delete spam.. it deleted every incoming mail for a day before I realized what I had done

  • I can relate :)

    How did it make me a better sysadmin: now I always delegate emails to someone else, those are too shitty to be improvised if you're not expert.

    • @Xowap Exactly this. Email is hard, and there are already serivces out there that you can tap that do a great job at it.

      When setting up a web site or application under your own domain, it is likely that you will also want a mail server to handle the domain's incoming and outgoing email. While it is possible to run your own mail server, it is often not the best option for a variety of reasons. This guide will cover many of the reasons that you may not want to run your own mail server, and offer a few alternatives.
    • I'm with you there. I refuse to run production mail servers. I know myself well enough not to.

    • Am I weird if I say I enjoy running my own mailserver? I've been doing it since around 2003 and steadily improved (and migrated, re-did etc.) the setup. It has been running smoothly for years and my customers never complain.

      • Oh, that's not weird, I do respect all kinds of fetishes.

        Trolling aside, it's not weird but requires a lot of time that I can't afford to spend at my scale.

Worst mistake?

Working with bad hosting providers.

I can break it down to three main problems I have with "bad" hosting:

1 Working with no shell access

a) I've lost files when moving them between two folders:
how in the world does cpanel manage to mess up mv /* ../?

2 Working with no control over resources

a) I've wondered for a couple of hours why a migrated website wasn't working properly:
Hard drive was full. Cpanel didn't complain (Actually, instead of scrolling down to the
last performed copy, it displays the top row of the log by default, which is so useful
compared to shell commands).

b) Having to call customer support to increase the PHP memory limit.
Thanks guys for the default 64MB. You know, you could have displayed the memory limit somewere in the UI, like, right under the bandwidth and hard drive usage?

c) You cannot optimize without basic access to configuration files
Some shared hosting providers get it right. Some just don't. And you have to live with that.

3 Working without automation

If you're stuck working with #1 or #2, tough luck, you can't fully automate tasks.
Don't get me started on what could be automated with shell access. I could do migrations, cloning of websites, automatic optimization, automatic backups and deployment with a single click.

Any cost they might think they are saving by hiring a cheap hosting provider is offset by the extra hours they have to pay me to work with their broken tools.

Bottom line: Next time I'm looking for a job, I've got to do a lookup on who is hosting their company website. Chances are, they use the same hosting for their clients.
I should probably automate that and turn it into a website... :)

  • I second "Working with bad hosting providers..". It can be a real time-waster.

    Here is one you might have had to deal with: having to scan in your drivers license and send it to them (the hosting provider) just so they will enable SSH.
    Also, having their servers crash in the middle of working on a project that is due./ Grrrr!!!

    • If that happened, I'd quit altogether. I'm not going to hand out any of my personal data to some people who think security is an "extra feature".

      If I had the money I'd just host with Digital Ocean, but mortals doodle on scratch paper, and not on a canvas...

      Well I'm hooking up a home server with SSH to avoid shady development hosting, Dynamic DNS is handled through the DigitalOcean API (thanks guys!). I'm just waiting for the fibre optic connection to finish the setup... And when I'll have to deploy to production, I'll just use the DigitalOcean API with git to automagically deploy to multiple droplets.

      ... Maybe I'll use a single droplet hosting a gitlab instance to detect when the home servers crash - which would trigger a script to deploy droplets with the code from the last git push.

      And then I wake up...
      To a cpanel login.

Luckily, I didn't had bad situations on servers I manage, but on my personal PCs... Anyways I find it relevant.

I will never forget:

  • sudo apt-get remove wine*

Well... I it was my beginning with Linux so I didn't paid attention to packages it will remove.

Needless to say, it removed most of system packages. Everything I spent time to super config PC and install all, I had to do again.

I thought it will remove wine and all wine packages.

Anyways I learned after that was is globbing and don't use it on servers. xD

Apart from still being logged into a server instead of my home machine when doing "shutdown -h now" and then leaving the house (people called me instantly, telling me their email wasn't working, hooray for human monitoring - often faster than my own systems) as I'm sure all of us have done at one point in time, I do have one horror story from about a year ago:

I was updating the kernel one of my servers and due to some kind of weird situation managed to destroy my grub configuration so it didn't come back up. Luckily it was late at night and I could resurrect the machine within an hour or so and reboot cleanly without causing much service interruption. But the whole time this production machine was down, I sure was sweating and swearing at myself a lot! I am now much more conscious when working on anything kernel/boot related.

I was about to clean up some stuff on the web server of a big german university in 2005. I wrote rm -rf to the command line and was going to copy&paste the name of the directory I wanted to delete - when I got a phone call. I was asked for a php snippet which I sent by email to the guy on the phone. I copy&pasted the snippet in the email.
After the phone call I saw the rm -rf and remembered that I got the directory name on the clipboard. I pasted and pressed enter ... but what I pasted wasn't the directory name but the php snipped which started with a comment /** ...

Typing rm -rf /* instead of rm -rf ./*... Taught me to be far more carful when using sudo and globs.

Backups backups backups. It's always about backups.

Many years ago I got a notification from my boss at the time that our financials ERP system - PeopleSoft - was down. Though we weren't a 24/7 operation sometimes the accounting director would come in and do a bit of work on the weekends. It was a Sunday morning. I checked things out remotely and saw that the database would not mount and showed in the list as "corrupted." To date myself it was a SQL server 6.5 server. In those days - the late '90s - we had racks of actual physical servers and this SQL server was on one of them, running NT 4.0.

I drove the 45 minutes to the office and started working on the issue. The newest backup we had was two days old from Friday morning - a conventional database dump from SQL Server that I backed up to another server on the network. We had been using ARCServe for backups but it was notoriously unreliable and inconsistent. In fact backups rarely worked with the software even after months of troubleshooting. Draconian software licensing schemes have nothing on ARCServe of that era. Installing and re-installing the software - which I had done probably close to a dozen times in a matter of months - required me to call CA and get an 20+ alphanumeric activation key for the core software and each of the dozens of agents. Every re-install was a painful descent into the 5th concentric circle of call center hell.

Numerous reboots and basic recovery attempts got me nowhere fast with the SQL server so I decided to attempt to copy the raw database files over to another system as I knew it was possible to recover from the binary files and I thought it would be a good idea to have an original of the files in case I goofed something up. In those days of slow servers and 10/100 networks this was extremely painful for a somewhat large database. Impatient, I kept cancelling the copy process.

With the one good two day-old database dump I figured I was golden - though I was missing a full day of work still - but fate turned out to be cruel mistress that Sunday. I can't recall how exactly I got to the point where I was attempting to resize the RAID on the server that had the one good database dump - it may have been because I didn't have enough room for a copy of the raw financials database files - but I ended up toasting the RAID configuration. Unlike most RAID controllers then and now the configuration on this HP RAID controller was not stored in several locations across multiple drives. This would have made configuration recovery easy. If you deleted it on this particular controller, that was it, you were done and had to start over.

I was faced with a long day and night of trying to recover the one decent backup I had - the second closest backup was 4 days old. It never happened. I'll never forget that sinking feeling I felt when the corporate office employees started coming in early Monday morning as I had been there all night. I had to debrief the director of accounting then my boss - the IT director - of what had happened. My boss told me to go home and go to sleep and come in Tuesday. They ended up hiring an outside consultant - a Peoplesoft SQL server expert - who ultimately arrived at the same conclusion and we had to roll back to the 4 day-old backup. The company estimated it lost $25,000 in overtime wages re-entering accounting data and on the consultant.

The worst part happened about 4 weeks later when we again had the same basic issue - a corruption of the financials database. I still hadn't mastered the ARCServe problems and the database dumps were erratic. We "only" lost 2 days of work that time but it was still pretty ugly. The PeopleSoft programmers hated me. I was capable of monitoring the Internet-bound traffic and in those days AIM was not encrypted. I got to watch the cruel intra-company chatter between IT employees about how much they despised me. Even though the company incredibly never fired me - they basically felt I had never had proper training on SQL server (which I really hadn't) so couldn't be expected to know how to maintain, backup, and recover it properly but by then the company's resentment was palpable and I couldn't take it.

I ended up taking a position at a small startup a few months after the incident.

Lesson: If you're responsible for backups, make really, really, really sure you're doing them properly and they're easily recoverable.

Over 20 years ago when at lunch we got an urgent call that a production system was down. When we asked what happened, we all laughed out loud because documentation can always be improved.

There was a documented process that sometimes on Unix (this is before Linux) you need to kill the parent process of a command that is either frozen or using excessive resources. A junior staff resource that was learning to become a sysadmin had followed the procedure and learned a valuable lesson.

We forgot to mention that killing init (that is pid=1) is not a good idea because experienced sysadmins automatically know that. An exception that needs to the documented.

Deleting things without having backups. I cringe when I think of how often and how much data I have deleted accidentally, with no duplicates anywhere as I desperately scramble around trying to find folders that MIGHT have the content..or having to use Google Cache to retrieve web pages..

Nowadays, backups and more backups is the order of the day. I back-up all websites, databases and files every night...and then to a different location once a week...and then to another location once a month (in folders named for each month and in a parent folder called YEARLY.)

Thank God for rsync.

Have another answer? Share your knowledge.