Backups backups backups. It’s always about backups.
Many years ago I got a notification from my boss at the time that our financials ERP system - PeopleSoft - was down. Though we weren’t a 24/7 operation sometimes the accounting director would come in and do a bit of work on the weekends. It was a Sunday morning. I checked things out remotely and saw that the database would not mount and showed in the list as “corrupted.” To date myself it was a SQL server 6.5 server. In those days - the late ‘90s - we had racks of actual physical servers and this SQL server was on one of them, running NT 4.0.
I drove the 45 minutes to the office and started working on the issue. The newest backup we had was two days old from Friday morning - a conventional database dump from SQL Server that I backed up to another server on the network. We had been using ARCServe for backups but it was notoriously unreliable and inconsistent. In fact backups rarely worked with the software even after months of troubleshooting. Draconian software licensing schemes have nothing on ARCServe of that era. Installing and re-installing the software - which I had done probably close to a dozen times in a matter of months - required me to call CA and get an 20+ alphanumeric activation key for the core software and each of the dozens of agents. Every re-install was a painful descent into the 5th concentric circle of call center hell.
Numerous reboots and basic recovery attempts got me nowhere fast with the SQL server so I decided to attempt to copy the raw database files over to another system as I knew it was possible to recover from the binary files and I thought it would be a good idea to have an original of the files in case I goofed something up. In those days of slow servers and 10/100 networks this was extremely painful for a somewhat large database. Impatient, I kept cancelling the copy process.
With the one good two day-old database dump I figured I was golden - though I was missing a full day of work still - but fate turned out to be cruel mistress that Sunday. I can’t recall how exactly I got to the point where I was attempting to resize the RAID on the server that had the one good database dump - it may have been because I didn’t have enough room for a copy of the raw financials database files - but I ended up toasting the RAID configuration. Unlike most RAID controllers then and now the configuration on this HP RAID controller was not stored in several locations across multiple drives. This would have made configuration recovery easy. If you deleted it on this particular controller, that was it, you were done and had to start over.
I was faced with a long day and night of trying to recover the one decent backup I had - the second closest backup was 4 days old. It never happened. I’ll never forget that sinking feeling I felt when the corporate office employees started coming in early Monday morning as I had been there all night. I had to debrief the director of accounting then my boss - the IT director - of what had happened. My boss told me to go home and go to sleep and come in Tuesday. They ended up hiring an outside consultant - a Peoplesoft SQL server expert - who ultimately arrived at the same conclusion and we had to roll back to the 4 day-old backup. The company estimated it lost $25,000 in overtime wages re-entering accounting data and on the consultant.
The worst part happened about 4 weeks later when we again had the same basic issue - a corruption of the financials database. I still hadn’t mastered the ARCServe problems and the database dumps were erratic. We “only” lost 2 days of work that time but it was still pretty ugly. The PeopleSoft programmers hated me. I was capable of monitoring the Internet-bound traffic and in those days AIM was not encrypted. I got to watch the cruel intra-company chatter between IT employees about how much they despised me. Even though the company incredibly never fired me - they basically felt I had never had proper training on SQL server (which I really hadn’t) so couldn’t be expected to know how to maintain, backup, and recover it properly but by then the company’s resentment was palpable and I couldn’t take it.
I ended up taking a position at a small startup a few months after the incident.
Lesson: If you’re responsible for backups, make really, really, really sure you’re doing them properly and they’re easily recoverable.
I accidentally deleted “.ssh” directory via bash script execution and as a result I lost access to the server :)