The stress should really be minimal if you have a backup and restore procedure, that it actually works and you know how it works. Mistakes happen.The problem wasn't the delete command, it was the nonexistent backups and documentation.
@@youngstellarobjects Nah, still stressful. Most companies aren't making a backup on every write that happens to a DB, so whoever deletes a DB knows that they've just made an oopsie that will cause a lot of headache for multiple people. And probably cost a lot of money for the company as well.
_Luckily team 1 took a snapshot 6 hours before..._ This happened to me. I copied a clients database to my development environment about 2 hours before they accidently wiped it. They called our company explaining what happened and it got around that I had a copy. Our company looked like a hero that day, and I got a bunch of credit for good luck.
@@amyx231 Actually, due to the nature of my current work, I have a script I run on demand approx every few days as needed that takes a snapshot. I usually get around to deleting everything that's more than a month old about twice a year or when my dev server starts btching about space.
Actually at the time of the video, what they addressed was the fact that deleting an account could cause problems with the server, it seems they didn't actually stop trolls from deleting an employee's account. I'd have thought employee accounts would be protected. The trolls didn't even get admin powers through privilege escalation, they just reported the target.
This reminds me of Toy Story, and how like a month before release the entire animation was accidentally deleted, causing absolute panic and hell at Disney. Luckily, one employee had the whole thing on a hard drive that they were taking home to work on. Her initials are on one of the number plates of one of the cars in the film. Always make a backup. Edit: She was a project manager who had to work from home, and the numberplate was actually "Rm Rf" in reference to the notorious line of code that did it.
So you're telling me a platform as big as GitLab went down because one engineer picked the wrong SSH session? Damn that makes me feel way better about my mistakes lol
i would highly high suggest people using customized shells, i use oh my zsh, i customize my themes to show git info, hostname (sometimes) and a lot more, not because i wanna know which ssh session im in, but i like the design :)
@Syed Mohammad Sannan No someone has to have that. The general problem is that there's no safety nets. I don't mean to suggest this is a good solution, because safe-rm is just jank. But using safe-rm would most likely have saved this situation. If you replace rm through a symlink to safe-rm you can configure a blacklist on production that doesn't allow for deleting the database or other critical data. I find many things about safe-rm to be unsafe. It doesn't protect if you cd into a directory and then do rm -rf *. A better program should simply evaluate the path its trying to delete and disallow it if the blacklist covers it. It also doesn't allow for custom messages through its blacklist. What you want is for a bad rm -rf to send a warning to the user. Otherwise there's no way of guaranteeing they don't just start avoiding the issues. For example, most likely you're not going to leave your backup unprotected by the blacklist just to create differences between production and backup. So a developer in this situation would expect to run into issues deleting postgres db on either server. It doesn't tell the user anything really. If you instead configure messages you can call attention to the hostname. The goal is just to induce further friction for dangerous actions. rm has always been so risky because it's so easy.
@@0xCAFEF00D I always check the hostname of the server and triple check the directory before using the rm -rf command. If in doubt I use the mv command to a different directory as backup. If everything works ok then I go in there and delete the old directory. Same thing happened to Pixar's movie Toy Story they were working on. Some storage admin used rm -rf on a directory by mistake and practically wiped out the movie. Lucky someone had a copy of the data on a laptop that was offsite at the time. They were able to rebuild the movie from that data.
@@0xCAFEF00D no, NO single employee should have enough privilege to bring down anything business sensitive. except if you’re the CTO maybe. These operations all should require a flag or check from someone else first. Just like how one person usually shouldnt be able to push any code by themselves. They need 1 or more checks before that.
Given the trouble they were in after the deletion, a recovery time of 24h and a recovery point of 6h is actually pretty heroic. Especially considering the stress they would have been under. 😰
@@L2002 Because of this? They were open and honest about their screwups which, for me, makes them a pretty good organisation to deal with. Plenty of others would not be and, at the end of the day, this stuff does happen from time to time. My measure of a company is not how well they work day to day, but how they handle adversity. Everyone screws up eventually and it's how you handle this that marks out the good ones from the bad ones. Also, a company who almost lost a production DB because of failed backups is unlikely to do it again ;-)
Something my first boss taught me (when I broke something big in production in my first few weeks) is that post mortems are to identify problems in a system and how to prevent them, avoiding blame to individuals. This is huge. Making sure to identify why it was even possible for something like this to happen and how to prevent it in the future is a great way to handle a post mortem like this. Good on the GitLab team.
Good boss. Bad ones often like when things are done fast and "efficient". And when this then establishes a culture of unsafe practies, thing will go fine, maybe for a long time. This one day, a human error occurs. Typically, such a boss will then blame the person who "did" it, even if the cause was the unsound culture. If as an employee you try to work safely, you get criticised for being slow and inefficient (and you technically are.)
You only fire people for their character, not cus of the inevitable fuckup. Also you basically sunk money into training this dude after that fuckup, so sacking him right after you inevitably paid to get him that experience, is counterproductive.
I'm glad they didn't fire the engineer. It goes to show the differences in mindsets from some organizations that care about it being a learning experience (albeit an expensive one). Many corporations would have fired the engineer as soon as the issue was resolved without hesitation. Thanks to those orgs who care about their team members and being more concerned with lessons learned.
Ugh, felt that "he slammed CTRL+C harder than he ever had before" (3:55). The only thing worse than deleting your own data is deleting everyone else's. In this case the poor guy kinda did both. Great story arc.
The rule I apply for backups is that no one should connect to both a backup server and a primary at the same time, two people should be working together. The employee that was logged on both DBs should have been really two physically separated employees
Yes, but, in the case that there is only one employee available and he has to connect to both he should either have different color schemes for the different servers OR do it all in one shell window and disconnect / connect to the server they have to edit that way it is a lot harder to execute commands on the wrong server.
I try to follow this rule myself. Every time I have to connect to a prod server to get anything, I disconnect as soon as I get the info before getting back to the test/dev server window.
@@refuzion1314 Different color schemes looks good but don't work during an outage, when you are stressed, exhausted, or anything distracts you. Sounds nice, but the mental load during crisis is too large to pay attention to that.
I can't tell you enough how easy it is to accidentally overwrite the wrong file. While I was working on something on a test machine with a usb stick plugged in to save the current progress, I saved the script, thought I saved it in the local directory and copied the unmodified script to my just saved usb stick version...
This is why problems like this are actually sometimes good. Of course extremely stressful, but they found sooo many issues and fixed them all. Amazing.
The person here is not an amateur, anyone can get brain farts especially when working an unexpected overnight, you should try it sometime, you'll start seeing ducks and rabbits in the shell.
@@-na-nomad6247 I'm a veteran in the Devops field. This comedy of mistakes could have never happened to me since I'm following a protocol, which these guys obviously did not. They were guessing and experimenting as if it were an ephemeral development environment. Their level of fatigue had little to do with their incompetence in understanding the commands they were running.
There is an awful lot that could be learned from this. 1) You should "soft delete" i.e. use mv to either rename the data e.g. renaming MyData to something like MyData_old or MyData_backup, or just mv it out of the way so you can restore it later if needed. Don't just rm -rf it from orbit 2) Script all your changes. Everything you need to do should be wrapped in a peer-reviewed script and you just run the script, so that the pre-agreed actions are all that gets done. Do not go off piste, do not just SSH into prod boxes and start flinging arbitrary commands around 3) Change Control - as above 4) If you have Server A and Server B, you should NOT have both shell sessions open on the same machine. Either use a separate machine entirely or - better still - get a buddy to log onto Server A from their end and you get on Server B from yours. Total separation 5) Do not ever just su into root. You use sudo, or some kind of carefully managed solution such as CyberArk to get the root creds when needed
Also for (2), never try to "improve" anything during the actual action. I once prepared a massive Oracle migration that I had timed to take about 3 hours. Preparation was three weeks. As I was watching the export script for the first schema during the actual migration, I thought "why not run two export jobs concurrently, it's gonna save some time". Yeah, made the whole thing slow down to a crawl, so it ended up taking 6 hours. Boss was furious. So no, never try to "improve" during the actual operation, no matter how big you think your original oversight was.
Fucking up catastrophically with Bash commands is a canon event. It is religion for me to always copy a file/directory to "xxx.bak" before doing anything sensitive
One of my most stressful moments as a software designer was when I accidentally broke a test environment right before a meeting with our client; I managed to have the project running at a 2nd test environment but that really taught me the importance of backups and telling the rest of staff about a problem ASAP.
Imagine for a moment, that you're that guy. That feeling of pure dread and the adrenaline rush immediately after the realization of what you've just done. We've all felt it at some point.
In my first job, ive gotten a 2 am call where in the first two weeks of working in the company, i accidentally left a process in prod shut down after maintanence, leading to intensive care patient data not making it into connected systems. Looking back, the entire company was set up super amateurish, yet they operate in several hospitals in my country.
At my second IT job I accidentally truncated an important table in the prod DB. The stress was immense but we identified a ton of issues and the team was pretty supportive. My boss ended up begging upper management to get us a backup server but they determined it wasn't important enough. The company went belly-up a few years later because of a ransomware attack they couldn't recover from.
When I was still a junior developer at some startup company, I was working on a specific PHP online store. Every time we would upgrade the site, we would first do it on Staging, then copy it over to Production. The whole process was kinda annoying as there was no streamlined upgrade flow yet and no documentation anywhere - it was a relatively new project we took over. I have upgraded it before so I knew what to do, and I just did the thing I always did. I was close to finishing it up and we had an office meeting coming up soon and lunch afterwards, so I wanted to be done with this before that - so I rushed a bit. And when I was copying files to Production, I overlooked something - I had also copied the staging config file (that contained database access info etc) to the production location and overwrote the production config file. After the copying had finished, thinking I was finally done, I relaxed and prepared myself for the meeting. As I was closing everything, I also tried refreshing the production site, just to see if it works. And then I realized... Articles weren't appearing, images weren't loading, errors everywhere. Initially I didn't believe this was production at all, probably just localhost or something, RIGHT?? However after re-refreshing it and confirming I had actually broke production, panic set in. Instead of informing anyone, I quietly moved closer to my computer, completely quiet, and started looking at what is wrong - with 100% focus, I don't think I was ever as focused as I was then - I didn't have time to inform anyone, it would only cause unnecessary delays. I had to restore this site ASAP. I remember sweating... the meeting was starting and I remember colleagues asking me "if I am coming" - and I just blurted "ye ye, just checking some things..." completely "calmly" as I was PANICKING to fix the site as soon as possible. Luckily I quickly found the source of the mistake within a minute and had to find a backup config file - and then after recovering the config file, everything was fixed. Followed by a huge sigh of relief. The site must have been down for only around 2 minutes. No one actually noticed what I had done - and I just joined the meeting as if nothing had happened - even though I was sweating and breathing quickly to calm myself down, I hid it pretty well. And this was a long time ago - and still to this day, I still remember that panic very well. Now I always make sure I have quick recovery options available at all times in case something goes wrong - and if possible always automate the upgrade process to minimize human errors
Well done. Having made mistakes like that, I can completely understand how you were feeling in that moment and how your brain just went "in the zone". It's only ever happened to me twice but I will NEVER forget them.
I did something similar but with testing on what i thought was dev server. Had some close calls but this time i fcked up. Was super high but was always high so doubt that was it. Quickly had to go and undo changes but was so shook had to make a chrome ext that would put up some graphics and ominous 40k mechicus music whenever i go on a live domain. Havnt made the same mistake since.
You were only lucky because the project had no proper and comprehensive CI/CD pipeline with unit tests. A competent tech company would have fired you over this.
A helpful hack is to set production terminal to red and test terminal to blue or something like that. Just a small helper to avoid human f’ups if you need to run manual commands in sensitive systems.
I use colored bash prompts to differentiate machine roles - my work PC uses a green scheme, non-production and testing servers use blue, backups use orange, and production servers use yellow letters on red background. It's very hard to miss.
Your editing is phenomenal. What an insane series of events 😂 Glad gitlab was able to get back to running, seeing all that public documentation was refreshing to see since it shows they were being transparent about their continued mistakes and their recovery process.
I once accidentally ran a chmod -R 0777 /var because i've missed a dot before the slash (in a web project with a /var folder), which (as i've now learned) may make a unix system totally unresponsive. I can very well understand how it feels, the moment you realize what you have just done. That did cost us a few hundred euros and kept 2 technicians busy for an afternoon on the weekend. Lessons learned, today we can laugh about it.
Ya, Unix / Linux will do what you tell it to do without any warnings. Pretty sure you sat there and wondered why that command is taking a long time to finish before you realize your mistake. Right then there it's the "Oh Shit" moment. 😀 Lucky for me though I use VMs so can always revert to previous snapshots.
Why does it make it unresponsive? I accidentally chmod 0777 the entire "/" once and well, I had to start again from scratch. Thankfully I was just creating a custom Ubuntu image with some preinstalled software for one of my professors. So it just cost me time. Still, I never figured out why opening up the permissions would lock everything up.
"rm -rf" is one of those commands I have huge respect for cause it reminds me of looking down the barrel of a gun (or any similar example of your choosing): Best case, you do it a) seldom, b) after a lot of strict and practiced checks, and c) if there's no alternative; unfortunately, the worst case is when you _think_ you're in that best case scenario.
I sourced my bash history like an idiot about a week ago. I have so many cd's and "rm -rf ./"'s and other awful things in there. I somehow got lucky and hadn't used sudo in that terminal at the time. I got caught on a sudo check before it ran anything absolutely hell inducing. Just a bunch of cd's and some commands that require a sourced environment to execute. Super Lucky. Icould have wiped out everything, because just a couple commands after that was a "rm -rf ./" and it had already cd'd into root.
@@givenfool6169 Lmao it had never once occurred to me what havoc it could wreak if you accidentally source the bash history, since it had never occurred to me that that's even possible (because why the hell would you?!). But of course it is, what an eye opener!
@@henningerhenningstone691 Yeah, I was trying to source my updated .bashrc but my auto-tab is setup to cycle through anything that starts with whatevers been typed (even ignores case) so I tabbed and hit enter. Big mistake. I guess this is why the default auto-tab requires you to type out the rest of the file if there are multiple potential completions.
For this reason, all our servers have color-coded prompts. Dev/Testing servers are green. Staging is yellow. Prod is bright red. When you enter a shell, you immediately see if you are on a server that is "safe" to mess around with, or not. The advantage to doing this in addition to naming your server something like "am03pddb", is that you don't have to consciously read anything. Doesn't matter if you accidentally SSH into the wrong server. If you meant to SSH into a "safe" server, then the bright red prompt will alert you that you are on prod. And if you meant to SSH into a prod server, then you better take the time to read which server it actually is.
i agree except there are only so many colors, so if manually controlling a lot of different machines (something that could maybe be avoided depending on what the servers do) i believe it's important to use unique memorable hostnames. the two servers in this story had hostnames 1 character apart and the same length, unless the names were all changed for the artwork
@@tacokoneko Yeah like imagine if those two characters were visually similar ones, like any combo of 3, 5, 6 and 8. Fatigued eyes could easily misleadingly "confirm" that you're on the right one when you're not.
@@makuru.42 That statement makes no sense. No matter how critical a system is, you'll have to perform some kind of maintenance at least semi-regularly.
Nice to hear that they didn't fire him. He did the correct procedure, some of the steps were unknown like the lag caused by the command, which could have been avoided by having clear documentation about it. Also when people are tired late at night, mistakes do happen, which anyone can be the victim of.
When I was just starting in a company, I accidentally deleted all the ticket intervals from the database. Causing all the tickets to close immediately and make some massive spam to the admins. I was really terrified of the situation and didn't know what to do, we didn't have any backup as well. I apologized as much as I can and didn't make another mistake like this again in years, sometimes mistakes make you work harder and be more careful in life.
I once accidentally deleted 2000 rows in one of my companies production databases, everything was restored 5 minutes later but it felt so bad, can't imagine what deleting an entire database would feel like
I have had good hands-on experience at my company on sql database but I'd check my query atleast 10 times before execute it..we had clients data saved in the database of more than 10 years..
@matthias916 hope you don't work there no more. You need more experience with SQL and other IT technologies before you actually allowed to touch it so these highly preventable errors don't happen. You need to learn how databases work, a backup/restore system. Not to mention you should be automating queries anyway , that's what pwsh and DevOps is for. Less human mistakes. So sad and very amateurish to delete databases without even backing up prior to making changes
I highly doubt is instantly deleted, probably someone made the decision to delete it (could just be an account spamming a bunch of mess onto repositories, and that isn't good either.
When in doubt, it's probably 4chan That low hanging fruit aside, not a good thing if someone can just do that with a bot acc. Maybe grant employees a special anti report protection can help until they find a more permanent solution against those trolls
The #1 thing I learned WAY EARLY on in my IT career (three decades): Never delete anything you can't _immediately_ put back. Never do anything you can't undo. Instead of deleting the data directory, _rename_ it. If you're on the wrong system, that can easily be fixed. (and on a live db server, that alone will be enough of a mess to clean up.) As for backups, if you aren't actively checking that (a) they've run, (b) they've completed successfully, and (c) they're actually usable... well, this is the shit you end up in. (The fact they're actively hiding ("lying") about this fiasco should be criminal.)
The voiceover: Outstanding, the editing: premium, the humor: dryer than the sahara *inhales* just how I like it. Ive never hit the sub so fast, keep em coming man!
I "deleted" on program from me with the cp command (I wanted to copy the config and the main file in a sub directory, but forgot to enter the directory after it, so it wrote the config to the main file) (I could get a older version of the file from the SD card, by manually read the content of that region and find one with it on it, as it doesn't override an save, but takes a new place)
On a home system? Absolutely. In a working environment? Doubtful. Maybe with a small company it would be acceptable, but creating an orphan database that may or may not contain sensitive information with no one in charge of it, or worse, no one who KNOWS ABOUT it, would be awful. God help you if that contains financial, medical, or government records.
@@Funnywargamesman you don't create it to keep it around forever, you create it as a failsafe for when you are doing potentially dangerous stuff, like deleting a whole database.
@@AndrewARitz I cannot tell you how many times "temporary" things become permanent on purpose, let alone the times people have said they are going to do something, like deleting a temp database they copied locally because their permissions didn't let them use it remotely, and then proceeded to forget to delete it. This will be especially true with the most sensitive databases, "because it's more important, so we should make a copy first, right?" Security is everyone's job and if you do (typically) irresponsible things like copying databases, "as a failsafe," chances are you are going to form a habit that means you will do it with a sensitive database. If you think YOU won't do it, that's fine, but assuming you are of average intelligence you need to remember 50% of people are dumber than you and some of them get REAL dumb. If you set policy to say that it would be allowed, then THEY will do it. This is exactly why I said that home environments and really tiny companies could be different, there it could/would be fine. Chances are, if you don't know the names of every single person in your company off the top of your head, it is too large to be that lax with data protection and management. Take it or leave it, it's my opinion.
A few places i worked at as a linux admin or engineer, the shell prompts (PS1) were color coded. Green was dev, yellow was qa and red meant your in prod. Worked like a charm.
Yeah, that's the way I do it as well, just the other way round (red being test). Extends to the UI as well - if the theme is red, you're on the test instance of Jira, not the real one.
@@blackbot7113 Yeah, it's a very wise thing to do imo. Currently, I work at a bank, and I recommended we have the header in the UI of the colleague and customer portal be different colors for lower environments, as well as the PS1 prompt on the servers. And I kinda got snickered at and got a reply along the lines of "How about we just pay attention to the server and page were on?" Its crazy because it's such an easy change to implement and almost entirely prevents anyone making such silly (yet catastrophic) mistakes. Edit: I make the PS1 prompt for my own user on the servers different colors, but that only helps so much since I sudo into other service users (or root). Additionally, we "rehydrate" the servers every. couple months, which means they get re-provisioned/deployed, so any of those settings get wiped out entirely. For it to be permanent, it needs to be added in the Docker file.
A long time ago we implemented a policy that absolutely nobody operates the production console alone. There always has to be someone else looking over your shoulder to point out oversights like the one in the video.
This is a good one, but I would add that no one ever does something on production, should allí.the handled by CI/CD pipelines and go through QA /peer review before any commit goes in
@@KR-tk8fe CTRL+C. On most Unix/Linux based CLIs, this combination aborts whatever command you were running. Technically, it sends a SIGINT (Interrupt) to the foreground process (active program), which usually causes the program to terminate, though it can be programmed to handle it differently. Its basically, the Oh Shit or This is taking too long button.
Awesome work on the video!! I love the editing being both funny and straight to the point, and your narration is easy to understand too. You seriously deserve more attention.
An old best practice that so many people these days seem to forget or never have heard about is that every week, you try to pull a random file from your backup system, whatever that is. (Or systems, in this case). You will learn SO MUCH about how horribly your backups are structured by doing this - so many people think they set up good backup systems but never continuously test them in any way, and then they get big surprises (like the GitLab team) when they do need to fall back on them.
What's far more impressive about this whole situation is how calm the engineers were in handling the situation. That to me is far more valuable than having engineers that are too gun-shy to make prod db changes at 12AM and panic when something goes wrong.
If you are working with multiple shells, VMs, remote sessions or the like - make sure they are color coded based on the machine you are running against! It can be as simple as picking a different color scheme in windows. But it is just too easy to mess up when all the visual difference is a single number, somewhere in the header.
Yep, I came here to say this. For any serious system I connect to, I use different params for my session, in my case I like old fashioned xterm, something like: alias u@s="xterm -fg white -bg '#073f00' -e 'ssh user@server'" It's very useful to see the green red, blue etc colouring and be sure which system you're talking to.
Our prod server has no staging environment or anything like that. I've asked the DB admin if the data and schema is safe in case of someone accidentally deleting everything and they told me everything is backed up daily. Kinda scared that I don't know how or where this is happening except for a job.
I checked my database backup script a couple days ago and noticed it hadn't backed up in 5 days O_O I SLAMMED the manual backup immediately. Then went and fixed the issue and made sure it would notify if there was no backup in 6 hours.
Testing to verify backups, replication, failover and the like is absolutely critical. As new scenarios occur, having a feedback loop to update the plan is key. It's a continuous process that most shops have learned the hard way. It is boring and tedious but if you don't test you will experience catastrophic consequences.
Exactly. Just like a dam, if there is a weak-point at the bottom, it all may come crumbling down. There needs to be a lot of redundancy when it comes to backups. Especially when it comes to a big server. An engineer accidentally removing a database should not have that catastrophic of consequences.
Yeah, the general rule is/should exist for having to be ready for stuff like that. If your fuckup is non-recoverable or a massive pain, you did something wrong. I'm sure a lot of companies are practically "trained" for when someone yeets the whole database or service.
Even before I started working in one company, one IT specialist deleted the directories of the new CC-supporting system. This was shortly after its implementation into production. Worse still, it turned out that the backup process was not working properly. For a week, the team responsible for programming this system practically stayed at work, recreating the environment almost from scratch. :D
This reminds me of all the times I have been in the wrong ssh session just before doing something that would have been pretty bad. I setup custom PS2 prompts to tell me exactly what environment, cluster, etc I am in.. and even colorize them accordingly but the problem is.. you start to just ignore them after a while. Its also kinda dangerous when stuff becomes fairly routine that is manual and potentially damaging
all things aside, that wasn't that bad. Yeah, they weren't operational for 24h, but that made many other companies realize their fault management. For example, my uni professor told us about this incident and we could comprehend the importance of backups and testing
As a former Amazonian (only QA for the now-ended Scout program, sadly), I read quite a few cautionary tales on the internal wiki about Wrong Window Syndrome. Sometimes, not even color-coded terminals and "break-glass protocols" (setting certain Very Spicy commands to only be usable if a second user grants the first user a time-limited permission via LDAP groups) is enough to save you from porking a prod database.
@@Skyline_NTR Afraid not, it was several pay grades above me both in job role and in coding knowledge, and I lost access to the company slack back in december so I can't really ask anyone...
@@ProgrammingP123 Yup, they disbanded the entire Scout division and then put a company-wide hiring freeze a month later so I had no hope of transferring...
Mistakes in the moment happen. I'm focusing more on the "we thought things were working as expected" parts. The backup process familiarity, backups not going to S3, Postgres version mismatches, insufficient WALs space, alert email failures, diligence on abuse deletes... These were all things that could have been and should have been caught way before the actual incident.
OMG, we have all been there haven't we? That awful, dreadful realization after deleting something that you shouldn't have. Mine was back in the days of manual code backups, before ALM tools were ubiquitous like today. I thought I had taken the last three days of code changes and overwritten the old backups that were no longer needed. And then I realized that I had done the exact opposite, and just deleted three complete days of coding - and would now have to recreate them from scratch 😒😭
One of my first jobs in IT was working as a big data admin and this video allows me to re-live the spicy moments of that job but with none of the responsibility attached
respect for not firing the guy, it was obviously just a small mistake, and it wasn't his fault that the backups didn't work. it shouldn't be possible for 1 command to completely delete everything in the first place. Good that they didn't just use him as a scapegoat :p
If they fired him they would just reintroduce the possibility of the same thing happening again in the future. I'm pretty sure the old employee will be paranoid for a loooong time and will double-check from now on lol. An expensive lesson but a lesson nonetheless.
That’s Unix systems for you. Their open nature makes them super useful for a lot of things but it’s also so easy to break them. Plus that old trick of telling new linux users that sudo rm -rf is a cool easter egg command wouldn’t be the same with more safeties and preventions.
Linux actually can in certain circumstances "undo" this wild kind of situation. Having ZFS as the file system will allow you to revert to a previous image of the filesystem. it's like versioning but for the entire file system. of course it takes up quite a bit of space so it's not done that often, software install are automated "imaging" points for instance. but you can trigger one manually when you think you're about to do something you're unsure about. (since the selection of save states is at GRUB, yes an unbootable system is still recoverable if you still have GRUB)
In my vocational school I had a subject simply called "Databases" and our teacher there once told us a story about how one of his co-workers lost his job. In essence he did everything right, created his backups and backup scripts and everything worked. At some point during the lifetime of the server this was running on someone replaced a harddrive for whatever reason, this lead to a change of the device UUID, which he had hard-coded into his backup script, when the main database failed a year or two later, they tried restoring from this backup only to find that there was none. Wasn't even really his fault, the only mistake he made was not implementing enough fail-saves. Nowadays we have it comparatively easy with all the automatic monitoring and notifications, but this was at least 30 years ago.
I guess that could have been solved by testing the backups. Install the database software on a spare server or just your own workstation, and then restore the backup onto it
@@hummel6364 I suppose as long as he's employed he should probably be checking the backup at least every couple months. Would I have remembered to do that? I dunno, but I'm not employed as a database admin.
@@hummel6364 yeah he kind of deserves to be fired...feel like it should be common sense the hdd could fail, no good excuse to not expect that. You should almost never hardcode stuff, not sure why they thought it was okay to hardcode the uuid of a drive that would one day fail.
@@yerpderp6800 I think the idea was that the device might change from sdX to sdY when other drives are added or removed, so using the UUID was the only simple and safe way to do it.
Yesterday I was added to a support team because we are getting a lot of tickets from users not waiting long enough for a service to load and closing the connection early. I died laughing from this story.
A couple of jobs ago, I had a colleague who managed to do worse than this. I think they were playing about with learning Terraform and managed to delete the entire account. Prod servers, databases, the dev/qa servers, disk images, even the backups. Luckily it was a smaller account hosting a handful of tiny trivial legacy sites, but even so, we didn't see them for the rest of the week after that mishap
As a dev for a large company who has been on a number of late night calls, I literally gasped at this. But good on the team to work through the issue, and good on management to keep these guys around
This video made me say "Oh... my... God..." way too many times 😂😂. Felt like some Chernobyl documentary about a bad sequence of actions. Love it! This is very insightful as to what things can take place on these types of environments as well as what are some measures that can prevent major falis like that. It's also super interesting to see that, no matter how perfect a software system is, humans will still find a way to screw it up 😂
Bro let's also don't forget the damage had already done, the server was down for like 18 hours thousands of prs, comments, issues and projects are all delete permanently, this should be a bigger deal
I thought it was just a meme back then when i saw this on twitter. Is this for real?! OMG. Everything aside, big applause to gitlab for not blaming a single person when this happen, such a nice company to work for.
I didn’t realize I would like these videos, but you are a good storyteller for production issues and I hope to see more in the future I am gonna share this with some of my coworkers
This video is awesome! The step by step analysis of what occurred during the outage coupled with the story telling format helped me learn some things I didn't know about database recovery procedures. Please make more videos in this format!
I barely understand anything here, but all I can say is massive thanks to the team who have worked hard, advancing our computer tech to the current state we have!
I deleted the main site from our backend in my first month as a full stack developer. Fortunately i figured out how to rebuild the apache server and clone the repository but i definitely worked well past my hours that day and the stress was crazy
I absolutely HATE how database backups and a lot of other common tasks just don't report any progress in a terminal when you run the command. It's agonizing because you don't know if the command is working or hung on something. For example, a mySQL database I work on takes a good hour or so to backup. During that time, we get NOTHING from the terminal output, so we have to monitor separately using IO and CPU tracking tools to make sure the SQL instance is still doing something. As for rm -rf, I've made it a habit to either take a manual snapshot immediately before running it on any production data, or more often I'll just make a copy of the directory right before to a temp directory, that way there is always a copy of the data before I remove it from where it normally resides. It's saved me from stupid decisions more than once...And always, ALWAYS verify backups exist where you expect them and the contents look complete before you make sweeping changes! We like to deploy a backup to a test server before major upgrades, just to make sure it restores as expected. It can take an extra day or so to do, but it's a good verification step that ensures any issues with our backups are caught during regular maintenance, instead of in the middle of a crisis. And wrong window syndrome...yeah it sucks. I've restarted or shut down the wrong servers more than once. I'm sure my mild dyslexia doesn't help in that regard...
Having a live screenshare with team members watching might seem a little wasteful. But for critical procedures like this, it is well worth the added cost.
the realization of what you're doing before it finishes itself is so cruel and happens so often, thats why when you're doing a job you always do it slow but correctly
Very interesting and easy to understand for layman. I’m sure most of us could also learn from the mistake even if we don’t deal with databases or code Ps - If you meant to blur/delete the names at 9:50 you missed the “replying to” part.
Thanks Kevin, and thanks GitHub. We still love you, and your recovery effort seemed great (and kindly presented by Mr. Fang here!) and altogether as humane as possible. The recovery stream to calm people down was a great idea, and I bet it helped a lot of people to not freak out. I would have been freaking out, and the fact that you guys didn't, but came through methodically, is very inspiring. I hope the lesson is: don't give AI filters baked into databases any Actually_Important responsibilities. I'm paused at "lesons laenrd" and now unpausing...
I’m pretty sure this event only ended up affecting things like comments and issues, but not the actual git repositories themselves, which would have been a huge relief, I imagine. Still, this was one of the most interesting things I’ve ever followed and ended up motivating me to learn a ton about databases, cloud practices, devops, and everything-as-code culture. Thanks for providing such a great lesson, GL. And huge kudos to them for transparency
One strict rule I always follow when connecting to prd servers via ssh or DB UI agent (pgadmin) is I always use different background colors, Red for prod Green for staging Black for test and local + double checking every command You can never be sure enough
I've always been paranoid when working in Prod. Always make it a point to have at least the Ops Lead on a screen-sharing session where I show what I'm doing while requesting affirmative acknowledgement of each step before proceeding. It's annoying. It's slow. But boy ohh boy does it make me feel safer.
It may be slow but look at it this way. You're probably saving a lot more time in the long run by preventing something horrible from happening in the first place.
I have something analogous to the 24 hour rule for shopping when I’m doing anything sketchy in a prod environment: before I hit RETURN and irreversibly commit to an operation, I leave my keyboard, stretch, go for a walk, grab coffee etc. it helps a lot with the tunnel vision. 10 minutes AFK can save you hours of pain later.
To prevent confusion in terminals with nearly identical host names i recomend to change the PS1 variable to clearly see in what shell you are now red color: [FUCKING PROD DB]:~/$ Green color: [FUCKING BACKUP DB]:~/$
I experienced something similar a couple years ago, it's the kind of thing that you think only can happen to others but yeah... I had to delete some specific data from the production database, I created the sql requests and executed them to the testing environment. The dataset between those databases is completely different, and the requests passed without any issue. But when I passed them to production they were taking way too long and then I realised. I almost had a panic attack. I reported the incident immediately and was mentally prepared to be fired. Fortunately we could retrieve most data from a backup and the lost ones were not that big of an issue. I still work in the same company :p
Wow the amount of stuff i learned here is huge, please make more reviews like these i subscribed and turned on notifications please don't disappoint me
The server was: Delete 5GB userstuff in like 5 minutes? Rough; might cause some desync. You want to pg_basebackup? Sure, just give me a few minutes before I even start. rm -rf important stuff? Clear as day, I'll permantently chunk through > 300GB in that one second before you try to cancel.
And yes... this is exactly the reason why I didn't study programming / engineering in college, and instead opted for graphic design / communication. if I write or design something wrong and it gets published, well, at worst the publication stays published as a reminder of my mistake, in programming all it takes is one finger mistake, misremembering something or just a simple distraction and you can absolutely wipe an entire company's network infrastructure out of existence.
Weird and almost unbelievable reason, not every software related position gives you the possibility to destroy the database or mess up critical things in production, and even if that's the case there are tons of safety nets that are here to prevent that, being human ones or software ones, If you can completely wipe everything from a one finger mistake it just means that you were doing things wrong from the start. Btw if you extrapolate things like that and imagine it's that easy to mess things up then also take the extreme example of making a pretty bad blunder about a popular brand in your design/writings and having that printed in millions of copies and already distributed to the public, costing your company a bunch of money.
@@heroe1486 first of all, both mistakes can cost a company a bunch of money, but one doesnt mess with years worth of data. Second, you are being too hopeful, maybe because of where you live, but i live in a third world country, such thing as a good network structure (that would prevent something like that to happend) Is not that common here. And third, a visible error on something printed can be obviously seen and edited, but good luck checking lines upon lines of code to find an error. Those are just two way different jobs that have very different risk, maybe you can make your company lose more money with graphic design, but you cant, at any point, wipe their whole data, nuke their systems, and so on, because of a mistake while doing your job. Its like wanting to compare being a carpenter and being a Chef, both work with Sharp objects, both can hurt themselves pretty bad while working, however, you can get your whole arm cut off way easier being a carpenter, because of a simple mistake, you can also burn yourself pretty bad being a Chef, but for that you really need to directly be doing something really wrong, or totally distracted, than when working with some big Power saw and just getting distracted being a carpenter.
Been there, done that, only in my case it was taking down the main network interface on a Solaris YP server used by an entire site of Solaris servers and workstations. The entire site ground to a halt in an instant. I didn't have access to the DC to get local access, either, so I had to make a red-faced confession to my boss for him to make the 2 mile drive to the secure DC.
Another trick is this: Have different color/background profiles in your terminal application for different servers, or at least types of servers. That way, you're more likely to notice that you're typing in the wrong place right now.
Of course I've made a few big fuckups myself in over 30 years as a systems administrator. But this really proves the point that developers should have absolutely _no_ access to production servers. This is what makes me sad about devops: originally I believe it meant having some seasoned sysadmins cooperate closely with the developers. But now it seems to have become "who needs grumpy sysadmins; they just block all progress, hate when stuff changes, and call for formal change requests, documented change procedures, and other boring stuff. Just let the developers run the entire show, they'll figure it out, and new releases and fixes will be applied much faster." Developers and sysadmins have completely opposite objectives. Developers produce new code, ie changes, the want to deploy faster, more often, even automated. Sysadmins want stability, consistency, so things should not change. They see any change as dangerous, a potential disaster. Oh, and if you haven't tested restoring your backup, you don't _have_ a backup.
Lot of people here preaching a whole lot of "this wouldn't have happened if..." nonsense, so nice to hear someone being honest about their own mistakes. The honest truth is that if something like this has never happened to you in your career, it probably means that you never reached the point of being trusted enough to be in a position to screw everything up. I will say (and, sadly, this is from experience) that no roller coaster on earth can ever simulate the feeling of sinking in your guts that comes when you realise that you've just done a very, very bad thing and that in the next 30 seconds a lot of alarms are about to sound...
@@TheDaern Absolutely right. There's also a lot of people suggesting various "aids" to prevent mishaps. (Myself included.) Such things may be useful. But there is a danger that one day your safe aliases aren't loaded for some reason, or whatever safety net you implemented. Problem is that the commands you use in a "safe" environment, where there is little risk of damage, are the same as in critical environments. Experienced sysadmins might spend a lot of time on systems where the wrong command can be fatal, and as a consequence tread lightly and carefully. Whereas a person who mainly works as a developer may have a more casual approach, thinking "trial-and-error" is a good strategy. I've seen a very experienced developer log on to a production server (as root), typing tcsh (cause he wanted command editing and history, which the - at that time - default ksh 88 didn't offer, at least without a little extra work), grep in a directory of log files, and directing output to a file. As a result of how tcsh handles globbing, his command got to grep its own output, resulting in an infinite loop and a filled log partition. Sysadmin work is - imo - worse than brain surgery, because when you do brain surgery, you are keenly aware that the life of the patient is at risk, and there is just no room for errors. For sysadmin work there is no clear difference between working in a forgiving environment, and in an environment where errors may affect - even harm or kill - hundreds or thousands of people if your luck is really bad. The only reasonably safe approach imo is to always act as if every keystroke is potentially lethal. The worst aspect of this is that most of the time, a careless approach works fine. I don't know how many times I've watched a senior developer do stuff, typing commands fast on a production server, while I was clenching my butt to avoid soiling my trousers - yet nothing bad happened. But just a little typo could have meant disaster.
You've got this backwards. First, your claims of a developer being more casual are without any basis, seemingly all anecdotal. Personally, I've dealt with a high-strung and extremely seriously platform developer who doesn't trust the sysadmins to manage physical volumes in production. I think personality has nothing to do with this. sysadmins typically might avoid process because of all the red tape that normal change controls/requests cause. Developers in great companies typically have a well-defined (and very short) development cycle for writing, testing, and deploying code. Developers in large and bureaucratic companies will have well-defined processes but too much unnecessary red tape. There weren't any developers consistently involved in this outage, had they been more involved it would've certainly been avoided. But the devs set up the backups once and never tested it again. That's the largest problem! Why did this happen? I have no idea. The engineers working this at night clearly were tier 2 support engineers with little knowledge of the overall environment! The lack of development in this area (environment configuration/IAC) is why this became an outage! This outage looks like it was entirely caused by the sysadmin job, but they're not the ones responsible. It's the management team most likely, whoever set up the teams for managing production env. Take it from a guy who worked in sysadm! You CAN be sysadm + a developer, and probably have to be nowadays. The old title of system administrator brought on by the decades old industry is dying. It requires more knowledge in the tech stack now and less about memorization of Linux commands. Else, if people fail to adapt, anyone will run into an outage just like this one. WHY would you ever take the responsibility and risk of manually running commands when some clever design and planning can cover most, if not all, of these kinds of situations. The senior dev who SSHed into production and starts hammering away at the keyboard is NOT doing dev work nor sys admin work, he's now being careless.
@@antdok9573 No, I haven't got it backwards. And I don't think I said that you can't be both a developer and a sysadm, or a devop for that matter. It seems you and I have wastly different experiences as sysadmins. Here is an example from 20 years ago, when the company I worked for was introducing the use of BEA WebLogic Integration. I was actually a developer at that time, but I had a long sysadm background. When deploying the first pilot test webservice to the test server of the operations department (after having tested it to work fine on the test server in the development department - with almost watertight shutters between the two departments, and certainly NO developer login to any servers in op) it turned out that it failed, because WLI assumed it could connect freely to the Internet. Not in this company, for sure! The development test servers weren't as strictly set up, so here there was no problem for WLI to retrieve an XML DTD from its official URL. The operations' test server however was locked down completely, and would not open connections to the outside. Now tell me, what would you do to solve this tiny problem?
@@lhpl I would fix the team structure. There are communication issues, regardless of people's occupations. There shouldn't be any distinction between dev & operation teams unless there's some archaic requirement for it (compliance), and even then, you can probably avoid rigid team structures like these. I don't need to know any of the technologies to see there's a leadership issue toward the top of the teams.
Damn I cannot even imagine the stress that admin was feeling after he realised he deleted DB1. He must have aged twenty years.
The legendary Onosecond.
@@1996Pinocchio I see that you see tom scott
The stress should really be minimal if you have a backup and restore procedure, that it actually works and you know how it works. Mistakes happen.The problem wasn't the delete command, it was the nonexistent backups and documentation.
@@youngstellarobjects Nah, still stressful. Most companies aren't making a backup on every write that happens to a DB, so whoever deletes a DB knows that they've just made an oopsie that will cause a lot of headache for multiple people. And probably cost a lot of money for the company as well.
As long as you have a backup there's no problem. I've done this before, but if there's no backup you prolly die of stress 😅😅😅
The fact they live streamed while trying to restore the data is a truly epic move.
Hope it was monetized
@@xpusostomosthat's why they live stream and post anyway. Not to educate but rather make money
Sounds like they had the spare bandwidth ;P
Programmer vtuber when?
@@joseaca1010already have one: vedal
_Luckily team 1 took a snapshot 6 hours before..._
This happened to me. I copied a clients database to my development environment about 2 hours before they accidently wiped it.
They called our company explaining what happened and it got around that I had a copy. Our company looked like a hero that day, and I got a bunch of credit for good luck.
You are a Legend :)
@mipmipmipmipmip Why is it bad security practice?
And now you routinely copy the client database every 24 hours?
@@amyx231 Actually, due to the nature of my current work, I have a script I run on demand approx every few days as needed that takes a snapshot. I usually get around to deleting everything that's more than a month old about twice a year or when my dev server starts btching about space.
@@jarrod752 perfect! You can probably set auto delete for one month out safely. But I applaud your caution.
I think the biggest problem (seemingly addressed at 6:21) is the fact they could delete an employee account by spam reporting it.
Actually at the time of the video, what they addressed was the fact that deleting an account could cause problems with the server, it seems they didn't actually stop trolls from deleting an employee's account. I'd have thought employee accounts would be protected. The trolls didn't even get admin powers through privilege escalation, they just reported the target.
read the video description
@@Milenakos every company says they do a manual review, but none of them actually do
@@DevinDTV source??? (edit: i was mostly complaining about you just saying they are lying out of thin air)
@@Milenakos source????????????????????????????????????????????
This reminds me of Toy Story, and how like a month before release the entire animation was accidentally deleted, causing absolute panic and hell at Disney. Luckily, one employee had the whole thing on a hard drive that they were taking home to work on. Her initials are on one of the number plates of one of the cars in the film.
Always make a backup.
Edit: She was a project manager who had to work from home, and the numberplate was actually "Rm Rf" in reference to the notorious line of code that did it.
I don't remember if it was the day-saving employee's initials, or RM-RF that was on the license plate
It was Toy Story 2, and the easter egg was in Toy Story 4, where the license plate had "rm rf" in it
they fired that person recently
@@ScruffyNZ. Yes, I heard this too
@@ScruffyNZ. That person should be happy to not work for a woke company like Disney.
So you're telling me a platform as big as GitLab went down because one engineer picked the wrong SSH session?
Damn that makes me feel way better about my mistakes lol
i would highly high suggest people using customized shells, i use oh my zsh, i customize my themes to show git info, hostname (sometimes) and a lot more, not because i wanna know which ssh session im in, but i like the design :)
@Syed Mohammad Sannan No someone has to have that.
The general problem is that there's no safety nets. I don't mean to suggest this is a good solution, because safe-rm is just jank. But using safe-rm would most likely have saved this situation. If you replace rm through a symlink to safe-rm you can configure a blacklist on production that doesn't allow for deleting the database or other critical data.
I find many things about safe-rm to be unsafe. It doesn't protect if you cd into a directory and then do rm -rf *. A better program should simply evaluate the path its trying to delete and disallow it if the blacklist covers it.
It also doesn't allow for custom messages through its blacklist. What you want is for a bad rm -rf to send a warning to the user. Otherwise there's no way of guaranteeing they don't just start avoiding the issues.
For example, most likely you're not going to leave your backup unprotected by the blacklist just to create differences between production and backup. So a developer in this situation would expect to run into issues deleting postgres db on either server. It doesn't tell the user anything really. If you instead configure messages you can call attention to the hostname.
The goal is just to induce further friction for dangerous actions. rm has always been so risky because it's so easy.
@@0xCAFEF00D I always check the hostname of the server and triple check the directory before using the rm -rf command. If in doubt I use the mv command to a different directory as backup. If everything works ok then I go in there and delete the old directory.
Same thing happened to Pixar's movie Toy Story they were working on. Some storage admin used rm -rf on a directory by mistake and practically wiped out the movie. Lucky someone had a copy of the data on a laptop that was offsite at the time. They were able to rebuild the movie from that data.
@@0xCAFEF00D no, NO single employee should have enough privilege to bring down anything business sensitive. except if you’re the CTO maybe. These operations all should require a flag or check from someone else first. Just like how one person usually shouldnt be able to push any code by themselves. They need 1 or more checks before that.
@@shahriar0247 I try to make my prod env glow red like that even if I am tired I can see it
Given the trouble they were in after the deletion, a recovery time of 24h and a recovery point of 6h is actually pretty heroic. Especially considering the stress they would have been under. 😰
@@L2002 Because of this? They were open and honest about their screwups which, for me, makes them a pretty good organisation to deal with. Plenty of others would not be and, at the end of the day, this stuff does happen from time to time. My measure of a company is not how well they work day to day, but how they handle adversity. Everyone screws up eventually and it's how you handle this that marks out the good ones from the bad ones.
Also, a company who almost lost a production DB because of failed backups is unlikely to do it again ;-)
@@L2002 Ah, yes, because Microsoft never has outages, data loss, or data leak incide- oh wait..
@@L2002Care to elaborate? Cause everyone else here, including me, disagrees.
Don't feed the troll, clearly not someone who's ever worked with computers on a proper level
@@titan5064 exactly. Their handle is "L" -- they are a literal walking loss (loser).
Something my first boss taught me (when I broke something big in production in my first few weeks) is that post mortems are to identify problems in a system and how to prevent them, avoiding blame to individuals.
This is huge. Making sure to identify why it was even possible for something like this to happen and how to prevent it in the future is a great way to handle a post mortem like this. Good on the GitLab team.
Good boss. Bad ones often like when things are done fast and "efficient". And when this then establishes a culture of unsafe practies, thing will go fine, maybe for a long time. This one day, a human error occurs. Typically, such a boss will then blame the person who "did" it, even if the cause was the unsound culture. If as an employee you try to work safely, you get criticised for being slow and inefficient (and you technically are.)
Yeah, things like this are the problem of the system, not fault of the operators
You only fire people for their character, not cus of the inevitable fuckup.
Also you basically sunk money into training this dude after that fuckup, so sacking him right after you inevitably paid to get him that experience, is counterproductive.
Also very cool that they did it completely in public even with livestreams. This will hopefully help other companies avoid mistakes like that.
Depends. Many times, it's used as on opportunity to kick out people they consider undesirable, even if they're great employees.
I'm glad they didn't fire the engineer. It goes to show the differences in mindsets from some organizations that care about it being a learning experience (albeit an expensive one). Many corporations would have fired the engineer as soon as the issue was resolved without hesitation. Thanks to those orgs who care about their team members and being more concerned with lessons learned.
Ugh, felt that "he slammed CTRL+C harder than he ever had before" (3:55). The only thing worse than deleting your own data is deleting everyone else's. In this case the poor guy kinda did both. Great story arc.
Yeah, I guess it was the most stressful moment in his life after realizing what you've done. I think he had a huge blackout
The rule I apply for backups is that no one should connect to both a backup server and a primary at the same time, two people should be working together. The employee that was logged on both DBs should have been really two physically separated employees
That is an excellent rule.
Yes, but, in the case that there is only one employee available and he has to connect to both he should either have different color schemes for the different servers OR do it all in one shell window and disconnect / connect to the server they have to edit that way it is a lot harder to execute commands on the wrong server.
I try to follow this rule myself. Every time I have to connect to a prod server to get anything, I disconnect as soon as I get the info before getting back to the test/dev server window.
@@refuzion1314 Different color schemes looks good but don't work during an outage, when you are stressed, exhausted, or anything distracts you. Sounds nice, but the mental load during crisis is too large to pay attention to that.
I can't tell you enough how easy it is to accidentally overwrite the wrong file. While I was working on something on a test machine with a usb stick plugged in to save the current progress, I saved the script, thought I saved it in the local directory and copied the unmodified script to my just saved usb stick version...
This is why problems like this are actually sometimes good. Of course extremely stressful, but they found sooo many issues and fixed them all. Amazing.
you are asuming they fixed them all
Until it breaks again.
@@federicocaputo9966 atleast next time they now have the experience to what not to do or what to do
That's a really good positive approach right there!
reminds me of that one ProZD skit, where the villain fixes everything
@@federicocaputo9966 that is life
"You think it's expensive to hire a professional? Wait till you hire an amateur" - some old wise businessman.
that indeed is wise
Loll
A professional costs you in bold italic and underline. An amateur mostly costs you in fineprint
The person here is not an amateur, anyone can get brain farts especially when working an unexpected overnight, you should try it sometime, you'll start seeing ducks and rabbits in the shell.
@@-na-nomad6247 I'm a veteran in the Devops field. This comedy of mistakes could have never happened to me since I'm following a protocol, which these guys obviously did not. They were guessing and experimenting as if it were an ephemeral development environment. Their level of fatigue had little to do with their incompetence in understanding the commands they were running.
There is an awful lot that could be learned from this.
1) You should "soft delete" i.e. use mv to either rename the data e.g. renaming MyData to something like MyData_old or MyData_backup, or just mv it out of the way so you can restore it later if needed. Don't just rm -rf it from orbit
2) Script all your changes. Everything you need to do should be wrapped in a peer-reviewed script and you just run the script, so that the pre-agreed actions are all that gets done. Do not go off piste, do not just SSH into prod boxes and start flinging arbitrary commands around
3) Change Control - as above
4) If you have Server A and Server B, you should NOT have both shell sessions open on the same machine. Either use a separate machine entirely or - better still - get a buddy to log onto Server A from their end and you get on Server B from yours. Total separation
5) Do not ever just su into root. You use sudo, or some kind of carefully managed solution such as CyberArk to get the root creds when needed
Also for (2), never try to "improve" anything during the actual action.
I once prepared a massive Oracle migration that I had timed to take about 3 hours. Preparation was three weeks.
As I was watching the export script for the first schema during the actual migration, I thought "why not run two export jobs concurrently, it's gonna save some time". Yeah, made the whole thing slow down to a crawl, so it ended up taking 6 hours. Boss was furious.
So no, never try to "improve" during the actual operation, no matter how big you think your original oversight was.
100%, upvoted.
I religiously never delete anything
I have no idea what any of this means but I feel like this is bad
Fucking up catastrophically with Bash commands is a canon event. It is religion for me to always copy a file/directory to "xxx.bak" before doing anything sensitive
One of my most stressful moments as a software designer was when I accidentally broke a test environment right before a meeting with our client; I managed to have the project running at a 2nd test environment but that really taught me the importance of backups and telling the rest of staff about a problem ASAP.
Imagine for a moment, that you're that guy. That feeling of pure dread and the adrenaline rush immediately after the realization of what you've just done. We've all felt it at some point.
In my first job, ive gotten a 2 am call where in the first two weeks of working in the company, i accidentally left a process in prod shut down after maintanence, leading to intensive care patient data not making it into connected systems.
Looking back, the entire company was set up super amateurish, yet they operate in several hospitals in my country.
Having exited my game without being sure I saved my progress before, this is very relatable.
the onosecond
like hitting a car in a parking lot 😵
At my second IT job I accidentally truncated an important table in the prod DB. The stress was immense but we identified a ton of issues and the team was pretty supportive. My boss ended up begging upper management to get us a backup server but they determined it wasn't important enough.
The company went belly-up a few years later because of a ransomware attack they couldn't recover from.
When I was still a junior developer at some startup company, I was working on a specific PHP online store. Every time we would upgrade the site, we would first do it on Staging, then copy it over to Production. The whole process was kinda annoying as there was no streamlined upgrade flow yet and no documentation anywhere - it was a relatively new project we took over. I have upgraded it before so I knew what to do, and I just did the thing I always did.
I was close to finishing it up and we had an office meeting coming up soon and lunch afterwards, so I wanted to be done with this before that - so I rushed a bit. And when I was copying files to Production, I overlooked something - I had also copied the staging config file (that contained database access info etc) to the production location and overwrote the production config file.
After the copying had finished, thinking I was finally done, I relaxed and prepared myself for the meeting. As I was closing everything, I also tried refreshing the production site, just to see if it works. And then I realized... Articles weren't appearing, images weren't loading, errors everywhere. Initially I didn't believe this was production at all, probably just localhost or something, RIGHT?? However after re-refreshing it and confirming I had actually broke production, panic set in.
Instead of informing anyone, I quietly moved closer to my computer, completely quiet, and started looking at what is wrong - with 100% focus, I don't think I was ever as focused as I was then - I didn't have time to inform anyone, it would only cause unnecessary delays. I had to restore this site ASAP.
I remember sweating... the meeting was starting and I remember colleagues asking me "if I am coming" - and I just blurted "ye ye, just checking some things..." completely "calmly" as I was PANICKING to fix the site as soon as possible. Luckily I quickly found the source of the mistake within a minute and had to find a backup config file - and then after recovering the config file, everything was fixed. Followed by a huge sigh of relief. The site must have been down for only around 2 minutes.
No one actually noticed what I had done - and I just joined the meeting as if nothing had happened - even though I was sweating and breathing quickly to calm myself down, I hid it pretty well.
And this was a long time ago - and still to this day, I still remember that panic very well. Now I always make sure I have quick recovery options available at all times in case something goes wrong - and if possible always automate the upgrade process to minimize human errors
Well done. Having made mistakes like that, I can completely understand how you were feeling in that moment and how your brain just went "in the zone". It's only ever happened to me twice but I will NEVER forget them.
Good lessons, thank you
Mann, we all have our fair share of breaking production.
I did something similar but with testing on what i thought was dev server. Had some close calls but this time i fcked up. Was super high but was always high so doubt that was it. Quickly had to go and undo changes but was so shook had to make a chrome ext that would put up some graphics and ominous 40k mechicus music whenever i go on a live domain. Havnt made the same mistake since.
You were only lucky because the project had no proper and comprehensive CI/CD pipeline with unit tests.
A competent tech company would have fired you over this.
A helpful hack is to set production terminal to red and test terminal to blue or something like that. Just a small helper to avoid human f’ups if you need to run manual commands in sensitive systems.
i second this I also use colors to differentiate multiple environments
it was easy and changing prompt color... but make a huge differece
I use colored bash prompts to differentiate machine roles - my work PC uses a green scheme, non-production and testing servers use blue, backups use orange, and production servers use yellow letters on red background. It's very hard to miss.
I use oh-my-posh with different themes
Both database servers were actually used in production.
Ultimate workplace comeback: "At least I've never nuked the entire database"
Better to have someone who knows what to do, than someone who has never experienced it
they work remotely
Ultimate comeback to that comeback: "So far."
Your editing is phenomenal. What an insane series of events 😂 Glad gitlab was able to get back to running, seeing all that public documentation was refreshing to see since it shows they were being transparent about their continued mistakes and their recovery process.
I once accidentally ran a chmod -R 0777 /var because i've missed a dot before the slash (in a web project with a /var folder), which (as i've now learned) may make a unix system totally unresponsive. I can very well understand how it feels, the moment you realize what you have just done. That did cost us a few hundred euros and kept 2 technicians busy for an afternoon on the weekend. Lessons learned, today we can laugh about it.
Ya, Unix / Linux will do what you tell it to do without any warnings. Pretty sure you sat there and wondered why that command is taking a long time to finish before you realize your mistake. Right then there it's the "Oh Shit" moment. 😀 Lucky for me though I use VMs so can always revert to previous snapshots.
the onosecond
@@Darkk6969 What if you ran it on the host?
@@parlor3115 he doesn't, Noah only runs things in virtualized environments, making snapshots every minute
Why does it make it unresponsive? I accidentally chmod 0777 the entire "/" once and well, I had to start again from scratch. Thankfully I was just creating a custom Ubuntu image with some preinstalled software for one of my professors. So it just cost me time. Still, I never figured out why opening up the permissions would lock everything up.
"rm -rf" is one of those commands I have huge respect for cause it reminds me of looking down the barrel of a gun (or any similar example of your choosing): Best case, you do it a) seldom, b) after a lot of strict and practiced checks, and c) if there's no alternative; unfortunately, the worst case is when you _think_ you're in that best case scenario.
I sourced my bash history like an idiot about a week ago. I have so many cd's and "rm -rf ./"'s and other awful things in there. I somehow got lucky and hadn't used sudo in that terminal at the time. I got caught on a sudo check before it ran anything absolutely hell inducing. Just a bunch of cd's and some commands that require a sourced environment to execute. Super Lucky. Icould have wiped out everything, because just a couple commands after that was a "rm -rf ./" and it had already cd'd into root.
@@givenfool6169 Lmao it had never once occurred to me what havoc it could wreak if you accidentally source the bash history, since it had never occurred to me that that's even possible (because why the hell would you?!). But of course it is, what an eye opener!
@@henningerhenningstone691 Yeah, I was trying to source my updated .bashrc but my auto-tab is setup to cycle through anything that starts with whatevers been typed (even ignores case) so I tabbed and hit enter. Big mistake. I guess this is why the default auto-tab requires you to type out the rest of the file if there are multiple potential completions.
@@henningerhenningstone691 bro idk wtf you're talking about and it's scaring me
Do ll first, make sure you're wanting to delete that directory, the press up and change ll to rm
For this reason, all our servers have color-coded prompts. Dev/Testing servers are green. Staging is yellow. Prod is bright red. When you enter a shell, you immediately see if you are on a server that is "safe" to mess around with, or not.
The advantage to doing this in addition to naming your server something like "am03pddb", is that you don't have to consciously read anything. Doesn't matter if you accidentally SSH into the wrong server. If you meant to SSH into a "safe" server, then the bright red prompt will alert you that you are on prod. And if you meant to SSH into a prod server, then you better take the time to read which server it actually is.
i agree except there are only so many colors, so if manually controlling a lot of different machines (something that could maybe be avoided depending on what the servers do) i believe it's important to use unique memorable hostnames. the two servers in this story had hostnames 1 character apart and the same length, unless the names were all changed for the artwork
@@tacokoneko Yeah like imagine if those two characters were visually similar ones, like any combo of 3, 5, 6 and 8. Fatigued eyes could easily misleadingly "confirm" that you're on the right one when you're not.
Also, dont ever ever work on the live database, a lesson i have learned the hard way many times on my own.
@@makuru.42 That statement makes no sense. No matter how critical a system is, you'll have to perform some kind of maintenance at least semi-regularly.
@@MunyuShizumi you make a backup or anything, yes you need to maintain it but not by making massive untested changes.
Nice to hear that they didn't fire him. He did the correct procedure, some of the steps were unknown like the lag caused by the command, which could have been avoided by having clear documentation about it. Also when people are tired late at night, mistakes do happen, which anyone can be the victim of.
When I was just starting in a company, I accidentally deleted all the ticket intervals from the database. Causing all the tickets to close immediately and make some massive spam to the admins. I was really terrified of the situation and didn't know what to do, we didn't have any backup as well. I apologized as much as I can and didn't make another mistake like this again in years, sometimes mistakes make you work harder and be more careful in life.
I once accidentally deleted 2000 rows in one of my companies production databases, everything was restored 5 minutes later but it felt so bad, can't imagine what deleting an entire database would feel like
terrible, sending the queries make you shiver
ig panick was at next level coz both dbs were deleted.
It feels like lighting a torch onto a sea of currency bank notes... that belongs to the company.
(and company is just about to release year end bonus)
I have had good hands-on experience at my company on sql database but I'd check my query atleast 10 times before execute it..we had clients data saved in the database of more than 10 years..
@matthias916 hope you don't work there no more. You need more experience with SQL and other IT technologies before you actually allowed to touch it so these highly preventable errors don't happen.
You need to learn how databases work, a backup/restore system. Not to mention you should be automating queries anyway , that's what pwsh and DevOps is for. Less human mistakes. So sad and very amateurish to delete databases without even backing up prior to making changes
The real problem here is that you can delete any user data by simply mass reporting him
I'm seeing a lot of serious problems here... I guess this is why I never heard of GitLab before.
I highly doubt is instantly deleted, probably someone made the decision to delete it (could just be an account spamming a bunch of mess onto repositories, and that isn't good either.
@@technicolourmylesthey're literally the 2nd largest enterprise git solution provider in the world.
When in doubt, it's probably 4chan
That low hanging fruit aside, not a good thing if someone can just do that with a bot acc. Maybe grant employees a special anti report protection can help until they find a more permanent solution against those trolls
@@PatalJunior6:21 literally says they fucked up by not making it check the details before deletion
As an engineer for a large company you got me in the feels talking about asking for help or posting a pr and then seeing all the mistakes you made😊
In my previous position I worked closely with one guy and we used to joke about how we were using each other as a rubber duck :D.
The buzzword is SRE and postmortems are supposed to be blameless now...
This is why you first mark the PR as a draft and read over the changes one more time before marking it as ready.
@@stingrae789 Damn I didn't know this thing has a name! I legit have done this before while discussing weird math problems
The #1 thing I learned WAY EARLY on in my IT career (three decades): Never delete anything you can't _immediately_ put back. Never do anything you can't undo. Instead of deleting the data directory, _rename_ it. If you're on the wrong system, that can easily be fixed. (and on a live db server, that alone will be enough of a mess to clean up.) As for backups, if you aren't actively checking that (a) they've run, (b) they've completed successfully, and (c) they're actually usable... well, this is the shit you end up in.
(The fact they're actively hiding ("lying") about this fiasco should be criminal.)
yea renaming is the key. first rename, then setup everything and then delete the renamed folder like a few months later.
The voiceover: Outstanding, the editing: premium, the humor: dryer than the sahara *inhales* just how I like it.
Ive never hit the sub so fast, keep em coming man!
imagine flagging messing with some employee and managing to bring down the entire site by proxy
What
Bot
@@hypenheimer beep boop
How to take down a site, the stealthy way
@@hypenheimer nah. it was probably a minecraft shorts bot account before he bought it though.
The best practice is to rename the directory or file to something else. Idk how the developers are so calm when using deletion commands
Well, when you live in a poor country, being underpaid by a fucking contractor company, with a overloaded team. shit hapnz
I "deleted" on program from me with the cp command (I wanted to copy the config and the main file in a sub directory, but forgot to enter the directory after it, so it wrote the config to the main file)
(I could get a older version of the file from the SD card, by manually read the content of that region and find one with it on it, as it doesn't override an save, but takes a new place)
On a home system? Absolutely. In a working environment? Doubtful. Maybe with a small company it would be acceptable, but creating an orphan database that may or may not contain sensitive information with no one in charge of it, or worse, no one who KNOWS ABOUT it, would be awful. God help you if that contains financial, medical, or government records.
@@Funnywargamesman you don't create it to keep it around forever, you create it as a failsafe for when you are doing potentially dangerous stuff, like deleting a whole database.
@@AndrewARitz I cannot tell you how many times "temporary" things become permanent on purpose, let alone the times people have said they are going to do something, like deleting a temp database they copied locally because their permissions didn't let them use it remotely, and then proceeded to forget to delete it. This will be especially true with the most sensitive databases, "because it's more important, so we should make a copy first, right?"
Security is everyone's job and if you do (typically) irresponsible things like copying databases, "as a failsafe," chances are you are going to form a habit that means you will do it with a sensitive database. If you think YOU won't do it, that's fine, but assuming you are of average intelligence you need to remember 50% of people are dumber than you and some of them get REAL dumb. If you set policy to say that it would be allowed, then THEY will do it.
This is exactly why I said that home environments and really tiny companies could be different, there it could/would be fine. Chances are, if you don't know the names of every single person in your company off the top of your head, it is too large to be that lax with data protection and management. Take it or leave it, it's my opinion.
A few places i worked at as a linux admin or engineer, the shell prompts (PS1) were color coded. Green was dev, yellow was qa and red meant your in prod. Worked like a charm.
Yeah, that's the way I do it as well, just the other way round (red being test). Extends to the UI as well - if the theme is red, you're on the test instance of Jira, not the real one.
@@blackbot7113 Yeah, it's a very wise thing to do imo. Currently, I work at a bank, and I recommended we have the header in the UI of the colleague and customer portal be different colors for lower environments, as well as the PS1 prompt on the servers. And I kinda got snickered at and got a reply along the lines of "How about we just pay attention to the server and page were on?"
Its crazy because it's such an easy change to implement and almost entirely prevents anyone making such silly (yet catastrophic) mistakes.
Edit: I make the PS1 prompt for my own user on the servers different colors, but that only helps so much since I sudo into other service users (or root). Additionally, we "rehydrate" the servers every. couple months, which means they get re-provisioned/deployed, so any of those settings get wiped out entirely.
For it to be permanent, it needs to be added in the Docker file.
A long time ago we implemented a policy that absolutely nobody operates the production console alone.
There always has to be someone else looking over your shoulder to point out oversights like the one in the video.
This is a good one, but I would add that no one ever does something on production, should allí.the handled by CI/CD pipelines and go through QA /peer review before any commit goes in
@@gabrielbarrantes6946 I was talking about a time when GIT and devop were just wet dreams.😁
i love how clear you've made it for us to tell whether commands are being run on DB1 or DB2
if only it were that clear irl...
"Slams Ctrl+C harder than he ever had before"
As a relatively new linux user, I felt that one.
As a new Linux user use the "-i" flag for "interactive" when using rm and a couple other commands.
As a windows user, I was very confused
@@KR-tk8fe CTRL+C. On most Unix/Linux based CLIs, this combination aborts whatever command you were running. Technically, it sends a SIGINT (Interrupt) to the foreground process (active program), which usually causes the program to terminate, though it can be programmed to handle it differently. Its basically, the Oh Shit or This is taking too long button.
@@LC-uh8ifIsn't that the same in Windows terminals? 🤔
Awesome work on the video!! I love the editing being both funny and straight to the point, and your narration is easy to understand too. You seriously deserve more attention.
An old best practice that so many people these days seem to forget or never have heard about is that every week, you try to pull a random file from your backup system, whatever that is. (Or systems, in this case). You will learn SO MUCH about how horribly your backups are structured by doing this - so many people think they set up good backup systems but never continuously test them in any way, and then they get big surprises (like the GitLab team) when they do need to fall back on them.
What's far more impressive about this whole situation is how calm the engineers were in handling the situation. That to me is far more valuable than having engineers that are too gun-shy to make prod db changes at 12AM and panic when something goes wrong.
One thing I learnt all this is never run delete command ever, and if you are paste the screenshot of command in your group before running it
If you are working with multiple shells, VMs, remote sessions or the like - make sure they are color coded based on the machine you are running against!
It can be as simple as picking a different color scheme in windows. But it is just too easy to mess up when all the visual difference is a single number, somewhere in the header.
Yep, I came here to say this. For any serious system I connect to, I use different params for my session, in my case I like old fashioned xterm, something like: alias u@s="xterm -fg white -bg '#073f00' -e 'ssh user@server'"
It's very useful to see the green red, blue etc colouring and be sure which system you're talking to.
It's very nice that Linux shells actually support setting session colors
Our prod server has no staging environment or anything like that. I've asked the DB admin if the data and schema is safe in case of someone accidentally deleting everything and they told me everything is backed up daily. Kinda scared that I don't know how or where this is happening except for a job.
I checked my database backup script a couple days ago and noticed it hadn't backed up in 5 days O_O I SLAMMED the manual backup immediately. Then went and fixed the issue and made sure it would notify if there was no backup in 6 hours.
The next question is... "Have you tested the backups?"
If they can't say for sure WHEN they were tested... Be very afraid...
@@CMDRSweeper we load the prod backup into staging nightly
6 hour full backups, mirroring/replicas, multiple servers and daily volume backups..
"Trust me, bro" only works in Dev. Every other environment needs regular verification.
Testing to verify backups, replication, failover and the like is absolutely critical. As new scenarios occur, having a feedback loop to update the plan is key. It's a continuous process that most shops have learned the hard way. It is boring and tedious but if you don't test you will experience catastrophic consequences.
Exactly. Just like a dam, if there is a weak-point at the bottom, it all may come crumbling down.
There needs to be a lot of redundancy when it comes to backups. Especially when it comes to a big server. An engineer accidentally removing a database should not have that catastrophic of consequences.
Yeah, the general rule is/should exist for having to be ready for stuff like that. If your fuckup is non-recoverable or a massive pain, you did something wrong. I'm sure a lot of companies are practically "trained" for when someone yeets the whole database or service.
Even before I started working in one company, one IT specialist deleted the directories of the new CC-supporting system. This was shortly after its implementation into production. Worse still, it turned out that the backup process was not working properly. For a week, the team responsible for programming this system practically stayed at work, recreating the environment almost from scratch. :D
This reminds me of all the times I have been in the wrong ssh session just before doing something that would have been pretty bad. I setup custom PS2 prompts to tell me exactly what environment, cluster, etc I am in.. and even colorize them accordingly but the problem is.. you start to just ignore them after a while. Its also kinda dangerous when stuff becomes fairly routine that is manual and potentially damaging
all things aside, that wasn't that bad. Yeah, they weren't operational for 24h, but that made many other companies realize their fault management. For example, my uni professor told us about this incident and we could comprehend the importance of backups and testing
I think the biggest issue was losing 6 hours of commits and comments.
@@kookie-py Agreed, virtually all of them will have the commits locally as well. Just noting that the data loss is a bigger deal than mere downtime.
This is why programming in general is great, nobody dies if you fuck up. (Obvious exceptions, medical, aviation etc)
@Titanium "nobody dies" (except the people who would die)
@@_Titanium_ cough cough therac software lmao
As a former Amazonian (only QA for the now-ended Scout program, sadly), I read quite a few cautionary tales on the internal wiki about Wrong Window Syndrome. Sometimes, not even color-coded terminals and "break-glass protocols" (setting certain Very Spicy commands to only be usable if a second user grants the first user a time-limited permission via LDAP groups) is enough to save you from porking a prod database.
This interests me. Got any resources/links to set that up (dangerous commands temporarily allowed by time-limited permissions via LDAP)
@@Skyline_NTR Afraid not, it was several pay grades above me both in job role and in coding knowledge, and I lost access to the company slack back in december so I can't really ask anyone...
@@WackoMcGoose Ahh were you laid off also??? I was lol
@@ProgrammingP123 Yup, they disbanded the entire Scout division and then put a company-wide hiring freeze a month later so I had no hope of transferring...
Mistakes in the moment happen. I'm focusing more on the "we thought things were working as expected" parts. The backup process familiarity, backups not going to S3, Postgres version mismatches, insufficient WALs space, alert email failures, diligence on abuse deletes... These were all things that could have been and should have been caught way before the actual incident.
OMG, we have all been there haven't we? That awful, dreadful realization after deleting something that you shouldn't have. Mine was back in the days of manual code backups, before ALM tools were ubiquitous like today. I thought I had taken the last three days of code changes and overwritten the old backups that were no longer needed. And then I realized that I had done the exact opposite, and just deleted three complete days of coding - and would now have to recreate them from scratch 😒😭
"He slammed Ctrl-C" - I can feel the cold gripping feeling you get when you realise you've just caused an accidental catastrophe...
One of my first jobs in IT was working as a big data admin and this video allows me to re-live the spicy moments of that job but with none of the responsibility attached
I can just imagine the relief that team felt when they find SOMETHING that they could use to restore files.
respect for not firing the guy, it was obviously just a small mistake, and it wasn't his fault that the backups didn't work. it shouldn't be possible for 1 command to completely delete everything in the first place. Good that they didn't just use him as a scapegoat :p
If they fired him they would just reintroduce the possibility of the same thing happening again in the future. I'm pretty sure the old employee will be paranoid for a loooong time and will double-check from now on lol. An expensive lesson but a lesson nonetheless.
Yep and he'll train new employees making super sure to emphasize triple checking before deleting from prod.
That’s Unix systems for you. Their open nature makes them super useful for a lot of things but it’s also so easy to break them.
Plus that old trick of telling new linux users that sudo rm -rf is a cool easter egg command wouldn’t be the same with more safeties and preventions.
What if I want to delete everything? I don’t want a baby proofed OS. I want an OS that does what I want. Even if I want to burn it all
@@BitTheByte why buy a computer at that point lol
Linux actually can in certain circumstances "undo" this wild kind of situation. Having ZFS as the file system will allow you to revert to a previous image of the filesystem. it's like versioning but for the entire file system. of course it takes up quite a bit of space so it's not done that often, software install are automated "imaging" points for instance. but you can trigger one manually when you think you're about to do something you're unsure about. (since the selection of save states is at GRUB, yes an unbootable system is still recoverable if you still have GRUB)
your content is really good, please keep up making these mini documentaries about tech failures!
In my vocational school I had a subject simply called "Databases" and our teacher there once told us a story about how one of his co-workers lost his job.
In essence he did everything right, created his backups and backup scripts and everything worked. At some point during the lifetime of the server this was running on someone replaced a harddrive for whatever reason, this lead to a change of the device UUID, which he had hard-coded into his backup script, when the main database failed a year or two later, they tried restoring from this backup only to find that there was none.
Wasn't even really his fault, the only mistake he made was not implementing enough fail-saves. Nowadays we have it comparatively easy with all the automatic monitoring and notifications, but this was at least 30 years ago.
I guess that could have been solved by testing the backups. Install the database software on a spare server or just your own workstation, and then restore the backup onto it
@@thewhitefalcon8539 well the backup ran properly for years, he just never thought that the UUID might change
@@hummel6364 I suppose as long as he's employed he should probably be checking the backup at least every couple months. Would I have remembered to do that? I dunno, but I'm not employed as a database admin.
@@hummel6364 yeah he kind of deserves to be fired...feel like it should be common sense the hdd could fail, no good excuse to not expect that. You should almost never hardcode stuff, not sure why they thought it was okay to hardcode the uuid of a drive that would one day fail.
@@yerpderp6800 I think the idea was that the device might change from sdX to sdY when other drives are added or removed, so using the UUID was the only simple and safe way to do it.
Yesterday I was added to a support team because we are getting a lot of tickets from users not waiting long enough for a service to load and closing the connection early. I died laughing from this story.
A couple of jobs ago, I had a colleague who managed to do worse than this.
I think they were playing about with learning Terraform and managed to delete the entire account. Prod servers, databases, the dev/qa servers, disk images, even the backups. Luckily it was a smaller account hosting a handful of tiny trivial legacy sites, but even so, we didn't see them for the rest of the week after that mishap
😱😱😱😱😱😱😱😱😱😱😱😱😱😱
4:40 never assume you have enough backups, I've been taught this 100 times and I don't even have anything important to backup (for now)
As a dev for a large company who has been on a number of late night calls, I literally gasped at this. But good on the team to work through the issue, and good on management to keep these guys around
Great video! Well produced content about software engineering war/horror stories are exactly what I’ve been looking for, keep it up!
This video made me say "Oh... my... God..." way too many times 😂😂. Felt like some Chernobyl documentary about a bad sequence of actions. Love it! This is very insightful as to what things can take place on these types of environments as well as what are some measures that can prevent major falis like that. It's also super interesting to see that, no matter how perfect a software system is, humans will still find a way to screw it up 😂
Bro let's also don't forget the damage had already done, the server was down for like 18 hours thousands of prs, comments, issues and projects are all delete permanently, this should be a bigger deal
I'd love to see the USCSB do an animation on this incident lmao
I know the exact feeling of terror the moment you realize the command you just ran has is about to cause havoc
I thought it was just a meme back then when i saw this on twitter. Is this for real?! OMG. Everything aside, big applause to gitlab for not blaming a single person when this happen, such a nice company to work for.
I didn’t realize I would like these videos, but you are a good storyteller for production issues and I hope to see more in the future
I am gonna share this with some of my coworkers
This video is awesome! The step by step analysis of what occurred during the outage coupled with the story telling format helped me learn some things I didn't know about database recovery procedures. Please make more videos in this format!
I barely understand anything here, but all I can say is massive thanks to the team who have worked hard, advancing our computer tech to the current state we have!
I deleted the main site from our backend in my first month as a full stack developer. Fortunately i figured out how to rebuild the apache server and clone the repository but i definitely worked well past my hours that day and the stress was crazy
I absolutely HATE how database backups and a lot of other common tasks just don't report any progress in a terminal when you run the command. It's agonizing because you don't know if the command is working or hung on something. For example, a mySQL database I work on takes a good hour or so to backup. During that time, we get NOTHING from the terminal output, so we have to monitor separately using IO and CPU tracking tools to make sure the SQL instance is still doing something.
As for rm -rf, I've made it a habit to either take a manual snapshot immediately before running it on any production data, or more often I'll just make a copy of the directory right before to a temp directory, that way there is always a copy of the data before I remove it from where it normally resides. It's saved me from stupid decisions more than once...And always, ALWAYS verify backups exist where you expect them and the contents look complete before you make sweeping changes! We like to deploy a backup to a test server before major upgrades, just to make sure it restores as expected. It can take an extra day or so to do, but it's a good verification step that ensures any issues with our backups are caught during regular maintenance, instead of in the middle of a crisis.
And wrong window syndrome...yeah it sucks. I've restarted or shut down the wrong servers more than once. I'm sure my mild dyslexia doesn't help in that regard...
Having a live screenshare with team members watching might seem a little wasteful. But for critical procedures like this, it is well worth the added cost.
Most people don't see the importance of such extreme level of caution until it's too late. It's like handling a firearm.
the realization of what you're doing before it finishes itself is so cruel and happens so often, thats why when you're doing a job you always do it slow but correctly
Wow! This was great and so interesting. I'm so glad I found this channel. I would love to hear more in depth analysis of software engineering fails
Very interesting and easy to understand for layman. I’m sure most of us could also learn from the mistake even if we don’t deal with databases or code
Ps - If you meant to blur/delete the names at 9:50 you missed the “replying to” part.
Thanks Kevin, and thanks GitHub. We still love you, and your recovery effort seemed great (and kindly presented by Mr. Fang here!) and altogether as humane as possible. The recovery stream to calm people down was a great idea, and I bet it helped a lot of people to not freak out. I would have been freaking out, and the fact that you guys didn't, but came through methodically, is very inspiring.
I hope the lesson is: don't give AI filters baked into databases any Actually_Important responsibilities. I'm paused at "lesons laenrd" and now unpausing...
GitLab not GitHub.
were you having a stroke?
I’m pretty sure this event only ended up affecting things like comments and issues, but not the actual git repositories themselves, which would have been a huge relief, I imagine. Still, this was one of the most interesting things I’ve ever followed and ended up motivating me to learn a ton about databases, cloud practices, devops, and everything-as-code culture. Thanks for providing such a great lesson, GL. And huge kudos to them for transparency
One strict rule I always follow when connecting to prd servers via ssh or DB UI agent (pgadmin) is I always use different background colors,
Red for prod
Green for staging
Black for test and local
+ double checking every command
You can never be sure enough
This gives me good insight on why our tech team keeps breaking shit….
Two things to remember:
1. Always backup before you start a change even if you have an automated backup system.
2. Audit you recovery procedures.
I do not know how I got here, I don't get most of the video, But I am absolutely lovin' It as It's bloody entertaining.
I've always been paranoid when working in Prod. Always make it a point to have at least the Ops Lead on a screen-sharing session where I show what I'm doing while requesting affirmative acknowledgement of each step before proceeding. It's annoying. It's slow. But boy ohh boy does it make me feel safer.
It may be slow but look at it this way. You're probably saving a lot more time in the long run by preventing something horrible from happening in the first place.
I have something analogous to the 24 hour rule for shopping when I’m doing anything sketchy in a prod environment: before I hit RETURN and irreversibly commit to an operation, I leave my keyboard, stretch, go for a walk, grab coffee etc. it helps a lot with the tunnel vision. 10 minutes AFK can save you hours of pain later.
I like how the terminal has the decoration of some linux-y windowmanager, but the message boxes are winXP xD
To prevent confusion in terminals with nearly identical host names i recomend to change the PS1 variable to clearly see in what shell you are now
red color:
[FUCKING PROD DB]:~/$
Green color:
[FUCKING BACKUP DB]:~/$
and you wonder why you are unemployed.
I experienced something similar a couple years ago, it's the kind of thing that you think only can happen to others but yeah... I had to delete some specific data from the production database, I created the sql requests and executed them to the testing environment. The dataset between those databases is completely different, and the requests passed without any issue. But when I passed them to production they were taking way too long and then I realised. I almost had a panic attack. I reported the incident immediately and was mentally prepared to be fired. Fortunately we could retrieve most data from a backup and the lost ones were not that big of an issue. I still work in the same company :p
Honestly the worst part of this was all the backup failures
Wow the amount of stuff i learned here is huge, please make more reviews like these i subscribed and turned on notifications please don't disappoint me
absolute nightmare. loved every min of this
The server was:
Delete 5GB userstuff in like 5 minutes? Rough; might cause some desync.
You want to pg_basebackup? Sure, just give me a few minutes before I even start.
rm -rf important stuff? Clear as day, I'll permantently chunk through > 300GB in that one second before you try to cancel.
Team member 1 would be a valuable hire. You can be sure he will never make the same mistake again.
3:52 it was at this moment when the viewers collectively scream, transcending space-time and raising a cosmic choir of dread and regret.
And yes... this is exactly the reason why I didn't study programming / engineering in college, and instead opted for graphic design / communication.
if I write or design something wrong and it gets published, well, at worst the publication stays published as a reminder of my mistake, in programming all it takes is one finger mistake, misremembering something or just a simple distraction and you can absolutely wipe an entire company's network infrastructure out of existence.
Weird and almost unbelievable reason, not every software related position gives you the possibility to destroy the database or mess up critical things in production, and even if that's the case there are tons of safety nets that are here to prevent that, being human ones or software ones, If you can completely wipe everything from a one finger mistake it just means that you were doing things wrong from the start.
Btw if you extrapolate things like that and imagine it's that easy to mess things up then also take the extreme example of making a pretty bad blunder about a popular brand in your design/writings and having that printed in millions of copies and already distributed to the public, costing your company a bunch of money.
@@heroe1486 first of all, both mistakes can cost a company a bunch of money, but one doesnt mess with years worth of data.
Second, you are being too hopeful, maybe because of where you live, but i live in a third world country, such thing as a good network structure (that would prevent something like that to happend) Is not that common here.
And third, a visible error on something printed can be obviously seen and edited, but good luck checking lines upon lines of code to find an error.
Those are just two way different jobs that have very different risk, maybe you can make your company lose more money with graphic design, but you cant, at any point, wipe their whole data, nuke their systems, and so on, because of a mistake while doing your job.
Its like wanting to compare being a carpenter and being a Chef, both work with Sharp objects, both can hurt themselves pretty bad while working, however, you can get your whole arm cut off way easier being a carpenter, because of a simple mistake, you can also burn yourself pretty bad being a Chef, but for that you really need to directly be doing something really wrong, or totally distracted, than when working with some big Power saw and just getting distracted being a carpenter.
Can't wait for part 2
Could just be missing the sarcasm but if you're referring to the ending Google bard isn't exactly the best at being factually accurate...
@@kevinfaang maybe he's being ominous :o
The story telling/edit is unmatched. Hands down best docu/short movie on youtube😂!
Been there, done that, only in my case it was taking down the main network interface on a Solaris YP server used by an entire site of Solaris servers and workstations. The entire site ground to a halt in an instant. I didn't have access to the DC to get local access, either, so I had to make a red-faced confession to my boss for him to make the 2 mile drive to the secure DC.
This is so cool to know the inner workings of a team like this
Another trick is this: Have different color/background profiles in your terminal application for different servers, or at least types of servers. That way, you're more likely to notice that you're typing in the wrong place right now.
Also test an rm with an ll first
Of course I've made a few big fuckups myself in over 30 years as a systems administrator. But this really proves the point that developers should have absolutely _no_ access to production servers.
This is what makes me sad about devops: originally I believe it meant having some seasoned sysadmins cooperate closely with the developers. But now it seems to have become "who needs grumpy sysadmins; they just block all progress, hate when stuff changes, and call for formal change requests, documented change procedures, and other boring stuff. Just let the developers run the entire show, they'll figure it out, and new releases and fixes will be applied much faster."
Developers and sysadmins have completely opposite objectives. Developers produce new code, ie changes, the want to deploy faster, more often, even automated. Sysadmins want stability, consistency, so things should not change. They see any change as dangerous, a potential disaster.
Oh, and if you haven't tested restoring your backup, you don't _have_ a backup.
Lot of people here preaching a whole lot of "this wouldn't have happened if..." nonsense, so nice to hear someone being honest about their own mistakes. The honest truth is that if something like this has never happened to you in your career, it probably means that you never reached the point of being trusted enough to be in a position to screw everything up.
I will say (and, sadly, this is from experience) that no roller coaster on earth can ever simulate the feeling of sinking in your guts that comes when you realise that you've just done a very, very bad thing and that in the next 30 seconds a lot of alarms are about to sound...
@@TheDaern Absolutely right.
There's also a lot of people suggesting various "aids" to prevent mishaps. (Myself included.) Such things may be useful. But there is a danger that one day your safe aliases aren't loaded for some reason, or whatever safety net you implemented.
Problem is that the commands you use in a "safe" environment, where there is little risk of damage, are the same as in critical environments. Experienced sysadmins might spend a lot of time on systems where the wrong command can be fatal, and as a consequence tread lightly and carefully. Whereas a person who mainly works as a developer may have a more casual approach, thinking "trial-and-error" is a good strategy. I've seen a very experienced developer log on to a production server (as root), typing tcsh (cause he wanted command editing and history, which the - at that time - default ksh 88 didn't offer, at least without a little extra work), grep in a directory of log files, and directing output to a file. As a result of how tcsh handles globbing, his command got to grep its own output, resulting in an infinite loop and a filled log partition.
Sysadmin work is - imo - worse than brain surgery, because when you do brain surgery, you are keenly aware that the life of the patient is at risk, and there is just no room for errors. For sysadmin work there is no clear difference between working in a forgiving environment, and in an environment where errors may affect - even harm or kill - hundreds or thousands of people if your luck is really bad.
The only reasonably safe approach imo is to always act as if every keystroke is potentially lethal. The worst aspect of this is that most of the time, a careless approach works fine. I don't know how many times I've watched a senior developer do stuff, typing commands fast on a production server, while I was clenching my butt to avoid soiling my trousers - yet nothing bad happened. But just a little typo could have meant disaster.
You've got this backwards. First, your claims of a developer being more casual are without any basis, seemingly all anecdotal. Personally, I've dealt with a high-strung and extremely seriously platform developer who doesn't trust the sysadmins to manage physical volumes in production. I think personality has nothing to do with this.
sysadmins typically might avoid process because of all the red tape that normal change controls/requests cause. Developers in great companies typically have a well-defined (and very short) development cycle for writing, testing, and deploying code. Developers in large and bureaucratic companies will have well-defined processes but too much unnecessary red tape. There weren't any developers consistently involved in this outage, had they been more involved it would've certainly been avoided. But the devs set up the backups once and never tested it again. That's the largest problem! Why did this happen? I have no idea. The engineers working this at night clearly were tier 2 support engineers with little knowledge of the overall environment!
The lack of development in this area (environment configuration/IAC) is why this became an outage! This outage looks like it was entirely caused by the sysadmin job, but they're not the ones responsible. It's the management team most likely, whoever set up the teams for managing production env.
Take it from a guy who worked in sysadm! You CAN be sysadm + a developer, and probably have to be nowadays. The old title of system administrator brought on by the decades old industry is dying. It requires more knowledge in the tech stack now and less about memorization of Linux commands. Else, if people fail to adapt, anyone will run into an outage just like this one.
WHY would you ever take the responsibility and risk of manually running commands when some clever design and planning can cover most, if not all, of these kinds of situations. The senior dev who SSHed into production and starts hammering away at the keyboard is NOT doing dev work nor sys admin work, he's now being careless.
@@antdok9573 No, I haven't got it backwards. And I don't think I said that you can't be both a developer and a sysadm, or a devop for that matter. It seems you and I have wastly different experiences as sysadmins. Here is an example from 20 years ago, when the company I worked for was introducing the use of BEA WebLogic Integration. I was actually a developer at that time, but I had a long sysadm background. When deploying the first pilot test webservice to the test server of the operations department (after having tested it to work fine on the test server in the development department - with almost watertight shutters between the two departments, and certainly NO developer login to any servers in op) it turned out that it failed, because WLI assumed it could connect freely to the Internet. Not in this company, for sure! The development test servers weren't as strictly set up, so here there was no problem for WLI to retrieve an XML DTD from its official URL. The operations' test server however was locked down completely, and would not open connections to the outside. Now tell me, what would you do to solve this tiny problem?
@@lhpl I would fix the team structure. There are communication issues, regardless of people's occupations. There shouldn't be any distinction between dev & operation teams unless there's some archaic requirement for it (compliance), and even then, you can probably avoid rigid team structures like these.
I don't need to know any of the technologies to see there's a leadership issue toward the top of the teams.
Had this video on in the background. The Teams notification sound really caught me off guard...
I must say, I heard many stories about this.. but that was a very nice summary of the nitty-gritty details, thank you. (:
Don't work when exhausted, tag out for fresh eyes, keep separate physical laptops for primary and secondary systems