I was watching this in realtime cause I had a gitlab account. They fixed it on-stream. People wanted him to be fired, and the lead helped him and REFUSED to punish him saying 'we all make mistakes' and they fixed it, implemented a post mortem, and they got it fixed. :)
Yeah firing someone for a human mistake that lead to a massive outage, because your N layers of safety and backups didn't work is just using the employee as a scapegoat. Everyone makes mistakes. It's important to have safegurards, and is important that companies have a dedicated team/person to manage risks and generate safeguards based on them
@@Slashx92 That was the message of their post mortem. A person *shouldn't* be able to destroy prod in a single line. They've since added processes and safeguards. Made me love GitLab even more.
@@Slashx92 right, pitchforks mobs are so useless crying for a pound of flesh but no real understanding or offering of solutions, they'd never be the same people campaigning for real solutions like 'rm -rf' to have seriously burdensome safeguards
@@fagnersales532It's kind of a code heavy evolution of system administration - keeping complex and distributed computer systems running, performant, and available, using tools like configuration management, infrastructure as a service tooling, monitoring software, and other automation.
For production servers we actually alias "rm", "mv" and all installed other tools that delete/rename files so that they ask for confirmation and print the user, host and affected files.
@Skorps1811 We ask people every now and then what they use when maintaining a server. Usually this only includes some of the cli tools preinstalled on e.g. ubuntu-server, stuff like rm, mv, rsync, dd and so on. We try to use IAC so working on the server is kind of rare anyways. So we ask people to maintain using only commands on the list(that we aliased, to add that extra sanity check) and if they want to use another tool they can just expand the list and add the alias. Made more sense than going out and aliasing literally everything...
the amount of anxiety i felt while watching the original video the first time was insane i felt so bad for the guy and imagining my self in that position 🤣
Yeah me too. I work in a mixed development and production environment and we regularly rm -rf a lot of databases and you better believe that even after more than two years on the job I'm still sweating and double checking each time. I never had such a fuck up happen to me, but I'm so secondarily traumatized from all the fuck ups I've witnessed and stories I've heard over my career that I've become as paranoid as if it had happened to me.
@@ikilledthemoon yes, that's my doubts too .. doesn't they have separate operation team 🤨... i my company.. my project alone has 40+ triage members....
All of the in progress Toy Story 2 got deleted with a rouge rm -rf. The backups failed. The only reason that movie came out was because someone was working from home and had the stuff sync to a remote server
there was (is) a particular service in my last company which was so bloated and diffucult to deal with, we used to joke around that the only way "fix" it would be to delete it from existence 😂😂
Ok, I'll come clean on my rm -rf ~, but this is really weird so get ready. I work on windows with vim, and I wanted to create a file. I did :e ~/folder/filename The thing is for some reason, (maybe because I used the wrong / instead of \...) a folder ~ was created in my current working directory (project I was working on) I opened a terminal (powershell) and typed "ls" to see that yes, a stupid folder "~" was in my project, then I typed "rm -rf ~"... Then hell let lose... I hit CTRL-C maybe harder than GitLab engeneer 1, because it was taking some time, I didn't expect that... I realized that ~ even in windows powershell, is your current user folder. (since when!?) Basically, I had lost all my dot files, and surprisingly, none of the other files. My bossed helped me to restore my files from a backup of that day, and I was back up and running in 45 min. 20 years of experience in IT, windows and Linux, and that happened...
i _pretty much_ have the same story except! i accidentally created the ~ file in vim (d~ are so far apart from each other i have no idea how i did this) anywho, i ls'd and boom, there it was... so i rm -rf ~ after about 1 second, i realized the command was taking WAY to long... rm -rf'd my week
@@ThePrimeTimeagen The lesson I learned is that I never rm -rf first thing, I always mv first so if the hoped for results don't manifest I can at least undo my attempt
I interviewed with Gitlab back in 2018, after this event, and remember one of the interviewers telling me that they were GTFO Azure and moving the GCP because of issues they had in the past with them
@@mattymerr701 I used the 3 major cloud providers (Because i mostly do contract work) and TBH... All 3 suck so bad.. Azure is the worst offendor because they have a thing for blocking your account without prior notice (I am talking accounts that spend 25 to 33k a month in server infra). Google's documentation suck and AWS is just like spaguetti. But i rather have yucky docs and spaguetti services rather than getting my account blocked for days without prior notice.
I feel for the guy. I've been working in the field for a year now, and I had a mishap, a happy little accident if I would say so, where I've accidently deleted something more than I was supposed to from a table in prod db. So I've spend at least 8 hours, after my shift, learning how to use backups to restore the table etc. It was a learning experience as well, but now I'm so paranoid, that for every statement I have to do on prodbuction db, I check thrice.
This was such a "multiple brains required" activity that I hope that if such a thing would happen to me, that despite the ungodly hour I would be smart enough to get in someone else to double check everything we do. If only to spread the impact of blame. To have to type that "I might have wiped all production data" message... poor soul.
@@MeriaDuck it never hurts to also take a snapshot or backup before the operation you have to perform and probably also put the db in down mode so no one is using it when you have to do your maintenance work. Also try blue green stuff. Copy the dB. Update the copy. Test the copy. Then switch to the copy. Save yourself some trouble
There's usually a way to quickly export it which you should know if you're administrating it in production, although you should have snapshots anyways in ideal circumstances. Even a CTAS may not be a bad idea to back it up if it's not large. But yeah the worst tasks in IT are handling db deletions in prod while under the gun.
I understand that and mostly does that too, for smaller dir. But knowing how they usually provision DB mountpoint, i bet that mountpoint don't even have enough space to accommodate the current broken db and take in backups from db1. Moving it to other mounts takes time and they might not even have enough space
@@monk3y206 Even if it doesn't have 2x capacity like that, you can watch it to make sure the restore starts correctly, and then remove it once you're confident all is well, while still having the ability to back out and undo if things go poorly. I actually had an near identical problem the other day with a replicated mongo cluster, fortunately I knew this trick and didn't end up wiping the db.
huge respect ? having 2 of 3 backup mechanism not even working, allowing user to delete employee data and perfoming manual untested operation on prod ?
@@paulsernine5302 Yes, huge respect for finding the problem detailing it so thouroughly and coming up with a future plan that adresses this. No one gets everything correct right out of the gate. They were under the impression that multiple stages of this process were working when they were not. They had the correct processes in place, they just forgot to make sure they were all functioning correctly. I doubt they will make this mistake again.
@@paulsernine5302It's just a company offering CI/CD pipeline automation as a service. Who would expect them to handle their own product's delivery pipeline properly?! 😜
16:41 Etsy has/had a yearly award for the best outage. It’s the “three-armed sweater award” because the 500 page is someone knitting an extra arm in to a sweater. It was highly coveted with an actual physical trophy (I think one year it was an actual 3 armed sweater)
When I review my own PR I'm taking off my 'pride of authorship' goggles and putting on my 'well ackshually' code review hat everyone else gets. And people think it's a joke when editors let you change the entire theme based on the db connection. Its only funny until its tragic.
And a favourite "joke" of mine is to approve a potentially dangerous command someone else is about to run, waiting for their hand to start flying to return and saying "oh WAIT" when it's too late to stop. I'm a bad person.
This one basically happened to me once. 1 day before presentation i was cleaning up my github dead projects and this was before they added the no copying project name for deleting, I yeeted the project i was suppose to present on and the sheer stress alone took 2 years of age out of me. Luckily a partner still had that project on their local machine because lord fobids, github didn't allow restoration of that specific repository because F me in particular i guess.
I remember, at my previous job, I felt the same way when they gave me credentials for the prod database and it had all the grant and admin access. I didn’t use it once, instead asked them to create a new readonly user and give it to developers.
That's the best way. The fewer permissions you have in prod, the fewer responsibilities you can be forced or tempted into. Nothing easier to prove your absence in a fuckup than to demonstrate you didn't have the permissions to fuck up.
I accidently nuked a production MobgoDB collection (the most important in the database). We restored our backup. Our CTO told me it's okay and told me about the GitLab incident. I feel this. :(
I had a less intense rm -rf experience. I was pretty green back then learning Python with Ubuntu and had multiple versions of Python installed. It got pretty annoying so one day I decided to do some clean up. Unbeknownst to me, I happened to delete the Python version that Ubuntu was using and lost the GUI interface + a bunch of functionality and only had access to CLI. Panic ensues and frantic googling began. Fun times.
What is a GUI good for, please? I mean, since you can do anything at the CLI ...? 😊 (Says the one who learned to program when even "glass ttys" were a luxury not everyone had access to and if you don't know what that term means just use your imagination.)
@@neildutoit5177 OK, if she did recommended this wrt. the server she lives on I'd say she's suicidal 🤔. Wrt. the computer you used to talk to her it could simply mean _,"nah, how can I get rid of that bastard" ..._ 👹
This is why you run darkstorms, and drill your teams DRPs to all team members. I can remember during one of our dark storms, we found out a good portion of production hadn't even been backed up, and these servers hosted about a quarter billion worth of contracts. If not for that, we would have been absolutely decimated during an actual event. Practice practice practice, and have empathy for those who make mistakes, I guarantee that was all of us at some point, and even possibly some of us who become too complacent in the future 😅
FWIW, I've been working on computers for about 40 years, and could describe dozens of situations like this which have happened to me. For a few of those I was actually hyperventilating while coming up with the fix for some minor mistake which had major consequences.
The other one I’ve seen is someone setting up replication but replicating in the wrong direction 😱 Remember folks we’re all vulnerable to these mistakes, it just takes a certain set of circumstances/tiredness/distractions etc
Seen this a few times and always makes me laugh. Nothing better to kick off your weekend and get the heart racing than demolishing your entire website in the space of 5 seconds.
At work we had the manufacturing personnel enter data into a database using a form. Sometimes they had to open a transaction, do their data entry, check that the result was as expected, and then commit the transaction. If they forgot the commit, the transaction could be open for a long time, and if there was a hiccup in the database connection, the transaction and their changes would be lost. To remind them to close the transaction I had the background of the form, which was normally just grey, turn a bright colored stripe so you could tell at a glance that a transaction was open. Maybe the terminals could also be color coded
This sort of stuff is why I have color coded the prompts on all of my servers differently in my login scripts so I can tell them apart and tell them apart from my local machine.
If I’m ever on a prod machine, I verbally say what action I’m about to take, then I type it out, then I read it again, then I say it again while reading it. Quadruple confirmation before executing any command. Everyone looks at me like I’m weird when I do it, but they *don’t understand* how fast and how far shit can go downhill.
At 13:10 it's because in a PR you're seeing the whole thing holistically rather than in bits and pieces. Not the current state but all of the changes all at once. And yeah, I totally review my own PRs as well.
Gitlab is probably in the "best" position to lose their data; it won't matter as much as for almost everyone else. Everyone has the full blockchain of commits on their disk for at least the repos they were working on, and if nobody in the world has a copy, maybe it wasn't that important after all..
I also find most things wrong about my code when looking at the PR file diff view. One reason is that I then see the aggregated changes and the other that it's getting official now and it better be good...
I feel like double or triple checking could always fail. But a safeguard wired in the system itself is just better. For example, on all our client's production accounts we have a different banner if it's in production. You cannot confuse it. Also, preventing rm -rf with something like an alias to run a command that wraps rm -rf in prod so an alert, prompt or ANYTHING is shown that you are in prod, and have to confirm your command by WRITING "confirm" This just sounds like every other company that trust too much in their humans, which are the most risky component of a system
This literally is the perfect storm of worst case scenarios on a Friday afternoon😂😂 I’m borderline having PTSD conniptions from the flashbacks every time you mention it, I agree it’s pretty much the worst
That’s why we have command center tier calls for incidents of this kind with bunch of eyes on the issue to prevent one man issues created in stress mode
I work in DevOps/Platform engineering. This is the reason I always do "rm -ri" for single files and folders on prod. You can do "rm -rI" (large i) for more files at once if you want to take the risk.. But never blindly remove stuff. I get flack sometimes when I force a person that has requested a deletion of a Prod-file to be in the same call as me and in a verbose way confirm the deletion. But I will not have it any other way. I want control, confirmation and a clear concience after doing those kinds of operations in prod. Different story for nonprod and preprod, but in Prod you should always doubt yourself. There should be fallbacks and backups. Safety. "Better to be safe than really, really sorry".
I am half way through the video, and I just wanted to say once I received a call around 10PM from a colleague from work saying he messed around with the production db and accidentally ruined it, so he had to drop it and was calling me to get it back since I created the backup procedures...
@@oblivion_2852 that was my response as well, and his was "ah, its not a big deal we don't get much traffic during the night anyways" since we were working on a gps tracking solution for deliveries that happened mostly during the day. I guess my sleep and well-being were also not that important either at that time, since I managed to pull an all-nighter to get everything working and ready for things to continue on as if nothing every happened by the morning. If you're wondering what happened to me and my colleague, since all of this was happening early in my career it basically went unnoticed.
The worst thing I ever did like this was to remove a small, static db in our Test env, which was easily restored. I had used the GUI at 3am when my eyes were very unclear, instead of using a shell and typing the command out. It was another case of 'wrong environment.' I still think about it, absolutely mortified at what could have been. 😱
12:55 yep, I always do this. Git GUIs like git kraken and sublime merge have saved me many a time when I look at a piece of code completely out of place.. 16:41 team member one exposed a whole list of collosal failures in gitlab which, in the end, strengthened their infrastructure. I'd say it's pretty quick win on his part.
I truly feel for those guys ... they are brave, committed, very professional .... Heroes no less !!! ... - - Anyone - - I repeat - - Anyone - - could have done the same .... ..... (don't ask me how I know that ... 😔 )
I worked with this type of stuff on a smaller scale, never made a mistake like this but made mistakes, had no experience, didn't know what I was doing constantly struggling with feeling inadequate. After this, I now know I'm not the only one beeing inadequate.
If there’s one thing I’ve learned from being a junior developer to becoming and Engineering manager is that imposter syndrome is consistent at every level of engineering. Just do what you know how to do and never be afraid to escalate and ask for help. People want you to ask for help.
On my first internship, I was given a warmup project to process some video files from an S3, and my boss suggested I try to mount the S3 locally. After successfully mounting it, I made a script to automate the process and I wanted to test it, but It would fail because the mount directory was already mounted. So I went ahead and rm -rf it... and after 10 seconds I started to wonder why it took so long and then I realized that's not how unmounting works.
I had a new junior run `sudo mv / ` instead of `cp -r ./ `. To this day I have no idea how they switched cp with mv, how they forgot to include the subdir, and why for whatever reason sudo was invoked, but our filesystem ran full lol. I was actually surprised nothing was deleted.
I feel like 90% of all applications out there are exactly like this, just that they haven't yet had a bad enough incident to realize how incomplete or broken their backup/recovery system is.
Fun little story: We wanted their enterprise solution on our on-prem k8 cluster but they told Gitlab k8 is not prod ready instead they offered us a vm solution which was not liked by my happily married to k8 manager and we ended up not using their solution
I'm ashamed to admit that while trying to generate white noise in my speekers, I've run the command "sudo dd if=/dev/random of=/dev/sdc1" instead of "sudo dd if=/dev/random of=/dev/dsp1" three times. I was young, though, but it certainly taught me to respect the root user... I just wish it could have done so the first time and not the third time...
I was in "multiple terminals" mode yesterday, sorting out SSL certificate renewals for servers i run. The tension from that alone was bad enough (ours is a small operation, miniscule) but watching this video gave me the eebie-jeebies - deleting database files by hand? Jeeeeeeez
At my first startup they were running a database in a container (6 years later i think this was a terrible idea but some people still do it). I was having some performance issues so I killed the container. Usually a safe thing to do since they meant to be ephemeral. Then i realized the db was gone. Neither me nor my boss knew that rhe other dev had made a backup. My boss was not forgiving. I cried a lot that night.
@@criptychWait, I did not understand that they were not keeping database files in a persistent volume. I was thinking: what is wrong with running a database server in a container... I didn't do it even on my home server. How do people do stuff like this?
The 'different brain mode' thing is why Digital Artists like myself Flip the Canvas. A single button press and it gives you a mental reset, all the mistakes that you can't see because you've been staring at the work for 8 hours, suddenly pop out at you.
12:10 If you are going to rm -rf on a production server ... just dont. In fact just dont SSH onto production at all. If you think you need to thats probably indicative of other (much larger, systemic) problems that need sorting out.
I think it would be appropriate that regardless of what you are running in a db1 terminal, it should always have a case-sensitive Linus-style "Yes, I want to run this on our production servers." prompt
You can do this with a lot of commands by adding the -I flag per alias. For rm for example, it makes it so that if there's a lot of files to delete itll ask for confirmation.
As someone who "RAWDOGS" production databases daily as my job, my main strategy for not creating this problem is vertically splitting my terminals on my monitor. Left is always the "readonly" information terminal and the right one is the "already broken, get it fixed" one. On the right one I work fast and efficient. On the left one I work slow, methotical and safe.
12:35 i agree with all the tips around double checking stuff especially the PR verification one; legit i find bugs as I see my code from a different prespective
I've done this as tech support for a shared webhost, but it was just that single person's Wordpresses that were brand new. I still felt terrible. Zero backups between them and our hosting due to them being such a new customer. In the scenario of a CLI tool not responding, I'd sooner chunk strace up in another session on the pid to ensure it's doing something. It's that easy baby!
I had to pause this video a couple of times to take a breath and steel myself, like when you know a jumpscare is coming up in a movie but you don't know where or when.
I worked for one of the worlds biggest websites about 15 years ago. We experienced an outage covered by many major outlets. The root cause was a dev had written a stored proc to delete rows where the parameter to the sp was also the column name. So “delete from x where id = id “ just nuked the table. We of course had dba’s who dutifully reviewed this and ran it. When the dev had tested it the sql plus prompt told him he deleted 1 row - because his dev dev db had only 1 row. The dba was just incompetent apparently. Funny thing. The dev in question had just handed in his notice but was being kept on as a contractor for way more money. And the sites covering the outage all reported it as a large scale hack. I managed to escape the building before being dragged back to help.
My uncle was a SOC analyst. I was 15 at the time and one good Friday he got back and was about to take a nap but suddenly. He company was being breached. After the call he sat down grabbed a bottle of alcohol sighed and left. He came back by 3 the next morning
all the servers I care about have an emoji in their shell prompt so that I always know where I'm at. Ever since I started doing this I had exactly 0 fuckups related to mixing up ssh sessions. Highly recommended.
The bit about finding bugs/seeing code differently when viewing it as a pull request doesn't surprise me at all. Common advice to someone editing a paper/article/thing-they-wrote is to view it in a different format. Print it out, save it as a PDF, change the font if it's all you can do. Anything to make it look and feel different from when you wrote it. Helps you see it differently, and suppresses the tendency to read what you think you wrote instead of what you actually wrote. This seems like a similar principle, just for code.
Fun fact: GitLab could've likely recovered their DB data much faster by dumping the filesystem journal that was still in the Linux kernel at that point in time, and undeleting the files, as the journal preserves the old file metadata up until the computer is rebooted.
13:05 I never do PRs on my own projects. But before I commit I usually do "git add -p" and am likely to look at every line of code. Sometimes I also look at the diff in Gitk. Often I find some lines which I don't want to be committed like debug logs, unimportant changes like variable renames, uncomment of parts of the code, or I find obvious mistakes like just having copied some function (like insert_before to insert_after), but didn't replace all the internal calls (like push_before -> push_after or get(i - 1) -> get(i + 1))
12:50 Same thing, I always create a merge request in a draft state and review all the changes there. If needed I do some cleanup etc and then I finalize the MR. I never seem to find everything that is wrong with my code when just reviewing it in the IDE or git gui.
My biggest F up so far (still in college so there's still time) was when I was trying to figure out a bug with a gitlab test for my HW assignment saying my .h files weren't formatted correctly. I assumed I had changed them somehow (it was stipulated we shouldn't change them since they were provided files). So I figured I could delete them and replace them with the originals. I did an rm *.h and deleted my entire project including .c files. I had to load a backup from gitlab that was 2 hours old and I had just spent those two hours almost completely rewriting it and had forgotten to backup. This happened at 10:30 pm on the day it was due... My soul left my body... It ended up being ok because the next day it was announced we had a couple grace days and I ended up getting 100% on the assignment. But that was one of the worst nights I've ever had during my school life. I'm glad it happened though because it's been burned into my psyche and I it will 100% be in my mind every single time I use an rm command.
When I first started git I failed my push because I didn't understand it, and somehow I ended up refetching the origin, overwriting all my changes locally lol.
You should approach production issues the same way aircraft pilots do: buy time. If you have any hysterics because "nobody can use our product", don't do things you don't know can fix it, do things that mitigate the issue, maybe you need to failover to a backup or something, but panicked rollbacks are how knight capital went under
I kinda feel like this is where having a managed DB from a good provider would help having all the “industry best practices” in data storage and backup checked.
this actually happened to me, that I mistakenly deleted most of the staging db. Test environment was in the next tab. I have not done something similar after. What helped most was naming terminals and having a +1 person on the call when doing something like this. Having another person present is most helpful
I was consulting at a gig in Connecticut when they were backing up a server for an upgrade. They reused the restore floppy and adjusted the script. One minor error. They forgot to comment out the format disk command. They restored from backup. Bad execution, good planning.
pro tip: have different background and font colours for every ssh session and root terminal tabs. It helps a lot. Especially if the one where you must not screw up has a red background.
I was watching this in realtime cause I had a gitlab account. They fixed it on-stream. People wanted him to be fired, and the lead helped him and REFUSED to punish him saying 'we all make mistakes' and they fixed it, implemented a post mortem, and they got it fixed. :)
Yeah firing someone for a human mistake that lead to a massive outage, because your N layers of safety and backups didn't work is just using the employee as a scapegoat. Everyone makes mistakes. It's important to have safegurards, and is important that companies have a dedicated team/person to manage risks and generate safeguards based on them
@@Slashx92 That was the message of their post mortem. A person *shouldn't* be able to destroy prod in a single line. They've since added processes and safeguards. Made me love GitLab even more.
@@Slashx92 right, pitchforks mobs are so useless crying for a pound of flesh but no real understanding or offering of solutions, they'd never be the same people campaigning for real solutions like 'rm -rf' to have seriously burdensome safeguards
Love it!
such great coworkers
i do devops and this video stressed me out the entire time
What devops is? Can you explain me? 😊
rename first. Then wait 4 hours. Then delete.
@@fagnersales532It's kind of a code heavy evolution of system administration - keeping complex and distributed computer systems running, performant, and available, using tools like configuration management, infrastructure as a service tooling, monitoring software, and other automation.
@@fagnersales532 "To make error is human. To propagate error to all server in automatic way is #devops.” --- one great spiritual leader
I don't have a job and I'm not going to watch this video at all tbh
For production servers we actually alias "rm", "mv" and all installed other tools that delete/rename files so that they ask for confirmation and print the user, host and affected files.
What are these tools?
@@Skorps1811 rm "remove" and mv "move", are commands for deleting and moving files
@@Skorps1811 "rm", "mv" and all installed other tools that delete/rename files
thanks for the nugget wisdom!
@Skorps1811 We ask people every now and then what they use when maintaining a server. Usually this only includes some of the cli tools preinstalled on e.g. ubuntu-server, stuff like rm, mv, rsync, dd and so on. We try to use IAC so working on the server is kind of rare anyways. So we ask people to maintain using only commands on the list(that we aliased, to add that extra sanity check) and if they want to use another tool they can just expand the list and add the alias. Made more sense than going out and aliasing literally everything...
Legend has it: On that day, a site-reliability engineer was born
lmao
Pretty much 😂
backup are automated just DMARC was needed here whcih also failed
the amount of anxiety i felt while watching the original video the first time was insane
i felt so bad for the guy and imagining my self in that position 🤣
Yeah me too. I work in a mixed development and production environment and we regularly rm -rf a lot of databases and you better believe that even after more than two years on the job I'm still sweating and double checking each time.
I never had such a fuck up happen to me, but I'm so secondarily traumatized from all the fuck ups I've witnessed and stories I've heard over my career that I've become as paranoid as if it had happened to me.
They say he's a dev, right? This is what happens when you shrink a whole team of a DBA, dev ops and QA into one dude. It doesn't work LOL
🤣 and i can't stop laughing when primeagen going through the same situation,,, thing of what about to come next....
@@ikilledthemoon yes, that's my doubts too .. doesn't they have separate operation team 🤨... i my company.. my project alone has 40+ triage members....
yeah same🤣🤣
All of the in progress Toy Story 2 got deleted with a rouge rm -rf. The backups failed. The only reason that movie came out was because someone was working from home and had the stuff sync to a remote server
Fun fact: that was one of Pixar's producers Galyn Susman. She was laid off by Pixar in 2023 for seemingly no reason....
"software engineers hate him... find out this one simple trick a dev used to fix all bugs permanently"
no code solution to help us all
there was (is) a particular service in my last company which was so bloated and diffucult to deal with, we used to joke around that the only way "fix" it would be to delete it from existence 😂😂
@@ThePrimeTimeagen "serverless"
@@ThePrimeTimeagen literally.. 'no code'
Ok, I'll come clean on my rm -rf ~, but this is really weird so get ready.
I work on windows with vim, and I wanted to create a file. I did :e ~/folder/filename
The thing is for some reason, (maybe because I used the wrong / instead of \...) a folder ~ was created in my current working directory (project I was working on)
I opened a terminal (powershell) and typed "ls" to see that yes, a stupid folder "~" was in my project, then I typed "rm -rf ~"...
Then hell let lose...
I hit CTRL-C maybe harder than GitLab engeneer 1, because it was taking some time, I didn't expect that...
I realized that ~ even in windows powershell, is your current user folder. (since when!?)
Basically, I had lost all my dot files, and surprisingly, none of the other files.
My bossed helped me to restore my files from a backup of that day, and I was back up and running in 45 min.
20 years of experience in IT, windows and Linux, and that happened...
i _pretty much_ have the same story except! i accidentally created the ~ file in vim (d~ are so far apart from each other i have no idea how i did this)
anywho, i ls'd and boom, there it was... so i rm -rf ~
after about 1 second, i realized the command was taking WAY to long...
rm -rf'd my week
These type of battle wounds stories should be compiled for future generations to learn from.
@@ThePrimeTimeagen The lesson I learned is that I never rm -rf first thing, I always mv first so if the hoped for results don't manifest I can at least undo my attempt
To all the dot files taking the bullet, thank you for your service o7
First of all ls don’t work on powershell I call bs on the story 😂 JK
I interviewed with Gitlab back in 2018, after this event, and remember one of the interviewers telling me that they were GTFO Azure and moving the GCP because of issues they had in the past with them
GCP is insanely worse. I've been using it for a year now and it makes me want to tear off my skin
@@mattymerr701 I used the 3 major cloud providers (Because i mostly do contract work) and TBH... All 3 suck so bad.. Azure is the worst offendor because they have a thing for blocking your account without prior notice (I am talking accounts that spend 25 to 33k a month in server infra). Google's documentation suck and AWS is just like spaguetti. But i rather have yucky docs and spaguetti services rather than getting my account blocked for days without prior notice.
I feel for the guy. I've been working in the field for a year now, and I had a mishap, a happy little accident if I would say so, where I've accidently deleted something more than I was supposed to from a table in prod db. So I've spend at least 8 hours, after my shift, learning how to use backups to restore the table etc. It was a learning experience as well, but now I'm so paranoid, that for every statement I have to do on prodbuction db, I check thrice.
config and script as code and save to files and just run them
This was such a "multiple brains required" activity that I hope that if such a thing would happen to me, that despite the ungodly hour I would be smart enough to get in someone else to double check everything we do. If only to spread the impact of blame. To have to type that "I might have wiped all production data" message... poor soul.
@@MeriaDuck it never hurts to also take a snapshot or backup before the operation you have to perform and probably also put the db in down mode so no one is using it when you have to do your maintenance work. Also try blue green stuff. Copy the dB. Update the copy. Test the copy. Then switch to the copy. Save yourself some trouble
There's usually a way to quickly export it which you should know if you're administrating it in production, although you should have snapshots anyways in ideal circumstances. Even a CTAS may not be a bad idea to back it up if it's not large.
But yeah the worst tasks in IT are handling db deletions in prod while under the gun.
ALWAYS opt to move/rename and not remove/delete. Deleting is one of the most dangerous things you can do.
Definitely this.
Especially with how fast hardware is and how cheap storage is nowadays….
WOW, thank you!
I understand that and mostly does that too, for smaller dir.
But knowing how they usually provision DB mountpoint, i bet that mountpoint don't even have enough space to accommodate the current broken db and take in backups from db1.
Moving it to other mounts takes time and they might not even have enough space
@@monk3y206 Even if it doesn't have 2x capacity like that, you can watch it to make sure the restore starts correctly, and then remove it once you're confident all is well, while still having the ability to back out and undo if things go poorly. I actually had an near identical problem the other day with a replicated mongo cluster, fortunately I knew this trick and didn't end up wiping the db.
Honestly huge respect for GitLab!
I think they've handled this problem pretty responsibly, which you sadly can't always expect from a big corporation.
huge respect ? having 2 of 3 backup mechanism not even working, allowing user to delete employee data and perfoming manual untested operation on prod ?
@@paulsernine5302yeah, sounds pretty irresponsible
@@paulsernine5302 Yes, huge respect for finding the problem detailing it so thouroughly and coming up with a future plan that adresses this. No one gets everything correct right out of the gate. They were under the impression that multiple stages of this process were working when they were not. They had the correct processes in place, they just forgot to make sure they were all functioning correctly. I doubt they will make this mistake again.
@@paulsernine5302It's just a company offering CI/CD pipeline automation as a service. Who would expect them to handle their own product's delivery pipeline properly?! 😜
It's not the fact that the engineer deleted the database in prod. It's the fact that the company's backup procedures cost them a full 24 hours.
I find the outcome is not that bad taking into account this chain of disasters!
shockingly tepid
16:41 Etsy has/had a yearly award for the best outage. It’s the “three-armed sweater award” because the 500 page is someone knitting an extra arm in to a sweater. It was highly coveted with an actual physical trophy (I think one year it was an actual 3 armed sweater)
When I review my own PR I'm taking off my 'pride of authorship' goggles and putting on my 'well ackshually' code review hat everyone else gets.
And people think it's a joke when editors let you change the entire theme based on the db connection. Its only funny until its tragic.
If I'm running a "drop database" or "rm -rf" on production, I absolutely get a couple of other people to look at what I'm doing before I press Return.
And a favourite "joke" of mine is to approve a potentially dangerous command someone else is about to run, waiting for their hand to start flying to return and saying "oh WAIT" when it's too late to stop. I'm a bad person.
@@jeremykothe2847 oh hey satan, we meet again
@@jeremykothe2847Nice positive work environment you got there bud.
This is why I make my Administrator Command Prompt background bright red (sometimes blue, but at least obviously not the default).
This one basically happened to me once. 1 day before presentation i was cleaning up my github dead projects and this was before they added the no copying project name for deleting, I yeeted the project i was suppose to present on and the sheer stress alone took 2 years of age out of me.
Luckily a partner still had that project on their local machine because lord fobids, github didn't allow restoration of that specific repository because F me in particular i guess.
I remember, at my previous job, I felt the same way when they gave me credentials for the prod database and it had all the grant and admin access. I didn’t use it once, instead asked them to create a new readonly user and give it to developers.
That's the best way. The fewer permissions you have in prod, the fewer responsibilities you can be forced or tempted into. Nothing easier to prove your absence in a fuckup than to demonstrate you didn't have the permissions to fuck up.
I accidently nuked a production MobgoDB collection (the most important in the database).
We restored our backup.
Our CTO told me it's okay and told me about the GitLab incident.
I feel this. :(
I had a less intense rm -rf experience.
I was pretty green back then learning Python with Ubuntu and had multiple versions of Python installed. It got pretty annoying so one day I decided to do some clean up. Unbeknownst to me, I happened to delete the Python version that Ubuntu was using and lost the GUI interface + a bunch of functionality and only had access to CLI. Panic ensues and frantic googling began. Fun times.
What is a GUI good for, please?
I mean, since you can do anything at the CLI ...? 😊
(Says the one who learned to program when even "glass ttys" were a luxury not everyone had access to and if you don't know what that term means just use your imagination.)
Trial of the True Coder: Save your precious GUI with the power of the CLI.
Gpt advised me to rm rf last week lol
@@neildutoit5177 OK, if she did recommended this wrt. the server she lives on I'd say she's suicidal 🤔.
Wrt. the computer you used to talk to her it could simply mean _,"nah, how can I get rid of that bastard" ..._ 👹
@@mittelwelle_531_khz you're just jurassic, man. I don't know what to tell you.
This is why you run darkstorms, and drill your teams DRPs to all team members. I can remember during one of our dark storms, we found out a good portion of production hadn't even been backed up, and these servers hosted about a quarter billion worth of contracts. If not for that, we would have been absolutely decimated during an actual event.
Practice practice practice, and have empathy for those who make mistakes, I guarantee that was all of us at some point, and even possibly some of us who become too complacent in the future 😅
I need a compilation of this kind of anecdotes as "Production Friday Horror Stories"
FWIW, I've been working on computers for about 40 years, and could describe dozens of situations like this which have happened to me. For a few of those I was actually hyperventilating while coming up with the fix for some minor mistake which had major consequences.
The other one I’ve seen is someone setting up replication but replicating in the wrong direction 😱
Remember folks we’re all vulnerable to these mistakes, it just takes a certain set of circumstances/tiredness/distractions etc
Seen this a few times and always makes me laugh. Nothing better to kick off your weekend and get the heart racing than demolishing your entire website in the space of 5 seconds.
Yea, I actually re-watched it two days ago including the cloudflare and capital one
good timing to have a couple off days to send out applications.
Did Prime ever realize that "they never accidentally deleted a production database again" was sarcasm?
At work we had the manufacturing personnel enter data into a database using a form. Sometimes they had to open a transaction, do their data entry, check that the result was as expected, and then commit the transaction. If they forgot the commit, the transaction could be open for a long time, and if there was a hiccup in the database connection, the transaction and their changes would be lost. To remind them to close the transaction I had the background of the form, which was normally just grey, turn a bright colored stripe so you could tell at a glance that a transaction was open. Maybe the terminals could also be color coded
This sort of stuff is why I have color coded the prompts on all of my servers differently in my login scripts so I can tell them apart and tell them apart from my local machine.
If I’m ever on a prod machine, I verbally say what action I’m about to take, then I type it out, then I read it again, then I say it again while reading it. Quadruple confirmation before executing any command. Everyone looks at me like I’m weird when I do it, but they *don’t understand* how fast and how far shit can go downhill.
My guy, every one on my team does it. The people who don't do quadruple checking are the weird ones.
At 13:10 it's because in a PR you're seeing the whole thing holistically rather than in bits and pieces. Not the current state but all of the changes all at once.
And yeah, I totally review my own PRs as well.
Gitlab is probably in the "best" position to lose their data; it won't matter as much as for almost everyone else. Everyone has the full blockchain of commits on their disk for at least the repos they were working on, and if nobody in the world has a copy, maybe it wasn't that important after all..
I also find most things wrong about my code when looking at the PR file diff view. One reason is that I then see the aggregated changes and the other that it's getting official now and it better be good...
exactly. different mode
Write drunk, edit sober. Or maybe it’s the other way around?
@@efkastner Don't drink and write...
"Do you have a change control process?" Yes, I'm the controller of change.
I feel like double or triple checking could always fail. But a safeguard wired in the system itself is just better. For example, on all our client's production accounts we have a different banner if it's in production. You cannot confuse it. Also, preventing rm -rf with something like an alias to run a command that wraps rm -rf in prod so an alert, prompt or ANYTHING is shown that you are in prod, and have to confirm your command by WRITING "confirm"
This just sounds like every other company that trust too much in their humans, which are the most risky component of a system
7th time rewatching this. just realised some guy tried to use gitlab repo as “some sort of cdn” (2:02).
absolute genius madlad!
The conversation probably qent like:
- "I accidentally deleted db1"
- "call HR"
This literally is the perfect storm of worst case scenarios on a Friday afternoon😂😂 I’m borderline having PTSD conniptions from the flashbacks every time you mention it, I agree it’s pretty much the worst
Someone else just said PTSD in the live chat 😂😂 It’s 100% a physical response
That’s why we have command center tier calls for incidents of this kind with bunch of eyes on the issue to prevent one man issues created in stress mode
I work in DevOps/Platform engineering. This is the reason I always do "rm -ri" for single files and folders on prod. You can do "rm -rI" (large i) for more files at once if you want to take the risk.. But never blindly remove stuff. I get flack sometimes when I force a person that has requested a deletion of a Prod-file to be in the same call as me and in a verbose way confirm the deletion. But I will not have it any other way. I want control, confirmation and a clear concience after doing those kinds of operations in prod. Different story for nonprod and preprod, but in Prod you should always doubt yourself. There should be fallbacks and backups. Safety. "Better to be safe than really, really sorry".
I am half way through the video, and I just wanted to say once I received a call around 10PM from a colleague from work saying he messed around with the production db and accidentally ruined it, so he had to drop it and was calling me to get it back since I created the backup procedures...
this... is the most beautiful message ever sent
"Uhm... You what?"
@@oblivion_2852 that was my response as well, and his was "ah, its not a big deal we don't get much traffic during the night anyways" since we were working on a gps tracking solution for deliveries that happened mostly during the day. I guess my sleep and well-being were also not that important either at that time, since I managed to pull an all-nighter to get everything working and ready for things to continue on as if nothing every happened by the morning. If you're wondering what happened to me and my colleague, since all of this was happening early in my career it basically went unnoticed.
I work cybersecurity in a CERT, and this would be an absolute nightmare. Ensure your backups are regularly tested.
The number of times I've looked at my diffs and gone "wait a damn minute that's wrong"
its the secret of good engineering
2:00 "We removed a user [using gitlab as a CDN], causing high load" roflmao
The worst thing I ever did like this was to remove a small, static db in our Test env, which was easily restored.
I had used the GUI at 3am when my eyes were very unclear, instead of using a shell and typing the command out. It was another case of 'wrong environment.' I still think about it, absolutely mortified at what could have been. 😱
The worst thing you’ve ever done…yet.
12:55 yep, I always do this. Git GUIs like git kraken and sublime merge have saved me many a time when I look at a piece of code completely out of place..
16:41 team member one exposed a whole list of collosal failures in gitlab which, in the end, strengthened their infrastructure. I'd say it's pretty quick win on his part.
I truly feel for those guys ... they are brave, committed, very professional .... Heroes no less !!! ... - - Anyone - - I repeat - - Anyone - - could have done the same .... ..... (don't ask me how I know that ... 😔 )
Did any of u guys noticed.. 0:30 sec when he said hit subscribe, the subscribe button actually animates.. 🙂idk im noticing now..
I worked with this type of stuff on a smaller scale, never made a mistake like this but made mistakes, had no experience, didn't know what I was doing constantly struggling with feeling inadequate. After this, I now know I'm not the only one beeing inadequate.
If there’s one thing I’ve learned from being a junior developer to becoming and Engineering manager is that imposter syndrome is consistent at every level of engineering. Just do what you know how to do and never be afraid to escalate and ask for help. People want you to ask for help.
On my first internship, I was given a warmup project to process some video files from an S3, and my boss suggested I try to mount the S3 locally. After successfully mounting it, I made a script to automate the process and I wanted to test it, but It would fail because the mount directory was already mounted. So I went ahead and rm -rf it... and after 10 seconds I started to wonder why it took so long and then I realized that's not how unmounting works.
I had a new junior run `sudo mv / ` instead of `cp -r ./ `. To this day I have no idea how they switched cp with mv, how they forgot to include the subdir, and why for whatever reason sudo was invoked, but our filesystem ran full lol. I was actually surprised nothing was deleted.
@@Blaisem mv NEVER deletes anything unless the data was moved successfully first
@@usernameak ah makes sense, thanks!
I feel like 90% of all applications out there are exactly like this, just that they haven't yet had a bad enough incident to realize how incomplete or broken their backup/recovery system is.
Well as a devops myself, this was one hell of a ride.
"One time I deleted my home directory"
Oh well, it was 1h before a live demo for me.
it happens
Fun little story: We wanted their enterprise solution on our on-prem k8 cluster but they told Gitlab k8 is not prod ready instead they offered us a vm solution which was not liked by my happily married to k8 manager and we ended up not using their solution
As stressful as it may be, it is still arguably tamer than rm -rf a person's life in prod, as is the case of THERAC-25.
The bad thing is, under pressure, you forget to double check.. Your head just executes a queue of commands in there
I'm ashamed to admit that while trying to generate white noise in my speekers, I've run the command "sudo dd if=/dev/random of=/dev/sdc1" instead of "sudo dd if=/dev/random of=/dev/dsp1" three times. I was young, though, but it certainly taught me to respect the root user... I just wish it could have done so the first time and not the third time...
That is a message I dread to ever have to send...
That I just nuked production royally to the point you need backups
I was in "multiple terminals" mode yesterday, sorting out SSL certificate renewals for servers i run. The tension from that alone was bad enough (ours is a small operation, miniscule) but watching this video gave me the eebie-jeebies - deleting database files by hand? Jeeeeeeez
This guy should be putting this on his CV.
Achievements -> Deleted GItlab prod db and survived.
"I have plot armor, so I'm a main character"
every now and then i come back to this video because is such an amazing part of history
At my first startup they were running a database in a container (6 years later i think this was a terrible idea but some people still do it). I was having some performance issues so I killed the container. Usually a safe thing to do since they meant to be ephemeral. Then i realized the db was gone. Neither me nor my boss knew that rhe other dev had made a backup. My boss was not forgiving. I cried a lot that night.
Run a database _server_ in a container, sure. But keep your database _files_ in a persistent volume.
@@criptychWait, I did not understand that they were not keeping database files in a persistent volume. I was thinking: what is wrong with running a database server in a container... I didn't do it even on my home server. How do people do stuff like this?
"Guys, I may have just accidently deleted db 1" 😂😂
The 'different brain mode' thing is why Digital Artists like myself Flip the Canvas. A single button press and it gives you a mental reset, all the mistakes that you can't see because you've been staring at the work for 8 hours, suddenly pop out at you.
12:10 If you are going to rm -rf on a production server ... just dont. In fact just dont SSH onto production at all. If you think you need to thats probably indicative of other (much larger, systemic) problems that need sorting out.
13:02 yes and it pains me because I should diff before committing. But I always just do it in the PR.
I think it would be appropriate that regardless of what you are running in a db1 terminal, it should always have a case-sensitive Linus-style "Yes, I want to run this on our production servers." prompt
You can do this with a lot of commands by adding the -I flag per alias. For rm for example, it makes it so that if there's a lot of files to delete itll ask for confirmation.
The transfer speed being limited to 60 Mb/s is literally the f*** Microsoft meme. I'm sorry, but we have to run these update now
1:16 Oh God, the nightmares... "PAGER DUTY ALERT! PAGER DUTY ALERT!"
As someone who "RAWDOGS" production databases daily as my job, my main strategy for not creating this problem is vertically splitting my terminals on my monitor. Left is always the "readonly" information terminal and the right one is the "already broken, get it fixed" one.
On the right one I work fast and efficient. On the left one I work slow, methotical and safe.
12:35 i agree with all the tips around double checking stuff especially the PR verification one; legit i find bugs as I see my code from a different prespective
god damn... i thought looking at your own PRs/MRs would be a no-brainer... but so many people seem to not do it (it makes me cry and die a bit inside)
I've done this as tech support for a shared webhost, but it was just that single person's Wordpresses that were brand new. I still felt terrible. Zero backups between them and our hosting due to them being such a new customer.
In the scenario of a CLI tool not responding, I'd sooner chunk strace up in another session on the pid to ensure it's doing something. It's that easy baby!
I had to pause this video a couple of times to take a breath and steel myself, like when you know a jumpscare is coming up in a movie but you don't know where or when.
I can instantly relate to that feeling of dread that must have been going through that guy's body when he realized.
The pain of knowing you could have just waited it out is insurmountable 😂 Can't help going into panic mode.
team-member-1 is now the most valuable team member. He now has knowledge, experience, but most importantly, paranoia to run commands in terminal
I worked for one of the worlds biggest websites about 15 years ago. We experienced an outage covered by many major outlets. The root cause was a dev had written a stored proc to delete rows where the parameter to the sp was also the column name. So “delete from x where id = id “ just nuked the table.
We of course had dba’s who dutifully reviewed this and ran it. When the dev had tested it the sql plus prompt told him he deleted 1 row - because his dev dev db had only 1 row. The dba was just incompetent apparently.
Funny thing. The dev in question had just handed in his notice but was being kept on as a contractor for way more money.
And the sites covering the outage all reported it as a large scale hack.
I managed to escape the building before being dragged back to help.
My uncle was a SOC analyst. I was 15 at the time and one good Friday he got back and was about to take a nap but suddenly. He company was being breached. After the call he sat down grabbed a bottle of alcohol sighed and left. He came back by 3 the next morning
all the servers I care about have an emoji in their shell prompt so that I always know where I'm at. Ever since I started doing this I had exactly 0 fuckups related to mixing up ssh sessions. Highly recommended.
Great idea.
As far as the actual horrible issue is concerned, team-member-1 was only the person that hit the final key. There were much bigger issues in play
Like a murder mystery series, would love to see a series like this in american-greed style
The bit about finding bugs/seeing code differently when viewing it as a pull request doesn't surprise me at all. Common advice to someone editing a paper/article/thing-they-wrote is to view it in a different format. Print it out, save it as a PDF, change the font if it's all you can do. Anything to make it look and feel different from when you wrote it. Helps you see it differently, and suppresses the tendency to read what you think you wrote instead of what you actually wrote. This seems like a similar principle, just for code.
Fun fact: GitLab could've likely recovered their DB data much faster by dumping the filesystem journal that was still in the Linux kernel at that point in time, and undeleting the files, as the journal preserves the old file metadata up until the computer is rebooted.
that 60mb/s was funny as hell :D ""Can someone get a dialup instead to make it faster??"
13:05 I never do PRs on my own projects.
But before I commit I usually do "git add -p" and am likely to look at every line of code.
Sometimes I also look at the diff in Gitk.
Often I find some lines which I don't want to be committed like debug logs, unimportant changes like variable renames, uncomment of parts of the code, or I find obvious mistakes like just having copied some function (like insert_before to insert_after), but didn't replace all the internal calls (like push_before -> push_after or get(i - 1) -> get(i + 1))
12:50 Same thing, I always create a merge request in a draft state and review all the changes there. If needed I do some cleanup etc and then I finalize the MR. I never seem to find everything that is wrong with my code when just reviewing it in the IDE or git gui.
This gave me some proper anxiety and unpleasant flashbacks, thx a lot.
My biggest F up so far (still in college so there's still time) was when I was trying to figure out a bug with a gitlab test for my HW assignment saying my .h files weren't formatted correctly. I assumed I had changed them somehow (it was stipulated we shouldn't change them since they were provided files). So I figured I could delete them and replace them with the originals. I did an rm *.h and deleted my entire project including .c files. I had to load a backup from gitlab that was 2 hours old and I had just spent those two hours almost completely rewriting it and had forgotten to backup. This happened at 10:30 pm on the day it was due... My soul left my body... It ended up being ok because the next day it was announced we had a couple grace days and I ended up getting 100% on the assignment. But that was one of the worst nights I've ever had during my school life. I'm glad it happened though because it's been burned into my psyche and I it will 100% be in my mind every single time I use an rm command.
When I first started git I failed my push because I didn't understand it, and somehow I ended up refetching the origin, overwriting all my changes locally lol.
@@Blaisem holy hell, my condolences lmao
You should approach production issues the same way aircraft pilots do: buy time. If you have any hysterics because "nobody can use our product", don't do things you don't know can fix it, do things that mitigate the issue, maybe you need to failover to a backup or something, but panicked rollbacks are how knight capital went under
the "forgot to enable DMARC authentication" and the misspelling of "lesons laerned" got me so good bruh
9:15 you're laughing. The whole dev team is panicing and you're laughing
I like thePrimagen’s career advice. None of it is exactly tailored to a chronically homeless wash out, but still *great* advice
The combination of stress and laughter was amazing!
I kinda feel like this is where having a managed DB from a good provider would help having all the “industry best practices” in data storage and backup checked.
this actually happened to me, that I mistakenly deleted most of the staging db. Test environment was in the next tab. I have not done something similar after. What helped most was naming terminals and having a +1 person on the call when doing something like this. Having another person present is most helpful
The fact that the second Prime heard they were switching between terminals he got worried...this dude engineers.
_"I'll be right back."_
**the video ends**
_"umm... dad?😭"_
No DB, no problem :D
Worst one I had at work was a SysAdmin deleting all the VPNs of the clients :D
I was consulting at a gig in Connecticut when they were backing up a server for an upgrade. They reused the restore floppy and adjusted the script. One minor error. They forgot to comment out the format disk command. They restored from backup.
Bad execution, good planning.
Your production console should scream "Production"
Using the colors, like red, works really well
pro tip:
have different background and font colours for every ssh session and root terminal tabs.
It helps a lot.
Especially if the one where you must not screw up has a red background.
that troll who reported an employee- won a jackpot:)