Dev Deletes Entire Production Database, Chaos Ensues

Kevin Fang

Просмотров 2,9 млн

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 20 ноя 2024

Комментарии • 2,7 тыс.

@VestigialHead Год назад ⁺²²⁹⁷²
Damn I cannot even imagine the stress that admin was feeling after he realised he deleted DB1. He must have aged twenty years.
@1996Pinocchio Год назад ⁺²²⁵⁹
The legendary Onosecond.
@NS-sd3mn Год назад ⁺⁴⁰⁰
@@1996Pinocchio I see that you see tom scott
@youngstellarobjects Год назад ⁺⁹⁶¹
The stress should really be minimal if you have a backup and restore procedure, that it actually works and you know how it works. Mistakes happen.The problem wasn't the delete command, it was the nonexistent backups and documentation.
@LeoVital Год назад ⁺⁶⁸³
@@youngstellarobjects Nah, still stressful. Most companies aren't making a backup on every write that happens to a DB, so whoever deletes a DB knows that they've just made an oopsie that will cause a lot of headache for multiple people. And probably cost a lot of money for the company as well.
@pqsk Год назад ⁺⁵⁶
As long as you have a backup there's no problem. I've done this before, but if there's no backup you prolly die of stress 😅😅😅
@Chris_Cross Год назад ⁺⁴⁹⁹⁶
The fact they live streamed while trying to restore the data is a truly epic move.
@xpusostomos Год назад ⁺¹⁰⁵
Hope it was monetized
@godjhaka7376 11 месяцев назад ⁺⁸⁰
@@xpusostomosthat's why they live stream and post anyway. Not to educate but rather make money
@Elesario 10 месяцев назад ⁺³⁵
Sounds like they had the spare bandwidth ;P
@joseaca1010 9 месяцев назад ⁺¹⁵
Programmer vtuber when?
@kv4648 8 месяцев назад
@@joseaca1010already have one: vedal
@jarrod752 Год назад ⁺⁴⁷⁷⁸
_Luckily team 1 took a snapshot 6 hours before..._
This happened to me. I copied a clients database to my development environment about 2 hours before they accidently wiped it.
They called our company explaining what happened and it got around that I had a copy. Our company looked like a hero that day, and I got a bunch of credit for good luck.
@abelkibebe577 Год назад ⁺⁹⁵
You are a Legend :)
@ilyasziani5504 Год назад ⁺¹¹
@mipmipmipmipmip Why is it bad security practice?
@amyx231 Год назад ⁺⁴⁴
And now you routinely copy the client database every 24 hours?
@jarrod752 Год назад ⁺¹⁸⁵
@@amyx231 Actually, due to the nature of my current work, I have a script I run on demand approx every few days as needed that takes a snapshot. I usually get around to deleting everything that's more than a month old about twice a year or when my dev server starts btching about space.
@amyx231 Год назад ⁺²⁰
@@jarrod752 perfect! You can probably set auto delete for one month out safely. But I applaud your caution.
@Webmage101 Год назад ⁺⁴⁶⁴⁵
I think the biggest problem (seemingly addressed at 6:21) is the fact they could delete an employee account by spam reporting it.
@alex_zetsu Год назад ⁺³⁷⁰
Actually at the time of the video, what they addressed was the fact that deleting an account could cause problems with the server, it seems they didn't actually stop trolls from deleting an employee's account. I'd have thought employee accounts would be protected. The trolls didn't even get admin powers through privilege escalation, they just reported the target.
@Milenakos Год назад ⁺¹²
read the video description
@DevinDTV Год назад ⁺¹⁵³
@@Milenakos every company says they do a manual review, but none of them actually do
@Milenakos Год назад ⁺³
@@DevinDTV source??? (edit: i was mostly complaining about you just saying they are lying out of thin air)
@foreigngodx6 Год назад ⁺⁵⁴
@@Milenakos source????????????????????????????????????????????
@Dragonfire-486 Год назад ⁺³⁶⁵⁷
This reminds me of Toy Story, and how like a month before release the entire animation was accidentally deleted, causing absolute panic and hell at Disney. Luckily, one employee had the whole thing on a hard drive that they were taking home to work on. Her initials are on one of the number plates of one of the cars in the film.
Always make a backup.
Edit: She was a project manager who had to work from home, and the numberplate was actually "Rm Rf" in reference to the notorious line of code that did it.
@mrsharpie7899 Год назад ⁺¹⁴⁶
I don't remember if it was the day-saving employee's initials, or RM-RF that was on the license plate
@alimanski7941 Год назад ⁺³⁴⁸
It was Toy Story 2, and the easter egg was in Toy Story 4, where the license plate had "rm rf" in it
@ScruffyNZ. Год назад ⁺¹⁶⁴
they fired that person recently
@atulyadav3197 Год назад ⁺³²
@@ScruffyNZ. Yes, I heard this too
@GoatzombieBubba Год назад ⁺¹⁹⁷
@@ScruffyNZ. That person should be happy to not work for a woke company like Disney.
@SIMULATAN Год назад ⁺¹⁶³⁸³
So you're telling me a platform as big as GitLab went down because one engineer picked the wrong SSH session?
Damn that makes me feel way better about my mistakes lol
@shahriar0247 Год назад ⁺⁶²³
i would highly high suggest people using customized shells, i use oh my zsh, i customize my themes to show git info, hostname (sometimes) and a lot more, not because i wanna know which ssh session im in, but i like the design :)
@0xCAFEF00D Год назад ⁺³²³
@Syed Mohammad Sannan No someone has to have that.
The general problem is that there's no safety nets. I don't mean to suggest this is a good solution, because safe-rm is just jank. But using safe-rm would most likely have saved this situation. If you replace rm through a symlink to safe-rm you can configure a blacklist on production that doesn't allow for deleting the database or other critical data.
I find many things about safe-rm to be unsafe. It doesn't protect if you cd into a directory and then do rm -rf *. A better program should simply evaluate the path its trying to delete and disallow it if the blacklist covers it.
It also doesn't allow for custom messages through its blacklist. What you want is for a bad rm -rf to send a warning to the user. Otherwise there's no way of guaranteeing they don't just start avoiding the issues.
For example, most likely you're not going to leave your backup unprotected by the blacklist just to create differences between production and backup. So a developer in this situation would expect to run into issues deleting postgres db on either server. It doesn't tell the user anything really. If you instead configure messages you can call attention to the hostname.
The goal is just to induce further friction for dangerous actions. rm has always been so risky because it's so easy.
@Darkk6969 Год назад ⁺⁹⁴
@@0xCAFEF00D I always check the hostname of the server and triple check the directory before using the rm -rf command. If in doubt I use the mv command to a different directory as backup. If everything works ok then I go in there and delete the old directory.
Same thing happened to Pixar's movie Toy Story they were working on. Some storage admin used rm -rf on a directory by mistake and practically wiped out the movie. Lucky someone had a copy of the data on a laptop that was offsite at the time. They were able to rebuild the movie from that data.
@BuyHighSellLo Год назад ⁺⁷³
@@0xCAFEF00D no, NO single employee should have enough privilege to bring down anything business sensitive. except if you’re the CTO maybe. These operations all should require a flag or check from someone else first. Just like how one person usually shouldnt be able to push any code by themselves. They need 1 or more checks before that.
@desoroxxx Год назад ⁺²¹
@@shahriar0247 I try to make my prod env glow red like that even if I am tired I can see it
@rosscads Год назад ⁺⁷¹⁴¹
Given the trouble they were in after the deletion, a recovery time of 24h and a recovery point of 6h is actually pretty heroic. Especially considering the stress they would have been under. 😰
@TheDaern Год назад ⁺⁷⁵⁵
@@L2002 Because of this? They were open and honest about their screwups which, for me, makes them a pretty good organisation to deal with. Plenty of others would not be and, at the end of the day, this stuff does happen from time to time. My measure of a company is not how well they work day to day, but how they handle adversity. Everyone screws up eventually and it's how you handle this that marks out the good ones from the bad ones.
Also, a company who almost lost a production DB because of failed backups is unlikely to do it again ;-)
@MunyuShizumi Год назад ⁺³⁴⁵
@@L2002 Ah, yes, because Microsoft never has outages, data loss, or data leak incide- oh wait..
@sinnlos229 Год назад ⁺⁷⁴
@@L2002Care to elaborate? Cause everyone else here, including me, disagrees.
@titan5064 Год назад ⁺¹⁰⁹
Don't feed the troll, clearly not someone who's ever worked with computers on a proper level
@realpillboxer Год назад
@@titan5064 exactly. Their handle is "L" -- they are a literal walking loss (loser).
@maxcohn3228 Год назад ⁺⁴⁷⁷⁰
Something my first boss taught me (when I broke something big in production in my first few weeks) is that post mortems are to identify problems in a system and how to prevent them, avoiding blame to individuals.
This is huge. Making sure to identify why it was even possible for something like this to happen and how to prevent it in the future is a great way to handle a post mortem like this. Good on the GitLab team.
@lhpl Год назад ⁺²⁵⁴
Good boss. Bad ones often like when things are done fast and "efficient". And when this then establishes a culture of unsafe practies, thing will go fine, maybe for a long time. This one day, a human error occurs. Typically, such a boss will then blame the person who "did" it, even if the cause was the unsound culture. If as an employee you try to work safely, you get criticised for being slow and inefficient (and you technically are.)
@FireWyvern870 Год назад ⁺³⁵
Yeah, things like this are the problem of the system, not fault of the operators
@honkhonk8009 Год назад ⁺⁷²
You only fire people for their character, not cus of the inevitable fuckup.
Also you basically sunk money into training this dude after that fuckup, so sacking him right after you inevitably paid to get him that experience, is counterproductive.
@gownerjones Год назад ⁺³⁷
Also very cool that they did it completely in public even with livestreams. This will hopefully help other companies avoid mistakes like that.
@FlabbyTabby Год назад ⁺¹⁰
Depends. Many times, it's used as on opportunity to kick out people they consider undesirable, even if they're great employees.
@robbybankston4238 7 месяцев назад ⁺⁷⁸
I'm glad they didn't fire the engineer. It goes to show the differences in mindsets from some organizations that care about it being a learning experience (albeit an expensive one). Many corporations would have fired the engineer as soon as the issue was resolved without hesitation. Thanks to those orgs who care about their team members and being more concerned with lessons learned.
@EssensOrAccidens Год назад ⁺⁴⁴⁰
Ugh, felt that "he slammed CTRL+C harder than he ever had before" (3:55). The only thing worse than deleting your own data is deleting everyone else's. In this case the poor guy kinda did both. Great story arc.
@ic6406 10 месяцев назад ⁺⁷
Yeah, I guess it was the most stressful moment in his life after realizing what you've done. I think he had a huge blackout
@ludoviclagouardette7020 Год назад ⁺⁴⁰⁹⁹
The rule I apply for backups is that no one should connect to both a backup server and a primary at the same time, two people should be working together. The employee that was logged on both DBs should have been really two physically separated employees
@act.13.41 Год назад ⁺²⁰⁹
That is an excellent rule.
@refuzion1314 Год назад ⁺¹⁵⁸
Yes, but, in the case that there is only one employee available and he has to connect to both he should either have different color schemes for the different servers OR do it all in one shell window and disconnect / connect to the server they have to edit that way it is a lot harder to execute commands on the wrong server.
@thoriumbr Год назад ⁺³⁸
I try to follow this rule myself. Every time I have to connect to a prod server to get anything, I disconnect as soon as I get the info before getting back to the test/dev server window.
@thoriumbr Год назад ⁺¹²⁴
@@refuzion1314 Different color schemes looks good but don't work during an outage, when you are stressed, exhausted, or anything distracts you. Sounds nice, but the mental load during crisis is too large to pay attention to that.
@onemprod Год назад ⁺²⁰
I can't tell you enough how easy it is to accidentally overwrite the wrong file. While I was working on something on a test machine with a usb stick plugged in to save the current progress, I saved the script, thought I saved it in the local directory and copied the unmodified script to my just saved usb stick version...
@Nick77ab2 Год назад ⁺⁴⁰²⁷
This is why problems like this are actually sometimes good. Of course extremely stressful, but they found sooo many issues and fixed them all. Amazing.
@federicocaputo9966 Год назад ⁺⁹⁶
you are asuming they fixed them all
Until it breaks again.
@JeyC_ Год назад ⁺¹⁴⁹
@@federicocaputo9966 atleast next time they now have the experience to what not to do or what to do
@brett2258 Год назад ⁺²³
That's a really good positive approach right there!
@djweavergamesmaster Год назад ⁺¹⁷
reminds me of that one ProZD skit, where the villain fixes everything
@mikabakker1 Год назад ⁺³
@@federicocaputo9966 that is life
@Misanthrope84 Год назад ⁺²⁴¹⁵⁰
"You think it's expensive to hire a professional? Wait till you hire an amateur" - some old wise businessman.
@urbexingTss Год назад ⁺⁴²¹
that indeed is wise
@shahriar0247 Год назад ⁺³⁷
Loll
@blue5659 Год назад ⁺⁶²⁸
A professional costs you in bold italic and underline. An amateur mostly costs you in fineprint
@-na-nomad6247 Год назад ⁺¹¹⁵⁰
The person here is not an amateur, anyone can get brain farts especially when working an unexpected overnight, you should try it sometime, you'll start seeing ducks and rabbits in the shell.
@Misanthrope84 Год назад ⁺²⁴²
@@-na-nomad6247 I'm a veteran in the Devops field. This comedy of mistakes could have never happened to me since I'm following a protocol, which these guys obviously did not. They were guessing and experimenting as if it were an ephemeral development environment. Their level of fatigue had little to do with their incompetence in understanding the commands they were running.
@mxbx307 Год назад ⁺⁶⁵⁸
There is an awful lot that could be learned from this.
1) You should "soft delete" i.e. use mv to either rename the data e.g. renaming MyData to something like MyData_old or MyData_backup, or just mv it out of the way so you can restore it later if needed. Don't just rm -rf it from orbit
2) Script all your changes. Everything you need to do should be wrapped in a peer-reviewed script and you just run the script, so that the pre-agreed actions are all that gets done. Do not go off piste, do not just SSH into prod boxes and start flinging arbitrary commands around
3) Change Control - as above
4) If you have Server A and Server B, you should NOT have both shell sessions open on the same machine. Either use a separate machine entirely or - better still - get a buddy to log onto Server A from their end and you get on Server B from yours. Total separation
5) Do not ever just su into root. You use sudo, or some kind of carefully managed solution such as CyberArk to get the root creds when needed
@magicmulder Год назад ⁺⁴⁹
Also for (2), never try to "improve" anything during the actual action.
I once prepared a massive Oracle migration that I had timed to take about 3 hours. Preparation was three weeks.
As I was watching the export script for the first schema during the actual migration, I thought "why not run two export jobs concurrently, it's gonna save some time". Yeah, made the whole thing slow down to a crawl, so it ended up taking 6 hours. Boss was furious.
So no, never try to "improve" during the actual operation, no matter how big you think your original oversight was.
@lashlarue7924 Год назад ⁺¹
100%, upvoted.
@xpusostomos Год назад ⁺³
I religiously never delete anything
@thedemolitionsexpertsledge5552 Год назад ⁺²
I have no idea what any of this means but I feel like this is bad
@alvinbontuyan8083 9 месяцев назад
Fucking up catastrophically with Bash commands is a canon event. It is religion for me to always copy a file/directory to "xxx.bak" before doing anything sensitive
@Dairunt1 Год назад ⁺¹⁸²
One of my most stressful moments as a software designer was when I accidentally broke a test environment right before a meeting with our client; I managed to have the project running at a 2nd test environment but that really taught me the importance of backups and telling the rest of staff about a problem ASAP.
@gosnooky Год назад ⁺²⁷²⁰
Imagine for a moment, that you're that guy. That feeling of pure dread and the adrenaline rush immediately after the realization of what you've just done. We've all felt it at some point.
@omniphage9391 Год назад ⁺¹⁷¹
In my first job, ive gotten a 2 am call where in the first two weeks of working in the company, i accidentally left a process in prod shut down after maintanence, leading to intensive care patient data not making it into connected systems.
Looking back, the entire company was set up super amateurish, yet they operate in several hospitals in my country.
@PixelSlayer247 Год назад ⁺⁶⁵
Having exited my game without being sure I saved my progress before, this is very relatable.
@thephlophers Год назад ⁺⁴⁰
the onosecond
@stacilynn604 Год назад ⁺¹⁰
like hitting a car in a parking lot 😵
@ashesagainst7236 Год назад ⁺⁴⁸
At my second IT job I accidentally truncated an important table in the prod DB. The stress was immense but we identified a ton of issues and the team was pretty supportive. My boss ended up begging upper management to get us a backup server but they determined it wasn't important enough.
The company went belly-up a few years later because of a ransomware attack they couldn't recover from.
@CryShana Год назад ⁺⁵¹²
When I was still a junior developer at some startup company, I was working on a specific PHP online store. Every time we would upgrade the site, we would first do it on Staging, then copy it over to Production. The whole process was kinda annoying as there was no streamlined upgrade flow yet and no documentation anywhere - it was a relatively new project we took over. I have upgraded it before so I knew what to do, and I just did the thing I always did.
I was close to finishing it up and we had an office meeting coming up soon and lunch afterwards, so I wanted to be done with this before that - so I rushed a bit. And when I was copying files to Production, I overlooked something - I had also copied the staging config file (that contained database access info etc) to the production location and overwrote the production config file.
After the copying had finished, thinking I was finally done, I relaxed and prepared myself for the meeting. As I was closing everything, I also tried refreshing the production site, just to see if it works. And then I realized... Articles weren't appearing, images weren't loading, errors everywhere. Initially I didn't believe this was production at all, probably just localhost or something, RIGHT?? However after re-refreshing it and confirming I had actually broke production, panic set in.
Instead of informing anyone, I quietly moved closer to my computer, completely quiet, and started looking at what is wrong - with 100% focus, I don't think I was ever as focused as I was then - I didn't have time to inform anyone, it would only cause unnecessary delays. I had to restore this site ASAP.
I remember sweating... the meeting was starting and I remember colleagues asking me "if I am coming" - and I just blurted "ye ye, just checking some things..." completely "calmly" as I was PANICKING to fix the site as soon as possible. Luckily I quickly found the source of the mistake within a minute and had to find a backup config file - and then after recovering the config file, everything was fixed. Followed by a huge sigh of relief. The site must have been down for only around 2 minutes.
No one actually noticed what I had done - and I just joined the meeting as if nothing had happened - even though I was sweating and breathing quickly to calm myself down, I hid it pretty well.
And this was a long time ago - and still to this day, I still remember that panic very well. Now I always make sure I have quick recovery options available at all times in case something goes wrong - and if possible always automate the upgrade process to minimize human errors
@valdimer11 8 месяцев назад ⁺³⁴
Well done. Having made mistakes like that, I can completely understand how you were feeling in that moment and how your brain just went "in the zone". It's only ever happened to me twice but I will NEVER forget them.
@yt-sh 5 месяцев назад ⁺¹
Good lessons, thank you
@vjndr32 5 месяцев назад ⁺⁴
Mann, we all have our fair share of breaking production.
@obanjespirit2895 5 месяцев назад ⁺³
I did something similar but with testing on what i thought was dev server. Had some close calls but this time i fcked up. Was super high but was always high so doubt that was it. Quickly had to go and undo changes but was so shook had to make a chrome ext that would put up some graphics and ominous 40k mechicus music whenever i go on a live domain. Havnt made the same mistake since.
@red_amoguss 3 месяца назад ⁺¹
You were only lucky because the project had no proper and comprehensive CI/CD pipeline with unit tests.
A competent tech company would have fired you over this.
@randomgeocacher Год назад ⁺²⁰¹⁰
A helpful hack is to set production terminal to red and test terminal to blue or something like that. Just a small helper to avoid human f’ups if you need to run manual commands in sensitive systems.
@tacokoneko Год назад ⁺⁶⁷
i second this I also use colors to differentiate multiple environments
@vaisakh_km Год назад ⁺¹⁸
it was easy and changing prompt color... but make a huge differece
@Wampa842 Год назад ⁺⁶⁴
I use colored bash prompts to differentiate machine roles - my work PC uses a green scheme, non-production and testing servers use blue, backups use orange, and production servers use yellow letters on red background. It's very hard to miss.
@darrionwhitfield46 Год назад ⁺⁶
I use oh-my-posh with different themes
@iUUkk Год назад ⁺⁶
Both database servers were actually used in production.
@Tmccreight25Gaming 7 месяцев назад ⁺⁶⁸
Ultimate workplace comeback: "At least I've never nuked the entire database"
@usellstech-ip2sg 7 месяцев назад ⁺³
Better to have someone who knows what to do, than someone who has never experienced it
@reyynerp 5 месяцев назад ⁺²
they work remotely
@ZT1ST 3 месяца назад ⁺¹
Ultimate comeback to that comeback: "So far."
@ErikPelyukhno Год назад ⁺³¹
Your editing is phenomenal. What an insane series of events 😂 Glad gitlab was able to get back to running, seeing all that public documentation was refreshing to see since it shows they were being transparent about their continued mistakes and their recovery process.
@helmchen1239 Год назад ⁺¹⁵⁴³
I once accidentally ran a chmod -R 0777 /var because i've missed a dot before the slash (in a web project with a /var folder), which (as i've now learned) may make a unix system totally unresponsive. I can very well understand how it feels, the moment you realize what you have just done. That did cost us a few hundred euros and kept 2 technicians busy for an afternoon on the weekend. Lessons learned, today we can laugh about it.
@Darkk6969 Год назад ⁺¹⁵⁷
Ya, Unix / Linux will do what you tell it to do without any warnings. Pretty sure you sat there and wondered why that command is taking a long time to finish before you realize your mistake. Right then there it's the "Oh Shit" moment. 😀 Lucky for me though I use VMs so can always revert to previous snapshots.
@desoroxxx Год назад ⁺¹⁶⁸
the onosecond
@parlor3115 Год назад ⁺⁶
@@Darkk6969 What if you ran it on the host?
@FurriousFox Год назад ⁺⁵¹
@@parlor3115 he doesn't, Noah only runs things in virtualized environments, making snapshots every minute
@aarondewindt Год назад ⁺⁵
Why does it make it unresponsive? I accidentally chmod 0777 the entire "/" once and well, I had to start again from scratch. Thankfully I was just creating a custom Ubuntu image with some preinstalled software for one of my professors. So it just cost me time. Still, I never figured out why opening up the permissions would lock everything up.
@LordHonkInc Год назад ⁺⁷²⁹
"rm -rf" is one of those commands I have huge respect for cause it reminds me of looking down the barrel of a gun (or any similar example of your choosing): Best case, you do it a) seldom, b) after a lot of strict and practiced checks, and c) if there's no alternative; unfortunately, the worst case is when you _think_ you're in that best case scenario.
@givenfool6169 Год назад ⁺⁴⁷
I sourced my bash history like an idiot about a week ago. I have so many cd's and "rm -rf ./"'s and other awful things in there. I somehow got lucky and hadn't used sudo in that terminal at the time. I got caught on a sudo check before it ran anything absolutely hell inducing. Just a bunch of cd's and some commands that require a sourced environment to execute. Super Lucky. Icould have wiped out everything, because just a couple commands after that was a "rm -rf ./" and it had already cd'd into root.
@henningerhenningstone691 Год назад ⁺⁴¹
@@givenfool6169 Lmao it had never once occurred to me what havoc it could wreak if you accidentally source the bash history, since it had never occurred to me that that's even possible (because why the hell would you?!). But of course it is, what an eye opener!
@givenfool6169 Год назад ⁺¹⁷
@@henningerhenningstone691 Yeah, I was trying to source my updated .bashrc but my auto-tab is setup to cycle through anything that starts with whatevers been typed (even ignores case) so I tabbed and hit enter. Big mistake. I guess this is why the default auto-tab requires you to type out the rest of the file if there are multiple potential completions.
@Shadowserpant00 Год назад ⁺⁶
@@henningerhenningstone691 bro idk wtf you're talking about and it's scaring me
@oliverford5367 Год назад ⁺¹
Do ll first, make sure you're wanting to delete that directory, the press up and change ll to rm
@MechMK1 Год назад ⁺⁶⁴⁶
For this reason, all our servers have color-coded prompts. Dev/Testing servers are green. Staging is yellow. Prod is bright red. When you enter a shell, you immediately see if you are on a server that is "safe" to mess around with, or not.
The advantage to doing this in addition to naming your server something like "am03pddb", is that you don't have to consciously read anything. Doesn't matter if you accidentally SSH into the wrong server. If you meant to SSH into a "safe" server, then the bright red prompt will alert you that you are on prod. And if you meant to SSH into a prod server, then you better take the time to read which server it actually is.
@tacokoneko Год назад ⁺¹³
i agree except there are only so many colors, so if manually controlling a lot of different machines (something that could maybe be avoided depending on what the servers do) i believe it's important to use unique memorable hostnames. the two servers in this story had hostnames 1 character apart and the same length, unless the names were all changed for the artwork
@seedmole Год назад ⁺⁹
@@tacokoneko Yeah like imagine if those two characters were visually similar ones, like any combo of 3, 5, 6 and 8. Fatigued eyes could easily misleadingly "confirm" that you're on the right one when you're not.
@makuru.42 Год назад ⁺⁵
Also, dont ever ever work on the live database, a lesson i have learned the hard way many times on my own.
@MunyuShizumi Год назад ⁺¹⁴
@@makuru.42 That statement makes no sense. No matter how critical a system is, you'll have to perform some kind of maintenance at least semi-regularly.
@makuru.42 Год назад ⁺¹
@@MunyuShizumi you make a backup or anything, yes you need to maintain it but not by making massive untested changes.
@TomSM5 Год назад ⁺¹⁰
Nice to hear that they didn't fire him. He did the correct procedure, some of the steps were unknown like the lag caused by the command, which could have been avoided by having clear documentation about it. Also when people are tired late at night, mistakes do happen, which anyone can be the victim of.
@minsiam Год назад ⁺¹¹
When I was just starting in a company, I accidentally deleted all the ticket intervals from the database. Causing all the tickets to close immediately and make some massive spam to the admins. I was really terrified of the situation and didn't know what to do, we didn't have any backup as well. I apologized as much as I can and didn't make another mistake like this again in years, sometimes mistakes make you work harder and be more careful in life.
@matthias916 Год назад ⁺⁹²⁰
I once accidentally deleted 2000 rows in one of my companies production databases, everything was restored 5 minutes later but it felt so bad, can't imagine what deleting an entire database would feel like
@marco56702 Год назад ⁺⁴⁸
terrible, sending the queries make you shiver
@varunkhadse5869 Год назад ⁺⁵
ig panick was at next level coz both dbs were deleted.
@Rncko Год назад ⁺⁵
It feels like lighting a torch onto a sea of currency bank notes... that belongs to the company.
(and company is just about to release year end bonus)
@Atulnavadiya Год назад ⁺⁵
I have had good hands-on experience at my company on sql database but I'd check my query atleast 10 times before execute it..we had clients data saved in the database of more than 10 years..
@godjhaka7376 11 месяцев назад ⁺¹
@matthias916 hope you don't work there no more. You need more experience with SQL and other IT technologies before you actually allowed to touch it so these highly preventable errors don't happen.
You need to learn how databases work, a backup/restore system. Not to mention you should be automating queries anyway , that's what pwsh and DevOps is for. Less human mistakes. So sad and very amateurish to delete databases without even backing up prior to making changes
@xmorse Год назад ⁺²⁷⁴
The real problem here is that you can delete any user data by simply mass reporting him
@technicolourmyles Год назад ⁺⁴⁶
I'm seeing a lot of serious problems here... I guess this is why I never heard of GitLab before.
@PatalJunior Год назад ⁺⁶
I highly doubt is instantly deleted, probably someone made the decision to delete it (could just be an account spamming a bunch of mess onto repositories, and that isn't good either.
@FighteroftheNightman Год назад ⁺⁴⁹
@@technicolourmylesthey're literally the 2nd largest enterprise git solution provider in the world.
@nonamepasserbya6658 Год назад ⁺¹
When in doubt, it's probably 4chan
That low hanging fruit aside, not a good thing if someone can just do that with a bot acc. Maybe grant employees a special anti report protection can help until they find a more permanent solution against those trolls
@Webmage101 Год назад ⁺⁴
@@PatalJunior6:21 literally says they fucked up by not making it check the details before deletion
@build-things Год назад ⁺⁸²²
As an engineer for a large company you got me in the feels talking about asking for help or posting a pr and then seeing all the mistakes you made😊
@stingrae789 Год назад ⁺¹⁹
In my previous position I worked closely with one guy and we used to joke about how we were using each other as a rubber duck :D.
@Gyvie-marie Год назад ⁺⁴
The buzzword is SRE and postmortems are supposed to be blameless now...
@jillfizzard1018 Год назад ⁺²
This is why you first mark the PR as a draft and read over the changes one more time before marking it as ready.
@mortache Год назад ⁺²
@@stingrae789 Damn I didn't know this thing has a name! I legit have done this before while discussing weird math problems
@jfbeam Год назад ⁺¹²⁵
The #1 thing I learned WAY EARLY on in my IT career (three decades): Never delete anything you can't _immediately_ put back. Never do anything you can't undo. Instead of deleting the data directory, _rename_ it. If you're on the wrong system, that can easily be fixed. (and on a live db server, that alone will be enough of a mess to clean up.) As for backups, if you aren't actively checking that (a) they've run, (b) they've completed successfully, and (c) they're actually usable... well, this is the shit you end up in.
(The fact they're actively hiding ("lying") about this fiasco should be criminal.)
@kurenaigames5357 Год назад ⁺¹⁷
yea renaming is the key. first rename, then setup everything and then delete the renamed folder like a few months later.
@ToastyWalrus7 4 месяца назад ⁺³
The voiceover: Outstanding, the editing: premium, the humor: dryer than the sahara *inhales* just how I like it.
Ive never hit the sub so fast, keep em coming man!
@Ganerrr Год назад ⁺¹¹⁴³
imagine flagging messing with some employee and managing to bring down the entire site by proxy
@batorerdyniev9805 Год назад ⁺²
What
@hypenheimer Год назад ⁺³
Bot
@Ganerrr Год назад ⁺⁵⁷
@@hypenheimer beep boop
@Jacob-ABCXYZ Год назад ⁺³
How to take down a site, the stealthy way
@kulled Год назад ⁺⁶
@@hypenheimer nah. it was probably a minecraft shorts bot account before he bought it though.
@TheDrTrouble Год назад ⁺⁵⁶¹
The best practice is to rename the directory or file to something else. Idk how the developers are so calm when using deletion commands
@setasan Год назад
Well, when you live in a poor country, being underpaid by a fucking contractor company, with a overloaded team. shit hapnz
@schwingedeshaehers Год назад ⁺⁷
I "deleted" on program from me with the cp command (I wanted to copy the config and the main file in a sub directory, but forgot to enter the directory after it, so it wrote the config to the main file)
(I could get a older version of the file from the SD card, by manually read the content of that region and find one with it on it, as it doesn't override an save, but takes a new place)
@Funnywargamesman Год назад ⁺⁴⁵
On a home system? Absolutely. In a working environment? Doubtful. Maybe with a small company it would be acceptable, but creating an orphan database that may or may not contain sensitive information with no one in charge of it, or worse, no one who KNOWS ABOUT it, would be awful. God help you if that contains financial, medical, or government records.
@AndrewARitz Год назад ⁺⁸⁰
@@Funnywargamesman you don't create it to keep it around forever, you create it as a failsafe for when you are doing potentially dangerous stuff, like deleting a whole database.
@Funnywargamesman Год назад
@@AndrewARitz I cannot tell you how many times "temporary" things become permanent on purpose, let alone the times people have said they are going to do something, like deleting a temp database they copied locally because their permissions didn't let them use it remotely, and then proceeded to forget to delete it. This will be especially true with the most sensitive databases, "because it's more important, so we should make a copy first, right?"
Security is everyone's job and if you do (typically) irresponsible things like copying databases, "as a failsafe," chances are you are going to form a habit that means you will do it with a sensitive database. If you think YOU won't do it, that's fine, but assuming you are of average intelligence you need to remember 50% of people are dumber than you and some of them get REAL dumb. If you set policy to say that it would be allowed, then THEY will do it.
This is exactly why I said that home environments and really tiny companies could be different, there it could/would be fine. Chances are, if you don't know the names of every single person in your company off the top of your head, it is too large to be that lax with data protection and management. Take it or leave it, it's my opinion.
@jhyland87 Год назад ⁺⁶⁸
A few places i worked at as a linux admin or engineer, the shell prompts (PS1) were color coded. Green was dev, yellow was qa and red meant your in prod. Worked like a charm.
@blackbot7113 Год назад ⁺⁴
Yeah, that's the way I do it as well, just the other way round (red being test). Extends to the UI as well - if the theme is red, you're on the test instance of Jira, not the real one.
@jhyland87 Год назад
@@blackbot7113 Yeah, it's a very wise thing to do imo. Currently, I work at a bank, and I recommended we have the header in the UI of the colleague and customer portal be different colors for lower environments, as well as the PS1 prompt on the servers. And I kinda got snickered at and got a reply along the lines of "How about we just pay attention to the server and page were on?"
Its crazy because it's such an easy change to implement and almost entirely prevents anyone making such silly (yet catastrophic) mistakes.
Edit: I make the PS1 prompt for my own user on the servers different colors, but that only helps so much since I sudo into other service users (or root). Additionally, we "rehydrate" the servers every. couple months, which means they get re-provisioned/deployed, so any of those settings get wiped out entirely.
For it to be permanent, it needs to be added in the Docker file.
@jamesrosemary2932 Год назад ⁺¹¹
A long time ago we implemented a policy that absolutely nobody operates the production console alone.
There always has to be someone else looking over your shoulder to point out oversights like the one in the video.
@gabrielbarrantes6946 Месяц назад
This is a good one, but I would add that no one ever does something on production, should allí.the handled by CI/CD pipelines and go through QA /peer review before any commit goes in
@jamesrosemary2932 Месяц назад
@@gabrielbarrantes6946 I was talking about a time when GIT and devop were just wet dreams.😁
@user-nj1qc7uc9c 4 месяца назад ⁺²
i love how clear you've made it for us to tell whether commands are being run on DB1 or DB2
if only it were that clear irl...
@HazySkies Год назад ⁺⁴⁰⁸
"Slams Ctrl+C harder than he ever had before"
As a relatively new linux user, I felt that one.
@ss-to7ii Год назад ⁺²
As a new Linux user use the "-i" flag for "interactive" when using rm and a couple other commands.
@KR-tk8fe 11 месяцев назад ⁺¹
As a windows user, I was very confused
@LC-uh8if 10 месяцев назад ⁺⁶
@@KR-tk8fe CTRL+C. On most Unix/Linux based CLIs, this combination aborts whatever command you were running. Technically, it sends a SIGINT (Interrupt) to the foreground process (active program), which usually causes the program to terminate, though it can be programmed to handle it differently. Its basically, the Oh Shit or This is taking too long button.
@MrCmon113 8 месяцев назад ⁺⁴
@@LC-uh8ifIsn't that the same in Windows terminals? 🤔
@daigennki Год назад ⁺¹²³
Awesome work on the video!! I love the editing being both funny and straight to the point, and your narration is easy to understand too. You seriously deserve more attention.
@karmatraining Год назад ⁺⁶⁵
An old best practice that so many people these days seem to forget or never have heard about is that every week, you try to pull a random file from your backup system, whatever that is. (Or systems, in this case). You will learn SO MUCH about how horribly your backups are structured by doing this - so many people think they set up good backup systems but never continuously test them in any way, and then they get big surprises (like the GitLab team) when they do need to fall back on them.
@swaggy3987 Год назад ⁺⁴
What's far more impressive about this whole situation is how calm the engineers were in handling the situation. That to me is far more valuable than having engineers that are too gun-shy to make prod db changes at 12AM and panic when something goes wrong.
@ChandravijayAgrawal 10 месяцев назад ⁺³
One thing I learnt all this is never run delete command ever, and if you are paste the screenshot of command in your group before running it
@christopherg2347 Год назад ⁺¹²³
If you are working with multiple shells, VMs, remote sessions or the like - make sure they are color coded based on the machine you are running against!
It can be as simple as picking a different color scheme in windows. But it is just too easy to mess up when all the visual difference is a single number, somewhere in the header.
@neekfenwick 10 месяцев назад ⁺¹
Yep, I came here to say this. For any serious system I connect to, I use different params for my session, in my case I like old fashioned xterm, something like: alias u@s="xterm -fg white -bg '#073f00' -e 'ssh user@server'"
It's very useful to see the green red, blue etc colouring and be sure which system you're talking to.
@Kalmaro4152 8 месяцев назад ⁺⁵
It's very nice that Linux shells actually support setting session colors
@DomskiPlays Год назад ⁺²⁸⁶
Our prod server has no staging environment or anything like that. I've asked the DB admin if the data and schema is safe in case of someone accidentally deleting everything and they told me everything is backed up daily. Kinda scared that I don't know how or where this is happening except for a job.
@indyalx Год назад ⁺⁶⁰
I checked my database backup script a couple days ago and noticed it hadn't backed up in 5 days O_O I SLAMMED the manual backup immediately. Then went and fixed the issue and made sure it would notify if there was no backup in 6 hours.
@CMDRSweeper Год назад ⁺⁴⁷
The next question is... "Have you tested the backups?"
If they can't say for sure WHEN they were tested... Be very afraid...
@indyalx Год назад ⁺⁸
@@CMDRSweeper we load the prod backup into staging nightly
@forbiddenera Год назад ⁺²
6 hour full backups, mirroring/replicas, multiple servers and daily volume backups..
@robertbeisert3315 Год назад ⁺⁴
"Trust me, bro" only works in Dev. Every other environment needs regular verification.
@matthewstott3493 Год назад ⁺¹³²
Testing to verify backups, replication, failover and the like is absolutely critical. As new scenarios occur, having a feedback loop to update the plan is key. It's a continuous process that most shops have learned the hard way. It is boring and tedious but if you don't test you will experience catastrophic consequences.
@-TheBugLord Год назад ⁺³
Exactly. Just like a dam, if there is a weak-point at the bottom, it all may come crumbling down.
There needs to be a lot of redundancy when it comes to backups. Especially when it comes to a big server. An engineer accidentally removing a database should not have that catastrophic of consequences.
@esa4573 Год назад ⁺²
Yeah, the general rule is/should exist for having to be ready for stuff like that. If your fuckup is non-recoverable or a massive pain, you did something wrong. I'm sure a lot of companies are practically "trained" for when someone yeets the whole database or service.
@Dobaspl 10 месяцев назад ⁺²
Even before I started working in one company, one IT specialist deleted the directories of the new CC-supporting system. This was shortly after its implementation into production. Worse still, it turned out that the backup process was not working properly. For a week, the team responsible for programming this system practically stayed at work, recreating the environment almost from scratch. :D
@derpnerpwerp Год назад ⁺⁸
This reminds me of all the times I have been in the wrong ssh session just before doing something that would have been pretty bad. I setup custom PS2 prompts to tell me exactly what environment, cluster, etc I am in.. and even colorize them accordingly but the problem is.. you start to just ignore them after a while. Its also kinda dangerous when stuff becomes fairly routine that is manual and potentially damaging
@pyqio Год назад ⁺¹³⁰
all things aside, that wasn't that bad. Yeah, they weren't operational for 24h, but that made many other companies realize their fault management. For example, my uni professor told us about this incident and we could comprehend the importance of backups and testing
@gblargg Год назад ⁺¹⁶
I think the biggest issue was losing 6 hours of commits and comments.
@gblargg Год назад ⁺²⁴
@@kookie-py Agreed, virtually all of them will have the commits locally as well. Just noting that the data loss is a bigger deal than mere downtime.
@_Titanium_ Год назад ⁺⁶
This is why programming in general is great, nobody dies if you fuck up. (Obvious exceptions, medical, aviation etc)
@ItsJohnBomb Год назад ⁺⁴
@Titanium "nobody dies" (except the people who would die)
@lolatmyage Год назад
@@_Titanium_ cough cough therac software lmao
@WackoMcGoose Год назад ⁺³⁴⁴
As a former Amazonian (only QA for the now-ended Scout program, sadly), I read quite a few cautionary tales on the internal wiki about Wrong Window Syndrome. Sometimes, not even color-coded terminals and "break-glass protocols" (setting certain Very Spicy commands to only be usable if a second user grants the first user a time-limited permission via LDAP groups) is enough to save you from porking a prod database.
@Skyline_NTR Год назад ⁺⁴
This interests me. Got any resources/links to set that up (dangerous commands temporarily allowed by time-limited permissions via LDAP)
@WackoMcGoose Год назад ⁺¹⁵
@@Skyline_NTR Afraid not, it was several pay grades above me both in job role and in coding knowledge, and I lost access to the company slack back in december so I can't really ask anyone...
@ProgrammingP123 Год назад ⁺¹
@@WackoMcGoose Ahh were you laid off also??? I was lol
@WackoMcGoose Год назад ⁺⁴
@@ProgrammingP123 Yup, they disbanded the entire Scout division and then put a company-wide hiring freeze a month later so I had no hope of transferring...
@markh3684 Год назад ⁺²⁵
Mistakes in the moment happen. I'm focusing more on the "we thought things were working as expected" parts. The backup process familiarity, backups not going to S3, Postgres version mismatches, insufficient WALs space, alert email failures, diligence on abuse deletes... These were all things that could have been and should have been caught way before the actual incident.
@jim2lane Год назад ⁺⁴
OMG, we have all been there haven't we? That awful, dreadful realization after deleting something that you shouldn't have. Mine was back in the days of manual code backups, before ALM tools were ubiquitous like today. I thought I had taken the last three days of code changes and overwritten the old backups that were no longer needed. And then I realized that I had done the exact opposite, and just deleted three complete days of coding - and would now have to recreate them from scratch 😒😭
@74HC138 Год назад ⁺¹
"He slammed Ctrl-C" - I can feel the cold gripping feeling you get when you realise you've just caused an accidental catastrophe...
@CarrotCastle Год назад ⁺²³
One of my first jobs in IT was working as a big data admin and this video allows me to re-live the spicy moments of that job but with none of the responsibility attached
@TonytheCapeGuy Год назад ⁺⁴³
I can just imagine the relief that team felt when they find SOMETHING that they could use to restore files.
@wojtekpolska1013 Год назад ⁺²⁵⁴
respect for not firing the guy, it was obviously just a small mistake, and it wasn't his fault that the backups didn't work. it shouldn't be possible for 1 command to completely delete everything in the first place. Good that they didn't just use him as a scapegoat :p
@yerpderp6800 Год назад ⁺¹²⁷
If they fired him they would just reintroduce the possibility of the same thing happening again in the future. I'm pretty sure the old employee will be paranoid for a loooong time and will double-check from now on lol. An expensive lesson but a lesson nonetheless.
@tuxie93 Год назад ⁺⁶⁷
Yep and he'll train new employees making super sure to emphasize triple checking before deleting from prod.
@D00000T Год назад ⁺⁵
That’s Unix systems for you. Their open nature makes them super useful for a lot of things but it’s also so easy to break them.
Plus that old trick of telling new linux users that sudo rm -rf is a cool easter egg command wouldn’t be the same with more safeties and preventions.
@BitTheByte Год назад ⁺³
What if I want to delete everything? I don’t want a baby proofed OS. I want an OS that does what I want. Even if I want to burn it all
@wojtekpolska1013 Год назад ⁺⁸
@@BitTheByte why buy a computer at that point lol
@tatsuuuuuu Год назад ⁺³
Linux actually can in certain circumstances "undo" this wild kind of situation. Having ZFS as the file system will allow you to revert to a previous image of the filesystem. it's like versioning but for the entire file system. of course it takes up quite a bit of space so it's not done that often, software install are automated "imaging" points for instance. but you can trigger one manually when you think you're about to do something you're unsure about. (since the selection of save states is at GRUB, yes an unbootable system is still recoverable if you still have GRUB)
@sortebill Год назад ⁺³
your content is really good, please keep up making these mini documentaries about tech failures!
@hummel6364 Год назад ⁺³⁵
In my vocational school I had a subject simply called "Databases" and our teacher there once told us a story about how one of his co-workers lost his job.
In essence he did everything right, created his backups and backup scripts and everything worked. At some point during the lifetime of the server this was running on someone replaced a harddrive for whatever reason, this lead to a change of the device UUID, which he had hard-coded into his backup script, when the main database failed a year or two later, they tried restoring from this backup only to find that there was none.
Wasn't even really his fault, the only mistake he made was not implementing enough fail-saves. Nowadays we have it comparatively easy with all the automatic monitoring and notifications, but this was at least 30 years ago.
@thewhitefalcon8539 Год назад ⁺⁴
I guess that could have been solved by testing the backups. Install the database software on a spare server or just your own workstation, and then restore the backup onto it
@hummel6364 Год назад ⁺⁷
@@thewhitefalcon8539 well the backup ran properly for years, he just never thought that the UUID might change
@thewhitefalcon8539 Год назад ⁺¹
@@hummel6364 I suppose as long as he's employed he should probably be checking the backup at least every couple months. Would I have remembered to do that? I dunno, but I'm not employed as a database admin.
@yerpderp6800 Год назад ⁺⁷
@@hummel6364 yeah he kind of deserves to be fired...feel like it should be common sense the hdd could fail, no good excuse to not expect that. You should almost never hardcode stuff, not sure why they thought it was okay to hardcode the uuid of a drive that would one day fail.
@hummel6364 Год назад ⁺¹
@@yerpderp6800 I think the idea was that the device might change from sdX to sdY when other drives are added or removed, so using the UUID was the only simple and safe way to do it.
@streetchronicles5693 Год назад ⁺⁶⁵
Yesterday I was added to a support team because we are getting a lot of tickets from users not waiting long enough for a service to load and closing the connection early. I died laughing from this story.
@daryl9915 Год назад ⁺³⁵
A couple of jobs ago, I had a colleague who managed to do worse than this.
I think they were playing about with learning Terraform and managed to delete the entire account. Prod servers, databases, the dev/qa servers, disk images, even the backups. Luckily it was a smaller account hosting a handful of tiny trivial legacy sites, but even so, we didn't see them for the rest of the week after that mishap
@lashlarue7924 Год назад
😱😱😱😱😱😱😱😱😱😱😱😱😱😱
@ComfyCherry Год назад ⁺⁴
4:40 never assume you have enough backups, I've been taught this 100 times and I don't even have anything important to backup (for now)
@edc2186 Год назад ⁺⁷⁹
As a dev for a large company who has been on a number of late night calls, I literally gasped at this. But good on the team to work through the issue, and good on management to keep these guys around
@SteveAcomb Год назад ⁺¹⁰⁸
Great video! Well produced content about software engineering war/horror stories are exactly what I’ve been looking for, keep it up!
@bennythetiger6052 Год назад ⁺³⁴⁷
This video made me say "Oh... my... God..." way too many times 😂😂. Felt like some Chernobyl documentary about a bad sequence of actions. Love it! This is very insightful as to what things can take place on these types of environments as well as what are some measures that can prevent major falis like that. It's also super interesting to see that, no matter how perfect a software system is, humans will still find a way to screw it up 😂
@blazi_0 Год назад ⁺⁷
Bro let's also don't forget the damage had already done, the server was down for like 18 hours thousands of prs, comments, issues and projects are all delete permanently, this should be a bigger deal
@mrsharpie7899 Год назад
I'd love to see the USCSB do an animation on this incident lmao
@CryptbloomEnjoyer Год назад ⁺¹
I know the exact feeling of terror the moment you realize the command you just ran has is about to cause havoc
@tw5222 Год назад ⁺⁴
I thought it was just a meme back then when i saw this on twitter. Is this for real?! OMG. Everything aside, big applause to gitlab for not blaming a single person when this happen, such a nice company to work for.
@hchris96 Год назад ⁺¹⁸
I didn’t realize I would like these videos, but you are a good storyteller for production issues and I hope to see more in the future
I am gonna share this with some of my coworkers
@ChosenOne-wz6km Год назад ⁺⁷
This video is awesome! The step by step analysis of what occurred during the outage coupled with the story telling format helped me learn some things I didn't know about database recovery procedures. Please make more videos in this format!
@bmo3778 Год назад ⁺¹⁶
I barely understand anything here, but all I can say is massive thanks to the team who have worked hard, advancing our computer tech to the current state we have!
@cc3 Год назад ⁺²
I deleted the main site from our backend in my first month as a full stack developer. Fortunately i figured out how to rebuild the apache server and clone the repository but i definitely worked well past my hours that day and the stress was crazy
@NickCharles Год назад ⁺²
I absolutely HATE how database backups and a lot of other common tasks just don't report any progress in a terminal when you run the command. It's agonizing because you don't know if the command is working or hung on something. For example, a mySQL database I work on takes a good hour or so to backup. During that time, we get NOTHING from the terminal output, so we have to monitor separately using IO and CPU tracking tools to make sure the SQL instance is still doing something.
As for rm -rf, I've made it a habit to either take a manual snapshot immediately before running it on any production data, or more often I'll just make a copy of the directory right before to a temp directory, that way there is always a copy of the data before I remove it from where it normally resides. It's saved me from stupid decisions more than once...And always, ALWAYS verify backups exist where you expect them and the contents look complete before you make sweeping changes! We like to deploy a backup to a test server before major upgrades, just to make sure it restores as expected. It can take an extra day or so to do, but it's a good verification step that ensures any issues with our backups are caught during regular maintenance, instead of in the middle of a crisis.
And wrong window syndrome...yeah it sucks. I've restarted or shut down the wrong servers more than once. I'm sure my mild dyslexia doesn't help in that regard...
@justdoityourself7134 Год назад ⁺⁶⁸
Having a live screenshare with team members watching might seem a little wasteful. But for critical procedures like this, it is well worth the added cost.
@Navak_ Год назад
Most people don't see the importance of such extreme level of caution until it's too late. It's like handling a firearm.
@theultimatetrashman887 Год назад ⁺³⁸
the realization of what you're doing before it finishes itself is so cruel and happens so often, thats why when you're doing a job you always do it slow but correctly
@jeromesimms Год назад ⁺²⁴
Wow! This was great and so interesting. I'm so glad I found this channel. I would love to hear more in depth analysis of software engineering fails
@james123428 10 месяцев назад ⁺¹
Very interesting and easy to understand for layman. I’m sure most of us could also learn from the mistake even if we don’t deal with databases or code
Ps - If you meant to blur/delete the names at 9:50 you missed the “replying to” part.
@ChessHistorian Год назад ⁺⁴
Thanks Kevin, and thanks GitHub. We still love you, and your recovery effort seemed great (and kindly presented by Mr. Fang here!) and altogether as humane as possible. The recovery stream to calm people down was a great idea, and I bet it helped a lot of people to not freak out. I would have been freaking out, and the fact that you guys didn't, but came through methodically, is very inspiring.
I hope the lesson is: don't give AI filters baked into databases any Actually_Important responsibilities. I'm paused at "lesons laenrd" and now unpausing...
@mdude3 Год назад
GitLab not GitHub.
@defetya 10 месяцев назад ⁺³
were you having a stroke?
@CharlesChacon Год назад ⁺¹⁹
I’m pretty sure this event only ended up affecting things like comments and issues, but not the actual git repositories themselves, which would have been a huge relief, I imagine. Still, this was one of the most interesting things I’ve ever followed and ended up motivating me to learn a ton about databases, cloud practices, devops, and everything-as-code culture. Thanks for providing such a great lesson, GL. And huge kudos to them for transparency
@blank001 Год назад ⁺⁶
One strict rule I always follow when connecting to prd servers via ssh or DB UI agent (pgadmin) is I always use different background colors,
Red for prod
Green for staging
Black for test and local
+ double checking every command
You can never be sure enough
@johnthomas2970 Год назад ⁺⁶⁸
This gives me good insight on why our tech team keeps breaking shit….
@spacemanmat Год назад ⁺⁸
Two things to remember:
1. Always backup before you start a change even if you have an automated backup system.
2. Audit you recovery procedures.
@thetophattedanon Год назад ⁺¹
I do not know how I got here, I don't get most of the video, But I am absolutely lovin' It as It's bloody entertaining.
@Lawsuited Год назад ⁺⁸
I've always been paranoid when working in Prod. Always make it a point to have at least the Ops Lead on a screen-sharing session where I show what I'm doing while requesting affirmative acknowledgement of each step before proceeding. It's annoying. It's slow. But boy ohh boy does it make me feel safer.
@isaiahsmith6016 Год назад
It may be slow but look at it this way. You're probably saving a lot more time in the long run by preventing something horrible from happening in the first place.
@knightoflambda Год назад ⁺⁴
I have something analogous to the 24 hour rule for shopping when I’m doing anything sketchy in a prod environment: before I hit RETURN and irreversibly commit to an operation, I leave my keyboard, stretch, go for a walk, grab coffee etc. it helps a lot with the tunnel vision. 10 minutes AFK can save you hours of pain later.
@GeorgeTsiros Год назад ⁺⁷
I like how the terminal has the decoration of some linux-y windowmanager, but the message boxes are winXP xD
@_DAN11L_ Год назад ⁺²
To prevent confusion in terminals with nearly identical host names i recomend to change the PS1 variable to clearly see in what shell you are now
red color:
[FUCKING PROD DB]:~/$
Green color:
[FUCKING BACKUP DB]:~/$
@Look_What_You_Did Год назад
and you wonder why you are unemployed.
@loupassakischristos9758 Год назад ⁺²
I experienced something similar a couple years ago, it's the kind of thing that you think only can happen to others but yeah... I had to delete some specific data from the production database, I created the sql requests and executed them to the testing environment. The dataset between those databases is completely different, and the requests passed without any issue. But when I passed them to production they were taking way too long and then I realised. I almost had a panic attack. I reported the incident immediately and was mentally prepared to be fired. Fortunately we could retrieve most data from a backup and the lost ones were not that big of an issue. I still work in the same company :p
@jumbo_mumbo1441 Год назад ⁺¹⁴
Honestly the worst part of this was all the backup failures
@eboubaker3722 Год назад ⁺²⁵
Wow the amount of stuff i learned here is huge, please make more reviews like these i subscribed and turned on notifications please don't disappoint me
@UltravioletMind Год назад ⁺¹¹
absolute nightmare. loved every min of this
@andy02q Год назад ⁺²
The server was:
Delete 5GB userstuff in like 5 minutes? Rough; might cause some desync.
You want to pg_basebackup? Sure, just give me a few minutes before I even start.
rm -rf important stuff? Clear as day, I'll permantently chunk through > 300GB in that one second before you try to cancel.
@purrfekt Год назад ⁺¹
Team member 1 would be a valuable hire. You can be sure he will never make the same mistake again.
@MrB10N1CLE Год назад ⁺⁴
3:52 it was at this moment when the viewers collectively scream, transcending space-time and raising a cosmic choir of dread and regret.
@jonix24mejor Год назад ⁺⁸
And yes... this is exactly the reason why I didn't study programming / engineering in college, and instead opted for graphic design / communication.
if I write or design something wrong and it gets published, well, at worst the publication stays published as a reminder of my mistake, in programming all it takes is one finger mistake, misremembering something or just a simple distraction and you can absolutely wipe an entire company's network infrastructure out of existence.
@heroe1486 3 месяца назад
Weird and almost unbelievable reason, not every software related position gives you the possibility to destroy the database or mess up critical things in production, and even if that's the case there are tons of safety nets that are here to prevent that, being human ones or software ones, If you can completely wipe everything from a one finger mistake it just means that you were doing things wrong from the start.
Btw if you extrapolate things like that and imagine it's that easy to mess things up then also take the extreme example of making a pretty bad blunder about a popular brand in your design/writings and having that printed in millions of copies and already distributed to the public, costing your company a bunch of money.
@jonix24mejor 3 месяца назад
@@heroe1486 first of all, both mistakes can cost a company a bunch of money, but one doesnt mess with years worth of data.
Second, you are being too hopeful, maybe because of where you live, but i live in a third world country, such thing as a good network structure (that would prevent something like that to happend) Is not that common here.
And third, a visible error on something printed can be obviously seen and edited, but good luck checking lines upon lines of code to find an error.
Those are just two way different jobs that have very different risk, maybe you can make your company lose more money with graphic design, but you cant, at any point, wipe their whole data, nuke their systems, and so on, because of a mistake while doing your job.
Its like wanting to compare being a carpenter and being a Chef, both work with Sharp objects, both can hurt themselves pretty bad while working, however, you can get your whole arm cut off way easier being a carpenter, because of a simple mistake, you can also burn yourself pretty bad being a Chef, but for that you really need to directly be doing something really wrong, or totally distracted, than when working with some big Power saw and just getting distracted being a carpenter.
@hououinkyouma2426 Год назад ⁺²⁴
Can't wait for part 2
@kevinfaang Год назад ⁺¹⁴
Could just be missing the sarcasm but if you're referring to the ending Google bard isn't exactly the best at being factually accurate...
@Xanhast Год назад ⁺¹
@@kevinfaang maybe he's being ominous :o
@shashankh7768 Год назад
The story telling/edit is unmatched. Hands down best docu/short movie on youtube😂!
@glennog Год назад ⁺¹
Been there, done that, only in my case it was taking down the main network interface on a Solaris YP server used by an entire site of Solaris servers and workstations. The entire site ground to a halt in an instant. I didn't have access to the DC to get local access, either, so I had to make a red-faced confession to my boss for him to make the 2 mile drive to the secure DC.
@Socsob Год назад ⁺⁸
This is so cool to know the inner workings of a team like this
@christianknuchel Год назад ⁺⁵
Another trick is this: Have different color/background profiles in your terminal application for different servers, or at least types of servers. That way, you're more likely to notice that you're typing in the wrong place right now.
@oliverford5367 Год назад
Also test an rm with an ll first
@lhpl Год назад ⁺¹⁵
Of course I've made a few big fuckups myself in over 30 years as a systems administrator. But this really proves the point that developers should have absolutely _no_ access to production servers.
This is what makes me sad about devops: originally I believe it meant having some seasoned sysadmins cooperate closely with the developers. But now it seems to have become "who needs grumpy sysadmins; they just block all progress, hate when stuff changes, and call for formal change requests, documented change procedures, and other boring stuff. Just let the developers run the entire show, they'll figure it out, and new releases and fixes will be applied much faster."
Developers and sysadmins have completely opposite objectives. Developers produce new code, ie changes, the want to deploy faster, more often, even automated. Sysadmins want stability, consistency, so things should not change. They see any change as dangerous, a potential disaster.
Oh, and if you haven't tested restoring your backup, you don't _have_ a backup.
@TheDaern Год назад ⁺³
Lot of people here preaching a whole lot of "this wouldn't have happened if..." nonsense, so nice to hear someone being honest about their own mistakes. The honest truth is that if something like this has never happened to you in your career, it probably means that you never reached the point of being trusted enough to be in a position to screw everything up.
I will say (and, sadly, this is from experience) that no roller coaster on earth can ever simulate the feeling of sinking in your guts that comes when you realise that you've just done a very, very bad thing and that in the next 30 seconds a lot of alarms are about to sound...
@lhpl Год назад ⁺²
@@TheDaern Absolutely right.
There's also a lot of people suggesting various "aids" to prevent mishaps. (Myself included.) Such things may be useful. But there is a danger that one day your safe aliases aren't loaded for some reason, or whatever safety net you implemented.
Problem is that the commands you use in a "safe" environment, where there is little risk of damage, are the same as in critical environments. Experienced sysadmins might spend a lot of time on systems where the wrong command can be fatal, and as a consequence tread lightly and carefully. Whereas a person who mainly works as a developer may have a more casual approach, thinking "trial-and-error" is a good strategy. I've seen a very experienced developer log on to a production server (as root), typing tcsh (cause he wanted command editing and history, which the - at that time - default ksh 88 didn't offer, at least without a little extra work), grep in a directory of log files, and directing output to a file. As a result of how tcsh handles globbing, his command got to grep its own output, resulting in an infinite loop and a filled log partition.
Sysadmin work is - imo - worse than brain surgery, because when you do brain surgery, you are keenly aware that the life of the patient is at risk, and there is just no room for errors. For sysadmin work there is no clear difference between working in a forgiving environment, and in an environment where errors may affect - even harm or kill - hundreds or thousands of people if your luck is really bad.
The only reasonably safe approach imo is to always act as if every keystroke is potentially lethal. The worst aspect of this is that most of the time, a careless approach works fine. I don't know how many times I've watched a senior developer do stuff, typing commands fast on a production server, while I was clenching my butt to avoid soiling my trousers - yet nothing bad happened. But just a little typo could have meant disaster.
@antdok9573 Год назад
You've got this backwards. First, your claims of a developer being more casual are without any basis, seemingly all anecdotal. Personally, I've dealt with a high-strung and extremely seriously platform developer who doesn't trust the sysadmins to manage physical volumes in production. I think personality has nothing to do with this.
sysadmins typically might avoid process because of all the red tape that normal change controls/requests cause. Developers in great companies typically have a well-defined (and very short) development cycle for writing, testing, and deploying code. Developers in large and bureaucratic companies will have well-defined processes but too much unnecessary red tape. There weren't any developers consistently involved in this outage, had they been more involved it would've certainly been avoided. But the devs set up the backups once and never tested it again. That's the largest problem! Why did this happen? I have no idea. The engineers working this at night clearly were tier 2 support engineers with little knowledge of the overall environment!
The lack of development in this area (environment configuration/IAC) is why this became an outage! This outage looks like it was entirely caused by the sysadmin job, but they're not the ones responsible. It's the management team most likely, whoever set up the teams for managing production env.
Take it from a guy who worked in sysadm! You CAN be sysadm + a developer, and probably have to be nowadays. The old title of system administrator brought on by the decades old industry is dying. It requires more knowledge in the tech stack now and less about memorization of Linux commands. Else, if people fail to adapt, anyone will run into an outage just like this one.
WHY would you ever take the responsibility and risk of manually running commands when some clever design and planning can cover most, if not all, of these kinds of situations. The senior dev who SSHed into production and starts hammering away at the keyboard is NOT doing dev work nor sys admin work, he's now being careless.
@lhpl Год назад
@@antdok9573 No, I haven't got it backwards. And I don't think I said that you can't be both a developer and a sysadm, or a devop for that matter. It seems you and I have wastly different experiences as sysadmins. Here is an example from 20 years ago, when the company I worked for was introducing the use of BEA WebLogic Integration. I was actually a developer at that time, but I had a long sysadm background. When deploying the first pilot test webservice to the test server of the operations department (after having tested it to work fine on the test server in the development department - with almost watertight shutters between the two departments, and certainly NO developer login to any servers in op) it turned out that it failed, because WLI assumed it could connect freely to the Internet. Not in this company, for sure! The development test servers weren't as strictly set up, so here there was no problem for WLI to retrieve an XML DTD from its official URL. The operations' test server however was locked down completely, and would not open connections to the outside. Now tell me, what would you do to solve this tiny problem?
@antdok9573 Год назад
@@lhpl I would fix the team structure. There are communication issues, regardless of people's occupations. There shouldn't be any distinction between dev & operation teams unless there's some archaic requirement for it (compliance), and even then, you can probably avoid rigid team structures like these.
I don't need to know any of the technologies to see there's a leadership issue toward the top of the teams.
@spaghettimkay5795 13 дней назад
Had this video on in the background. The Teams notification sound really caught me off guard...
@AndreGreeff Год назад
I must say, I heard many stories about this.. but that was a very nice summary of the nitty-gritty details, thank you. (:
@ronin2963 Год назад ⁺⁴
Don't work when exhausted, tag out for fresh eyes, keep separate physical laptops for primary and secondary systems

Следующие

Автовоспроизведение

Yandere Simulator Complete Source Code Analysis - Code Review