I mean backing up a petabyte of stuff on a cloud provider is so fucking expensive, you need to pay huge amounts for bandwidth (even with 200MBps it will take forever), so its not like a realistic option
Right? I've worked in infrastructure for years, the consumer end videos are awesome and insightful, but the server/infrastructure videos frustrate me so much sometimes...
That's what I like too, there's only so many gaming hardware reviews I can stand to watch, they're all much of a muchness to me, but I really enjoy the infrastructure and unusual project videos the most
@@MajesticBlueFalcon you would be surprised how many companies mess up. What about if the IT dept didn't do their job properly and skipped over certain things in order to save time?
@@AegisHyperon not true. I do sysadmin for small and midsize businesses, and you wouldn't believe the kinds of things I've had to take over. Usually it's either some guy who does something else at the company and thinks he knows stuff but doesn't, or the work of some usually very mediocre external company.
@@robertt9342 that was my first thought when he said about reusing old voult. It would be new voult build from old voult. So... New Old Voult [short NOV]
As a full time Sysadmin i always wondered how you guys sustained your data without a real backup plan. As it turns out now, you didn't. Really sorry to hear that guys! That's exactly why people like me get hired. Companies think they can do it on their own until they lose critical data to misconfigs and missing maintenance. Hurts to learn it the hard way. I really recommend you guys to create offline backups to tape storage for all your archived content. And respect for admitting having it done wrong so others can learn! Keep on making such great content!
I'm not a sysadmin, just a network guy that dabbles in sysadmin stuff and yeah, it blew my mind to hear what happened here. If they open a spot to hire an IT guy I think I'm gonna apply :D
I'm not a sysadmin either, but I'm also surprised they didn't catch the offline drives earlier. Even without the regular data scrubs, basic monitoring should have caught that. As for tape backups, I agree but also advise caution that tape backups can fail too, so they need to be planned properly. I've done tape backups myself but that was a long time ago.
Yep. I'm doing this for 25 years now. I worked for a couple of companies with big raid systems but no backup. It's a struggle to get the responsible persons to buy sufficient backup systems. In one case only one week after installing the backup solution and having the first full backup, the main raid system failed and died. Without backup this company would have gone out of business completely. I have seen this happen to companies before.
You would need one hell of a tape array to backup that kind of data not to mention it would take forever! I don't see tapes a practical offline backup solution for this quantity of data for a company LTT's size. It is better to off have a duplicate server in a DC with clean power and resilient backups and replicate the data, that would act as backup and be a suitable DR solution. Without backups I do wonder if they have a BCDR plan in place also?
LOL I can see the tittle: Ever try to backup a few sextibytes? or even just a few exobytes?no? well funny thing happend... Or "This is awkward...newcubed16 ...." Please tell me they have fiber to the new vault and aren't trying to do this over a normal connection.
@@gorkskoal9315 They do have a UPS for their server room, but for a few months they didn't because their UPS caught fire. Also it sounds like they never configured the servers to _safely_ shutdown when the UPS was running low, instead the UPS ran out of power and the servers got plug pulled.
@@applepie9806 been in IT for 10 years, worked as infrasturcture engineer for hospital, Technical lead for MSP supporting SMEs and finally soloutions architect for a 250m £ company, Im also a freelance consultant I love LTTs vids i normally just have it on in the background whilst im working. but they do make some big mistakes but its all part of the drama :L
Linus: "the way they name HDMI generations are so confusing" also Linus: "we move the data from the old vault to new new vault and then name the old vault new new new vault with a bit of upgrade"
Linus, I have over a decade of experience in managing multi-petabyte ZFS with five nines uptime in large ISP's. I think you may have the wrong cause of the data and it may not (MAY NOT) be as lost as you think. Please reach out to me
It also helps that he's worked there from when they were getting serious about their data storage, so he knows the reasoning behind why the things are set up the way they are
@@riks.1773 as Linus explained, it's routine checks that they should have been doing monthly, for years. AND they didn't set any email alerts so they never got notified of the failures !
Tech Tips' data loss is due to one thing - quantum variability. :D The data was in a state of flux until someone audited, at which point it was forced to exist or not exist. Some were observed to be the latter.
Just hearing "never hired a full time IT person" makes me go "uh oh... I don't like where this is going..." a good sysadmin who can help protect systems is a valuable part of any modern company
Classic case of responsibility creep. As Linus and others have become responsible for more stuff as the company has grown their ability to handle routine IT maintenance duties has dropped off, and because it's happened slowly over time it's never quite shown up on anyone's radars as a matter of concern.
Yeah, because just so you know, only other sysadmins value sysadmins. It's an extremely simple job, so the rest of us think we can do it, and we sure can until our real job prevents us. If only we could teach monkeys a couple of bash commands and have them be sysadmins for a couple dozen bananas.
That's why the sys admin job is dying out and most mid sized companies pay less to move it to cloud based systems that are more reliable (for now, until the price gets hiked)
The second most important thing to consider about backups, behind actually having them in the first place, is TESTING THEM! If you don't test your backups then you don't have backups.
Only if you have shit backup software. Last year I did a restore of our main HPC file system after and upgrade, everything came back. The only "testing" necessary is the occasional restore when users have done daft stuff and deleted files by accident. Then again I have a "proper" backup system in IBM Spectrum Protect (nee TSM). If you use toy backup systems (aka everything else in my view) then yeah test them regularly.
@@jacquesb5248 Nope if you have to "check" that your backups are running then you are doing it wrong. This should be integrated into your monitoring system so you get told that your backup *DIDN'T* run. Checking manually is prone to someone forgetting or been on holiday or insert a thousand other reasons. Also getting told daily that you backup ran also becomes an issue where it is seen as background noise and you get bored checking the same report day in day out. Basically being notified something is as expected is the wrong way to do anything. You need to be notified that something is *NOT* as expected, in this case the backup didn't run to completion without errors.
As soon as they switched from storage spaces I kind of saw this coming; I've got a 912tb S2D cluster that serves as storage for about 200 or so virtual machines and it's been rock solid and performance with NVME cache has been solid. One of the things I saw on Spiceworks was a warning about over engineering infrastructure.
@@mkastelovic 250-300MBps. And they have worm tapes. And by using a dual drive tape robot, it makes backups completely automated. Restores too. Backing up to individual LTO drives having to load tape after tape is too much labor. Backups will never get done.
@@jspafford Well, if you have the library, the backup is done automaticaly, plus in their case, we are speaking about the incremental backups, where most of the old videos doesn't change at all ;), so Backup will be done during the night.
I agree with all of you, except for one thing. Linus has expressed that he has quite a lot of data that he says isn't that important. Meaning that buying a tape robot, would be quite a expensive investment. Maybe not even worth trouble.
Well if you hadn't made this video, I never would have known to check if automatic scrubbing was enabled on my storebought NAS. It wasn't. I don't believe it's ever suffered a power failure, being connected to a UPS and configured for automatic shutdown when the UPS drops below 50% battery since day one, so no automatic scrub on resume either. It's now set to automatically scrub once a month, so thanks!
Other Pro tip: if building such a large scale storage, make sure your disks are from different manufacturing batches. Imagine the nightmare is having disks with consecutive serials wearing out and failing almost at the same time.
Or they could just buy a professional backup solution and get proper training operating it plus a maintenance contract. You know the way every real enterprise would do it :D
@@lostintechnology1851 It's a different situation, he said this is non essential archival footage, the creation of these servers created content, the failure of it created content, and yeah, backing that stuff up would cost a lot of money... so risk/reward. The best option isn't necessarily always the right option.
@@entelin Didnt watch the video yet but it sounds like something that RAID 5 would solve instantly and would cost them barely any storage with that many hard drives
I’m responsible for our SANs at work and there’s something else that wasn’t touched on in this video - make sure you configure email reporting from your storage nodes! The sooner you’re notified about issues, the sooner corrective action can be taken. Additionally, if possible, keep hardware spares at each site where the hardware is, so if a drive has failed (or even if it’s in a predictive failure statue), you can swap a new drive in ASAP. Same goes for other hardware, such as controller cache batteries; these too can fail, and can do so silently, allowing the node to continue working, but with degraded performance. TL;DR - Keep an eye on your infrastructure and monitor it!
I worked for an MSP where they had fired the previous person in charge of backups. I was on the infrastructure team. We found that 65% of our customer backups were no good and something like 85-90% of offsite replication was failing. It was 8 months before we could return all the backups to normal and reduce the back checks workflow to less than a few hours per week. During the 8 months, it spend the first 4 working to get the backups all straightened out with almost every hour of my workday. Suffice to say, having an ops team with competent people who are organized and themselves redundant and able to check each other’s work without judgement is absolutely paramount for a team in charge of critical systems. I personally love working on backups because it’s a silent way to ensure continuity while working with amazing technologies.
As someone in IT I know the pain, one thing I go by is that you don't have a valid backup unless you have tested it. I've had some times (granted back in the day like 8 - 9 years ago) where the software would say its a good back up. god forbid that you need to restore as it will just fail. they are very fun times they are
@@Phoen1x883 well if you can do something in company time to help the company bottom-line... You should? There sre things like loyalty and good will even in corporates
"I'm the highest ranking person in the company, the highest ranking person in the IT team, and the person who decided not to hire a dedicated IT staff. There is no way to determine who's accountable here" - Linus 2022
@Connor Williams but by that logic it means he will fix what they failed at and not for anything that may arise. If they don't have a full time or part time IT person then the same or similar issues are doomed to happen again
@@KP3droflxp they need an IT specialist to schedule and perform regular preventative maintenance. Otherwise, their team will just fix things when they break like this video.
As someone who works in the IT field for a small company I will be following this very closely. Anything that you guys do like this I absolutely love and try to implement it if it is appropriate for my company.
In the takeaways at the end of the video, there was no mention of monitoring. If zfs zed was configured to email somebody/service desk on events like drive failure, this disaster could have been averted by replacing failing drives one at a time as they failed instead of accidentally finding the house of cards your enterprise is built on. Monitoring for failure should have been the most prominent takeaway.
Thought it was weird too. An email as soon as one drive fails could reduce response time. The number of drives they are handling meant the chances of 2 or more failing at the same time is pretty high. What about reserve drives to automatically repair when one degrades? Not foolproof but a good start. For bit rot, more frequent scrubbing?
Not just configure it, also test that configuration. I've worked at a place where the storage system was set up to send an email in case of pending doom. Problem was it wasn't configured correctly so the emails never reached their recipiant. How did they found out about the impending doom? Well, the system also gave off a sound alert as well as flashing a LED which were only noticed when I was given a tour of the server room.
I think we all have a lot of cases of 'didn't follow our own advice' in the storage/DR world. Unless it affects your bottom line, backups and DR tend to be lower on the priority list. And lower on the priority list usually means either "not configured at all" or at minimum "never been tested before" :(
As a fellow Repair and Recovery guy in the SS world we sell hundreds of drives globally to guys in that very situation. TRUST ME ! new new vault...... " vault 3 " ..... 😇
I always follow my advice for backing up data because there is a simple rule... if you back up your data, you won't need the backup, if you don't back up your data you WILL need the backup.
Backups and UPS should be essentials at this point, lost drives on my old comp cause I had the "bright" idea to use it in the middle of a bad wind storm with only a surge protector.
You know. It's nice to see someone that handles things like an adult, admit mistakes, acknowledge that some failures aren't simple "that person screwed up", and use it to constructively fix problems
@@beermarket9971 Why? It is everyones own choice how important their data is. If they can live with the loss of some old footage, i see no problem in their actions...
@@alias_not_needed There are plenty of reasons why this is childish in my POV: For one you should value what belongs to you and protect them from predictable breakdown otherwise you come out as a spoiled child. Second, as a CEO you have to duty to protect and save your employees work, while accidents do happen when they are caused by a lack of prevention, the people in charge (or CEO) come out as childish. Finally, when a CEO cannot hold someone accountable for data loss (or work loss) it's ultimately his fault and he should just own it but, maybe i missed it, but it didnt quite come out like that. I don't want this to come out negative, i like LTT and it looks like an amazing place to work, and i admire Linus. But this is frustrating to watch...
I am very interested to see how the recovery process goes. As someone who has only ever done disaster recovery in the realm of terabytes... yikes. Good luck friends.
Only for fans over 18 years old baby-girls.id/angelina?cute-girl 🍑 tricks I do not know Megan: "Hotter" Hopi: "Sweeter" Joonie: "Cooler" Yoongi: "Butter So with toy and his tricks, do not read it to him that he writes well mamon there are only to laugh for a while and not be sad and stressed because of the hard life that is lived today. Köz karaş: '' Taŋ kaldım '' Erinder: '' Sezimdüü '' Jılmayuu: '' Tattuuraak '' Dene: '' Muzdak '' Jizn, kak krasivaya melodiya, tolko pesni pereputalis. Aç köz arstan Bul ukmuştuuday ısık kün bolçu, jana arstan abdan açka bolgon. Uyunan çıgıp, tigi jer-jerdi izdedi. Al kiçinekey koyondu wins taba algan. Al bir az oylonboy koyondu karmadı. '' Bul koyon menin kursagımdı toyguza albayt '' dep oylodu arstan. Arstan koyondu öltüröyün dep jatkanda, bir kiyik tigi tarapka çurkadı. Arstan aç köz bolup kaldı. Kiçine koyondu emes, çoŋ kiyikti jegen jakşı dep oylodu. # 垃圾 They are one of the best concerts, you can not go but just seeing them from the screen, I know it was surprising 💗❤️💌💘
Linus is the only person who's adverts I enjoy. Angry Joe started putting tonnes of effort in to his, but they are so forced!! I think Linus actually gets a laugh-kick out of saying LTTSTORE where it's crowbarred into something lol. I know I do, but not as much kick as the coffee in this LTTSTORE FLASK WILL HAVE. Linus dude, over all the years I have watched you I don't think I ever credited you properly. Well done man, this thing you've all created is really cool :)
@@Avendesora Except, of course, it's totally wrong to use one of those words for the other. Unless your server room is so large that you must use a Segway to get to the sponsor.
First off, I am sorry for LTT about the data loss. Secondly, I am glad it wasn't "active" or current data but rather old RUclips videos, and those can be recovered (but only the uploaded videos and not any extra material/footage you had stored). Good luck on the project!
Big mistake to immeidately replace the drives that weren't even dead, which just showed some failures. By removing them LTT removed all the (still good) parity data on those. Probably should've run a scrub first, and then remove the possibly malfunctioning drives.
The vault is hardly the beating heart of the company that whonnock was, and sounds like this unfolded over the course of days and weeks as Jake found the issues and they are still working on rebuilding the data. The vault is just archive data. Whonnock is the in progress projects, and I think at that time Linus said there was no backup.
10:30 and most importantly: monitor your environment! SNMP, Syslogs and even specialized monitoring agents are an easy way to monitor your environment.
At that density and the infrequency of the older data being updated you really should consider acquiring a tape library. A couple iSCSI targets and a 250 slot LTO library would keep you until you more than double your current use. But considering the increasing file sizes of the raw files you're ingesting I would recommend going for a 3-3.5X scaling.
Tape is slow. I think the whole point of their setup is for fast access to footage new or old for editing purposes. If they were just hanging on to it for keep sake then Tape is an option but I think they keep it so they can retrieve previous footage on-demand to splice into the current video being edited.
@@grrkaa8450 Tape is slow, but much cheaper per TB. Typically you would have a hybrid system where users interacting with the data would hit high speed disk storage of some sort, and that disk storage would be running software that would migrate copies of files, or just less accessed files to tape. It's effectively the best of both worlds, users have the speed and accessibility of high speed storage, but the high speed pool is much smaller, and most of the archival data is on less expensive tape drives. The only time you hit a slow down is when a user has to access the stuff on tape which would be normally pulled when the user accessed a stub file representing the file on the disk pool.
I've been through a number of mergers and acquisitions over the past 10+ years. On every single one the IT dept/employees who do IT tasks for the other entity have been running without viable backups, server monitoring, out of band management, or alerting. Most also lacked UPS units (or working UPS units), and one was even running RAID0 on a production server and couldn't figure out why it kept failing on them. It's a scary world out there.
A very practical advice. Store you "old" archival data (like photos) on hard drive that is not connected to power / server. Use other cloud storages all you want but just keep a one, disconnected, low tech option.
When discussing redundancy, one is none, two is one. That's how I discuss backup options with my clients. If it's mission critical you need a layered backup system.
As an operations engineer, the amount of red flags that the process you followed here brought up was terrifying. Please write processes for this sort of stuff and test them - it's all fun and games till you lose something essential because of a stupid decision from 5 years ago
Not to mention they used Seagate drives. They are just completely unreliable. I wouldn't trust them in any circumstance. I've hundreds of Seagate drives due to failure, but only a handful of WD/Hitachi. It isn't surprising as Seagate purchased the worst hard drive company that ever existed, Maxtor. And they didn't learn their lesson, they got even more Seagate drives.
@@williameldridge9382 Got a different experience. I'm still rocking Seagate and WD drives while all of my Hitachi drives from the same era as all my other drives died. But not sure right now though.
@@williameldridge9382 Hard drive manufacturers have all had bad batches, it's just the nature of the beast now. I have had failures from all brands in usage. You should see hard drives as a consumable (especially as a storage array), run SMART and replace when health is detected as bad. The bigger issue is people not doing backups, that's a failure on you and your users to not enforce that.
You just have to wait until something goes wrong and boom new server content! Maybe we pay seagate to send Bad drives, so we get new content sooner? Sounds like a good, reasonable idea.
I'm just wondering why LTT didn't go for tape storage for their servers, since, as Linus said himself, it was for archival purposes and more of a fun project to test out the tech they got. They even got a tape drive some time ago afaik. It doesn't make sense to keep the drives spinning for years if they are not actively used or maintained.
Because they probably want to have quick access to it, I think... To cut something out of old video and things like that? As far as I know, tape storage does not give you that luxury.
@@Stasiek_Zabojca You could store lower-bitrate stuff on fast storage for browsing and only get the tape out when you need access to the original files.
Tape storage is cost competitive on the level of multiple petabytes, not single petabytes. So it's nothing that any significant minority of viewers will ever see in person, let alone be part of decision making process to buy, install or configure.
Because he'd rather have "dope hardware" instead of using tape. If they need access to it that's fine, every week or month or time frame you do a fresh backup to tape and keep your servers running for access and have tapes as backup. He failed to implement backup in depth which is basically industry standard. Archive is not backup. Redundant and separated storage of data is backup.
Love your show! Just wanted to chime in here coming from an IT background supporting large companies in datacenters as well as being a content creator. Trying to maintain an accessible RAID of ever growing content only gets more difficult and expensive over time. You will eventually need a full time employee to manage your content if you go this route and at some point you will need to migrate your entire content to a new RAID when 1 petabyte isn't enough anymore and that's not going to be fun. The alternative cheaper and simpler solution is to archive your content to tape which will have a much higher chance of surviving the years to come as it's not on spinning platters that run 24/7. Yes, getting access to a piece of content you want to grab on short notice will be more annoying but you can always keep a smaller RAID with your completed videos and archive your raw content via tape as it's the RAW video content that really eats up the TB which is why you might want to consider archiving your raw video.
just build a giant data center that uses robots to automatically fetch & read tapes, so it's at least automated, even if it still takes half an hour. Building a data center is probably also great content for the channel ^^ /s
This response from a pro is why you would leave a job like this to pros. As this pro demonstrated, #1 is identifying and understanding the requirements. Do you really need all your old content available online, or maybe offline is good enough? Then solution to fit the needs.
In my first months as a sysadmin I learned a lesson: always keep a secondary backup that isn't on-premise. Power can go out, and you'll have a few bad sectors on your drives. But if there's a fire and your server goes with it, all of a sudden giving a few bucks to Jeff Bezos doesn't sound that bad of a deal after all.
Yeah, not paying for cloud storage basically confirms that they wouldn't cry to sleep if they lost the whole lot. Which is a reasonable decision since it's not mission critical data.
Heck anyone with a 5 or more user office 365 tenant can get unlimited onedrive backup. Yes it's slow to backup, yes it's full of details like 25TB sharepoint sites that you have to subdivide, but it IS unlimited for very cheap and an offsite backup.
I think they've covered this in the past, and the problem is that they just have so much at this point that the upload will take forever. But that doesn't mean you're not right. If anything, they should do it _now_ because every day they wait is going to just be more they have to upload. I'm sure there's something out there that will just start uploading everything in the background until it catches up. Also, IDK if it would only be "a few bucks" for the amount they need. IDK what that kind of enterprise level storage costs, but it's probably not cheap and I'll bet that even on "unlimited" cloud storage plans there's probably a catch written in the fine print with some way of restricting the storage in practice, like restricting the upload bandwidth past a certain amount of data uploaded to such a slow rate that they would never be able to upload faster than they create new data...
I don't really think this was an issue of not having a "tech person", or "not having time" to set up. It was simply an oversight. Setting up scrubs and SMART alerts doesn't take long, and you certainly don't need a full time person sitting around waiting for trouble notices from monitoring applications.
Hey Linus, 2 best practice recommendations I didn't hear you state, but would be very important. 1) per every 24 disk (avg) you should have 1 hotspare. This drive should be in the same HA zone as the 24 disks so if a failure occurs or a scrub detects failures it can automatically start the rebuilds to the spare, this gives you time to get replacement equipment etc, without having to worry about your data while your purchasing. 2) If you cannot do full backups such as to a cloud or dedicated location, the next best thing is to ensure your data is across 2x different technology solutions. As this is entirely archival, and your not worried about location protection having your 2nd system be replaced by either an always on VTFS (virtual tape FS) or just streamed to tape backups and 1 guy pulling the tapes about once a month. Tape is rather inexpensive and had a great shelf life. I've been doing IT Storage and Data Protection engineering for the last 10 years, and customers in your position of not having dedicated staff but are gathering increasingly large data sizes is all too common sadly.
They're using Seagate drives. So for every drive they should have a hotspare ;) No seriously. I'd go 12D 1HS at least. The way they manhandle their servers i'd say VTFS is prone to a hilarious video with a lot of grey confetti :D
i don't think any of this will produce any downtime for his company. The Petabyte worth of data he may loose as he said is just a "nice to have". It's not the actual production server where they store the current projects and videos. His employees may not even know about this data loss.
Thank you for this. I've never scrubbed my ZFS pool because I didn't know what that meant. I now have it set up to do it monthly and am running one as we speak. 5 hour estimate for completion
Lets set this straigt: There are more backup options than local spinning disks and cloud storage. The cheapest way would be a LTO Tape-Library. An LTO8 Tape (12TB of uncompressed storage) is about 50-100€, thich is only a fraction of the cost of spinning disks. Also they are archival grade and can be labelled and stored on a Shelf somewhere. As their backup files dont really change you could just put a few projects on one tape and chuck it in the warehouse.
While I agree with the archive not being on spinning disks, long term storage of tapes is an issue in itself. It requires regular maintenance, climate controlled warehousing and copying every few years. I work in broadcasting and I have only seen deep archives done correctly maybe once in my career. I have quoted archival systems several times and the face customers make when they see the numbers and are then informed it does not include any recurring and on-going operational cost is always funny (not really).
Sure sounds like a good time to make a video about how tape drive systems aren't as obsolete as many might think and maybe even get yourself a super cool tape robot! You could also dig into data reconstruction/recovery software to see what you can pry out of the drives you've pulled and maybe try out the old "HDD in a freezer" trick. There you go. Two new video ideas (that I'd love to see presented by Jake and Anthony respectively) to hopefully recoup some of the costs of this oversight.
(oversimplified) Summary: The power dropped out a bunch of times and LTT dropped the ball on configuring the servers so the servers dropped a bunch of errors before dropping physical drives out of the servers resulting in the servers permanently dropping some data... I see a familiar pattern here.
Only for fans over 18 years old baby-girls.id/angelina?cute-girl 🍑 tricks I do not know Megan: "Hotter" Hopi: "Sweeter" Joonie: "Cooler" Yoongi: "Butter So with toy and his tricks, do not read it to him that he writes well mamon there are only to laugh for a while and not be sad and stressed because of the hard life that is lived today. Köz karaş: '' Taŋ kaldım '' Erinder: '' Sezimdüü '' Jılmayuu: '' Tattuuraak '' Dene: '' Muzdak '' Jizn, kak krasivaya melodiya, tolko pesni pereputalis. Aç köz arstan Bul ukmuştuuday ısık kün bolçu, jana arstan abdan açka bolgon. Uyunan çıgıp, tigi jer-jerdi izdedi. Al kiçinekey koyondu wins taba algan. Al bir az oylonboy koyondu karmadı. '' Bul koyon menin kursagımdı toyguza albayt '' dep oylodu arstan. Arstan koyondu öltüröyün dep jatkanda, bir kiyik tigi tarapka çurkadı. Arstan aç köz bolup kaldı. Kiçine koyondu emes, çoŋ kiyikti jegen jakşı dep oylodu. # 垃圾 They are one of the best concerts, you can not go but just seeing them from the screen, I know it was surprising 💗❤️💌💘
the perfect opportunity for testing out tape backup! i had 4hdd's failing at the same time in my raid 6 storage server with total data loss. i recover all my data from my tapes! it was only 80TB of data, but when come to price for large backups, tape is king!
Thanks for doing this video, I'm sure this made a LOT of people go back and check whether their home servers, or servers they support to make sure they are not vulnerable.
Considering a lot of this is for cold storage, it would be neat to see you implement a tape drive for this use case. They would store a ton of data, pretty cheaply, and safely. Also something that many people don't even know is still is use
@@florabee9283 Also electrical problems. For small concerns like myself, I've tried tape but disk is just easier and cheaper and tracks needs better. For Linus, tape is definitely well worth a look.
I made the case here many times how it is overkill to storing all footage in raw data. Not really surprised this strategy failed, very sorry to hear this. It's not like you can simply backup a few petabytrs to another machine. So yeah, tapes. Mby eben Amazon glacier? At least I would have made a second backup tier to store compressed data. Another option would be to store finished renders in max quality on Blu-rays. That's still a lot better in case of a permanent loss. And a lot cheaper.
Yes. We store over 70 PB of archive data on tapes. They have their own failure modes though, but overall it's a good solution. We once had a tape robot arm out of alignment, and it knocked a lot of tapes out of the storage.
It's the clear solution, for a normal user an LTO drive is expensive AF, but in his case it'd be cheap compared to a server... and the tape cartridges are very cheap for what they can store... he could have duplicate cartridges of all his data even. Instead he insists on buying bigger and more expensive hardware which is more complicated to mantain and has much more points of failure.
Oh my gosh, they had no automatic scrubs and no automatic e-mail notification when a drive fails? That's absolutely necessary maintenance basics for ZFS... I wish LTT luck on restoring their data!
I wish them luck too, but all the information they tell about how it's important to backup the drives and have multiple backups they don't even follow. Is it just we'll fix it later or the cost to do it isn't a justifiable reason?
Also funny as they had multiple videos with sponsors like Pulseway where they brag about having everything monitored (so I guess they don’t use it… or didn’t configure that either…)
Ya'll spend A LOT of money on redundancy for data, how about allocating "a reasonable amount of money" to redundant power backup strategies. Generators, solar panels, enterprise UPS w/ some SLA battery banks, or a nice LiPo/LiFe array. Buy yourself some time, with a big enough buffer for power outages. Do an energy audit of what absolutely must never loose power, and consider your options. Custom automating your alternative power sources, or even off loading your grid expenses with alt energy would pay off in MANY ways. You have a roof on that building load it up with some panels. It would make a supreme video series as well!
LTT, please look into an implementing an LTO 8 tape library as a proper backup to your network pool! Tapes are so much cheaper than drives, and are the preferred archive format for long term. The tape robot and archiving software would do all the hard work and keeping track of data.
I was about to suggest the same. Newer (1-2 years)/more frequently accessed video goes to the petabyte but classic ones go to tape. They only talked about it 3 years ago. ruclips.net/video/alxqpbSZorA/видео.html
the one-time tapes would actually make sense only downside is you would have to pay for application to read/write said tapes (i.e commvault). But that isn't all terribly bad.
Tape is definitely a good way to go, especially with a tape library. As for applications to write to the tapes, there are some powerful open-source ones such as Bacula but it might take someone a bit of time to get it up and running.
@@VampyWorm Yep, but a BackupExec copy with 1 agent would be in the hundreds or very low thousands USD / yr. Tech support included :))) 46+2 LTO8 tapes would absolutely rule LTT. Just done a 4 drive, 2 autoloader, 2 libraries implementation, it took 3 people around one week to fully set up copy & backup jobs, I'm very impressed with the results!!
Please don't use 15 wide vdevs. Groups of 6 wide in raid-z2 is a good choice for spinning rust (4 data + 2 parity). As a zfs user for 10+ years, I cannot imagine running multiple 15 wide vdevs.
Really wide VDEVs are only OK when using SSDs or low capacity HDDs. The rebuild time on a 12 drive VDEV of 12TB drives is insane, and the stress the other disks are under during that period can easily cause one to fail. 6-8 drives on a RAIDZ2 seems to be the sweet spot for large drives, maybe 9 drive RAIDZ3 if you're _really_ paranoid. EDIT: I'm also saying this as someone who's running 8TB drives in 9 drive RAIDZ2 VDEVs. I have plenty of slots for more drives, so I'm sticking with 8TB drives for the time being.
More than 6 raidz2 using 20tb disks sounds a little edgy. I would require disks rated 1 error over 10^17 bits for that. 15 is objectively scary with raidz2. 10 with adequate replication or backups would already be edgy. With raidz3 maybe 15 is not crazy but you might want to upgrade the pool at some point with 40tb drives or more, if they ever come out. Which would be totally nuts.
11 wide z3 vdevs would be the most I'd be comfortable with regardless of ssd / rated error rate. But once at 11 wide z3, why not go (2x) 6 wide z2? One extra drive, one extra parity, more stripping (more performance) more flexibility in adding / removing / replacing devices. All a balance between redundancy / space efficiency and flexibility. To me, 6 drive z2's, and just multiply as needed. Lets think about worst case. For 6 drive z2's, you lose 2 drives. you have a 4 drive "raid 0" to deal with until redundancy is returned. Not great, but not terrible. Email alerts, etc. But a 15 wide z2? No email alerting? 2 drives die you get a 13 wide "raid 0". Good luck.
These videos about your big fuckups are by far the most informational and educational videos on your channel... I have a little checklist of shit not to do when I set up a storage system, wouldn't have heard about these pitfalls anywhere else.
Bro, hire a dedicated sys admin. You have too many employees that rely on your server infrastructure to yolo everything yourself. You mention that you, Anthony, and Jake work on it, but they also are writers. You have enough data and infrastructure to warrant a dedicated and experienced sys admin at this point
I wouldn't want that job. They'll go behind his/her/their back at any opportunity anyways because "it's faster that way" or "reasons". The way LTT grew the IT-Guy job is a surefire way to get PTSD now ;) No way they'll can establish any structure now.
I would love to know why tape backups aren't considered. It seems to be one of the more economical options and is great for archival. Also, as a photographer who works with tens of terabytes I would love to learn more about tape backup.
@@Lexan_YT Backup is not main storage. With Dell Powervault TL and IBM Spectrum we are achieving 1-2Gb/s write and read speeds. So restore of that data isn't that insanely slow.
All of this was very patiently and thoroughly explained, except for one thing: what happened to that LTO-8 drive you were planning to put into service years ago?
@@jayred8289 i mean how does such a big company with so many resources not have a 3-2-1 backuo, even of it's some raw data. It's not like they're short on cash, are they
@@lolish1234 Because it's not ridiculously important data, Linus even says in the video that half the reason they bother keeping it around is because they can make interesting videos on it. I wouldn't be surprised if the eventual goal was a 3-2-1 backup system but they wanted to cover setting up each stage in videos which kept slipping cause LMG is pretty busy until we get to today. A lesson into why businesses with large data needs should be hiring their own IT guy.
@@TheDemocrab Setting up a cable testing lab and acquiring more space is more important then building a 3-2-1 backup system? Yes "in hind sight" everything is easy to judge, but assuming Linus sets his priority straight, literally he has more issues with monitor cables then his raw video archive.
that is exactly the reason why I stopped building my own storage servers and got my first Synology like 10 years ago! Obviously I far less storage demand (I got 4TB of triple backed data and 25TB of nice to have original videos and RAW photos backed up ones). All secured via parity, auto-scrubbing, snapshot deduplication etc. I've never run into any issue and I've basically distributed more than 20 of DiskStations in my family and close friend's circle to people with far less IT know how than me... and I'm a different kind of scientist with ok-ish Hobby IT knowledge. There is no way on earth I can build something half reliable and convenient as purchasing a Synology or maybe QNAP and put another one up as backup at my parent's place!
With how large these drives are, I would really recommend going with Raid-Z3. I'm not saying larger drives fail more often, but rather resilvering a vDev with large drives takes INSANELY long. And resilvering hammers the remaining drives. Raid-Z1 and Z2 were great with like 2-8TB drives. 20TB? Not so much.
This. IDK about the specifics of the different RAID configs, but I do think that it makes a lot more sense to have more smaller drives so that if and when something fails, it has less of a chance to wipe out _everything._
@@Kevin-jb2pv More smaller drives can cost you a lot more though. You need twice as many servers, twice as much space (and also more power and cooling, though that's not much of a concern here). But if a drive has twice as much capacity at nearly the same speed it'd propably be appropreate to think of them as two drives in terms of their redundancy needs (2 out of 15*10 TB is fine, 2 out of 15*20 TB is like 2 out of 30*10 TB drives which is risky)
Linus, i used to tell my loved ones "there is 2 kind of people in the world. who have a backup and who whish he had" i used to work on storage rack support and i've seen the worst of the worst, including a 24 hour straight marathon to restore a super critical one. but i've also seen a storage rack with all the capacitor blown due to a lightningh strike that fried a little unprotected datacenter. so... are you hiring an IT fulltime person now? :P
Actually there is 3 kind of people in the world. Who have a backup who wish he had and who check that it is possible to restore data from the backup. I mean that lots of companies are thinking that they have backups, but actually they haven't tried to restore data from the backup and it is possible that their "backups" is not recoverable. Just try to restore data from your backup and you might be unpleasantly surprised.
@L. Kärkkäinen You're right, However, this could be mitigated through tapes. They are actually ideal for this kind of data, as video files are sequential data files. Tape is also archival class, meaning they should not suffer from bit rot over time when stored properly. if they need old footage from years ago, they can grab the tape from archive, and it should seek and fetch the data off relatively quickly. Tape also solves the offline problem, as they should only be loaded when writing new data, or if you intend to retrieve it.
Why is raid not considered backup? I was considering using a 2 drive raid synology nas for my desktop files, and possibly copying that data to a cloud provider like wasabi as well. Is this not a good solution for “safely” storing my crap?
Yeah, tape is one of the best solution to offline data storage. It is "old" tech, but it does the job. For personal use i have cloud for archives, but for larger businesses a tape library is a nice touch. Only problem is the software, it can be high priced.
The whole way through.. could not stop myself saying "shoulda got a small tape library and backed up to that" - it's very cost effective esp by comparison to the options presented at around 9:20. modern LTO tapes store tens of TB per tape and with LTTs connections, swinging a library, a couple of drives and a full suite of tapes should be no more expensive than a few months of cloud storage while not hurting the power bill - even our small library here at home consumes at most 350W total for the controller shell, expansion shell and all drives + gantry.
He also missed on the fact that things like Deep Archive at AWS are answers to this for around $1/TB-Month. Yes, you pay to retrieve it, but in reality, you are rarely ever going to. It is a vault of last resort. So it is doable for $1k-$2k a month. Time to do the cost-benefit analysis with more correct values vs the on prem tape vault.
@@Herlehy He already answered that question though. None of this is mission critical and RUclips is literally providing cloud backup for all the videos and they're paying his company to do it!
Our studies concluded that tape backup fails about 1/4 ÷ nT times, where nT is the number of tapes in the backup. If you recover a tape backup that involved more than 2 tapes, you're already at a coin flip. Tape is delicate and requires very careful storage to even work 3/4 of the time. Each additional tape adds another chance of failure. Petabytes of data on tape would take literally years to back up, years to recover and have a virtually 0% chance of recovery. You'd hope that Delta backups would make it more efficient, but they only complicate matters further, sadly.
I would love to see a follow up on this with how much data was saved, and how much was lost. What videos are only on RUclips, and how much they can still refer back to
@@ticler They can rot as bad as the 'toy' storages do. It's enough that they don't get attention. And where would the many hours of fun content about it go?
Having such large raid groups (15 drives) without any hot spares or replacement routines with large drives seems rather dangerous as well. If you already have two dead drives in a vdev its not that unlikly that you will loose a third during the resilver. Anyway, ltt:s IT infrastructure has always been a bit of an dumpster fire, but maybe they do it intentionally cause it results in alot of great content 😅 I wonder if they have thought about connecting an 84 drive SAS expansion to their ssd tier and just have old data migrate to spinning drives (I think seagate has a rebranded dell box if they have a partnership with seagate).
LMG seems like a company where everybody does everything and that can work to a degree if you have just a couple of employees but it's a disaster when you have a bigger business to run.
@@ericwhite265 very true. just about every commercial nas software has some notification system for when a drive goes down. you shouldn't have to audit the system to find that there are several that have failed.
Only for fans over 18 years old baby-girls.id/angelina?cute-girl 🍑 tricks I do not know Megan: "Hotter" Hopi: "Sweeter" Joonie: "Cooler" Yoongi: "Butter So with toy and his tricks, do not read it to him that he writes well mamon there are only to laugh for a while and not be sad and stressed because of the hard life that is lived today. Köz karaş: '' Taŋ kaldım '' Erinder: '' Sezimdüü '' Jılmayuu: '' Tattuuraak '' Dene: '' Muzdak '' Jizn, kak krasivaya melodiya, tolko pesni pereputalis. Aç köz arstan Bul ukmuştuuday ısık kün bolçu, jana arstan abdan açka bolgon. Uyunan çıgıp, tigi jer-jerdi izdedi. Al kiçinekey koyondu wins taba algan. Al bir az oylonboy koyondu karmadı. '' Bul koyon menin kursagımdı toyguza albayt '' dep oylodu arstan. Arstan koyondu öltüröyün dep jatkanda, bir kiyik tigi tarapka çurkadı. Arstan aç köz bolup kaldı. Kiçine koyondu emes, çoŋ kiyikti jegen jakşı dep oylodu. # 垃圾 They are one of the best concerts, you can not go but just seeing them from the screen, I know it was surprising 💗❤️💌💘
I can’t even imagine building such massive storage servers and then never running a scrub or even manually checking the disks, wow. I have a relatively tiny home server with like 80 TB of storage and I run monthly scrubs, manually verify disks constantly, and make regular cold storage backups.
I've been working IT for the past 5 years, and never scrubbing our drives or verifying disks is unthinkable. LTT need to hire an actual IT guy, not just tech enthusiasts.
@@deViant14 lol did you watch the video? none of it was critical also, they would've made a video saying "we should've checked our fire extinguishers", to which OP then would've replied that he always checks the fire extinguisher in his "small $3M villa"...
Linus: "We're not sure who's accountable here, so I'm considering hiring someone to be accountable because the situation is currently untenable without an appropriate system of blame in place."
Very impressed with the honesty on this channel. I know plenty of IT folks who would never admit losing data. I run large ZFS storage arrays at my work. When my primary ZFS array is due for replacement (after moving data and workloads to a new array), I then create a Zpool on the old array configured for max capacity and sequential I/O. I then snap and replicate (zfs send/receive) the data on the primary array nightly to the old array. I don't need a ton of performance or redundancy on the old array as it only receives the changed blocks on each replication and is only used for Oh Sh!t moments. I also HIGHLY recommend you add mirrored "Special" devices to your Zpools. Special devices (man zpool) are used for storing metadata (use SSD/NVME) and removing those I/O's from your slower main Zpool drives. You will be amazed at the performance increase, I promise.
you have to be careful with those special devices though, if you happen to have them configured in a non-redundant way and they go away, you drop the entire pool.
@@shanemshort great advice from you both. I totally agree. But they forgot to scrub a 2PB array and its backup, letting them rot for years. I mean... guys, come on.
man 7 zpoolconcepts should be where 'special' is hiding. POSIX compatibility, COW filesystem, and magnetic media with its large seek times makes for a less than idea combination if performance matters.
One thing that was not included in the root cause analysis, is single ZFS vDev per pool. In the case of archival data, using a single vDev per pool and multiple pools per server, makes more sense. This would have potentially allowed more data recovery than multiple vDevs in a single pool. Write once data, (basically what archival is), means you can also fill the pool up higher. Perhaps even 95%. That said, of course not making the vDev too wide should also be mentioned. Meaning if you are going with single vDev per pool, don't use a vDev of 16 or more disks. The wider the vDev, the longer the RAID-Z2 re-build time since ZFS may have to read more disks per data block / stripe. Good luck.
Forgot to mention that with larger disks, (like 20TB), using fewer disks per vDev is also suggested. Having week long rebuilds, even with RAID-Z2's 2 disk parity is still pretty risky. So 10 to 12 disks maximum in a vDev, with single vDev per pool is probably optimal on a storage verses cost basis. Last, leaving a free disk slot in each server for replace in place is also a good idea. This allows you to replace a failing, but not yet failed, disk, with higher degree of safety than simply pulling the failing disk. Allowing ZFS to read data from the failing disk as well as the rest of the vDev to re-create the failing disk into the replacement disk. Thus, if their are other unknown errors, less chance of data loss. ZFS is one of the few RAID schemes that allows this functionality, (though probably more common today than when ZFS first came out). Of course, this does not help in the case of a completely failed disk, nor in some failing disk cases where it's in bad shape.
yeah, and triple-parity / mirroring for such large drives if you're going to be running on a home-brew system that you're not 100% confident that you'll be notified of errors. This is why enterprise storage and enterprise backup platforms exist.
If they can't hire a full time IT team to manage this it would make the most sense to contract a 3rd party to manage something like this. it's not "mission critical" it is not "top secret" it's just old already uploaded videos. It would make sense to hire an expert to set up and maintain this for 1k a month wich is about 1/5 or 1/6th of the cost of one full time person.
Linus, when you did the cloud pricing calc you missed something- backblaze wasn't showing you the archival level rates (most likely on purpose), for example with Azure it's $0.001 per GB per month as long as you're OK with the delay of accessing archival level files. So more like $1k per month for 1PB which aint bad.
@@ryanjones8977 Deep Glacier is good as a last resort because it's so cheap but the retrieval cost is quite significant. So personally I use it as a backup of a backup.
Thanks for pointing out needing to manually schedule a parity check! I've been using Unraid and I assumed that it would have scheduled _something_ by default. Nope. Parity hasn't been checked since I set it up in October.
This. And either constant notification of error condition (email every hour etc) and/or escalation to someone else if not resolved in a particular time frame. Oh and have hotspares
@@blowfly71 just have alerts mate. You don't want constant "everything is OK" messages, because you will start ignoring those real quick and miss the one that says it's no longer OK.
@@jfolz he wasn't talking about constant notifications, but ongoing reminders if an error has occurred but wasn't fixed yet. That way you can't miss the single notification of a failed drive.
@@blowfly71 got it. Though sending constant messages does have a benefit: it's a canary for your monitoring ;) It's probably better to have monitoring that monitors the monitoring though.
I do like this about LMG. I've been called in to help with several incidents of a similar nature and the level of stress as people see their livelihoods on the line can be pretty extreme. The fact that LMG can just make lemonade out of it is quite refreshing (pun not intended).
This is the issue with the massive storage on a single disks like 20TB, it takes forever to rebuild and your more likely to have another failure during the rebuild cycle. Also, you should always have some hot spares so it rebuilds automatically once a failure is detected instead of manually doing it.
declustered parity helps with this issue significantly by getting every disk in the array involved with rebuilding the lost data instead of a single parity disk or two.
The irony of a cloud storage provider sponsoring this segment is not lost on Linus. I like that.
The most amazing part is a backup provider also sponsored the first video on losing their data. Incredible timing.
@@Electrex8 "Hi we would like to sponsor your next data loss video, can you put us on your waiting list?"
I mean backing up a petabyte of stuff on a cloud provider is so fucking expensive, you need to pay huge amounts for bandwidth (even with 200MBps it will take forever), so its not like a realistic option
@@legominimovieproductions If they built the drive and sent it to the host already loaded/backed up/ready to go, I wonder what the service would run.
Generously negotiated for future f***ups no doubt!!
LTT never ceases to amaze me on how professional and unprofessional they actually are at the same time.
You just described every corporation and Government in the world.
Do as i say not as i do.
Definitely. But minus the professional part.
Right? I've worked in infrastructure for years, the consumer end videos are awesome and insightful, but the server/infrastructure videos frustrate me so much sometimes...
Yet 100% entertaining which is the only metric by which to value an entertainment business ;)
I don't know why, but "server issues" episodes are my favourite LTT videos. Content like this just doesn't exist anywhere else.
That's what I like too, there's only so many gaming hardware reviews I can stand to watch, they're all much of a muchness to me, but I really enjoy the infrastructure and unusual project videos the most
IMO It feels more real, and a lot like old LTT did, just overall more entertaining to watch than the usual formula
Me too. One of the first LTT videos i watched was the one years ago where Linus, Anthony and Jake doing stuff in the server room on the weekend
Right?! The Whonnock Server died video is one of my favorites to watch! I have no reason why, but I just like watching it for some reason😂
Check out Craft Computing if you like home lab server videos. Techno Tim as well for homelab hosting tutorials.
As a data center engineer your storage content is my favorite content. I'm terribly sorry for your issues here.
Postmortem reports like this are hugely valuable, but companies don’t usually share them. This is a great service to the community.
Because companies don't let their storage get to this situation
@@AegisHyperon exactly. Companies from the get go have an official IT dept. or outsources it to a competent MSP.
@@MajesticBlueFalcon you would be surprised how many companies mess up. What about if the IT dept didn't do their job properly and skipped over certain things in order to save time?
@@AegisHyperon not true. I do sysadmin for small and midsize businesses, and you wouldn't believe the kinds of things I've had to take over. Usually it's either some guy who does something else at the company and thinks he knows stuff but doesn't, or the work of some usually very mediocre external company.
@@AegisHyperon oh yes they do.
Linus: "Right, *Now* we won't ever lose data again!
Data storage: "How many time do we have to teach you this lesson, old man?"
@Nimki rafa 8 What the fuck?
More like power outages😂
@@limemason Spam bots, they reply to every comment automatically. Just report and move on.
*times
@@limemason Only for fans over 18 years old, where's the confusion coming from?
Linus: Hates how USB and HDMI are being named.
Also Linus: New new new vault
If you can't beat them, join them.
Well it’s pretty clear. It’s not like it’s named new old vault.
@@robertt9342 Don't give him ideas.
@@robertt9342 that was my first thought when he said about reusing old voult. It would be new voult build from old voult. So... New Old Voult [short NOV]
At least new vault and new new vault aren't being renamed vault 2.0 and vault 2.0 + new
Your backups must be tested
So you know they work as expected
Offline is best
So you can rest
When lightening strikes unexpected
Limerick?
And make sure it's surge protected
As a full time Sysadmin i always wondered how you guys sustained your data without a real backup plan. As it turns out now, you didn't. Really sorry to hear that guys!
That's exactly why people like me get hired. Companies think they can do it on their own until they lose critical data to misconfigs and missing maintenance. Hurts to learn it the hard way.
I really recommend you guys to create offline backups to tape storage for all your archived content.
And respect for admitting having it done wrong so others can learn!
Keep on making such great content!
I'm not a sysadmin, just a network guy that dabbles in sysadmin stuff and yeah, it blew my mind to hear what happened here. If they open a spot to hire an IT guy I think I'm gonna apply :D
Get this dude hired, quick!
I'm not a sysadmin either, but I'm also surprised they didn't catch the offline drives earlier. Even without the regular data scrubs, basic monitoring should have caught that. As for tape backups, I agree but also advise caution that tape backups can fail too, so they need to be planned properly. I've done tape backups myself but that was a long time ago.
Yep. I'm doing this for 25 years now. I worked for a couple of companies with big raid systems but no backup. It's a struggle to get the responsible persons to buy sufficient backup systems. In one case only one week after installing the backup solution and having the first full backup, the main raid system failed and died. Without backup this company would have gone out of business completely. I have seen this happen to companies before.
You would need one hell of a tape array to backup that kind of data not to mention it would take forever! I don't see tapes a practical offline backup solution for this quantity of data for a company LTT's size. It is better to off have a duplicate server in a DC with clean power and resilient backups and replicate the data, that would act as backup and be a suitable DR solution.
Without backups I do wonder if they have a BCDR plan in place also?
If Linus manages his data the way he manages hardware... it's no surprise the data dropped
indeed...
Lmfao good one
lol
I wonder who actually falls for these bots…
@Cherian Philip Just report them as child abuse
"a lot of power outages" + "transferring that much data might take months" sounds like a recipe for another video in this series.
Yeah, on how bad a power grid can be and how important an UPS becomes in such situations.
I'll hazard aguess that they keep blowing a fuse. and don't have a generator for the building, or a UPS for the servers.
LOL I can see the tittle: Ever try to backup a few sextibytes? or even just a few exobytes?no? well funny thing happend...
Or "This is awkward...newcubed16 ...."
Please tell me they have fiber to the new vault and aren't trying to do this over a normal connection.
@@gorkskoal9315 They do have a UPS for their server room, but for a few months they didn't because their UPS caught fire. Also it sounds like they never configured the servers to _safely_ shutdown when the UPS was running low, instead the UPS ran out of power and the servers got plug pulled.
Natural gas backup generators aren't very expensive relative to petabytes of hard drives, they should probably invest in one.
"We never hired a full-time IT person" was stated and I immediately had the urge to bust out the popcorn and look at IT pros in the comment section.
To be fair many of the LMG staff do qualify as IT pros in skill, if not in formal credentials.
@@onceuponaban no. Just no.
The constant fuckups shows that they are not
The funniest thing is the next two comments under this are from the IT pros.
@@applepie9806 been in IT for 10 years, worked as infrasturcture engineer for hospital, Technical lead for MSP supporting SMEs and finally soloutions architect for a 250m £ company, Im also a freelance consultant I love LTTs vids i normally just have it on in the background whilst im working. but they do make some big mistakes but its all part of the drama :L
I am a bit of an IT professional myself.
Linus: "the way they name HDMI generations are so confusing"
also Linus: "we move the data from the old vault to new new vault and then name the old vault new new new vault with a bit of upgrade"
Lmao
I hope they pin this
Ayyyyy
Its the new 14nm +++++++++++
USB: this guy seems to know his stuff. let's just learn how to name things from him.
Linus, I have over a decade of experience in managing multi-petabyte ZFS with five nines uptime in large ISP's. I think you may have the wrong cause of the data and it may not (MAY NOT) be as lost as you think.
Please reach out to me
Upvoting this to get it seen.
Tweet at him
No you dont
I would recommend that you reach out to them as well.
Linus does read a lot of comments, but RUclips isn’t a good way to get a response.
Email their business email address.
I feel Jake holds a lot more of LTT together with his expertise than we think. Underrated!
Fact he takes the time to actually look and uncover this is enough to be praised employee of the month
It also helps that he's worked there from when they were getting serious about their data storage, so he knows the reasoning behind why the things are set up the way they are
@@riks.1773 usually you set up a monitoring with alerting for checking the health state of your storages.
@@kstenders yes, but i never assumed they configured that... because other simple things i´ve seen get overlooked
@@riks.1773 as Linus explained, it's routine checks that they should have been doing monthly, for years. AND they didn't set any email alerts so they never got notified of the failures !
Tech Tips' data loss is due to one thing - quantum variability. :D
The data was in a state of flux until someone audited, at which point it was forced to exist or not exist. Some were observed to be the latter.
Tf are you on about?
@@HilbertXVI if you don't like quantum jokes then I'm half-certain there is a dimension on which you didn't comment.
Schrodinger's hard drive?
😂
To be or not to be 😹😹😹
Just hearing "never hired a full time IT person" makes me go "uh oh... I don't like where this is going..." a good sysadmin who can help protect systems is a valuable part of any modern company
The world's biggest IT youtube channel, there's no IT guy
Classic case of responsibility creep. As Linus and others have become responsible for more stuff as the company has grown their ability to handle routine IT maintenance duties has dropped off, and because it's happened slowly over time it's never quite shown up on anyone's radars as a matter of concern.
Yeah, because just so you know, only other sysadmins value sysadmins. It's an extremely simple job, so the rest of us think we can do it, and we sure can until our real job prevents us. If only we could teach monkeys a couple of bash commands and have them be sysadmins for a couple dozen bananas.
As a sysadmin/sysengineer, unfortunately these guys although knowledgeable, aren’t professionals and works doesn’t always mean works properly :/
That's why the sys admin job is dying out and most mid sized companies pay less to move it to cloud based systems that are more reliable (for now, until the price gets hiked)
The second most important thing to consider about backups, behind actually having them in the first place, is TESTING THEM!
If you don't test your backups then you don't have backups.
Only if you have shit backup software. Last year I did a restore of our main HPC file system after and upgrade, everything came back. The only "testing" necessary is the occasional restore when users have done daft stuff and deleted files by accident. Then again I have a "proper" backup system in IBM Spectrum Protect (nee TSM). If you use toy backup systems (aka everything else in my view) then yeah test them regularly.
This is not a backup system, it's live storage.
But you are right.
Agreed
yeah actually checking that the backups are running
@@jacquesb5248 Nope if you have to "check" that your backups are running then you are doing it wrong. This should be integrated into your monitoring system so you get told that your backup *DIDN'T* run. Checking manually is prone to someone forgetting or been on holiday or insert a thousand other reasons. Also getting told daily that you backup ran also becomes an issue where it is seen as background noise and you get bored checking the same report day in day out. Basically being notified something is as expected is the wrong way to do anything. You need to be notified that something is *NOT* as expected, in this case the backup didn't run to completion without errors.
Alternate title: The LMG group MIGHT hire an actual IT person
Linus Media Group group
@@callowaymotorcompany yes
Linus LMG group might hire an IT person for new new vault
@@Achilleaa With Seagate again for another Vault.
@@callowaymotorcompany LMGG group?
As soon as they switched from storage spaces I kind of saw this coming; I've got a 912tb S2D cluster that serves as storage for about 200 or so virtual machines and it's been rock solid and performance with NVME cache has been solid. One of the things I saw on Spiceworks was a warning about over engineering infrastructure.
Tape (LTO-9) is still an affordable option for backups. Especially for data that doesn't change. Yeah, it's old tech but it still works.
Yep, completely agree with you, Tape library with LTO 9 tapes will be much safer. And it isn't so slow as people think. :)
@@mkastelovic 250-300MBps. And they have worm tapes. And by using a dual drive tape robot, it makes backups completely automated. Restores too. Backing up to individual LTO drives having to load tape after tape is too much labor. Backups will never get done.
Modern tape storage has INSANE capacity. We are talking 32 petabytes per rack. ETERNUS DX600 S5 is one such system.
@@jspafford Well, if you have the library, the backup is done automaticaly, plus in their case, we are speaking about the incremental backups, where most of the old videos doesn't change at all ;), so Backup will be done during the night.
I agree with all of you, except for one thing. Linus has expressed that he has quite a lot of data that he says isn't that important. Meaning that buying a tape robot, would be quite a expensive investment. Maybe not even worth trouble.
Well if you hadn't made this video, I never would have known to check if automatic scrubbing was enabled on my storebought NAS. It wasn't. I don't believe it's ever suffered a power failure, being connected to a UPS and configured for automatic shutdown when the UPS drops below 50% battery since day one, so no automatic scrub on resume either. It's now set to automatically scrub once a month, so thanks!
Same here on a Synology box, thanks to your comment I checked and noticed it wasn't enabled either. I also activated a monthly schedule :)
Other Pro tip: if building such a large scale storage, make sure your disks are from different manufacturing batches. Imagine the nightmare is having disks with consecutive serials wearing out and failing almost at the same time.
Or they could just buy a professional backup solution and get proper training operating it plus a maintenance contract. You know the way every real enterprise would do it :D
@@lostintechnology1851 It's a different situation, he said this is non essential archival footage, the creation of these servers created content, the failure of it created content, and yeah, backing that stuff up would cost a lot of money... so risk/reward. The best option isn't necessarily always the right option.
That is... a pretty smart idea.
@@entelin Didnt watch the video yet but it sounds like something that RAID 5 would solve instantly and would cost them barely any storage with that many hard drives
Or just never use Seagate.
I’m responsible for our SANs at work and there’s something else that wasn’t touched on in this video - make sure you configure email reporting from your storage nodes! The sooner you’re notified about issues, the sooner corrective action can be taken. Additionally, if possible, keep hardware spares at each site where the hardware is, so if a drive has failed (or even if it’s in a predictive failure statue), you can swap a new drive in ASAP. Same goes for other hardware, such as controller cache batteries; these too can fail, and can do so silently, allowing the node to continue working, but with degraded performance.
TL;DR - Keep an eye on your infrastructure and monitor it!
This. If they had been notified from the first drive, this most likely would've been prevented
Please adopt LTO tape backups into your workflow! It's indispensable as a deep storage solution, especially within my field of work (film industry).
I worked for an MSP where they had fired the previous person in charge of backups. I was on the infrastructure team. We found that 65% of our customer backups were no good and something like 85-90% of offsite replication was failing. It was 8 months before we could return all the backups to normal and reduce the back checks workflow to less than a few hours per week. During the 8 months, it spend the first 4 working to get the backups all straightened out with almost every hour of my workday.
Suffice to say, having an ops team with competent people who are organized and themselves redundant and able to check each other’s work without judgement is absolutely paramount for a team in charge of critical systems.
I personally love working on backups because it’s a silent way to ensure continuity while working with amazing technologies.
As someone in IT I know the pain, one thing I go by is that you don't have a valid backup unless you have tested it. I've had some times (granted back in the day like 8 - 9 years ago) where the software would say its a good back up. god forbid that you need to restore as it will just fail. they are very fun times they are
Good worker! Providing billable service with no ongoing expenses like "maintenance" or "checking the backups".
-most MSP management, probably
@@Phoen1x883 well if you can do something in company time to help the company bottom-line... You should?
There sre things like loyalty and good will even in corporates
"I'm the highest ranking person in the company, the highest ranking person in the IT team, and the person who decided not to hire a dedicated IT staff. There is no way to determine who's accountable here" - Linus 2022
Bet he still might not hire one since he 'learned his lesson'. Oh well live and learn!
@Connor Williams but by that logic it means he will fix what they failed at and not for anything that may arise. If they don't have a full time or part time IT person then the same or similar issues are doomed to happen again
@@Dimmers that's my point, hopefully we see a video posted asking for applications soon so this doesn't happen again!
@Connor Williams it would be quite dumb for them to hire an IT specialist because a good portion of their content is working on their own IT systems.
@@KP3droflxp they need an IT specialist to schedule and perform regular preventative maintenance. Otherwise, their team will just fix things when they break like this video.
As someone who works in the IT field for a small company I will be following this very closely. Anything that you guys do like this I absolutely love and try to implement it if it is appropriate for my company.
Most enterprise NAS already do all this for you, for example Synology.
In the takeaways at the end of the video, there was no mention of monitoring. If zfs zed was configured to email somebody/service desk on events like drive failure, this disaster could have been averted by replacing failing drives one at a time as they failed instead of accidentally finding the house of cards your enterprise is built on. Monitoring for failure should have been the most prominent takeaway.
My take away is no system is safe from hard drive failure and owner of system this big should hire someone dedicated to take care of it.
Thought it was weird too. An email as soon as one drive fails could reduce response time. The number of drives they are handling meant the chances of 2 or more failing at the same time is pretty high.
What about reserve drives to automatically repair when one degrades? Not foolproof but a good start. For bit rot, more frequent scrubbing?
even a post power outage check or weekly job for an intern... oh well. once is a mistake, twice is a problem, thrice = low value asset.
My thoughts too. File it under "mistakes where made", it's the big locker you can't miss 😁
Not just configure it, also test that configuration. I've worked at a place where the storage system was set up to send an email in case of pending doom. Problem was it wasn't configured correctly so the emails never reached their recipiant.
How did they found out about the impending doom? Well, the system also gave off a sound alert as well as flashing a LED which were only noticed when I was given a tour of the server room.
I think we all have a lot of cases of 'didn't follow our own advice' in the storage/DR world. Unless it affects your bottom line, backups and DR tend to be lower on the priority list.
And lower on the priority list usually means either "not configured at all" or at minimum "never been tested before" :(
Always test your backup and fail-safe. There is no use of having a backup but it doesnt work ar all.
Dont just do backup, TEST your backup
@Nimki rafa 8 shut up bot
As a fellow Repair and Recovery guy in the SS world we sell hundreds of drives globally to guys in that very situation. TRUST ME !
new new vault...... " vault 3 " ..... 😇
I always follow my advice for backing up data because there is a simple rule... if you back up your data, you won't need the backup, if you don't back up your data you WILL need the backup.
Backups and UPS should be essentials at this point, lost drives on my old comp cause I had the "bright" idea to use it in the middle of a bad wind storm with only a surge protector.
You know. It's nice to see someone that handles things like an adult, admit mistakes, acknowledge that some failures aren't simple "that person screwed up", and use it to constructively fix problems
If he was handling this as an adult he would have hired a fulltime IT long time this is childish
@@beermarket9971 Why? It is everyones own choice how important their data is. If they can live with the loss of some old footage, i see no problem in their actions...
@@alias_not_needed There are plenty of reasons why this is childish in my POV:
For one you should value what belongs to you and protect them from predictable breakdown otherwise you come out as a spoiled child.
Second, as a CEO you have to duty to protect and save your employees work, while accidents do happen when they are caused by a lack of prevention, the people in charge (or CEO) come out as childish.
Finally, when a CEO cannot hold someone accountable for data loss (or work loss) it's ultimately his fault and he should just own it but, maybe i missed it, but it didnt quite come out like that.
I don't want this to come out negative, i like LTT and it looks like an amazing place to work, and i admire Linus. But this is frustrating to watch...
Mate, I really appreciate the honesty of this video. Eating humble pie in order to educate your viewers shows real dedication to your mission.
I am very interested to see how the recovery process goes. As someone who has only ever done disaster recovery in the realm of terabytes... yikes. Good luck friends.
Only for fans over 18 years old baby-girls.id/angelina?cute-girl 🍑
tricks I do not know
Megan: "Hotter"
Hopi: "Sweeter"
Joonie: "Cooler"
Yoongi: "Butter
So with toy and his tricks, do not read it to him that he writes well mamon there are only to laugh for a while and not be sad and stressed because of the hard life that is lived today.
Köz karaş: '' Taŋ kaldım ''
Erinder: '' Sezimdüü ''
Jılmayuu: '' Tattuuraak ''
Dene: '' Muzdak ''
Jizn, kak krasivaya melodiya, tolko pesni pereputalis.
Aç köz arstan
Bul ukmuştuuday ısık kün bolçu, jana arstan abdan açka bolgon.
Uyunan çıgıp, tigi jer-jerdi izdedi. Al kiçinekey koyondu wins taba algan. Al bir az oylonboy koyondu karmadı. '' Bul koyon menin kursagımdı toyguza albayt '' dep oylodu arstan.
Arstan koyondu öltüröyün dep jatkanda, bir kiyik tigi tarapka çurkadı. Arstan aç köz bolup kaldı. Kiçine koyondu emes, çoŋ kiyikti jegen jakşı dep oylodu. # 垃圾
They are one of the best concerts, you can not go but just seeing them from the screen, I know it was surprising
💗❤️💌💘
Damn, these bots
#RUclipsKilledTrustedFlagging
these bots out here calling youngboy "extravagant"
What I don't und2is why isn't auto mod capturing then (when ever I post a link 90% of the time my post gets auto modded, it disappears)
@@marcogenovesi8570 both are problems. One is not higher than the other.
HR meeting with Linus: “All our data has been lost, i’m gonna fire someone…
But not before i fire up our segway to our sponsor…”
🤣🤣🤣🤣
*Segue
Linus is the only person who's adverts I enjoy. Angry Joe started putting tonnes of effort in to his, but they are so forced!! I think Linus actually gets a laugh-kick out of saying LTTSTORE where it's crowbarred into something lol. I know I do, but not as much kick as the coffee in this LTTSTORE FLASK WILL HAVE.
Linus dude, over all the years I have watched you I don't think I ever credited you properly. Well done man, this thing you've all created is really cool :)
@@Avendesora Except, of course, it's totally wrong to use one of those words for the other. Unless your server room is so large that you must use a Segway to get to the sponsor.
Gotta make up for that loss of money somehow.
“This caused the array to offline itself to prevent further degradation”
…Been there, array. Been there.
Suicidal array much
@@Harperion mnooo
@@Harperion if i faulted millions of times, i'd probably suicide it too. 😂
Bean there
First off, I am sorry for LTT about the data loss. Secondly, I am glad it wasn't "active" or current data but rather old RUclips videos, and those can be recovered (but only the uploaded videos and not any extra material/footage you had stored). Good luck on the project!
Big mistake to immeidately replace the drives that weren't even dead, which just showed some failures. By removing them LTT removed all the (still good) parity data on those. Probably should've run a scrub first, and then remove the possibly malfunctioning drives.
Wouldn't that take a long time tho?
Yeah I was thinking the same
I'm pretty sure those wouldn't survive a scrub either.
@@AyoKeito we wouldn’t know they for sure but we do know it didn’t survive the replacements
If they are offlined, they are already dirty parity data.
Ahhh, the reason I originally subbed to LTT, insane server builds and configs.
Insanely bad and mismanaged server builds
@@theairaccumulator7144 the point still stands
And lots of dropping expensive hardwares
Well this "presentation" format certainly has a different energy to it than Whonnock died.
The vault is hardly the beating heart of the company that whonnock was, and sounds like this unfolded over the course of days and weeks as Jake found the issues and they are still working on rebuilding the data.
The vault is just archive data. Whonnock is the in progress projects, and I think at that time Linus said there was no backup.
Linus being so calm while talking about one of his/their biggest oopsies is so cool 😄
10:30 and most importantly: monitor your environment! SNMP, Syslogs and even specialized monitoring agents are an easy way to monitor your environment.
PRTG has entered the chat
The irony is that they advertise these products in segways but don't implement them it seems.
@@towel2473 I was going to say, did they not have Pulseway deployed :)
Rather messages from SMART and HBA utilities.
At that density and the infrequency of the older data being updated you really should consider acquiring a tape library. A couple iSCSI targets and a 250 slot LTO library would keep you until you more than double your current use. But considering the increasing file sizes of the raw files you're ingesting I would recommend going for a 3-3.5X scaling.
A 250 slot library for what? 3 PB of direct access tape storage?
Tape is slow. I think the whole point of their setup is for fast access to footage new or old for editing purposes. If they were just hanging on to it for keep sake then Tape is an option but I think they keep it so they can retrieve previous footage on-demand to splice into the current video being edited.
@@grrkaa8450 Tape is slow, but much cheaper per TB.
Typically you would have a hybrid system where users interacting with the data would hit high speed disk storage of some sort, and that disk storage would be running software that would migrate copies of files, or just less accessed files to tape.
It's effectively the best of both worlds, users have the speed and accessibility of high speed storage, but the high speed pool is much smaller, and most of the archival data is on less expensive tape drives. The only time you hit a slow down is when a user has to access the stuff on tape which would be normally pulled when the user accessed a stub file representing the file on the disk pool.
@@killer2600 so do both? use tape as a economical backup option.
@@grrkaa8450 Why keep the data in hot storage at all? Archive to tape (not backup) toss it in a fire safe.
I've been through a number of mergers and acquisitions over the past 10+ years. On every single one the IT dept/employees who do IT tasks for the other entity have been running without viable backups, server monitoring, out of band management, or alerting. Most also lacked UPS units (or working UPS units), and one was even running RAID0 on a production server and couldn't figure out why it kept failing on them. It's a scary world out there.
A very practical advice. Store you "old" archival data (like photos) on hard drive that is not connected to power / server. Use other cloud storages all you want but just keep a one, disconnected, low tech option.
Suddenly I feel smart for keeping my backup drives in an antistatic bag unplugged
A data *recovery* policy abides with this: "The only 'known-good backup' is one that you *have* successfully restored." 😀
There is even the question, if the old Premiere projects are still loadable in current software versions..
Former Data Protection Product Manager here for some 30-40k servers at my old job: Yes. :)
Eeplwllwlwl
This is the way
When discussing redundancy, one is none, two is one. That's how I discuss backup options with my clients. If it's mission critical you need a layered backup system.
As an operations engineer, the amount of red flags that the process you followed here brought up was terrifying. Please write processes for this sort of stuff and test them - it's all fun and games till you lose something essential because of a stupid decision from 5 years ago
Not to mention they used Seagate drives. They are just completely unreliable. I wouldn't trust them in any circumstance. I've hundreds of Seagate drives due to failure, but only a handful of WD/Hitachi. It isn't surprising as Seagate purchased the worst hard drive company that ever existed, Maxtor. And they didn't learn their lesson, they got even more Seagate drives.
Why anyone trusts this guy for basically anything is beyond me. Lol.
@@williameldridge9382 Got a different experience. I'm still rocking Seagate and WD drives while all of my Hitachi drives from the same era as all my other drives died. But not sure right now though.
seriously, its hard to watch
@@williameldridge9382 Hard drive manufacturers have all had bad batches, it's just the nature of the beast now. I have had failures from all brands in usage. You should see hard drives as a consumable (especially as a storage array), run SMART and replace when health is detected as bad. The bigger issue is people not doing backups, that's a failure on you and your users to not enforce that.
Please be server room related. I’ve been craving some of that content recently
Me too man!! Also ^s/o to the milfs in the 20 mile radius comments ahah
yeah we all wana see his server ;)
I started following Linus by server content haha
You just have to wait until something goes wrong and boom new server content!
Maybe we pay seagate to send Bad drives, so we get new content sooner?
Sounds like a good, reasonable idea.
Wait until Seagate fails again in Vault 3 then we have Vault 4 until we got into Vault 76 and That marks the End of Seagate.
Mad props for coming out and saying you guy screwed up. All of us can learn this and hopefully not lose any data of our own.
I'm just wondering why LTT didn't go for tape storage for their servers, since, as Linus said himself, it was for archival purposes and more of a fun project to test out the tech they got. They even got a tape drive some time ago afaik. It doesn't make sense to keep the drives spinning for years if they are not actively used or maintained.
Basically this. it was the first thing I thought of. If the archive data never changes, tapes will be crazy cheap way of backing up old videos.
Because they probably want to have quick access to it, I think... To cut something out of old video and things like that? As far as I know, tape storage does not give you that luxury.
@@Stasiek_Zabojca You could store lower-bitrate stuff on fast storage for browsing and only get the tape out when you need access to the original files.
Tape storage is cost competitive on the level of multiple petabytes, not single petabytes.
So it's nothing that any significant minority of viewers will ever see in person, let alone be part of decision making process to buy, install or configure.
Because he'd rather have "dope hardware" instead of using tape. If they need access to it that's fine, every week or month or time frame you do a fresh backup to tape and keep your servers running for access and have tapes as backup. He failed to implement backup in depth which is basically industry standard.
Archive is not backup. Redundant and separated storage of data is backup.
Love your show! Just wanted to chime in here coming from an IT background supporting large companies in datacenters as well as being a content creator. Trying to maintain an accessible RAID of ever growing content only gets more difficult and expensive over time. You will eventually need a full time employee to manage your content if you go this route and at some point you will need to migrate your entire content to a new RAID when 1 petabyte isn't enough anymore and that's not going to be fun.
The alternative cheaper and simpler solution is to archive your content to tape which will have a much higher chance of surviving the years to come as it's not on spinning platters that run 24/7. Yes, getting access to a piece of content you want to grab on short notice will be more annoying but you can always keep a smaller RAID with your completed videos and archive your raw content via tape as it's the RAW video content that really eats up the TB which is why you might want to consider archiving your raw video.
just build a giant data center that uses robots to automatically fetch & read tapes, so it's at least automated, even if it still takes half an hour. Building a data center is probably also great content for the channel ^^
/s
@@pixelmaster98 yea until crash override and acid burn have a hacking battle with your tape robots
Ok no one asked
This response from a pro is why you would leave a job like this to pros. As this pro demonstrated, #1 is identifying and understanding the requirements. Do you really need all your old content available online, or maybe offline is good enough? Then solution to fit the needs.
This could also be done with AWS Glacier
In my first months as a sysadmin I learned a lesson: always keep a secondary backup that isn't on-premise. Power can go out, and you'll have a few bad sectors on your drives. But if there's a fire and your server goes with it, all of a sudden giving a few bucks to Jeff Bezos doesn't sound that bad of a deal after all.
Yeah, not paying for cloud storage basically confirms that they wouldn't cry to sleep if they lost the whole lot. Which is a reasonable decision since it's not mission critical data.
They had a remote server in a previous VLOG a year or two back. I wonder what's up with that?
Heck anyone with a 5 or more user office 365 tenant can get unlimited onedrive backup. Yes it's slow to backup, yes it's full of details like 25TB sharepoint sites that you have to subdivide, but it IS unlimited for very cheap and an offsite backup.
@@tpmeredith No such thing as "unlimited", it just means "we haven't written down a hard limit". 720TB would definitely be knocking on that door!
I think they've covered this in the past, and the problem is that they just have so much at this point that the upload will take forever. But that doesn't mean you're not right. If anything, they should do it _now_ because every day they wait is going to just be more they have to upload. I'm sure there's something out there that will just start uploading everything in the background until it catches up.
Also, IDK if it would only be "a few bucks" for the amount they need. IDK what that kind of enterprise level storage costs, but it's probably not cheap and I'll bet that even on "unlimited" cloud storage plans there's probably a catch written in the fine print with some way of restricting the storage in practice, like restricting the upload bandwidth past a certain amount of data uploaded to such a slow rate that they would never be able to upload faster than they create new data...
I don't really think this was an issue of not having a "tech person", or "not having time" to set up. It was simply an oversight. Setting up scrubs and SMART alerts doesn't take long, and you certainly don't need a full time person sitting around waiting for trouble notices from monitoring applications.
Hey Linus, 2 best practice recommendations I didn't hear you state, but would be very important. 1) per every 24 disk (avg) you should have 1 hotspare. This drive should be in the same HA zone as the 24 disks so if a failure occurs or a scrub detects failures it can automatically start the rebuilds to the spare, this gives you time to get replacement equipment etc, without having to worry about your data while your purchasing. 2) If you cannot do full backups such as to a cloud or dedicated location, the next best thing is to ensure your data is across 2x different technology solutions. As this is entirely archival, and your not worried about location protection having your 2nd system be replaced by either an always on VTFS (virtual tape FS) or just streamed to tape backups and 1 guy pulling the tapes about once a month. Tape is rather inexpensive and had a great shelf life. I've been doing IT Storage and Data Protection engineering for the last 10 years, and customers in your position of not having dedicated staff but are gathering increasingly large data sizes is all too common sadly.
They're using Seagate drives. So for every drive they should have a hotspare ;) No seriously. I'd go 12D 1HS at least. The way they manhandle their servers i'd say VTFS is prone to a hilarious video with a lot of grey confetti :D
Offsite rotating backups?
@@brucepayan2845 that would be ideal, but in the video they said they couldn’t afford offsite backups.
When your downtime and data loss is measured in lost $, hiring full time systems engineer becomes a very attractive value proposition.
i don't think any of this will produce any downtime for his company. The Petabyte worth of data he may loose as he said is just a "nice to have". It's not the actual production server where they store the current projects and videos. His employees may not even know about this data loss.
Thank you for this. I've never scrubbed my ZFS pool because I didn't know what that meant. I now have it set up to do it monthly and am running one as we speak. 5 hour estimate for completion
Update: no errors on all drives of my 8 drive Z2 array. Awesome! Took about 8 hours and they're 2 tb drives.
It's amazing that a company in the tech field can take such a YOLO approach to backups and still be credible to some.
Because its frankly unimportant for their company. Get that stick out your ass bud.
"We lost a sh*tload of video data, lets make an educational video about it" - Most Linus thing ever
Lets set this straigt: There are more backup options than local spinning disks and cloud storage. The cheapest way would be a LTO Tape-Library. An LTO8 Tape (12TB of uncompressed storage) is about 50-100€, thich is only a fraction of the cost of spinning disks. Also they are archival grade and can be labelled and stored on a Shelf somewhere. As their backup files dont really change you could just put a few projects on one tape and chuck it in the warehouse.
Yeah he's done a video about tape storage before
This is not the logic channel
You must be old :-), like me.
Yep, the system my team and I designed included LTO with 2 robotic libraries. Archival data doesn't belong on a hard drive.
While I agree with the archive not being on spinning disks, long term storage of tapes is an issue in itself. It requires regular maintenance, climate controlled warehousing and copying every few years. I work in broadcasting and I have only seen deep archives done correctly maybe once in my career. I have quoted archival systems several times and the face customers make when they see the numbers and are then informed it does not include any recurring and on-going operational cost is always funny (not really).
Up to 80tb myself and needing more soon....! This hoarding raw footage is a nightmare 🤣
hey i watch your cities skylines videos. hope your day is going well. much love
Hello everybody and welcome back to the next episode of fix my NAS.
Consider cloud, or tape based backups that you mail to a trusted friend or put it in a safety box at a bank.
@@Briceronie hi, thanks 😊
@Malaclypse The Elder yes, that'll be me soon lol 😆
Sure sounds like a good time to make a video about how tape drive systems aren't as obsolete as many might think and maybe even get yourself a super cool tape robot!
You could also dig into data reconstruction/recovery software to see what you can pry out of the drives you've pulled and maybe try out the old "HDD in a freezer" trick.
There you go. Two new video ideas (that I'd love to see presented by Jake and Anthony respectively) to hopefully recoup some of the costs of this oversight.
I used to think tape drives were old...I recently seen a tape drive with TB or something...guess I was wrong
@@mrmotofy LTO-9 is 18TB/tape.
(oversimplified) Summary: The power dropped out a bunch of times and LTT dropped the ball on configuring the servers so the servers dropped a bunch of errors before dropping physical drives out of the servers resulting in the servers permanently dropping some data... I see a familiar pattern here.
HOLD IT!
I'm not sure what you are getting at.
At least they get to drop a new video about it.
Would've been nice if (Mass)Drop sponsored then as well
Linus drop tips
how are we suppose to trust linus' tech tips if they keep dropping the ball :(
But atleast they show us!
All techs: "Follow this advice!"
Those same techs: "YOLO"
Ay gotta know the rules before you break them
Only for fans over 18 years old baby-girls.id/angelina?cute-girl 🍑
tricks I do not know
Megan: "Hotter"
Hopi: "Sweeter"
Joonie: "Cooler"
Yoongi: "Butter
So with toy and his tricks, do not read it to him that he writes well mamon there are only to laugh for a while and not be sad and stressed because of the hard life that is lived today.
Köz karaş: '' Taŋ kaldım ''
Erinder: '' Sezimdüü ''
Jılmayuu: '' Tattuuraak ''
Dene: '' Muzdak ''
Jizn, kak krasivaya melodiya, tolko pesni pereputalis.
Aç köz arstan
Bul ukmuştuuday ısık kün bolçu, jana arstan abdan açka bolgon.
Uyunan çıgıp, tigi jer-jerdi izdedi. Al kiçinekey koyondu wins taba algan. Al bir az oylonboy koyondu karmadı. '' Bul koyon menin kursagımdı toyguza albayt '' dep oylodu arstan.
Arstan koyondu öltüröyün dep jatkanda, bir kiyik tigi tarapka çurkadı. Arstan aç köz bolup kaldı. Kiçine koyondu emes, çoŋ kiyikti jegen jakşı dep oylodu. # 垃圾
They are one of the best concerts, you can not go but just seeing them from the screen, I know it was surprising
💗❤️💌💘
To be fair it's more, follow this advice if X and then the same techs don't really have X. He basically said that at 9:27
Ok but we all know linus has said before "do as i say, not as i do"
"Do as I say, not as I do".
the perfect opportunity for testing out tape backup! i had 4hdd's failing at the same time in my raid 6 storage server with total data loss. i recover all my data from my tapes! it was only 80TB of data, but when come to price for large backups, tape is king!
Tape is so underrated by so many people. It's such a great choice for storing a shitload of data for long periods of time.
"only 80TB" bruh
@@heavyq problem is the drives cost a shitload...
Thanks for doing this video, I'm sure this made a LOT of people go back and check whether their home servers, or servers they support to make sure they are not vulnerable.
Considering a lot of this is for cold storage, it would be neat to see you implement a tape drive for this use case. They would store a ton of data, pretty cheaply, and safely. Also something that many people don't even know is still is use
He does have tape drive, but he probably would need an autoloader and bit of backup automation 🤔
yep, tape is very viable, especially if its only accessed a few times after the final video have been uploaded to RUclips like Linus said.
Tape backup burns less coal too, and are immune to blackouts.
@@florabee9283 Also electrical problems. For small concerns like myself, I've tried tape but disk is just easier and cheaper and tracks needs better. For Linus, tape is definitely well worth a look.
I made the case here many times how it is overkill to storing all footage in raw data. Not really surprised this strategy failed, very sorry to hear this. It's not like you can simply backup a few petabytrs to another machine. So yeah, tapes. Mby eben Amazon glacier? At least I would have made a second backup tier to store compressed data. Another option would be to store finished renders in max quality on Blu-rays. That's still a lot better in case of a permanent loss. And a lot cheaper.
There's another type of storage for enterprise who needs a lot of storage. LTO is a lot better when it's too much data like you have
Yes.
We store over 70 PB of archive data on tapes.
They have their own failure modes though, but overall it's a good solution.
We once had a tape robot arm out of alignment, and it knocked a lot of tapes out of the storage.
Yes, if it's for archive, why keep it on actively running drives...
It's the clear solution, for a normal user an LTO drive is expensive AF, but in his case it'd be cheap compared to a server... and the tape cartridges are very cheap for what they can store... he could have duplicate cartridges of all his data even. Instead he insists on buying bigger and more expensive hardware which is more complicated to mantain and has much more points of failure.
Which is funny, because I thought LTT had tape backups after the... third(?) server crash. Linus did a full review of the LTO-8 thunderbolt dock.
Excuse my ignorance, what is a LTO ? (I'm too lazy to search on Wikipedia lol)
Oh my gosh, they had no automatic scrubs and no automatic e-mail notification when a drive fails? That's absolutely necessary maintenance basics for ZFS...
I wish LTT luck on restoring their data!
I wish them luck too, but all the information they tell about how it's important to backup the drives and have multiple backups they don't even follow. Is it just we'll fix it later or the cost to do it isn't a justifiable reason?
Also funny as they had multiple videos with sponsors like Pulseway where they brag about having everything monitored (so I guess they don’t use it… or didn’t configure that either…)
Ya'll spend A LOT of money on redundancy for data, how about allocating "a reasonable amount of money" to redundant power backup strategies. Generators, solar panels, enterprise UPS w/ some SLA battery banks, or a nice LiPo/LiFe array. Buy yourself some time, with a big enough buffer for power outages. Do an energy audit of what absolutely must never loose power, and consider your options. Custom automating your alternative power sources, or even off loading your grid expenses with alt energy would pay off in MANY ways. You have a roof on that building load it up with some panels. It would make a supreme video series as well!
LTT, please look into an implementing an LTO 8 tape library as a proper backup to your network pool! Tapes are so much cheaper than drives, and are the preferred archive format for long term. The tape robot and archiving software would do all the hard work and keeping track of data.
I was about to suggest the same. Newer (1-2 years)/more frequently accessed video goes to the petabyte but classic ones go to tape.
They only talked about it 3 years ago. ruclips.net/video/alxqpbSZorA/видео.html
the one-time tapes would actually make sense only downside is you would have to pay for application to read/write said tapes (i.e commvault). But that isn't all terribly bad.
Tape is definitely a good way to go, especially with a tape library. As for applications to write to the tapes, there are some powerful open-source ones such as Bacula but it might take someone a bit of time to get it up and running.
you LOOK like an LTT employee.
@@VampyWorm Yep, but a BackupExec copy with 1 agent would be in the hundreds or very low thousands USD / yr. Tech support included :))) 46+2 LTO8 tapes would absolutely rule LTT. Just done a 4 drive, 2 autoloader, 2 libraries implementation, it took 3 people around one week to fully set up copy & backup jobs, I'm very impressed with the results!!
Please don't use 15 wide vdevs. Groups of 6 wide in raid-z2 is a good choice for spinning rust (4 data + 2 parity). As a zfs user for 10+ years, I cannot imagine running multiple 15 wide vdevs.
Really wide VDEVs are only OK when using SSDs or low capacity HDDs. The rebuild time on a 12 drive VDEV of 12TB drives is insane, and the stress the other disks are under during that period can easily cause one to fail. 6-8 drives on a RAIDZ2 seems to be the sweet spot for large drives, maybe 9 drive RAIDZ3 if you're _really_ paranoid.
EDIT: I'm also saying this as someone who's running 8TB drives in 9 drive RAIDZ2 VDEVs. I have plenty of slots for more drives, so I'm sticking with 8TB drives for the time being.
Let alone multiple 15 wide vdevs in raidz2! Even worse. Then 4 of them in one pool? Of course that data was a time bomb.
More than 6 raidz2 using 20tb disks sounds a little edgy. I would require disks rated 1 error over 10^17 bits for that.
15 is objectively scary with raidz2. 10 with adequate replication or backups would already be edgy.
With raidz3 maybe 15 is not crazy but you might want to upgrade the pool at some point with 40tb drives or more, if they ever come out. Which would be totally nuts.
11 wide z3 vdevs would be the most I'd be comfortable with regardless of ssd / rated error rate. But once at 11 wide z3, why not go (2x) 6 wide z2? One extra drive, one extra parity, more stripping (more performance) more flexibility in adding / removing / replacing devices. All a balance between redundancy / space efficiency and flexibility. To me, 6 drive z2's, and just multiply as needed. Lets think about worst case. For 6 drive z2's, you lose 2 drives. you have a 4 drive "raid 0" to deal with until redundancy is returned. Not great, but not terrible. Email alerts, etc. But a 15 wide z2? No email alerting? 2 drives die you get a 13 wide "raid 0". Good luck.
@@ryan0io exactly. 100% right. Especially with a linus budget lmao.
I.....I strongly believe he needs to hire an I.T full time to manage and do preventive maintenance on those data servers
On WAN show he said he thinks about it, also because of the Lab
These videos about your big fuckups are by far the most informational and educational videos on your channel... I have a little checklist of shit not to do when I set up a storage system, wouldn't have heard about these pitfalls anywhere else.
Bro, hire a dedicated sys admin. You have too many employees that rely on your server infrastructure to yolo everything yourself. You mention that you, Anthony, and Jake work on it, but they also are writers. You have enough data and infrastructure to warrant a dedicated and experienced sys admin at this point
I wouldn't want that job. They'll go behind his/her/their back at any opportunity anyways because "it's faster that way" or "reasons". The way LTT grew the IT-Guy job is a surefire way to get PTSD now ;) No way they'll can establish any structure now.
@@peterpain6625 they know enough to be dangerous
@@outofahat9363 They know a lot in some areas and go full Dunning-Kruger in others ;)
I would love to know why tape backups aren't considered. It seems to be one of the more economical options and is great for archival. Also, as a photographer who works with tens of terabytes I would love to learn more about tape backup.
As an actual IT professional, learn about it from literally anywhere other than RUclips.
It would probably be insanely slow if they ever wanted to use the videos to edit from
@@Lexan_YT theoretically they'd use it to reinstall on new drives and use the tape backups as backups and not main drives
@@Lexan_YT Backup is not main storage. With Dell Powervault TL and IBM Spectrum we are achieving 1-2Gb/s write and read speeds. So restore of that data isn't that insanely slow.
Wow, I just checked, LTO-8 standard goes up to 12 TB per cartridge! It's very interesting!
All of this was very patiently and thoroughly explained, except for one thing: what happened to that LTO-8 drive you were planning to put into service years ago?
I thought that they should have tape back up to
@@jayred8289 i mean how does such a big company with so many resources not have a 3-2-1 backuo, even of it's some raw data. It's not like they're short on cash, are they
It was probably a review unit sent to LTT just to make a video and not something they were actively going to implement.
@@lolish1234 Because it's not ridiculously important data, Linus even says in the video that half the reason they bother keeping it around is because they can make interesting videos on it. I wouldn't be surprised if the eventual goal was a 3-2-1 backup system but they wanted to cover setting up each stage in videos which kept slipping cause LMG is pretty busy until we get to today. A lesson into why businesses with large data needs should be hiring their own IT guy.
@@TheDemocrab Setting up a cable testing lab and acquiring more space is more important then building a 3-2-1 backup system?
Yes "in hind sight" everything is easy to judge, but assuming Linus sets his priority straight, literally he has more issues with monitor cables then his raw video archive.
that is exactly the reason why I stopped building my own storage servers and got my first Synology like 10 years ago!
Obviously I far less storage demand (I got 4TB of triple backed data and 25TB of nice to have original videos and RAW photos backed up ones). All secured via parity, auto-scrubbing, snapshot deduplication etc. I've never run into any issue and I've basically distributed more than 20 of DiskStations in my family and close friend's circle to people with far less IT know how than me... and I'm a different kind of scientist with ok-ish Hobby IT knowledge.
There is no way on earth I can build something half reliable and convenient as purchasing a Synology or maybe QNAP and put another one up as backup at my parent's place!
With how large these drives are, I would really recommend going with Raid-Z3. I'm not saying larger drives fail more often, but rather resilvering a vDev with large drives takes INSANELY long. And resilvering hammers the remaining drives. Raid-Z1 and Z2 were great with like 2-8TB drives. 20TB? Not so much.
Finally somebody making sense.
This. IDK about the specifics of the different RAID configs, but I do think that it makes a lot more sense to have more smaller drives so that if and when something fails, it has less of a chance to wipe out _everything._
Definitely. Same reasoning why RAID6 isn't considered "good enough" in large drive arrays any more, either.
**cries in RAID rebuilds** Seriously though how is ZFS not a widely adopted standard of storage?
@@Kevin-jb2pv More smaller drives can cost you a lot more though. You need twice as many servers, twice as much space (and also more power and cooling, though that's not much of a concern here). But if a drive has twice as much capacity at nearly the same speed it'd propably be appropreate to think of them as two drives in terms of their redundancy needs (2 out of 15*10 TB is fine, 2 out of 15*20 TB is like 2 out of 30*10 TB drives which is risky)
Linus, i used to tell my loved ones "there is 2 kind of people in the world. who have a backup and who whish he had" i used to work on storage rack support and i've seen the worst of the worst, including a 24 hour straight marathon to restore a super critical one. but i've also seen a storage rack with all the capacitor blown due to a lightningh strike that fried a little unprotected datacenter.
so... are you hiring an IT fulltime person now? :P
lol this guy is like, "Where do I send my resume?"
@Telleva You deleted their data, and blamed them for not backing it up..?
Actually there is 3 kind of people in the world. Who have a backup who wish he had and who check that it is possible to restore data from the backup. I mean that lots of companies are thinking that they have backups, but actually they haven't tried to restore data from the backup and it is possible that their "backups" is not recoverable. Just try to restore data from your backup and you might be unpleasantly surprised.
@Telleva I have my stuff saved on icloud and Google photos
When backing up, always remember 3,2,1. 3 copy's, 2 local, 1 remote.
Another important thing not to forget, Raid is not a backup.
The must basic rule of sysadmins
This was just on my mind as well. 3-2-1 - I do it at home as well.
@L. Kärkkäinen You're right, However, this could be mitigated through tapes. They are actually ideal for this kind of data, as video files are sequential data files. Tape is also archival class, meaning they should not suffer from bit rot over time when stored properly.
if they need old footage from years ago, they can grab the tape from archive, and it should seek and fetch the data off relatively quickly.
Tape also solves the offline problem, as they should only be loaded when writing new data, or if you intend to retrieve it.
Why is raid not considered backup? I was considering using a 2 drive raid synology nas for my desktop files, and possibly copying that data to a cloud provider like wasabi as well. Is this not a good solution for “safely” storing my crap?
Yeah, tape is one of the best solution to offline data storage. It is "old" tech, but it does the job. For personal use i have cloud for archives, but for larger businesses a tape library is a nice touch. Only problem is the software, it can be high priced.
Thank you for sharing, Linus. This is a sobering heads-up video for all of us who seek future dealings with our own DIY servers. Peace.👍
The whole way through.. could not stop myself saying "shoulda got a small tape library and backed up to that" - it's very cost effective esp by comparison to the options presented at around 9:20. modern LTO tapes store tens of TB per tape and with LTTs connections, swinging a library, a couple of drives and a full suite of tapes should be no more expensive than a few months of cloud storage while not hurting the power bill - even our small library here at home consumes at most 350W total for the controller shell, expansion shell and all drives + gantry.
He also missed on the fact that things like Deep Archive at AWS are answers to this for around $1/TB-Month. Yes, you pay to retrieve it, but in reality, you are rarely ever going to. It is a vault of last resort.
So it is doable for $1k-$2k a month. Time to do the cost-benefit analysis with more correct values vs the on prem tape vault.
Yea offline tape drives seem like the answer to this issue, can even have boxes of tape offsite
And recovery would be measured in years. For something this large, tape isn't practical.
@@Herlehy He already answered that question though. None of this is mission critical and RUclips is literally providing cloud backup for all the videos and they're paying his company to do it!
Our studies concluded that tape backup fails about 1/4 ÷ nT times, where nT is the number of tapes in the backup. If you recover a tape backup that involved more than 2 tapes, you're already at a coin flip.
Tape is delicate and requires very careful storage to even work 3/4 of the time.
Each additional tape adds another chance of failure.
Petabytes of data on tape would take literally years to back up, years to recover and have a virtually 0% chance of recovery.
You'd hope that Delta backups would make it more efficient, but they only complicate matters further, sadly.
I would love to see a follow up on this with how much data was saved, and how much was lost. What videos are only on RUclips, and how much they can still refer back to
Moral of this story: hire a IT specialist already Linus.
Linus: "I am the IT specialist."
^^^^^^^^^^^^^^^^
They can very well afford midrange EMC or Netapp storages that will be more stable and may be as performant as these toy storages.
@@ticler
They can rot as bad as the 'toy' storages do. It's enough that they don't get attention. And where would the many hours of fun content about it go?
@@ticler THANK YOU *HUG*
Anthony is more than a writer and IT person. He is the true face of LMG, and my hero.
Having such large raid groups (15 drives) without any hot spares or replacement routines with large drives seems rather dangerous as well. If you already have two dead drives in a vdev its not that unlikly that you will loose a third during the resilver.
Anyway, ltt:s IT infrastructure has always been a bit of an dumpster fire, but maybe they do it intentionally cause it results in alot of great content 😅
I wonder if they have thought about connecting an 84 drive SAS expansion to their ssd tier and just have old data migrate to spinning drives (I think seagate has a rebranded dell box if they have a partnership with seagate).
Isn't a 15-disk RAID-Z2 vdev a terrible idea regardless of whether you have hot spares?
"lose"
LMG seems like a company where everybody does everything and that can work to a degree if you have just a couple of employees but it's a disaster when you have a bigger business to run.
@UnjustifiedRecs I don't understand how you easily lose track of a server that should be sending out notifications to someone that drives have died.
@@ericwhite265 very true. just about every commercial nas software has some notification system for when a drive goes down. you shouldn't have to audit the system to find that there are several that have failed.
"With great power comes great responsibility"
Watch it Linus, we all know what happens to characters who say those cursed words.
Only for fans over 18 years old baby-girls.id/angelina?cute-girl 🍑
tricks I do not know
Megan: "Hotter"
Hopi: "Sweeter"
Joonie: "Cooler"
Yoongi: "Butter
So with toy and his tricks, do not read it to him that he writes well mamon there are only to laugh for a while and not be sad and stressed because of the hard life that is lived today.
Köz karaş: '' Taŋ kaldım ''
Erinder: '' Sezimdüü ''
Jılmayuu: '' Tattuuraak ''
Dene: '' Muzdak ''
Jizn, kak krasivaya melodiya, tolko pesni pereputalis.
Aç köz arstan
Bul ukmuştuuday ısık kün bolçu, jana arstan abdan açka bolgon.
Uyunan çıgıp, tigi jer-jerdi izdedi. Al kiçinekey koyondu wins taba algan. Al bir az oylonboy koyondu karmadı. '' Bul koyon menin kursagımdı toyguza albayt '' dep oylodu arstan.
Arstan koyondu öltüröyün dep jatkanda, bir kiyik tigi tarapka çurkadı. Arstan aç köz bolup kaldı. Kiçine koyondu emes, çoŋ kiyikti jegen jakşı dep oylodu. # 垃圾
They are one of the best concerts, you can not go but just seeing them from the screen, I know it was surprising
💗❤️💌💘
With great comments comes great botsibilites.
Didn't know linus had a kid named Peter Parker
Prime,
At least 3 pro-establishment bots were trained to oppose yours and my viewpoint.
🤣
Anyone else seeing these annoying bots everywhere
i will refrain from saying anything negative because i appreciate your honesty.
I can’t even imagine building such massive storage servers and then never running a scrub or even manually checking the disks, wow. I have a relatively tiny home server with like 80 TB of storage and I run monthly scrubs, manually verify disks constantly, and make regular cold storage backups.
I've been working IT for the past 5 years, and never scrubbing our drives or verifying disks is unthinkable. LTT need to hire an actual IT guy, not just tech enthusiasts.
"Relatively tiny" my buttocks... Also, kind of a dick move bro. I think they realise they made a mistake.
@@drizzle8309 yeah a mistake like never checking your fire extinguishers still work...ever...on a 100 story building.
@@deViant14 lol did you watch the video? none of it was critical
also, they would've made a video saying "we should've checked our fire extinguishers", to which OP then would've replied that he always checks the fire extinguisher in his "small $3M villa"...
@@deViant14 well in the situation your talking about lives are lost. So that comparison is a reach at best.
Linus: "We're not sure who's accountable here, so I'm considering hiring someone to be accountable because the situation is currently untenable without an appropriate system of blame in place."
Sounds like my company. Lol
.... yes
Caring comes from being the person to blame
Said every CEO ever.
look up "Diffusion of responsibility" to understand what he truly meant.
Ahhh you've discovered the world of IT.
Very impressed with the honesty on this channel. I know plenty of IT folks who would never admit losing data. I run large ZFS storage arrays at my work. When my primary ZFS array is due for replacement (after moving data and workloads to a new array), I then create a Zpool on the old array configured for max capacity and sequential I/O. I then snap and replicate (zfs send/receive) the data on the primary array nightly to the old array. I don't need a ton of performance or redundancy on the old array as it only receives the changed blocks on each replication and is only used for Oh Sh!t moments. I also HIGHLY recommend you add mirrored "Special" devices to your Zpools. Special devices (man zpool) are used for storing metadata (use SSD/NVME) and removing those I/O's from your slower main Zpool drives. You will be amazed at the performance increase, I promise.
you have to be careful with those special devices though, if you happen to have them configured in a non-redundant way and they go away, you drop the entire pool.
@@shanemshort great advice from you both. I totally agree. But they forgot to scrub a 2PB array and its backup, letting them rot for years. I mean... guys, come on.
@@alessandrozigliani2615 There was a backup? It could be scrubbed?
man 7 zpoolconcepts should be where 'special' is hiding. POSIX compatibility, COW filesystem, and magnetic media with its large seek times makes for a less than idea combination if performance matters.
Honestly the first RUclipsr I regularly watched. And still do
One thing that was not included in the root cause analysis, is single ZFS vDev per pool. In the case of archival data, using a single vDev per pool and multiple pools per server, makes more sense. This would have potentially allowed more data recovery than multiple vDevs in a single pool. Write once data, (basically what archival is), means you can also fill the pool up higher. Perhaps even 95%.
That said, of course not making the vDev too wide should also be mentioned. Meaning if you are going with single vDev per pool, don't use a vDev of 16 or more disks. The wider the vDev, the longer the RAID-Z2 re-build time since ZFS may have to read more disks per data block / stripe.
Good luck.
Forgot to mention that with larger disks, (like 20TB), using fewer disks per vDev is also suggested. Having week long rebuilds, even with RAID-Z2's 2 disk parity is still pretty risky. So 10 to 12 disks maximum in a vDev, with single vDev per pool is probably optimal on a storage verses cost basis.
Last, leaving a free disk slot in each server for replace in place is also a good idea. This allows you to replace a failing, but not yet failed, disk, with higher degree of safety than simply pulling the failing disk. Allowing ZFS to read data from the failing disk as well as the rest of the vDev to re-create the failing disk into the replacement disk. Thus, if their are other unknown errors, less chance of data loss. ZFS is one of the few RAID schemes that allows this functionality, (though probably more common today than when ZFS first came out). Of course, this does not help in the case of a completely failed disk, nor in some failing disk cases where it's in bad shape.
yeah, and triple-parity / mirroring for such large drives if you're going to be running on a home-brew system that you're not 100% confident that you'll be notified of errors. This is why enterprise storage and enterprise backup platforms exist.
If they can't hire a full time IT team to manage this it would make the most sense to contract a 3rd party to manage something like this. it's not "mission critical" it is not "top secret" it's just old already uploaded videos. It would make sense to hire an expert to set up and maintain this for 1k a month wich is about 1/5 or 1/6th of the cost of one full time person.
@BassRacerx but hiring someone wouldn’t get you content which is one of the main motivations of projects like this.
@@PBMS123 Good point.
Linus, when you did the cloud pricing calc you missed something- backblaze wasn't showing you the archival level rates (most likely on purpose), for example with Azure it's $0.001 per GB per month as long as you're OK with the delay of accessing archival level files. So more like $1k per month for 1PB which aint bad.
AWS S3 also has archival tiers that are competitive with tape, though retrieval fees will bite you in the ass if you don't plan around them.
Kind of what I was thinking that they should probably be using tape cold storage
I use S3 Glacier Deep Archive for my backups. Ends up being just a few pennies a month.
@@ryanjones8977 Deep Glacier is good as a last resort because it's so cheap but the retrieval cost is quite significant. So personally I use it as a backup of a backup.
@@SuperSmashDolls screw AWS/Amazon. they're horrible.
"The rule of two: One is none. Two is one. If it's important, you need a backup." - C. G. P. Grey
Thanks for pointing out needing to manually schedule a parity check!
I've been using Unraid and I assumed that it would have scheduled _something_ by default. Nope. Parity hasn't been checked since I set it up in October.
Protip: Also setup proper monitoring of system and harddrives so that you react immediately when even ONE drive fails.
This. And either constant notification of error condition (email every hour etc) and/or escalation to someone else if not resolved in a particular time frame. Oh and have hotspares
@@blowfly71 just have alerts mate. You don't want constant "everything is OK" messages, because you will start ignoring those real quick and miss the one that says it's no longer OK.
@@jfolz he wasn't talking about constant notifications, but ongoing reminders if an error has occurred but wasn't fixed yet. That way you can't miss the single notification of a failed drive.
@@jfolz thats what I meant. You have alerts that require action, escalate if not resolved...
@@blowfly71 got it. Though sending constant messages does have a benefit: it's a canary for your monitoring ;)
It's probably better to have monitoring that monitors the monitoring though.
Team: Our data is gone
Linus: So we got our content for today’s video
and quite a couple more
as long as they do not lose the data for that video too...
I do like this about LMG. I've been called in to help with several incidents of a similar nature and the level of stress as people see their livelihoods on the line can be pretty extreme. The fact that LMG can just make lemonade out of it is quite refreshing (pun not intended).
This is the issue with the massive storage on a single disks like 20TB, it takes forever to rebuild and your more likely to have another failure during the rebuild cycle. Also, you should always have some hot spares so it rebuilds automatically once a failure is detected instead of manually doing it.
declustered parity helps with this issue significantly by getting every disk in the array involved with rebuilding the lost data instead of a single parity disk or two.
You and your bunch give us so much of yourselves,thank you for putting so much time and precision in all your work.