Having just gotten into ZFS with TrueNAS, if this video teaches me anything it's that I have made the right choice. If it can get "90% recovery out of the box" with that terrible degraded pool... I think I am safe with my 5 drive NAS :D Thank you for this great video.
Especially considering your pool should never get this degraded. The main cause of degration in Linus' case was using an old version of ZFS with no regular scrub set up. TrueNAS sets this up by default apparently.
@Сусанна Серге́евна I'm not talking about Linus, rather @diesieben07 case RAIDZ2/6 vs RAID10 has a key difference when it comes to drive loss and rebuild. Rebuild puts massive strain on the drives, so it is possible to lose a drive in the rebuild. If in RAID10 that loss is in the good mirror than you have lost your data, whereas in RAIDZ2/6 you still have your data and hopefully the rebuild will complete nand you can rebuild another failed drive As to Linus, he has hasn't realised that there is big difference between where he started as a single person/small company an a medium/large business that he is now. This may have taught him that things that you do as a small outfit are different to a large one. And being a tech channel does NOT exempt you from having standards that are driven by proper system management. He has realised this on PC builds for his growing staff, that having a collection of random PCs is a nightmare to support and things can go wrong stopping production
As an old techsnap fan, I'm loving this. So now we see the end of the line. When youtubers need tech support they call Linus, who calls Wendell, who presses the red button to release Allan.
@@kaptenkrok8123 can't disagree a lot of people forget Linus or other tech channels are media creators, sometimes actors who have no idea what they are talking about in the case of Steve from GN which what 1-2 years ago "didn't know how to build a pc inside a case" which means he's a complete fake
This was so enjoyable to listen to. The complexity of storage, and everything else, has multiplied so much over the decades. Nice there are people that still love working to improve it all from top to bottom.
im glad you got the main head honcho on to speak, and talk about how he improved the recovery tooling. I know what its like to do a recovery in ZFS, and it IS kind of fun just to experience the internals with ZDB, and its so flexible, but also hard to do everything manually. Ive been using it since 2018, ZFS is the way to go.
I would love more in-the-weeds content on this, sitting, as I am, with a failed pool. I've just left it turned off until I have enough replacement drives, and reading on the topic in the meantime. I've gleaned some useful things from this discussion, but it would be good to see more in-depth.
Diffing the different data blocks to detect which bits are potentialy at fault and brute forcing them automatically if requested would be an awesome addition.
27:45 This is what I've been saying for years, stagger your mirrored SSDs (and use different brands) so they don't wear out at the same time. You can also rotate them every X months if you have the budget
This has been done with HDDs before that. We used different brands and different manufacturing date hdds in hour NAS systems 15-20 years ago...because of different wear and tear.
Linus Tech Tips, oh the irony there. I found it amazing watching their clip how they didn't even follow a best practice guide on zfs. They used freenas/truenas even, a healthy-sized community in itself. The irony is just remarkable. Anyway, great piece and great to hear about the new advances in the pipeline. Love hearing about this stuff from Allan. I have an all-nvme project over 100GBe so am eagerly waiting for zfs to be tuned to take advantage of such a platform. The "repair send" and a new Block Reference Tree features are great ideas that I hope can be implemented upstream ASAP.
When system administration is an afterthought for someone doing other - openly profitable - work this is what happens. I used to do all the system administration at my old job but was often rented out to customers. Then no one would do it and months later when I got back from the engagement I'd have to clean up a ton of problems.
@Сусанна Серге́евна I am not so sure because what we learnt is that they didn't even set up email alerts - this is all more of an indication of a certain degree of arrogance...because we cannot truly believe that they are woefully inexperienced, "do as i say, not as i do" individuals running a technology channel that delves into Enterprise when they clearly haven't got the credentials to properly do so?
@Сусанна Серге́евна Couldn't agree more. They're simply entertainers. It is quite bizarre why a small proportion of the enterprise sector is seemingly aligning with them. I never thought i'd consider Supermicro or Toshiba/Kioxia in an "enthusiast" domain (not to lesser it of course!) but if that's what they see of themselves then perhaps we should realign our impression of them too.
@23:14 amazing advice! If you have a zraid and a drive fails, don't pull the bad drive until the rebuild is complete. What a good point that a drive might be occasionally erroring and should definitely be replaced prior to total failure, but it's probably good enough to support the resilvering operation.
The most I ever made on a single task was a drive recovery for a client that was not allowed to let a service contract. Said service contract would have ensured the backups were good. Client managed to corrupt the backup resulting in a recovery when the drive failed rather than a replace/restore.
This conversation inspired a general recovery question regarding traditional “classic” Audio CDs. These don’t have error correction. Using software like AccurateRip you can check if your rip matches checksums from a database of other users that ripped the same Audio CD. Sometimes scratches on the CD prevent a track from being accurately ripped and the damaged sectors of a track get logged. Is it possible to use modern very fast hardware to fill these “holes” in tracks with all possible data combinations until the entire track matches the verified checksum from that AccurateRip database?
I see backplane issues as you're describing at 4:00 _quite often_ even in my personal zfs setups and it's quite frustrating watching the CKSUM (for example) counters of _all_ drives in a pool rise together with the exact same count's and 100% knowing that's a backplane issue, but ZFS is happy to eject them instead of be a bit more careful with that kind of problem. Ideally... backplane failures shouldn't be happening anyway.. and it's going to look the same to ZFS's checking, but damn it can be annoying to deal with.
What backplane are you using? Direct Attach or Expander? Manufacturer? Just asking because im currently shopping for a enclosure with backplane, would be nice to know what to avoid.
I love that you guys are professional friends or partnered with LTT in the way that you are... I love being able to see the deep dives on your interactions on their over the top reactions to the original problem, and advancement of the tech itself... It's Inspiring.
You mentioned a talk Allan was going to give near the end of the video, and that it may have already happened by the time the video went up. Has it happened yet?
Hmmm didn't know ZFS wasn't optimized for NVMe drives, thought it would by now. Tbh, if I can get away with using tape drives, I would lol. Thanks for having Allan on, I've learned quite a bit.
This is part of the reason I like using mirrors in home NAS situations. All the data is on all the drives, and as long as your backups are in order, you're good. And if you occasionally upgrade the mirrors with larger drives, you can add up space fairly well, assuming the old drives in the vdev are still healthy. You can even keep a separate pool with triple mirrors for things that are smaller, but really important, like family photos or financial documents. I get it for big chassis in large scale production units, things like RAIDz allow you to get more space for your money, and the computational complexity during rebuild shouldn't matter that much. But I prefer the simplicity of mirrors, along with the incredibly fast resilvers. Recently I had some checksum errors on a drive in my NAS, and it turns out it was likely just an issue with a cable getting disconnected because of some construction. But, because I was using mirrors, I just added a new drive to the mirror vdev (an upgraded 6TB EXOS to replace the two 4TB Barracudas in the vdev), had it resilver, took out the "bad" drive, then added another third disk to replace to "good" drive. Once both 6TB drives were resilvered, I was able to take out the "good" 4TB drive, and poof, now I have two extra terabytes of space in my pool. Then, assuming both drives are fine, I can either use the old drives as a new vdev, or if I have other drives of the same size in the pool, I can attach them to the vdev as hot spares. It's part of why I plan to build a JBOD enclosure to attach to my NAS. I'm getting pretty low on drive space using standard ATX cases lol (though I'm able to do like 10 4.5" drives? Which is really great for an older, budget case).
The uberblock creating errors in physically nearby sectors sounds like rowhammer. Edit: Could you write backup/snapshot copies of the ssd metadata to the mechanical disks?
Wendel. Best audio ever. Please give us a hint how you did it and PLEASE consider paying the same level of attention to the audio on level one. Thanks.
Great video! Love how this lead to developing great tooling and really interesting to hear this talk! Sounds like Linus could do with outsourcing his sysadmin to a reputable company like one in Kentucky. For a tech channel, they are hilariously negligent with the administration of their own systems.😅 It seems like every few months there is an IT disaster that was easily avoidable with some regular systems administration. [Edit] as Wendell said himself @50:08
This is the first that I've heard that ZFS stores additional copies of metadata. I was under the impression that it did not and I have been using btrfs on my small system largely because btrfs does store extra copies. So to ask a next question, can I have a multi-drive ZFS setup where the data has no redundancy but the metadata is still protected if one drive were to fail? With low-value data I don't mind if some files become inaccessible but I at least want to know exactly what I lost.
44:10 I am surprised that's not a part of zfs already (b-tree) , btrfs does that as I can restore a snapshot to a folder path and it's just mounts the snapshot I would have thought the vdev is the lowest level and zfs datasets are like btrfs subvolumes, just surprised that it duplicates it like that on zfs in snapshots restore to another dataset
I use ZFS now for 4.5 years and I'm very happy with it. The video was partly over my head, I have to study somewhat more :) I use ZFS on: 1. My Ryzen 3 2200G; 16GB desktop running a minimal install of Ubuntu 22.04 LTS on OpenZFS 2.1.4. A minimal install, because I moved my "work" to VMs. It has 3 datapools mapped to a nvme-SSD; 2 HDDs (~9 power-on years 1TB+500GB) and a 128GB sata-SSD used as L2ARC/ZIL (SSD cache for the HDD datapools). The datapools are: - 512GB on my nvme-SSD for the most frequent used VMs; - 2x 500GB striped partitions on the 2 HDDs for my data and for more VMs, but with copies=2 for my data (pictures, videos, music etc.); - 500GB at the end of the 1TB HDD for archives (ancient 16-bits software and outdated VMs, like e.g. Windows 3.11 for Workgroups or Ubuntu 4.10). This weekend my datapools were scrubbed again without any issues. Once I had, that my "data" dataset (copies=2) has been corrected during the scrub. Note that the last 3.5 years the HDDs are more or less retired, they are used for less than 4 hours/week and they are supported by the high hit rates of the ZFS caches. I have a hit rate of ~98% for my L1ARC (memory cache maxed at 4GB) and currently the L2ARC hit rate stands at 49%. 2. My 1st backup a 2011 HP Elitebook 8460p laptop with i5-2520M; 8GB and an almost new 2TB HDD. It also runs Ubuntu 22.04 LTS also on OpenZFS . The nice thing is, that it runs exactly the same VMs as my desktop, ideal during holidays in Europe during family visits. 3. My 2nd backup the remains of a ~20 year old HP d530 SFF with 4 leftover HDDs in total 1.21TB, but with less than 3 power-on years. It has a Pentium 4 HT (1C2T; 3.0GHz); 1.5GB DDR. It has 2 datapools, one zroot on 2 striped HDDs (IDE 3.5" 250+320GB) and dpool on 2 striped HDDs (SATA-1 2.5" 320+320GB). The case is an original Compaq Evo Tower with a Win 98SE activation sticker :) The system runs FreeBSD 13.1 also on OpenZFS 2.1.4. The only disadvantage is, that the backup runs at 200Mbps of the 1Gbps due to a 95% load on one P4 CPU thread. I did run the scrub last week. I see two effects: - I run both backups at the same time, but the backup to FreeBSD/Pentium 4 is faster than the backup to Ubuntu/i5, while all 3 involved PCs are connected to the same 1 Gbps switch. The P4 runs at a constant transfer speed between 20 - 25 MB/s, while the i5 runs in the begin at say 80MB/s, but than it gets slower and irregular with long period of say 0 to 4 MB/s interleaved with say 20 to 40MB/s. It gives me the feeling of fragmentation in the receive buffer. In the past when the L1ARC was used for buffering, I did not have that issue. I did file an Ubuntu bug report, but no reaction. - Boot time of Xubuntu VM is ~6 seconds from the SP nvme-SSD (3400/2300MB/s), the reboot times are ~5 seconds, I assume from the L1ARC. To get faster reboot times from L1ARC memory, I assume, I have to buy a Ryzen 5 5600G. Maybe with the 2200G I could try to cache only the metadata for the nvme-SSD, the difference should not be very significant. After the here announced nvme queuing improvements, it might even be faster assuming, that the decompression of each record is assigned to another CPU thread also?
I have tried to set the primary cache of the nvme-SSD to metadata and the (re)boot time for Xubuntu is now ~7 seconds, but the use of L1ARC memory reduces from 4GB to 0.25GB :) Also for Windows 11 I did not loose more than say 10% in the boot time, so I keep the metadata setting for the moment. I noticed that the load on the nvme-SSD increases, but I still have a 74% hit rate for the L1ARC cache.
Had zfs "invalid exchange" error during zpool import on 3 usb drives, maybe due to no export during shutdown?. Well, that setup serves me well for a few years with 2 disk replaced previously.
The question everyone want to know is "Did Jake do it" I'm British and even i got confused when Allen said Zed FS etc lol Is he an expat? Or maybe Canadian? Edit: oh he is Canadian, I guess he accepted the English language without "making it their own" ;)
What would be your reccomendation for SAS cloning or duplication? Stand alone product 1:1 or another SAS chassis/server and recovery/ partition software?
ZFS really changed my way of looking at multi-volume storage stuff I started my journey when I used the fake-raid of my motherboards uefi to set up a 5-disk raid5 ... and played the restore-game about 3 times now - lucky me without data loss as I got rather annoyed by the fact I might lose data in a raid5 I started to look at other options and first got hooked by BtrFS - which is no good for raid5/6 - and so I learned about ZFS since M$ pulled the support for win7 and I wanted to get rid of the fake-raid5 anyway I switched over to arch linux, bought a big hba to connect many drives to it - set up a raidz2 pool and do regular scrubs sure - I had a look into windows storage spaces and ReFS - but it comes with the issue that this is all closed source and proprietary and only expensive enterprise 3rd party stuff is available - that's a no-go for me in a recovery situation: I want an easy and simple way to access my data - and with zfs-on-windows comming along quite a way ZFS seems to be the one option for a multi-platform filesystem - cause exFAT just doesn't cut it for many reasons - mostly because it's a FAT descendant
Yeah, with SSDs on servers, I'd probably have a backup setup, so that I have two copies of the vdev contents, one on the SSD to be used, and one on HDD as a backup, copied during downtime. I get having fast media on home servers for things like 4K video, but generally I wouldn't really want to store anything there and only there. In fact, it'd be better to have some sort of caching setup where a file is copied to the SSD vdev before playback, if possible.
Are the SMART data for “proper”SSDs like ones from Samsung or Micron really reliable regarding their remaining NAND life? For example get a Samsung 980 PRO with 2 TB, only use 1 TB of its capacity and write 1 PB to it. This should result in a significantly higher rest life expectancy compared to using the entire 2 TB and writing 1 PB to it. I’m a bit worried that SSD manufacturers might game their SMART values with the total TBW the specific SSDs are rated for. All SSD defects I have personally witnessed since 2011 were sudden deaths with no prior SMART indication whatsoever :( The most recent one was a Micron 7450.
"All SSD defects I have personally witnessed since 2011 were sudden deaths with no prior SMART indication whatsoever" Thats because 95%+ of SSD failures is the controller dying and the NAND is still fine, atleast for consumer/prosumer SSDs.
@@_--_--_ well, the Micron 7450 isn’t exactly highest end but I would have classified it at least as “proper enterprise”. Regarding prosumer stuff I hoped that at least vertically integrated manufacturers like Crucial (Micron), Intel and Samsung would have proper SMART monitoring - note: Had the least amount of failures with Samsung SSDs. But I dislike it that their enterprise SSDs don’t offer any warranty for end customers, this “forces” one to Micron after Intel messed up their NAND SSDs with unfixable bugs before discontinuing SSDs for good :(
@@abavariannormiepleb9470 Well depends on the controller used I guess, personally never bothered with micron, so I cant say anything about their drives. I just mentioned consumer/prosumer because lots of drives from manufacturers like WD, Kingston or Sandisk use controllers that are very nutorious for just dying very frequently and i have never seen these specific controllers used in enterprise SSDs. As far as I know Samsungs U.2 PM9A3 uses the same controller as the 980 Pro, I have to agree Samsung controllers are significantly more reliable, but I have heard of dead 980 Pro and PM9A3 also from sudden controller failure, but usually if this new Samsung pcie 4.0 controller dies it atleast tends to die within its first couple dozen operating hours. Kioxia also uses proprietary controllers like Samsung, dont know about how good their smart reporting is, but I have never heard of any of them failing from a sudden dead controller.
Compare that system to Apple's Time Machine. When Apple's "program" gets a read error, it truncates the "backup" and informs no one that data was lost. Then as time goes on, the Apple program deletes older non-truncated files to make room for new corrupted files. Years later, I am still finding files corrupted by Apple.
For the future of ZFS: Please introduce a solution that also works with system sleep (S3) so your ZFS storage doesn’t have to be on 24/7 or cold-booted all the time. This would help ZFS’s proliferation into more “normal people” homes.
Are we going to shame the blackplane manufacturer? This seems like some pretty crummy firmware that needs to be called out. Also probably in part because I'm guessing they used SATA drives on a SAS expander (this never ends well).
Its nice to hear a technical discussion like this where you aren't having to stumble over client confidentiality every 5 minutes. it was really cool.
If Linus can drop gpus then why does it matter if he drops tables too
Bwahahahaha
I see what you did there...
Ole Bobby Tables
maybe you should sanitize inputs
This is a match made in heaven, Allan and Wendell on the same video.
Is that a jojo reference
46:21 Waiting for that talk link! What a truly amazing guys.
found this ruclips.net/video/v8sl8gj9UnA/видео.html
I want to say thank you Wendell for being the expert that you are and sharing your adventures on RUclips for all of us to learn from and enjoy.
Having just gotten into ZFS with TrueNAS, if this video teaches me anything it's that I have made the right choice. If it can get "90% recovery out of the box" with that terrible degraded pool... I think I am safe with my 5 drive NAS :D
Thank you for this great video.
Especially considering your pool should never get this degraded. The main cause of degration in Linus' case was using an old version of ZFS with no regular scrub set up. TrueNAS sets this up by default apparently.
they had RAIDZ2 so had two redundant copies rather than the single you have so be aware of that
@Сусанна Серге́евна I'm not talking about Linus, rather @diesieben07 case
RAIDZ2/6 vs RAID10 has a key difference when it comes to drive loss and rebuild. Rebuild puts massive strain on the drives, so it is possible to lose a drive in the rebuild. If in RAID10 that loss is in the good mirror than you have lost your data, whereas in RAIDZ2/6 you still have your data and hopefully the rebuild will complete nand you can rebuild another failed drive
As to Linus, he has hasn't realised that there is big difference between where he started as a single person/small company an a medium/large business that he is now. This may have taught him that things that you do as a small outfit are different to a large one. And being a tech channel does NOT exempt you from having standards that are driven by proper system management. He has realised this on PC builds for his growing staff, that having a collection of random PCs is a nightmare to support and things can go wrong stopping production
5 years since you two did this last, time flies.
As an old techsnap fan, I'm loving this. So now we see the end of the line. When youtubers need tech support they call Linus, who calls Wendell, who presses the red button to release Allan.
First mistake was not having a good, tested backup strategy. Second mistake is calling Linus at all.
if you call linus for tech support you are gonna have a really bad time lol...
@@kaptenkrok8123 can't disagree a lot of people forget Linus or other tech channels are media creators, sometimes actors who have no idea what they are talking about in the case of Steve from GN which what 1-2 years ago "didn't know how to build a pc inside a case" which means he's a complete fake
@@fredEVOIX what?! Where can i see that?
If you liked TechSNAP, have you seen 2.5admins?
This was an incredible talk with people who obviously know A LOT. Would love more content like this
This was so enjoyable to listen to. The complexity of storage, and everything else, has multiplied so much over the decades. Nice there are people that still love working to improve it all from top to bottom.
I dont know what any of this all means but there is something soothing about listening to two professionals discuss something they know quite well.
yup
It was very nice to see Allan with Wendell. I got Techsnap nostalgia. That show was soo good and informative (precisely because of Allan).
Look at 2.5 Admins, Allan's more recent podcast project. I just binged those over the last month, they are wonderful.
You’re an inspiration Wendell. You gave me the strength to come out of my shell!
Thanks for everything. Love you brother!
im glad you got the main head honcho on to speak, and talk about how he improved the recovery tooling. I know what its like to do a recovery in ZFS, and it IS kind of fun just to experience the internals with ZDB, and its so flexible, but also hard to do everything manually. Ive been using it since 2018, ZFS is the way to go.
Oh man, when I heard "years of no maintenance/scrubbing/etc" I was just kind of hands on head in shock LOL
Really great video, thanks for the in depth coverage. Glad the broken system could be then used as a research device to make ZFS even better!
Wow, it's been a while since I last saw Allan at LFNW like... 8 years ago. Nice to see where he is now.
check out “2.5 admins”/ OR “bsdnow”/ podcasts featuring allan jude
the repair-send and no-copy dataset moves are AMAZING
I would love more in-the-weeds content on this, sitting, as I am, with a failed pool. I've just left it turned off until I have enough replacement drives, and reading on the topic in the meantime. I've gleaned some useful things from this discussion, but it would be good to see more in-depth.
Diffing the different data blocks to detect which bits are potentialy at fault and brute forcing them automatically if requested would be an awesome addition.
27:45 This is what I've been saying for years, stagger your mirrored SSDs (and use different brands) so they don't wear out at the same time. You can also rotate them every X months if you have the budget
This has been done with HDDs before that. We used different brands and different manufacturing date hdds in hour NAS systems 15-20 years ago...because of different wear and tear.
"Don't be like Linus" Truer words....
I agree Tovarlds is a bit of a git
great to see Alan talking about ZFS again
Linus Tech Tips, oh the irony there. I found it amazing watching their clip how they didn't even follow a best practice guide on zfs. They used freenas/truenas even, a healthy-sized community in itself. The irony is just remarkable.
Anyway, great piece and great to hear about the new advances in the pipeline. Love hearing about this stuff from Allan. I have an all-nvme project over 100GBe so am eagerly waiting for zfs to be tuned to take advantage of such a platform. The "repair send" and a new Block Reference Tree features are great ideas that I hope can be implemented upstream ASAP.
When system administration is an afterthought for someone doing other - openly profitable - work this is what happens. I used to do all the system administration at my old job but was often rented out to customers. Then no one would do it and months later when I got back from the engagement I'd have to clean up a ton of problems.
@Сусанна Серге́евна I am not so sure because what we learnt is that they didn't even set up email alerts - this is all more of an indication of a certain degree of arrogance...because we cannot truly believe that they are woefully inexperienced, "do as i say, not as i do" individuals running a technology channel that delves into Enterprise when they clearly haven't got the credentials to properly do so?
@Сусанна Серге́евна Couldn't agree more. They're simply entertainers. It is quite bizarre why a small proportion of the enterprise sector is seemingly aligning with them. I never thought i'd consider Supermicro or Toshiba/Kioxia in an "enthusiast" domain (not to lesser it of course!) but if that's what they see of themselves then perhaps we should realign our impression of them too.
I always need more long technical videos by Wendell!
@23:14 amazing advice! If you have a zraid and a drive fails, don't pull the bad drive until the rebuild is complete. What a good point that a drive might be occasionally erroring and should definitely be replaced prior to total failure, but it's probably good enough to support the resilvering operation.
The most I ever made on a single task was a drive recovery for a client that was not allowed to let a service contract. Said service contract would have ensured the backups were good. Client managed to corrupt the backup resulting in a recovery when the drive failed rather than a replace/restore.
This conversation inspired a general recovery question regarding traditional “classic” Audio CDs. These don’t have error correction. Using software like AccurateRip you can check if your rip matches checksums from a database of other users that ripped the same Audio CD. Sometimes scratches on the CD prevent a track from being accurately ripped and the damaged sectors of a track get logged. Is it possible to use modern very fast hardware to fill these “holes” in tracks with all possible data combinations until the entire track matches the verified checksum from that AccurateRip database?
Very interesting conversation. Peeking into the ZFS internals was enjoyable.
Hey it's Alan, love 2.5 admins and BSDNow podcast!
I see backplane issues as you're describing at 4:00 _quite often_ even in my personal zfs setups and it's quite frustrating watching the CKSUM (for example) counters of _all_ drives in a pool rise together with the exact same count's and 100% knowing that's a backplane issue, but ZFS is happy to eject them instead of be a bit more careful with that kind of problem. Ideally... backplane failures shouldn't be happening anyway.. and it's going to look the same to ZFS's checking, but damn it can be annoying to deal with.
What backplane are you using? Direct Attach or Expander? Manufacturer?
Just asking because im currently shopping for a enclosure with backplane, would be nice to know what to avoid.
Wow, this was actually an awesome talk. So much good advice and insights into ZFS!
I love that you guys are professional friends or partnered with LTT in the way that you are... I love being able to see the deep dives on your interactions on their over the top reactions to the original problem, and advancement of the tech itself...
It's Inspiring.
You mentioned a talk Allan was going to give near the end of the video, and that it may have already happened by the time the video went up. Has it happened yet?
found this ruclips.net/video/v8sl8gj9UnA/видео.html
Hmmm didn't know ZFS wasn't optimized for NVMe drives, thought it would by now. Tbh, if I can get away with using tape drives, I would lol. Thanks for having Allan on, I've learned quite a bit.
I can't think of any Filesystem optimized for NVMe drives
This is part of the reason I like using mirrors in home NAS situations. All the data is on all the drives, and as long as your backups are in order, you're good. And if you occasionally upgrade the mirrors with larger drives, you can add up space fairly well, assuming the old drives in the vdev are still healthy. You can even keep a separate pool with triple mirrors for things that are smaller, but really important, like family photos or financial documents. I get it for big chassis in large scale production units, things like RAIDz allow you to get more space for your money, and the computational complexity during rebuild shouldn't matter that much. But I prefer the simplicity of mirrors, along with the incredibly fast resilvers. Recently I had some checksum errors on a drive in my NAS, and it turns out it was likely just an issue with a cable getting disconnected because of some construction. But, because I was using mirrors, I just added a new drive to the mirror vdev (an upgraded 6TB EXOS to replace the two 4TB Barracudas in the vdev), had it resilver, took out the "bad" drive, then added another third disk to replace to "good" drive. Once both 6TB drives were resilvered, I was able to take out the "good" 4TB drive, and poof, now I have two extra terabytes of space in my pool. Then, assuming both drives are fine, I can either use the old drives as a new vdev, or if I have other drives of the same size in the pool, I can attach them to the vdev as hot spares.
It's part of why I plan to build a JBOD enclosure to attach to my NAS. I'm getting pretty low on drive space using standard ATX cases lol (though I'm able to do like 10 4.5" drives? Which is really great for an older, budget case).
I remember Allan Jude from the tech snap days. Very smart guy
The uberblock creating errors in physically nearby sectors sounds like rowhammer.
Edit: Could you write backup/snapshot copies of the ssd metadata to the mechanical disks?
Wendel. Best audio ever. Please give us a hint how you did it and PLEASE consider paying the same level of attention to the audio on level one. Thanks.
Wendell, I had to come back and watch this video again :)
Wow really great video! Especially interesting were the part about file cloning on copy and restore coming in the future. 😁
Great video!
Love how this lead to developing great tooling and really interesting to hear this talk!
Sounds like Linus could do with outsourcing his sysadmin to a reputable company like one in Kentucky. For a tech channel, they are hilariously negligent with the administration of their own systems.😅 It seems like every few months there is an IT disaster that was easily avoidable with some regular systems administration. [Edit] as Wendell said himself @50:08
I'm surprised ddrescue wasn't mentioned anywhere. Was it never a part of the recovery process?
Dubstep Allan has returned!
This is the first that I've heard that ZFS stores additional copies of metadata. I was under the impression that it did not and I have been using btrfs on my small system largely because btrfs does store extra copies. So to ask a next question, can I have a multi-drive ZFS setup where the data has no redundancy but the metadata is still protected if one drive were to fail? With low-value data I don't mind if some files become inaccessible but I at least want to know exactly what I lost.
Excellent vid, guys... Thanks!
Where's the link to Allan's talk though?
Wendell, pls put windows on Do Not Disturb when you're recording your system audio!
Did the faulty LTT backplane have Broadcom components in it?
lol, Linus contributing to FOSS & Linux in ways he never expected XD
Block reference tree was one of the killer features of SimpliVity from day 1 :)
This is the first video that mentioned scrubbing. Recovery is very different from normal operation.
Excellent video. I hope LMG reimbursed you for the work done.
2 Legends !!!!!!!😃😃
Haven’t seen Allan in a long time.
I still do not understand, why Linus did make a video about LTO backup, but does not use it?!
Those outlook notification sounds threw me off there!
This is fantastic listening
Amazing info even for people just running stuff at home.
45:00 omg finally!! 🤩 Been using reflinks with btrfs and xfs for years! ^^
Love my boy Allan!
Damn, I can only click the like button once! Thanks for this, very informative.
Is there any narrative that details how ZDB was used for the forensics?
44:10 I am surprised that's not a part of zfs already (b-tree) , btrfs does that as I can restore a snapshot to a folder path and it's just mounts the snapshot
I would have thought the vdev is the lowest level and zfs datasets are like btrfs subvolumes, just surprised that it duplicates it like that on zfs in snapshots restore to another dataset
Where is the link to the talk?
I use ZFS now for 4.5 years and I'm very happy with it. The video was partly over my head, I have to study somewhat more :) I use ZFS on:
1. My Ryzen 3 2200G; 16GB desktop running a minimal install of Ubuntu 22.04 LTS on OpenZFS 2.1.4. A minimal install, because I moved my "work" to VMs.
It has 3 datapools mapped to a nvme-SSD; 2 HDDs (~9 power-on years 1TB+500GB) and a 128GB sata-SSD used as L2ARC/ZIL (SSD cache for the HDD datapools). The datapools are:
- 512GB on my nvme-SSD for the most frequent used VMs;
- 2x 500GB striped partitions on the 2 HDDs for my data and for more VMs, but with copies=2 for my data (pictures, videos, music etc.);
- 500GB at the end of the 1TB HDD for archives (ancient 16-bits software and outdated VMs, like e.g. Windows 3.11 for Workgroups or Ubuntu 4.10).
This weekend my datapools were scrubbed again without any issues. Once I had, that my "data" dataset (copies=2) has been corrected during the scrub. Note that the last 3.5 years the HDDs are more or less retired, they are used for less than 4 hours/week and they are supported by the high hit rates of the ZFS caches. I have a hit rate of ~98% for my L1ARC (memory cache maxed at 4GB) and currently the L2ARC hit rate stands at 49%.
2. My 1st backup a 2011 HP Elitebook 8460p laptop with i5-2520M; 8GB and an almost new 2TB HDD. It also runs Ubuntu 22.04 LTS also on OpenZFS . The nice thing is, that it runs exactly the same VMs as my desktop, ideal during holidays in Europe during family visits.
3. My 2nd backup the remains of a ~20 year old HP d530 SFF with 4 leftover HDDs in total 1.21TB, but with less than 3 power-on years. It has a Pentium 4 HT (1C2T; 3.0GHz); 1.5GB DDR. It has 2 datapools, one zroot on 2 striped HDDs (IDE 3.5" 250+320GB) and dpool on 2 striped HDDs (SATA-1 2.5" 320+320GB). The case is an original Compaq Evo Tower with a Win 98SE activation sticker :) The system runs FreeBSD 13.1 also on OpenZFS 2.1.4. The only disadvantage is, that the backup runs at 200Mbps of the 1Gbps due to a 95% load on one P4 CPU thread. I did run the scrub last week.
I see two effects:
- I run both backups at the same time, but the backup to FreeBSD/Pentium 4 is faster than the backup to Ubuntu/i5, while all 3 involved PCs are connected to the same 1 Gbps switch. The P4 runs at a constant transfer speed between 20 - 25 MB/s, while the i5 runs in the begin at say 80MB/s, but than it gets slower and irregular with long period of say 0 to 4 MB/s interleaved with say 20 to 40MB/s. It gives me the feeling of fragmentation in the receive buffer. In the past when the L1ARC was used for buffering, I did not have that issue. I did file an Ubuntu bug report, but no reaction.
- Boot time of Xubuntu VM is ~6 seconds from the SP nvme-SSD (3400/2300MB/s), the reboot times are ~5 seconds, I assume from the L1ARC. To get faster reboot times from L1ARC memory, I assume, I have to buy a Ryzen 5 5600G. Maybe with the 2200G I could try to cache only the metadata for the nvme-SSD, the difference should not be very significant. After the here announced nvme queuing improvements, it might even be faster assuming, that the decompression of each record is assigned to another CPU thread also?
I have tried to set the primary cache of the nvme-SSD to metadata and the (re)boot time for Xubuntu is now ~7 seconds, but the use of L1ARC memory reduces from 4GB to 0.25GB :) Also for Windows 11 I did not loose more than say 10% in the boot time, so I keep the metadata setting for the moment. I noticed that the load on the nvme-SSD increases, but I still have a 74% hit rate for the L1ARC cache.
Xdiff for metadata , or maybe with a colour code so you can quickly see?
Had zfs "invalid exchange" error during zpool import on 3 usb drives, maybe due to no export during shutdown?. Well, that setup serves me well for a few years with 2 disk replaced previously.
You mentioned not to use dd to clone the failing drives? Is there a better tool for cloning an unreliable driver sector by sector?
The question everyone want to know is "Did Jake do it"
I'm British and even i got confused when Allen said Zed FS etc lol Is he an expat? Or maybe Canadian?
Edit: oh he is Canadian, I guess he accepted the English language without "making it their own" ;)
I thought he had said Zetta FS, from the original Zettabyte FS naming.
I would like to add special device to my pool, but at the moment figuring out the exact size of drives to use seems complex and a bit voodoo
What would be your reccomendation for SAS cloning or duplication? Stand alone product 1:1 or another SAS chassis/server and recovery/ partition software?
49:30 RIP Optane, I loved you whilst you were around 😢
Wendell = gloriously insane!
Hey, anyone got a link to the Linus video Wendell is talking about?
No scrubs in years just boggles my mind. How!? Whyyy?!
Intel 530's had a write amplification bug as well, fixed long after it mattered.
ZFS really changed my way of looking at multi-volume storage stuff
I started my journey when I used the fake-raid of my motherboards uefi to set up a 5-disk raid5 ... and played the restore-game about 3 times now - lucky me without data loss
as I got rather annoyed by the fact I might lose data in a raid5 I started to look at other options and first got hooked by BtrFS - which is no good for raid5/6 - and so I learned about ZFS
since M$ pulled the support for win7 and I wanted to get rid of the fake-raid5 anyway I switched over to arch linux, bought a big hba to connect many drives to it - set up a raidz2 pool and do regular scrubs
sure - I had a look into windows storage spaces and ReFS - but it comes with the issue that this is all closed source and proprietary and only expensive enterprise 3rd party stuff is available - that's a no-go for me in a recovery situation: I want an easy and simple way to access my data - and with zfs-on-windows comming along quite a way ZFS seems to be the one option for a multi-platform filesystem - cause exFAT just doesn't cut it for many reasons - mostly because it's a FAT descendant
Yeah, with SSDs on servers, I'd probably have a backup setup, so that I have two copies of the vdev contents, one on the SSD to be used, and one on HDD as a backup, copied during downtime. I get having fast media on home servers for things like 4K video, but generally I wouldn't really want to store anything there and only there. In fact, it'd be better to have some sort of caching setup where a file is copied to the SSD vdev before playback, if possible.
Are the SMART data for “proper”SSDs like ones from Samsung or Micron really reliable regarding their remaining NAND life?
For example get a Samsung 980 PRO with 2 TB, only use 1 TB of its capacity and write 1 PB to it. This should result in a significantly higher rest life expectancy compared to using the entire 2 TB and writing 1 PB to it. I’m a bit worried that SSD manufacturers might game their SMART values with the total TBW the specific SSDs are rated for. All SSD defects I have personally witnessed since 2011 were sudden deaths with no prior SMART indication whatsoever :(
The most recent one was a Micron 7450.
"All SSD defects I have personally witnessed since 2011 were sudden deaths with no prior SMART indication whatsoever"
Thats because 95%+ of SSD failures is the controller dying and the NAND is still fine, atleast for consumer/prosumer SSDs.
@@_--_--_ well, the Micron 7450 isn’t exactly highest end but I would have classified it at least as “proper enterprise”. Regarding prosumer stuff I hoped that at least vertically integrated manufacturers like Crucial (Micron), Intel and Samsung would have proper SMART monitoring - note: Had the least amount of failures with Samsung SSDs. But I dislike it that their enterprise SSDs don’t offer any warranty for end customers, this “forces” one to Micron after Intel messed up their NAND SSDs with unfixable bugs before discontinuing SSDs for good :(
@@abavariannormiepleb9470
Well depends on the controller used I guess, personally never bothered with micron, so I cant say anything about their drives.
I just mentioned consumer/prosumer because lots of drives from manufacturers like WD, Kingston or Sandisk use controllers that are very nutorious for just dying very frequently and i have never seen these specific controllers used in enterprise SSDs.
As far as I know Samsungs U.2 PM9A3 uses the same controller as the 980 Pro, I have to agree Samsung controllers are significantly more reliable, but I have heard of dead 980 Pro and PM9A3 also from sudden controller failure, but usually if this new Samsung pcie 4.0 controller dies it atleast tends to die within its first couple dozen operating hours.
Kioxia also uses proprietary controllers like Samsung, dont know about how good their smart reporting is, but I have never heard of any of them failing from a sudden dead controller.
Good video! Really neat stuff.
only a "hundred hours" to "melt" Wendell's brain??? OMG! that level waaaay out paces me!
TrueNAS Scale on native hardware (Ryzen 3600, Radeon RX550, 48GB RAM) or virtualized TrueNAS on Proxmox?
The whole time i watched this on my phone, I kept thinking I had dropped some crumbs on the screen.
Where is the talk referenced? You didn't link it and it's 2 months since this was.
Ok simple question, Why did they not restore from backups?. if they don't have backups why?
Compare that system to Apple's Time Machine. When Apple's "program" gets a read error, it truncates the "backup" and informs no one that data was lost. Then as time goes on, the Apple program deletes older non-truncated files to make room for new corrupted files. Years later, I am still finding files corrupted by Apple.
A long time I have not seen you Allan since bsd show
ZFSsend with all safeties turned off.. is that "Full Send"?
This was really great! Thanks!
For the future of ZFS: Please introduce a solution that also works with system sleep (S3) so your ZFS storage doesn’t have to be on 24/7 or cold-booted all the time. This would help ZFS’s proliferation into more “normal people” homes.
is Allan's talk already live?
Ah man this is awesome stuff!!!
Good video, but you never actually put the link to Alan's talk in the description like you said you would in the video.
Allan and Wendell?! 🥰😘😂😊
I fear the cold truth here is that it's always possible to be more negligent than your storage is smart!
Woot!
Are we going to shame the blackplane manufacturer? This seems like some pretty crummy firmware that needs to be called out. Also probably in part because I'm guessing they used SATA drives on a SAS expander (this never ends well).
Liked Allan on Jupiter back in the day
Awesome!
This was really interesting even though most of it went above my “pay grade” 😅