This guy right there is the OG backend engineer. The topics which he chooses for his videos are the ones which a backend engineer should be familiar with. Though i find his style of teaching a bit difficult and hard to grasp, i have massive respect for the topics which he chooses. Unlike others, who only talk about CRUD operations, this dude is totally different 🙏🙏🙏🙏🙌🙌🙌🙌🙌🙌
Based on topic alone it sounds exceptional, but then you realize it's just "parallel throughput of massive HDD pool > single SSD" and it's honestly not that interesting anymore. Sure it's something to keep in mind when planning storage solutions. There's always going to be bottleneck somewhere anyway: network, disk array or indeed cache. How much scalability do you want and what is your budget. That should already help you establish which bottleneck you'll have so plan everything else with that in mind.
I realized that when I put 12 x 24TB HDDs on my workstation machine. Oh, the throughput combined of the entire array is greater than my Sansumg 970 SSD for sequential writes, but not of my PCIe M2 thingy (I bet it cheats as it has 1GB of RAM itself).
Yeah it'd pretty obvious. There are various type of ssds. In this case your want the "crappy" version without dram. Terrible for a boot drive but excellent as a cache for a mechanical drive array
As a specialist on storage latency they just needed more SSD’s. One single SSD will never win from hundreds of spinning disks. It sounds like they were completely wearing out their SSD’s with writes and that’s causing the latency and hardware failures.
“The word block is the most overloaded word in software engineering” 19:40 Oh I love this! It really is pretty overloaded - what are some other overloaded terms? “Model” “User” “Auth” (this one is a bit of a cheat since it’s a shortening of at least 2 other terms)
Your topic isn't quite related to me (though I do own servers), but you convey your points in a really nice, friendly and simple way for the Layman. Nice one 👍
At the time Dropbox started using SSDs Optane was far from dead. From the sounds of it they don't even need allot of it per server (10s of GB sizes would work for them). Much lower latency and no wear leveling to speak of either (their use case is similar to mysql binary logs with rotation so would naturally wear level anyway). Despite that sounds like they took a similar approach to CEPH which makes each disk an OSD and then duplicates the writes for redundancy (be it no two writes exist on the same OSD, server, rack, etc) as many times as needed and so the writes naturally spread out over the disks and remain independent of each other. The write returns / marked as complete when a certain number of replicas are written to disk.
Obviously the SSDs would fail much sooner than HDDs, because they are a write-back cache, so lots of little changes. Thus many writes to SSD, but HDD is mostly adding files in case of DropBox
I would be super curious if an Optane SSD would perform better in this situation. It's already been proven Optane drive, even first gen ones can out perform modern NAND drives when it comes to random access and multiple operations of that type. This is why I am sad Intel is retiring it as it has some fabulous use cases and is fundamentally different than NAND.
@@monad_tcp Its a thing but its really a step backwards. Battery backed DRAM existed before optane. It works but is cumbersome and takes much more energy than optane. Optane is unique because it doesn't use NAND but is still none volatile. It has much better wear leveling and total lifetime than NAND and as I mentioned it handles random access and multiple access better than NAND. NAND of course smokes Optane in most other i/o operations, but Optane does have its niches where it excels (eg: its a great OS boot drive or cache drive). I hate its going away already.
I'm currently looking at bcachefs development for per-file raid and caching disks, so this was an interesting watch. I have CMR hardrives so I don't know how much this matters.
23:44 the next best thing - memory as cache . but what if it fails, send the request to 3 or 4 servers and use some cache-locking algorithm for the writes to the HDDs, let it duplicate a bit, then clean up later but marking the space as free
Hey, thank you for this video, very interesting! :) I was thinking that tey just buy S3 storage and resell it :D at least i know that they started like that, but it makes sense to move on to their own infrastructure.
tl: dw (article can be read in far less than 30 minutes): they had a single SSD "cache" (more a buffer) in front of an array of disks so the speed of the SSD was a bottleneck. Other reasons too (a bunch of SSDs failed at the same time a while back) to eliminate the SSD rather than say have an array of buffer SSDs in front.
They were stupid for using ssds like that. I build high capacity storage for my environments that I manage and you never use a single ssd in this way. Think about it... if an ssd is faster than a HD an array of ssds will be faster than an array of hds.... when designing a storage system, I always build the cache as an array of ssds.... to be more accurate I actually have 3 tiers with ram being in there.... when you do this on a large scale you need a fabric and fast network 100gig and up and you start isolating storage types.... and now I get to the part of the video where you bring up multiple ssds lol.... nice job....
I'm currently using my SSDs right now as a temporarily stop-gap measure as I am migrating my systems and consolidating it down to a single server. The proposal that's on the table right now is that I will have eight HGST 6 TB SATA 6 Gbps 7200 rpm in a raidz2 ZFS pool, that will host the VM's OS disk images/files, and then that will be separated from the larger, bulk storage of the rest of my data. I've found that when I tried to pile everything into the main, bulk storage, the performance of the VMs were suffering (sometimes semi-catastrophically) because it was waiting for the bulk storage's I/O to catch up. And with the incoming server/data migration, I'm going to be writing a fair bit of data to the bulk storage, and I want the VMs to be able to stay alive and responsive during that time. So that's the plan for the moment.
I'm not sure its over other factors, they mention the replicas. The service will not be all that useful if the data could not be retrieved or was corrupted in some way. Speed underlies allot of other factors though, the faster the data can be written the more a single server can handle. Allot of these cloud services operate on an oversubscription model that makes use of under-utilized resources in servers deployed at a smaller (read pretty much everyone else's) business. When you are paying per-server / per-rack costs you want to squeeze as much performance out as possible. They are already at the point where there are redundant servers in place (whereas in a small business you have to make a whole plan / business case just to get your 2nd server to act as the backup).
@@davidmcken that makes sense. Thanks! Any good book recommendations for a front end light backend guy to learn more about all this or backend engineering in general?
when they said the ssds reached write endurance they meant the ssds failed cause of the wear and tear that happens when data is written over and over onto it ssds have a max write endurance capacity i.e., if a 2tb ssd has that value as 200tb. it means after we write a total of 200tb of data the ssd would die
Don't spread misinformation! Unless the malfunction unfortunately occurs outside the designed constrains, in almost all of the cases the data would never die post its target write endurance because that's not how it is designed to work. You will be able to read all of the data, you just won't be able to write anymore into it.
@@deleater Oks. Thanks for the correction. Do you have a link on that handy. I would be relieved of the anxiety and put more of my stuff in SSDs. Right now I keep a bunch of HDDs and only a couple nvmes
Ok. So here is what I summarized from the explanation on internet. Post the write endurance, the drive wont suddenly become read only. It will start getting write errors and the disk will start making the errored areas as bad thereby reducing available space. Post a certain degradation, the write/delete operations themselves would become unreliable making the drive risky to edit. So by that definition some of the data would be present but it can get botched due to partial writes/deletes. Effectively making the disk an incorrect source of data. To your point, one should keep a disk nearing the capacity for read purposes and not overwrite important data. Any such carefullly managed data should stay available for longer term. Still its best to move the critical data to new drive and use the old disk as a spare
Durability, it was mentioned that once data was written to the SSD it was signalled upsteam that the write was complete and therefore the web-server could discard the data safe in the assumption that it could be retrieved safely later on. MySQL calls this the doublewrite buffer (not sure what postgres calls it). Comparing to a RDBMS dropbox seems to be dealing with an issue similar to the concept of pages and for the SMR setup dirty pages needing to be written all at once.
I wonder why they used ssd in the first place, they could have used custom high capacity high speed server RAM or HBM usd in GPUs as cache storage and custom software to work things out if they wanted to do better. They need a better architect designer.
Get my fundamentals of database engineering udemy course database.husseinnasser.com
This guy right there is the OG backend engineer. The topics which he chooses for his videos are the ones which a backend engineer should be familiar with. Though i find his style of teaching a bit difficult and hard to grasp, i have massive respect for the topics which he chooses. Unlike others, who only talk about CRUD operations, this dude is totally different 🙏🙏🙏🙏🙌🙌🙌🙌🙌🙌
Sounds like a job for Pied Piper Inc. :D
just do it the middle-out
Sounds like a job for Gilfoil
sound like a job for dric codes, the plagiarism of sv applied into a "real" business.
Knowing that magic pocket is for cold storage, I don't think it is that surprising that the SSDs were more trouble than they were worth.
Based on topic alone it sounds exceptional, but then you realize it's just "parallel throughput of massive HDD pool > single SSD" and it's honestly not that interesting anymore. Sure it's something to keep in mind when planning storage solutions. There's always going to be bottleneck somewhere anyway: network, disk array or indeed cache. How much scalability do you want and what is your budget. That should already help you establish which bottleneck you'll have so plan everything else with that in mind.
Sounds like Dropbox is just being cheap
I realized that when I put 12 x 24TB HDDs on my workstation machine.
Oh, the throughput combined of the entire array is greater than my Sansumg 970 SSD for sequential writes, but not of my PCIe M2 thingy (I bet it cheats as it has 1GB of RAM itself).
0.5ms latency doesn't matter if you are going to spend tens of seconds transferring your big 4TB "block" , It only means it finishes 0.5ms later.
Yeah it'd pretty obvious. There are various type of ssds. In this case your want the "crappy" version without dram. Terrible for a boot drive but excellent as a cache for a mechanical drive array
That's the conclusion I reached less than seven minutes into the video. Talk about comparing a single apple to a crate of oranges.
As a specialist on storage latency they just needed more SSD’s. One single SSD will never win from hundreds of spinning disks. It sounds like they were completely wearing out their SSD’s with writes and that’s causing the latency and hardware failures.
Truly mind blowing - So counterintuitive at first. Thank you - I'm a big fan !
“The word block is the most overloaded word in software engineering” 19:40
Oh I love this! It really is pretty overloaded - what are some other overloaded terms?
“Model”
“User”
“Auth” (this one is a bit of a cheat since it’s a shortening of at least 2 other terms)
spot on
"Component" has to be THE most overloaded.
interface
so nice to see that HDD technology is still alive.
God bless IBM
@@deleater I actually had IBM-Hitachi drive at some point. It was quite amazing.
@@boredape1257 lucky you :)
Your topic isn't quite related to me (though I do own servers), but you convey your points in a really nice, friendly and simple way for the Layman. Nice one 👍
Sounds like the perfect use case for Optane. That is, if Optane were not dead
At the time Dropbox started using SSDs Optane was far from dead. From the sounds of it they don't even need allot of it per server (10s of GB sizes would work for them). Much lower latency and no wear leveling to speak of either (their use case is similar to mysql binary logs with rotation so would naturally wear level anyway).
Despite that sounds like they took a similar approach to CEPH which makes each disk an OSD and then duplicates the writes for redundancy (be it no two writes exist on the same OSD, server, rack, etc) as many times as needed and so the writes naturally spread out over the disks and remain independent of each other. The write returns / marked as complete when a certain number of replicas are written to disk.
Obviously the SSDs would fail much sooner than HDDs, because they are a write-back cache, so lots of little changes. Thus many writes to SSD, but HDD is mostly adding files in case of DropBox
Were they using 3D XPoint (Optane) drives? Optane drives or persistent memory modules seems to be a good option for a such system.
I would be super curious if an Optane SSD would perform better in this situation. It's already been proven Optane drive, even first gen ones can out perform modern NAND drives when it comes to random access and multiple operations of that type. This is why I am sad Intel is retiring it as it has some fabulous use cases and is fundamentally different than NAND.
Wasn't there a substitute for it that was essentially battery backed RAM ?
@@monad_tcp Its a thing but its really a step backwards. Battery backed DRAM existed before optane. It works but is cumbersome and takes much more energy than optane.
Optane is unique because it doesn't use NAND but is still none volatile. It has much better wear leveling and total lifetime than NAND and as I mentioned it handles random access and multiple access better than NAND. NAND of course smokes Optane in most other i/o operations, but Optane does have its niches where it excels (eg: its a great OS boot drive or cache drive). I hate its going away already.
What an interesting blog. Your analysis Hussein was very valuable.
ASMR drives are the quietest in the data centre!
7:17 ... and write it asynchronously to ASMR 😂
This VS BTRFS /ZFS implementation on unraid with ZFS-raid 8x 10 SSD with zoned namespace (IOPS FOR DAYS)
The new hard drives use HAMR which is heat-assisted magnetic read/write that uses a laser to allow the ferrus material to be flipped.
I suspect this won’t mean that we can keep our node_modules in Dropbox.
For some reason your videos stopped getting recommended to me, and i completely forgot i use to watch your channel.
I'm currently looking at bcachefs development for per-file raid and caching disks, so this was an interesting watch. I have CMR hardrives so I don't know how much this matters.
This seems like they should have known before they installed it no? Sounds like an oversight of the designer.
Thanks! I want to get your GIS book but I feel like an Udemy course would be easier to digest n better to learn lol.
23:44 the next best thing - memory as cache . but what if it fails, send the request to 3 or 4 servers and use some cache-locking algorithm for the writes to the HDDs, let it duplicate a bit, then clean up later but marking the space as free
zfs sync=disabled will do the job.
You should try reading Salesforce engineering blogs. They are pretty good ... Too!!
I'm not that smart. And even I can understand. Thanks for your great explanation. 👍
Hey, thank you for this video, very interesting! :)
I was thinking that tey just buy S3 storage and resell it :D at least i know that they started like that, but it makes sense to move on to their own infrastructure.
First , a big follower for content
tl: dw (article can be read in far less than 30 minutes): they had a single SSD "cache" (more a buffer) in front of an array of disks so the speed of the SSD was a bottleneck. Other reasons too (a bunch of SSDs failed at the same time a while back) to eliminate the SSD rather than say have an array of buffer SSDs in front.
They were stupid for using ssds like that. I build high capacity storage for my environments that I manage and you never use a single ssd in this way. Think about it... if an ssd is faster than a HD an array of ssds will be faster than an array of hds.... when designing a storage system, I always build the cache as an array of ssds.... to be more accurate I actually have 3 tiers with ram being in there.... when you do this on a large scale you need a fabric and fast network 100gig and up and you start isolating storage types.... and now I get to the part of the video where you bring up multiple ssds lol.... nice job....
can you link to the OG blog!!
Nice !
I'm currently using my SSDs right now as a temporarily stop-gap measure as I am migrating my systems and consolidating it down to a single server.
The proposal that's on the table right now is that I will have eight HGST 6 TB SATA 6 Gbps 7200 rpm in a raidz2 ZFS pool, that will host the VM's OS disk images/files, and then that will be separated from the larger, bulk storage of the rest of my data.
I've found that when I tried to pile everything into the main, bulk storage, the performance of the VMs were suffering (sometimes semi-catastrophically) because it was waiting for the bulk storage's I/O to catch up.
And with the incoming server/data migration, I'm going to be writing a fair bit of data to the bulk storage, and I want the VMs to be able to stay alive and responsive during that time.
So that's the plan for the moment.
Here for the ASMR.
The problem was that they had 1 SSD. They need 100 SSDs.
working in a small business i dont always understand the obsession with speed over other factors....whats up with that!?
I'm not sure its over other factors, they mention the replicas. The service will not be all that useful if the data could not be retrieved or was corrupted in some way.
Speed underlies allot of other factors though, the faster the data can be written the more a single server can handle. Allot of these cloud services operate on an oversubscription model that makes use of under-utilized resources in servers deployed at a smaller (read pretty much everyone else's) business. When you are paying per-server / per-rack costs you want to squeeze as much performance out as possible. They are already at the point where there are redundant servers in place (whereas in a small business you have to make a whole plan / business case just to get your 2nd server to act as the backup).
@@davidmcken that makes sense. Thanks! Any good book recommendations for a front end light backend guy to learn more about all this or backend engineering in general?
Hussein, Stream on Twitch and discuss problems and blogs! people will go crazy!
No value in this video for me. Reading the article takes a few minutes and I knew more than after 3 minutes of listening to this video.
Yeah it's what i just did then came back here. He's just pumping fluff in to justify all the ad breaks.
when they said the ssds reached write endurance they meant the ssds failed cause of the wear and tear that happens when data is written over and over onto it
ssds have a max write endurance capacity i.e., if a 2tb ssd has that value as 200tb. it means after we write a total of 200tb of data the ssd would die
Don't spread misinformation! Unless the malfunction unfortunately occurs outside the designed constrains, in almost all of the cases the data would never die post its target write endurance because that's not how it is designed to work. You will be able to read all of the data, you just won't be able to write anymore into it.
@@deleater Oks. Thanks for the correction. Do you have a link on that handy. I would be relieved of the anxiety and put more of my stuff in SSDs. Right now I keep a bunch of HDDs and only a couple nvmes
Ok. So here is what I summarized from the explanation on internet. Post the write endurance, the drive wont suddenly become read only. It will start getting write errors and the disk will start making the errored areas as bad thereby reducing available space. Post a certain degradation, the write/delete operations themselves would become unreliable making the drive risky to edit. So by that definition some of the data would be present but it can get botched due to partial writes/deletes. Effectively making the disk an incorrect source of data.
To your point, one should keep a disk nearing the capacity for read purposes and not overwrite important data. Any such carefullly managed data should stay available for longer term. Still its best to move the critical data to new drive and use the old disk as a spare
@dhruv by the above summary, dropbox had no reason to trust a disk nearing its write endurance as their primary case is write
Do is ALL in ASMR next time :D
nothing new
1 we create our own cache to increase performance
2 we remove our own cache to increase performance
Simpler and quality UX has overengineered parts in the backend.
have they even thought about using ram as a cache yet?
problem is RAM is not durable so might lose data. maybe until ULTRARAM is a thing?
Durability, it was mentioned that once data was written to the SSD it was signalled upsteam that the write was complete and therefore the web-server could discard the data safe in the assumption that it could be retrieved safely later on. MySQL calls this the doublewrite buffer (not sure what postgres calls it).
Comparing to a RDBMS dropbox seems to be dealing with an issue similar to the concept of pages and for the SMR setup dirty pages needing to be written all at once.
I wonder why they used ssd in the first place, they could have used custom high capacity high speed server RAM or HBM usd in GPUs as cache storage and custom software to work things out if they wanted to do better. They need a better architect designer.