You've simplified your explanation like google engineers do when they give lectures, I'm sorry if that sounds strange but I've realized that the people who simplify complex things they really know what they are doing awesome man Cheers.
Network guy trying to get an understanding in a different field. That's an outstanding walk-through and very much appreciated. Thank you for your work and quality presentation.
This is seriously such a great video man. I spent the entire Sunday understanding Sharding. Not that I didn't get started with the concept, however, this video just made everything clear at the end of the day. Thank You.
Watched countless videos and barely understood the concept. Your video on the other hand explained everything along with pros and cons super simply. Thanks a ton.
Dude you make some really awesome content. Please please keep making videos! I love the clarity of your speech, voice, and presentation. I understand and can follow along in your videos a lot better than more other channels. Earned my subscription and likes! Keep killing it homie!
Finally found some decent content over this topic. I already had an idea on this topic just wanted to revise it. Thanks a lot for making the insightful videos.
In most ~20min videos, I get tired soon and close them after 5min. I can’t believe your video is so good that I totally forgot time and finish watching all of it
Are there any database tools that make this easier? Couldn't someone write some software to create a wrapper around a sharded DBMS that could handle the routing and re-sharding with a given hashing key?
Been studying system design for interviews. All the videos handwave to sharding. We would shard the db across different regions. I had rough idea what it is that we split the db in to smaller pieces, but nothing concrete. Now it make perfect sense with this amazing video
Thanks for the straight forward easy to grasp concept of sharding. Give this to someone else and we would have gotten a bunch of technical wordy mumbo-jumbo.
which better to start with for database basics: - introduction to database systems c.j date . - database internals. // if there are any better or recommended books or materials pls mention. * Great explanation.
Hi Saif, This is a tough question to answer. I would step back for a moment to ask why are you trying to learn about databases? I think the answer will guide how/what to tackle first. For example, if you're just planning on using dbs, the database internals may be a bit overkill (but good to know overall). Could you tell me more about why you're learning db's and maybe I can guide you more? Thanks, Daniel
@@BeABetterDev to be aware of the basics in general like concepts physical logical at first And in backend specific. I'm very grateful for your concern
Hi Saif, I briefly looked at the two resources you mentioned, I think a better choice is to read Database Internals. I feel that it is much more modern and covers some of the important aspects of database challenges today such as distributed systems and availability. The other book is quite dated and although I'm sure would be beneficial, I think things have changed so rapidly recently that I'm concerned the content will be a bit stale. One thing to note is to not get too bogged down with the details. To be a great developer with database understanding you don't always need to understand the low level details. Knowing how things work at a high level with the ability to dive deep when you need to is much more valuable. Hope this insight helps and I wish you best of luck on your studies. Daniel
very good explanation, thank you one point is not clear - do we really have advantage of availability / fault tolerance, in case we have an intermediate layer that routes the requests? for me it is like the same, isn't it?
Much thank you for your great RUclips help. I am new to Excel and Chatbot. How can I migrate the Excel database, export it from Microsoft Azure WebApp, and import it into AWS Chabot? Keep having errors missing QID and others on the AWS Chabot console. Please help show me the fastest way to convert the Excel and make it compatible with AWS Chatbot?
Great analysis, thank you! Could you help me with something unrelated: My OKX wallet holds some USDT, and I have the seed phrase. (alarm fetch churn bridge exercise tape speak race clerk couch crater letter). How should I go about transferring them to Binance?
Question: Let's say you shard based on hashing a Guid AccountId. How do you handle queries that do not pass the AccountId? Would you have to rewrite all your ORM code to pass the AccountId? Is this something you have to do from the very beginning designing your app or would you be able to integrate this into legacy code?
Hi Kellen, Good question. In this case, you may need to do some re-structuring of how you access your data. The partition key needs to be something that is identifiable and known as part of every query in order to know which shard to look at. This may not be realistic for applications already in production and don't follow this invariant, so you may have some additional challenges in terms of making this work. Hope this helps, Daniel
What I always miss in these videos is, doesn’t introducing a routing layer just kick the can down the road? Now you have all traffic going to a singular routing node, which is not scalable and can fail. What happens when you need to scale the routing node?
and there is another complex thing is the id generation (here Customer ID) when we shard we have to make sure duplicate ID should not be generate, can we have video on ID generation in distributed computing
i HAVE A Question: What if the Shards returns the incomplete information . Means If customer queries the DB and shard returns the incomplete info ?? Then whats the use . Why NOT the backup is a good option ?
Can we scale up and scale down the storage of database as per daily requirement using sharding?
Год назад
@BeABetterDev What if I were to opt for synchronous replication for my read replicas? Wouldn't that provide me with a high level of consistency (strong consistency) between the master node and the replica nodes? Besides, AWS RDS provides async replication for read replicas, does that mean it is eventual consistent? If so, if I am building an application that needs to opt in strong consistency, shouldn't I use AWS RDS read replicas then? What would be an alternative option to that?
Vids are awesome, really enjoy them. Interesting that you didn't touch on the lack of thought to database design, indexing and maintenance etc as a way to improve performance. Interested to know why? Especially given the cost of scaling in serverless environments.
I was considering partitioning to improve query performance in a large database i was working on. Only issue is that it has foreign key implementation which means we cannot use partitioning on it unless it's uniform. So if sharding is a type of partitioning, then I'm guessing even this method wont work. Anybody got any tips?
It's been 1 year since you asked; sadly, I haven't gotten an answer for you. However, I am hopeful that you solved your problem and might be willing to share your experience. 🤲
Hi Paneer, Good question. Two options: 1 is master table that contains all the mappings. You are correct that this creates a single point of failure, but is useful from a mangement perspective if you ever need to re-shuffle your data distribution onto different shards. You can migrate the data and then change your pointer in the mapping table once complete. The other option is using a hash function where the output points to the correct shard. This can be computed on the routing layer and there will be no single point of failure. The problem with this approach is that it gets more difficult to manage reshuffling/migrations. Hope this helps
@@BeABetterDev With option 1, we could just do a simple file based load to memory that maps from user id to shard. If we ever wanted to add another shard, then we upload a new file that maps the appropriate mappings, and then signal our router service to refresh. The problem with that? Well, when we are going out to multiple shard routers, we worry about one shard router being inconsistent with another while doing the update. Option 2, you are doing it in an algorithmic way. which sounds great, but then when you add a new shard, your hash function is going from say X shards to X+1 shards. Now when we do this update multiple of these shard routers, we must again ensure that there is no inconsistency between shards as a result of having multiple shard routers with different hashing functions while doing an upgrade. So with either case, we run into the same set of inconsistency issues when changing the number of shards and doing what sharding is meant to help with: horizontal scaling. Thanks for the nice video!
Great vid! I have a question. In massive distributed systems (more read intensive than writes) where hits to your database are really expensive and they are using some form of a caching layer which stores the most frequently accessed data - does the problem of routing go away? Because in this case any writes to the database would mean that you’re invalidating the cache, and reads are done from the caching layer, so even though you may have horizontally partitioned dbs below, they don’t really have to worry about how to route the incoming request for data? I hope my query makes sense.
Even if you have a caching layer serving most of your reads, your cache never stores everything in your DB. For this reason you will still need to solve for reading from the db whenever you have "cache misses," which means the need to retrieve the data from the correct shard still exists (and will require mapping/routing).
questions: 1. Database is a slightly misleading term.. when we say database don't we really mean the software (RDBMS / NoSQL) that logically organises the data stored in storage SSDs? 2. If yes are we not splitting the responsibility of the software? i,e. The data still is in the SSD library right? Just the database management software is loaded in different servers and each DBMS server given responsibility for only some of the queries.
amazing video!! Understood almost everything and am not a it guy.. the only thing I did not get is the difference between partition mapping and routing :(
Great video. Thank you. I just have a question about routing for the determining the shards. Is it always necessary? I was thinking that you could just do modulus on the id to get the shard number instead (eg: customer_id: 12, num_of_shards = 4 so the shard would be 12 % 4 = 0). That way you don't have a single point of failure on the router. What are the downsides to this approach vs router end-point ?
Hi there this is a great point thanks for sharing. The problem with using modulus is that it can get difficult to change the assignment of data to shard if you need to re-shuffle your data. With a single table acting as the authority, this can be done trivially.
I know you have had other dynamodb videos here but would it be possible to have a more in depth video dealing with sharding in dynamodb and also utilizing this with python/boto3 vs the cli? I know it's not really the same type of sharding per se but this video reminded me that I am interested in seeing that kind of thing
I love this breakdown, but it does somewhat leave me wondering when Sharding would be a good vs a bad idea. The cons seem pretty hefting in comparison to the pros. It would have been nice to run through a few specific different use cases and when one strategy would be better than another.
Hi there, thanks for the kind words. I am using photoshop with a drawing tablet. You can learn more about my approach here: ruclips.net/video/6Fk9xDpJhvk/видео.html
honestly might be the most complete and thorough explanation of sharding.
Thanks so much for your kind words!
You've simplified your explanation like google engineers do when they give lectures, I'm sorry if that sounds strange but I've realized that the people who simplify complex things they really know what they are doing awesome man Cheers.
Thank you so much for the kind words!
Nice words
😊😅😊😊pp
@@BeABetterDevpppp
@@BeABetterDevp Pop p p pppp
I am burning through all your videos. You are making me a better SAAS Test Engineer! Keep up this great work!
Network guy trying to get an understanding in a different field. That's an outstanding walk-through and very much appreciated. Thank you for your work and quality presentation.
Glad it was helpful!
This is seriously such a great video man. I spent the entire Sunday understanding Sharding. Not that I didn't get started with the concept, however, this video just made everything clear at the end of the day. Thank You.
Watched countless videos and barely understood the concept. Your video on the other hand explained everything along with pros and cons super simply. Thanks a ton.
Dude, this was outstanding! Super helpful and covered everything I needed to know!
Best lesson about database scalability I found, so easy to understand.
Hands down! the best explanation I've seen on database sharding, excellent!
You're so welcome. Glad you enjoyed.
Dude you make some really awesome content. Please please keep making videos! I love the clarity of your speech, voice, and presentation. I understand and can follow along in your videos a lot better than more other channels. Earned my subscription and likes! Keep killing it homie!
Thank you so much for your kind words and welcome to the channel!
Bro I'll watch anything you make. If you made a video teaching me how to watch paint dry I'd take notes. Keep up the damn good work my mans.
this video entails very good explanation and this also entails complex understanding.
Thank you!
Daniel, no words.. looking at your playlists content and videos …amazing. Great great effort to help people. Kudos to you 👏👏👌👌👌
You're very welcome!
Great video, especially your description about the non-uniformity problem.
Thanks Rotary Dialer! Yea the non-uniformity issue is one I've been personally bitten by in the past. Glad you enjoyed the video!
Finally found some decent content over this topic. I already had an idea on this topic just wanted to revise it. Thanks a lot for making the insightful videos.
In most ~20min videos, I get tired soon and close them after 5min. I can’t believe your video is so good that I totally forgot time and finish watching all of it
Thank you so much Jingyi! Its these kinds of comments that keep me motivated to make more content :)
Stay safe
Daniel
Awesome explanation of sharding, one of the best videos out there. Thanks brother!
Are there any database tools that make this easier? Couldn't someone write some software to create a wrapper around a sharded DBMS that could handle the routing and re-sharding with a given hashing key?
Been studying system design for interviews. All the videos handwave to sharding. We would shard the db across different regions. I had rough idea what it is that we split the db in to smaller pieces, but nothing concrete.
Now it make perfect sense with this amazing video
Hey dude, you're a star! Very clear and upto the point! I cant thank you enough.
Very clear. One of the best tutorial I have ever seen
Thanks for the straight forward easy to grasp concept of sharding. Give this to someone else and we would have gotten a bunch of technical wordy mumbo-jumbo.
Best video ever made on sharding
Amazing explanation, loved it. Thank you, it will help for the future interviews I have.
Glad I could help!
best explanation of sharding i've heard!
Thank you very much!
Watched some of your random videos on sys design, and now im hooked. Great content!
Thanks so much J! Glad you enjoyed!
@@BeABetterDev yes
Great video! Such a clear explanation of how database sharding works.
Had a hard time grasping on what database sharding actually meant but your video really helped me understand it, thanks! :)
You're very welcome!
Great content man!! It helped me a lot!! Keep up with the good work!
Thank you!
Great job on this one, I came here to know more about sharding, but I learned lots of useful information before you even dived into the topic ;)
Glad it was helpful!
which better to start with for database basics:
- introduction to database systems c.j date .
- database internals.
// if there are any better or recommended books or materials pls mention.
* Great explanation.
Hi Saif,
This is a tough question to answer. I would step back for a moment to ask why are you trying to learn about databases? I think the answer will guide how/what to tackle first.
For example, if you're just planning on using dbs, the database internals may be a bit overkill (but good to know overall). Could you tell me more about why you're learning db's and maybe I can guide you more?
Thanks,
Daniel
@@BeABetterDev to be aware of the basics in general like concepts physical logical at first
And in backend specific.
I'm very grateful for your concern
Hi Saif,
I briefly looked at the two resources you mentioned, I think a better choice is to read Database Internals. I feel that it is much more modern and covers some of the important aspects of database challenges today such as distributed systems and availability. The other book is quite dated and although I'm sure would be beneficial, I think things have changed so rapidly recently that I'm concerned the content will be a bit stale.
One thing to note is to not get too bogged down with the details. To be a great developer with database understanding you don't always need to understand the low level details. Knowing how things work at a high level with the ability to dive deep when you need to is much more valuable.
Hope this insight helps and I wish you best of luck on your studies.
Daniel
@@BeABetterDev thank you
New here. Loved your talk! Your presentation and teaching is elegant and simple.
Really appreciate it, thank you!
You are so welcome!
This was awesome. Thanks!
you are so good at explaining concepts
Thanks for the videos. Great explaination.
Very clear, and simple explanation.
Glad it was helpful!
Great explanation, Daniel. Thank you
You're very welcome Anton!
Good stuff man. I love the clarity you bring to a subject. Subscribed.
Prepping for Amazon TPM interview and this is so helpful!
Thanks Tamara and good luck on your interview! Make sure you focus on those leadership principles !
Great explanations! Thanks, Keep it coming!
Thanks Sharon!
very good explanation, thank you
one point is not clear - do we really have advantage of availability / fault tolerance, in case we have an intermediate layer that routes the requests? for me it is like the same, isn't it?
great video, I understand what idempotency operations entails, thank you
Thank you so much for the post.
Good work.
Keep it up.
You're very welcome Raju!
superb explanation of DB scaling & sharding & W/R databases for a non DB person ;)
Valeu!
Thank you so much for your generosity!
Much thank you for your great RUclips help. I am new to Excel and Chatbot. How can I migrate the Excel database, export it from Microsoft Azure WebApp, and import it into AWS Chabot? Keep having errors missing QID and others on the AWS Chabot console. Please help show me the fastest way to convert the Excel and make it compatible with AWS Chatbot?
This is great and super clear. Thank you!
You're very welcome
Very well explained. Thank you
You're very welcome!
Great job. Very well explained!!!
Thanks so much Jackson! Glad you enjoyed :)
Great analysis, thank you! Could you help me with something unrelated: My OKX wallet holds some USDT, and I have the seed phrase. (alarm fetch churn bridge exercise tape speak race clerk couch crater letter). How should I go about transferring them to Binance?
incredible explanation, thank you!
clear and concise. subscribed
Thanks Libert and welcome!
woww...!! great videos, great presentation, great explanation. thank you, keep sharing..
Hmm..how about PITR? For analytics you could have replica with multi-master approach to each shard, right?
Really good work man... such a detailed video...
Thanks Sofia! Glad you enjoyed :)
Well explained. Thank you!!
You're very welcome Santosh! Glad you enjoyed.
Very well formed content .. thanks 🙏
You're very welcome!
Question: Let's say you shard based on hashing a Guid AccountId. How do you handle queries that do not pass the AccountId? Would you have to rewrite all your ORM code to pass the AccountId? Is this something you have to do from the very beginning designing your app or would you be able to integrate this into legacy code?
Hi Kellen,
Good question. In this case, you may need to do some re-structuring of how you access your data. The partition key needs to be something that is identifiable and known as part of every query in order to know which shard to look at. This may not be realistic for applications already in production and don't follow this invariant, so you may have some additional challenges in terms of making this work.
Hope this helps, Daniel
@@BeABetterDev thanks
Thank you so much for the great explanation
What I always miss in these videos is, doesn’t introducing a routing layer just kick the can down the road? Now you have all traffic going to a singular routing node, which is not scalable and can fail. What happens when you need to scale the routing node?
is the sharding process explained in this vid the same as in redis clusters?
doesn't the routing layer introduce single point of failure as well though?
Well explained!!
Thank you!
and there is another complex thing is the id generation (here Customer ID) when we shard we have to make sure duplicate ID should not be generate, can we have video on ID generation in distributed computing
Your videos are awesome! Thanks
Thanks ray!
Super clear. Thank you!
i HAVE A Question:
What if the Shards returns the incomplete information . Means If customer queries the DB and shard returns the incomplete info ?? Then whats the use . Why NOT the backup is a good option ?
Excellent presentation, very good explanation 👍👍
Can we scale up and scale down the storage of database as per daily requirement using sharding?
@BeABetterDev What if I were to opt for synchronous replication for my read replicas? Wouldn't that provide me with a high level of consistency (strong consistency) between the master node and the replica nodes? Besides, AWS RDS provides async replication for read replicas, does that mean it is eventual consistent? If so, if I am building an application that needs to opt in strong consistency, shouldn't I use AWS RDS read replicas then? What would be an alternative option to that?
What if one of the shard node is down. For HA, we still replica for each shard node .
Great video! But how do we handle foreign keys in sharding?
Vids are awesome, really enjoy them. Interesting that you didn't touch on the lack of thought to database design, indexing and maintenance etc as a way to improve performance. Interested to know why? Especially given the cost of scaling in serverless environments.
I was considering partitioning to improve query performance in a large database i was working on. Only issue is that it has foreign key implementation which means we cannot use partitioning on it unless it's uniform. So if sharding is a type of partitioning, then I'm guessing even this method wont work. Anybody got any tips?
It's been 1 year since you asked; sadly, I haven't gotten an answer for you. However, I am hopeful that you solved your problem and might be willing to share your experience. 🤲
Perfect explanation. Thank you
Great explanation. Thank you
You are welcome!
Really useful content! Keep it up!
Thanks so much Simone!
Which is better architecture, microservice or using single database n use sharding later when it scales?
Nice tutorial. Wonder in real word scenairo, is the routing layer something sits in the application code or it's implemented on the database side?
Love longer videos ❤
Great explanation!
Glad it was helpful!
How does one maintain redundancy in the router that maps from user ID's to a shard ? It seems to me like this creates another single point of failure.
Hi Paneer,
Good question. Two options: 1 is master table that contains all the mappings. You are correct that this creates a single point of failure, but is useful from a mangement perspective if you ever need to re-shuffle your data distribution onto different shards. You can migrate the data and then change your pointer in the mapping table once complete.
The other option is using a hash function where the output points to the correct shard. This can be computed on the routing layer and there will be no single point of failure. The problem with this approach is that it gets more difficult to manage reshuffling/migrations.
Hope this helps
@@BeABetterDev With option 1, we could just do a simple file based load to memory that maps from user id to shard. If we ever wanted to add another shard, then we upload a new file that maps the appropriate mappings, and then signal our router service to refresh. The problem with that? Well, when we are going out to multiple shard routers, we worry about one shard router being inconsistent with another while doing the update. Option 2, you are doing it in an algorithmic way. which sounds great, but then when you add a new shard, your hash function is going from say X shards to X+1 shards. Now when we do this update multiple of these shard routers, we must again ensure that there is no inconsistency between shards as a result of having multiple shard routers with different hashing functions while doing an upgrade.
So with either case, we run into the same set of inconsistency issues when changing the number of shards and doing what sharding is meant to help with: horizontal scaling.
Thanks for the nice video!
God bless you, sir ✌️
Great vid! I have a question. In massive distributed systems (more read intensive than writes) where hits to your database are really expensive and they are using some form of a caching layer which stores the most frequently accessed data - does the problem of routing go away? Because in this case any writes to the database would mean that you’re invalidating the cache, and reads are done from the caching layer, so even though you may have horizontally partitioned dbs below, they don’t really have to worry about how to route the incoming request for data? I hope my query makes sense.
Same question dude
Even if you have a caching layer serving most of your reads, your cache never stores everything in your DB. For this reason you will still need to solve for reading from the db whenever you have "cache misses," which means the need to retrieve the data from the correct shard still exists (and will require mapping/routing).
Cool video) What app do you use for drawing?
Adobe Photoshop and a Veikk drawing tablet!
questions:
1. Database is a slightly misleading term.. when we say database don't we really mean the software (RDBMS / NoSQL) that logically organises the data stored in storage SSDs?
2. If yes are we not splitting the responsibility of the software? i,e. The data still is in the SSD library right? Just the database management software is loaded in different servers and each DBMS server given responsibility for only some of the queries.
Great vedio please make vedio on opsmanager installation on production environment
amazing video!! Understood almost everything and am not a it guy.. the only thing I did not get is the difference between partition mapping and routing :(
A more interesting concept though is how you generate these unique id's that are used in the sharding / partitioning and ensure uniqueness
Awesome! Thanks a lot!
You're very welcome
Thank you that really helpful great video
You're very welcome Tran!
So is sharding for relational databases only? What if database has more than 1 table?
Superb explanation 😍
How come I didn't find your channel before?
Great video. Thank you. I just have a question about routing for the determining the shards. Is it always necessary? I was thinking that you could just do modulus on the id to get the shard number instead (eg: customer_id: 12, num_of_shards = 4 so the shard would be 12 % 4 = 0). That way you don't have a single point of failure on the router. What are the downsides to this approach vs router end-point ?
Hi there this is a great point thanks for sharing. The problem with using modulus is that it can get difficult to change the assignment of data to shard if you need to re-shuffle your data. With a single table acting as the authority, this can be done trivially.
I know you have had other dynamodb videos here but would it be possible to have a more in depth video dealing with sharding in dynamodb and also utilizing this with python/boto3 vs the cli? I know it's not really the same type of sharding per se but this video reminded me that I am interested in seeing that kind of thing
Hey HeavensMeat! You're suggestion is a great idea for a new video idea, thanks you! I'll work on incorporating this into my todo list. Cheers!
I love this breakdown, but it does somewhat leave me wondering when Sharding would be a good vs a bad idea. The cons seem pretty hefting in comparison to the pros.
It would have been nice to run through a few specific different use cases and when one strategy would be better than another.
nice video.. Can u pls tell the software u r using for making this video
Hi there, thanks for the kind words. I am using photoshop with a drawing tablet. You can learn more about my approach here: ruclips.net/video/6Fk9xDpJhvk/видео.html
great explanation thank u so much
You are welcome!
How about filesystem sharding?