It's worth clarifying that with proper sharding, indexing and powerful enough machines (which a company with a billion users can afford) a single db lookup can be done in sub-milliseconds. So it's not expensive. The actual reason why a better solution like an in-memory cache is necessary is because of the number of simultaneous lookups (i.e. number of users trying to signup) is huge in such a scenario thus making even a sub-millisecond lookup time per query infeasible.
I was thinking same, that a binary search will take O(50) to search on a database of Quadrillion, so what is the need of these, but simultaneous lookups are a valid reason to do so...
@@ayeameen we shard a DB not a table. So sharding here would make sense if the db has only one table or only user related tables. Anyway, I think if we keep it simple and just increase the data then we better use partitioning. It will create virtual sub-tables of a table with each having its own B-Tree for indexing. And we can also partition by starting alphabets maybe. Even with this model we'll have to search among dozens millions of rows instead of billion.
Instead of number of oops concepts videos in RUclips tech channels....i found it very very useful as she is explaining the real time usecase.... thank you...
This was brilliant. As beginner people do not think of advance techniques which is fine, but after a while in the industry you do need to look for these advanced techniques used by top players.
- Unique indexing or hashing: Standard and most effective for quick lookups. - Sharding: Ideal for distributed systems and extremely large datasets. - Bloom filters: Fast in-memory probabilistic checks to avoid unnecessary database lookups. - In-memory caching: Extremely fast for frequently queried user data. - Partitioning: Optimizes database lookups by reducing the size of search spaces.
This is the first time I saw your video I'm saying this on behalf of all viewers that you are an amazing teacher with brilliant communication and visualization presentation skills... Subscribed🎉
My first thought is to optimize your query to cut down your dataset. I think that might be after/within step 3. Great video, it’s awesome to see what bigger companies do.
Even before watching this video I can give my thoughts on it and what I use on a daily basis. 1. Hash partitioning - you generate a hash of your data (maybe 1 to 100000) and then generate a secondary hash further from 1 to 1000 say. Create a composite index of both. Generate the hashes before querying and query the user efficiently. Your data set is reduced. 2. Sharding is a good way to augment the above strategy with lets say keeping the hash of hash as a shard key. 3. On top of that a persistent caching like redis can be very useful.
Even if you use increment decrement bit array it won't solve the false positives problem, thereby it highly relies on hash function this is the one we should focus more on.
- Use a hash-based index to map user information eg; username in game to hash values. Hash tables have an average time complexity of O(1) for lookups. - Cache like redis if finance of project allow.
have you ever looked for a word in the dictionary? a dictionary has hundreds of thousands of words, but you still take less than 30 seconds to find your word let's say "luck", you go to section of L, then you go to section of U, then you go to the section of C and then you go to section of K and then you find your word. this works only on sorted database, and database should be sorted periodically so newly added, deleted and modified data can be sorted to reduce lookup time.
Yea, that is why databases uses an Index, to speed up the data lookup. What you explained is the process used to find dat in an index ognized table. For other tables, if Indexed and DB thinks that index is going to be useful, the same lookup proceses is used on the index, and that results in the location where data is stored.
Wow, Ma'am, Not many peoples will watch your content because it is very neniche, But please please please do not stop sharing the knowledge you hold. I am amazed by your knowledge, which i know is result of doing hardwork in the industry by investing many many years. Thankyou very much for making this content and sharing your knowledge for free.
Excellent Ma'am, you have explained real-world scenarios. I am expecting such more videos. It could be more better if you create a spring boot application and implement those scenarios what you have explained. it would be so helpful. Thank you..🙏
Quite interesting in Bloom Filter, however if we combined those three, we will get the downsides of the others isn't it, imagine we use Bloom filter for low memory footprint and we use Redis for another validations so Redis still need to store these record ? And how could we do query the database for another validations with faster responses ?
Great question! Combining Bloom filters, Redis, and database queries balances trade-offs. Bloom filters reduce unnecessary queries, while Redis stores frequently accessed data for faster lookups. Redis doesn't need to store all records, just recent or frequently used ones. For database queries, we rely on sharding and indexing to maintain speed, with Redis acting as a buffer to reduce load.
I would say handling multiple users logging in at the same time is the only concern. Most of the time redis will take care of this. And for a smaller application the in-memory database of the application would be sufficient. Also many users mostly try to solve the problem on the application level while their own database do not have indexes. A well designed table is far more efficient than creating hash functions.
I don't understand how caching will help for this particular problem, if I wanna check the email is already in used or not, hardly anyone else will try to check for the same email in near time (before cache expiry)
Good point! Caching helps mainly with frequently checked or popular usernames/emails. For unique queries, cache hits are rare, but it still reduces load on the database for cases where multiple users may check the same email (e.g., typos or common names). Caching shines more in scenarios with repeated access patterns, but other techniques like Bloom filters handle the uniqueness aspect efficiently.
I'd assume there would be lot of queries to check most common emails like Max@mail John@mail etc Ofcourse not affective for very specific email IDs and that's why I agree with the solution which combines multiple approaches.
Thank you for the nice and detailed explanation. Have a question - what is the max length of values use in hash function so in this case what would be max email length? Is there any thumb rule for that
SHA-256 always generates a fixed 256-bit (32-byte) hash, regardless of input length, so it doesn’t limit the maximum length of an email. Email length limits are typically defined by standards or application constraints - usually up to 254 characters as per the RFC guidelines.
It does. It also adds complexity to querying.Database sharding definitely helps with scaling, as it distributes the load across multiple servers. However, even with sharding, cache and Bloom filters add an extra layer of speed by reducing direct database queries, which is crucial for minimizing latency at a massive scale.
what a explanation, subscribed, so basically we will need this bloom filters only when we have data over lakhs or in crores, not in thousands or hundreds in which cahcing can be used efficiently, right? And also this bloom filter will be used for signup only or any other scenario it will be beneficial in?
Yes, for the data size you mentioned, caching with database sharding and indexing would be good enough. Better to check with your architect. I have mentioned a few other scenarios companies like Google, facebook and hbase are using bloom filters for. Please check.
Awesome explanation and informative video, but I have doubt what if we add constraint over the email column itself, how would the DB behave then, will it check over all the entries, will that be same as querying over all the records manually? Thank you.
Most startups (99%) don’t have billion customers. Those that do have already implemented a one time custom solution to this problem. I don’t understand the reason for such interview questions. Just do a db query on an index.
Database sharding definitely helps with scaling, as it distributes the load across multiple servers. However, even with sharding, cache and Bloom filters add an extra layer of speed by reducing direct database queries, which is crucial for minimizing latency at a massive scale.
For 10-15 million users, an index and cache would likely work well for performance, especially with a well-tuned database. Please check with your architect to understand the future data growth and scale expected.
To find a user in milliseconds, we need a combination of Geo-location based routing, caching , and database sharding. Yes, if the objective is to check, if the user exists or not only, bloom filter may be the way to go.
can u make video on my question where i have 10 billion mobile numbers where 1000 mobile number want to delete from 10 billion, what is efficient way in java?
It's worth clarifying that with proper sharding, indexing and powerful enough machines (which a company with a billion users can afford) a single db lookup can be done in sub-milliseconds. So it's not expensive. The actual reason why a better solution like an in-memory cache is necessary is because of the number of simultaneous lookups (i.e. number of users trying to signup) is huge in such a scenario thus making even a sub-millisecond lookup time per query infeasible.
Databases have in memory caches.
I was thinking same, that a binary search will take O(50) to search on a database of Quadrillion, so what is the need of these, but simultaneous lookups are a valid reason to do so...
What about network delay @@Cassp0nk
What will be your shard key? We are trying to find if any user exists with the email address.
@@ayeameen we shard a DB not a table. So sharding here would make sense if the db has only one table or only user related tables. Anyway, I think if we keep it simple and just increase the data then we better use partitioning. It will create virtual sub-tables of a table with each having its own B-Tree for indexing. And we can also partition by starting alphabets maybe. Even with this model we'll have to search among dozens millions of rows instead of billion.
Instead of number of oops concepts videos in RUclips tech channels....i found it very very useful as she is explaining the real time usecase.... thank you...
Thank you so much 🙂 🙏
Please upload this type of video were U teach what tech giants optimize their API. one of the best video on youtube please keep-it-up..
Sure, pls share it in your circle too and support this channel. 🙏
This was brilliant. As beginner people do not think of advance techniques which is fine, but after a while in the industry you do need to look for these advanced techniques used by top players.
Agreed! Let me know what you think about my latest video- HLL in the Redis usage.
- Unique indexing or hashing: Standard and most effective for quick lookups.
- Sharding: Ideal for distributed systems and extremely large datasets.
- Bloom filters: Fast in-memory probabilistic checks to avoid unnecessary database lookups.
- In-memory caching: Extremely fast for frequently queried user data.
- Partitioning: Optimizes database lookups by reducing the size of search spaces.
The video itself is great but the comments are gold. Learnt so much.
Glad to hear it! Please check our other videos too 🙏
This is the first time I saw your video I'm saying this on behalf of all viewers that you are an amazing teacher with brilliant communication and visualization presentation skills...
Subscribed🎉
Wow, thank you! Means a lot to me! Please share it in your circle 🙏
Agree.
Surely recommend your channel to my team as well in my office. Thanks a lot for this type of video.
My first thought is to optimize your query to cut down your dataset. I think that might be after/within step 3. Great video, it’s awesome to see what bigger companies do.
Thanks RUclips for recommending this.
Even before watching this video I can give my thoughts on it and what I use on a daily basis.
1. Hash partitioning - you generate a hash of your data (maybe 1 to 100000) and then generate a secondary hash further from 1 to 1000 say. Create a composite index of both. Generate the hashes before querying and query the user efficiently. Your data set is reduced.
2. Sharding is a good way to augment the above strategy with lets say keeping the hash of hash as a shard key.
3. On top of that a persistent caching like redis can be very useful.
Starting my day with good learning through this video and comments.
Greatly explained. Just subscribed.
Awesome, thank you!
Thank you Mam, you can very well name your channel as "Tech Goldmine"!
🙂 🙏
Very simple and in plain, understandable way!!! Excellent explanation. Please keep more videos coming. Subscribed!!!
Thanks, will do! Please share it in your circle 🙏
Superb exploration!
Keep posting such tutorials.
Sure 👍 🙏
Wow. Please upload more of such system design videos
Definitely. Please share it in your circle 🙏
Thank you for providing such a clear explanation with examples of production services. Great content, keep up the amazing work, Ma'am
Much appreciated!
Even if you use increment decrement bit array it won't solve the false positives problem, thereby it highly relies on hash function this is the one we should focus more on.
- Use a hash-based index to map user information eg; username in game to hash values. Hash tables have an average time complexity of O(1) for lookups.
- Cache like redis if finance of project allow.
Thanks for the video. It surely expands the knowledge of engineering with all the conversation going on in the comments.
Understood the concept, Keep sharing your precious knowledge with use
Thank you, I will🙏
This bloom filter stuff is ingenious.
Learned a nice concept and strategy today. Thank you.
Excellent technical presentation. Very good
Thanks! Please check my other videos too!
Straight to the point , instant sub :)
Thanks! Do check out our other videos 🙏
@@TechCareerBytes Ya
Never thought this video would be this informative!!!!
Glad it was helpful! Please share it in your circle and support this channel 🙏
What an insightful video, thank you for sharing such an amazing knowledge. Subscribed!
Awesome, thank you! Please share in your circle 🙏
Thanks RUclips algo, very well explained video, subbed
Thanks for the sub! 🙏
The video was really helpful mam. Thank you for the video.
Glad it was helpful! 🙏
Excellent explanation thanks a lot..
Glad you liked it. Please check our other videos too 🙏
instantly subscribed 🙏
.
system design and concepts for optimal performance
Thanks 🙏
have you ever looked for a word in the dictionary?
a dictionary has hundreds of thousands of words, but you still take less than 30 seconds to find your word let's say "luck",
you go to section of L,
then you go to section of U,
then you go to the section of C
and then you go to section of K
and then you find your word.
this works only on sorted database, and database should be sorted periodically so newly added, deleted and modified data can be sorted to reduce lookup time.
you made all that Killer BIT process, simple man...Thank yoU!!
Isn't this the database partition in the nutshell?
Thnxxx
Yea, that is why databases uses an Index, to speed up the data lookup.
What you explained is the process used to find dat in an index ognized table.
For other tables, if Indexed and DB thinks that index is going to be useful, the same lookup proceses is used on the index, and that results in the location where data is stored.
@@anothermouth7077 not partition. It's indexing
Wow, Ma'am, Not many peoples will watch your content because it is very neniche, But please please please do not stop sharing the knowledge you hold. I am amazed by your knowledge, which i know is result of doing hardwork in the industry by investing many many years. Thankyou very much for making this content and sharing your knowledge for free.
Thanks 🙏
Good one. It could have been a youtube shot. Good luck
You have a new subscriber, ma'am. 🎉
Thanks for a great explanation.
This is very informative. Thank you. Hope to see mote videos like this
Insight full video Tutorial with very good real world examples. Thankyou Mam..Keep sharing knowledge and experiences
Thanks for liking. Please share it in your circle 🙏
Very good and straight to the point video. Also how to implement these methods in other languages
AI has two meanings.
Artificial Indian and An Instructor.
very knowledgeable video, thanks mam.
Thanks for liking 🙏
Great work ma'am!!
Thanks a lot 😊 🙏
Finally some insights.
Excellent Ma'am, you have explained real-world scenarios. I am expecting such more videos. It could be more better if you create a spring boot application and implement those scenarios what you have explained. it would be so helpful. Thank you..🙏
I will try my best.
Quite interesting in Bloom Filter, however if we combined those three, we will get the downsides of the others isn't it, imagine we use Bloom filter for low memory footprint and we use Redis for another validations so Redis still need to store these record ? And how could we do query the database for another validations with faster responses ?
Great question! Combining Bloom filters, Redis, and database queries balances trade-offs. Bloom filters reduce unnecessary queries, while Redis stores frequently accessed data for faster lookups. Redis doesn't need to store all records, just recent or frequently used ones. For database queries, we rely on sharding and indexing to maintain speed, with Redis acting as a buffer to reduce load.
Explained so well!
This was so insightful, thank you so much.
Glad it was helpful! Please share it in your circle 🙏
Thanks 4 the info, You've now got a new sub😁
Thanks for the sub! Please check my other videos too 🙏
wow , I really loved it just watched it out of curiosity and learned a lot
Happy to hear that! Please share it in your circle 🙏
Great info 🙌🙌
I would say handling multiple users logging in at the same time is the only concern. Most of the time redis will take care of this. And for a smaller application the in-memory database of the application would be sufficient.
Also many users mostly try to solve the problem on the application level while their own database do not have indexes. A well designed table is far more efficient than creating hash functions.
I don't understand how caching will help for this particular problem, if I wanna check the email is already in used or not, hardly anyone else will try to check for the same email in near time (before cache expiry)
Good point! Caching helps mainly with frequently checked or popular usernames/emails. For unique queries, cache hits are rare, but it still reduces load on the database for cases where multiple users may check the same email (e.g., typos or common names). Caching shines more in scenarios with repeated access patterns, but other techniques like Bloom filters handle the uniqueness aspect efficiently.
I'd assume there would be lot of queries to check most common emails like
Max@mail
John@mail etc
Ofcourse not affective for very specific email IDs and that's why I agree with the solution which combines multiple approaches.
Thank you for the nice and detailed explanation. Have a question - what is the max length of values use in hash function so in this case what would be max email length? Is there any thumb rule for that
SHA-256 always generates a fixed 256-bit (32-byte) hash, regardless of input length, so it doesn’t limit the maximum length of an email. Email length limits are typically defined by standards or application constraints - usually up to 254 characters as per the RFC guidelines.
Nice tutorial, learnt something new today, thank you so much Mam...
Glad to hear that. Please share it in your circle too! 🙏
This was impressive
Wow, I learned quite some new things from this! Thanks
Glad it was helpful! Please share it in your circle 🙏
Just awesome 💯💯
What a way to explain💐📈
Thanks 🙏
First time watching your videos and it was very informative. Thank you for your efforts and clear explanation.
Glad it was helpful! 🙏
Great, Please upload more videos on these concepts !!!!
Thank you, I will. Please share it in your circle and support this channel 🙏
Thank you 🙏 please upload video in 4K resolution if possible
Ok next time
Wow, Unique video, great content, Nice explanation.. Thank you so much madam, Please make this kind of unique videos, Subscribed.... ♥
Thanks and welcome 🙏
Thanks and welcome
Sharding the database also helps in querying speed & performance!
It does. It also adds complexity to querying.Database sharding definitely helps with scaling, as it distributes the load across multiple servers. However, even with sharding, cache and Bloom filters add an extra layer of speed by reducing direct database queries, which is crucial for minimizing latency at a massive scale.
what a explanation, subscribed, so basically we will need this bloom filters only when we have data over lakhs or in crores, not in thousands or hundreds in which cahcing can be used efficiently, right?
And also this bloom filter will be used for signup only or any other scenario it will be beneficial in?
Yes, for the data size you mentioned, caching with database sharding and indexing would be good enough. Better to check with your architect.
I have mentioned a few other scenarios companies like Google, facebook and hbase are using bloom filters for. Please check.
@@TechCareerBytes ok, thanks, will check that.
Great video, would suggest to improve the quality of screenshots also invest in good quality microphone
Working on it
Love your videos mam. Can you please make a video on sorted sets data structure?
Great. Useful.
Nice video 😊
amazing video thanks for sharing
Glad you liked it! 🙏
Helpful 🙏🏼
Keep it up ma'am 👏
Thank you, I will. Please share in in your circle.
@@TechCareerBytes sure ma'am
Awesome explanation and informative video, but I have doubt what if we add constraint over the email column itself, how would the DB behave then, will it check over all the entries, will that be same as querying over all the records manually? Thank you.
When to use consistent hashing ? Please explain with real use cases. Thanks Ruba
Sure. You can check my videos on data partition and data replication. They cover consistent hashing.
Such concept in this short video, really appreciate it. ❤
Glad you liked it!. Please share it in your circle 🙏
Please post more video like this
Super Awesome video, make more like it.
I will try my best. Thanks. Please share it in your circle 🙏
Incredible !!!! spot on
Thank you! Pls share it in your circle 🙏
Great lesson! keep it up ! Thanks! :)
Thanks! Please share it in your circle and support this channel 🙏
The code snippets code is not properly visible, the fonts are blur
Please check the description for the link to the code.
Super Madam.
one of the great video
Glad you think so! Pls share it in your circle 🙏
great video mam....Thanks
Thanks! Please share it in your circle 🙏
I Appreciate Amazing knowledge shared by, but please buy some good quality mic, your audio should more clean
Noted. Thanks.
Thanks Rupa mam
❤ nice makes video on schema migration and database migration
Really helpful
Most startups (99%) don’t have billion customers. Those that do have already implemented a one time custom solution to this problem. I don’t understand the reason for such interview questions. Just do a db query on an index.
Great Video
Thanks! Please share it in your circle and support this channel 🙏
What about sharding the database?
Database sharding definitely helps with scaling, as it distributes the load across multiple servers. However, even with sharding, cache and Bloom filters add an extra layer of speed by reducing direct database queries, which is crucial for minimizing latency at a massive scale.
What about indexing, will it not be helpful and the best approach
Indexing and database sharding will definitely help. But, at large scale we also need a cache and bloom filter to speed up the process.
Please tell about sharding and other concepts
Please check this video - ruclips.net/video/EoHh1NMeUJM/видео.html
Extraordinary mam.
Thanks a lot 🙏 please share it in your circle.
if data is not too much big but it's big as 10-15M approx. can we just apply cache + index on table . please guide u
For 10-15 million users, an index and cache would likely work well for performance, especially with a well-tuned database. Please check with your architect to understand the future data growth and scale expected.
Caching in caching db is the same direct db query approach
Thank you so much for the video ma'am. Can you please provide the link to the code? Its not clear.
Yes, sure. Please check the description box for the link. Don't forget to share the video in your circle 🙏
Thank you
To find a user in milliseconds, we need a combination of Geo-location based routing, caching , and database sharding.
Yes, if the objective is to check, if the user exists or not only, bloom filter may be the way to go.
Nice content Mam.
Thanks a lot. Please share it in your circle and support this channel. 🙏
The video is great but i wish to improve the quality of the images provided as examples
Noted. Will work on it
can u make video on my question where i have 10 billion mobile numbers where 1000 mobile number want to delete from 10 billion, what is efficient way in java?
Let me try.
Caching wont help i guess, bcoz everytime new user will come and enter his email id, high chance of that data not available in cache