The only thing about this channel that makes me feel awful is that i didnt discovered it earlier. The topics discussed here are not really common, but extremely important in my opinion. I havent found anything that gets near this channel, some topics are so high level that I havent heard of them at all. Makes me feel like I know nothing, which is the best feeling ever. There is no much to learn! Thank you so much!
I'm a junior Software Engineer and working my way through to specialize with Backend. I always learn new concepts from your videos. This one in particular hits home for me because I've worked on couple of projects were I had to make the design choice of the database. I found it quite difficult to make the right choices because I always end up building a search endpoint with full text search and other search parameters. To put it into context, the databases I had problems with were filled with recipes. I had multiple tables with 100k+ entries filled only with IDs (the problem you mentioned). After two projects I think I switched to the latest idea you mentioned with the list/json column and that one worked the best for me. Because not only it avoids searching through a big table but also saves me an extra query to another table. This is kind of irrelevant the this video but when implementing full text search I think it's better to go with postgresql rather than mysql since it supports gin and gist indexes and fuzzy searching that can help build a nice, affordable and quick solution to meduim sized databases. Keep doing the nice work.
Man your content is GOLD. I come from a front-end stack and was underestimating the work with databases. But your content has helped in understanding the pitfalls of backend engineering.
You help not only software engineers with your channel, but also data analysts like me that works a lot with data engineering. Thanks! Greetings from Brazil.
generally, it's helpful to think in terms of read and write paths of your data. On the read side, on top of partitioning you can add bloom filters to quicky test where a value exists or not to reduce searching the B-Tree or other persistent data structures
nilanjan sarkar nice idea Nilanjan, Completely forgot about bloom filters probabilistic data structure.. For anyone interested check out this video I discussed it before Bloom Filters Explained by Example ruclips.net/video/gBygn3cVP80/видео.html
" On the read side, on top of partitioning you can add bloom filters to quicky test where a value exists" That would mean scanning the bloom for every partition and the results you get will contain false-positives meaning you will still scan indexesof many partitions that wil not contain any matches. It's better than doing an index scan on all partitions, but I get the impression that we're not talking about the same thing when we say "partitioning'. The point of partitioning is that you divide rows up by the value of fields so you can predict the partitions-to-search by looking at the query. If I partition a phonebook by the first letter of the last name then querying for "Johnes" will only use the 'J' partition and the btree of that table wil be 1/26th of the original table (roughly) and so scan much faster. I don't have to pre-filter to find out that I need partition 'J', the database knows that from the query.
Good video. The last trick in the video is called denormalization. Also as soon as you introduce sharding you need also add replication because the probability of failures increases.
I had to deal with tables with a few billion records per month, and MySql merge engine (merge tables) let us slice and dice them any way we needed. You can even have more specific indexes in the real tables themselves, as long as each merge index exists in all tables. The downside of Merge tables is that it multiplies the open file handles on the system, which can be tricky for a machine doing public networking, but with latest kernels, you can get kind of crazy. High memory makes a huge difference, of course.
Maybe I didn't understand the second last section about eliminating the need to update both ends of a connection, but your solution will crumble when person A who is following person B closes their account since the information on who follows person B is only in the person's B's records. So when person A exits the medium, we won't know who to update.
2 extra ideas: 1. table archiving: most large tables are caused by timeseries records -- just archive the old records in separate tables and keep the live table small 2. use modern databases that are more scalable than traditional single-host databases: cockroachdb, spanner, aurora, tidb, fauna, etc.
I'm a big fan of archiving, just make sure that you don't shift the problem from the "live" table to the archive table, if you keep archiving to the same table for a year then that table will become impossibly huge and essentially unusable and you end updeleting it. A simple cronjob can create a new monthly table or something. Partitioning is in effect a type of archiving but with a transparent api so you can just run a query and the database will choose the partitions by itself. As for more scalable database engines; that would depend on wether those can still deliver the functionality you need. For example; any app dealing with money cannot live without transactions and referental integrity because one mistake can cost millions. Also; is't not exactly true that traditional databases are single-host. Replication is easy to setup to get more nodes involved and sharding is realtively easy to built into an app or a proxy.
Are we archiving entire db or just a table? How do we join between archived and non archived dbs? Because if there are joined users that do not exist anymore archive will not work. I guess we need to snapshot the db or at least more than one table that is problematic
I really like listening to you while doing other stuff, like driving eating or walking. Always being productive and learning new things "sidely" by your videos. Habibi ❤
Sometimes you cannot redesign the table or escape getting at billion rows tables. For example, the journal entry for real-time payment system gets very large in very few months. Moreover, you cannot resort to NoSQL databases since these data are ACID in nature in which eventual consistency is not applicable. I guess you will end up using all the aforementioned techniques like: Indexing (Obvious One), Partitioning and Sharding. One problem with Partitioning though is that most ORMs are not Partition aware. On the other hand, you have to be careful of new distributed and horizontally scalable databases like Cockroach DB, YugaByte,etc,. Because there is always a catch.
I did have a field day implementing partitioning on top of JPA and Postgres a few years back. Works like a charm, but requires a lot of low level fiddling to get it to work
if there is a huge table and we need to run a migration on it (structural changes), how partitioning can help to minimising the outage in production environment as table lock will not allow other changes to be applied on that time.?
In case of a twitter like app, I will go certainly go for a graph database like Neo4j and use another database to store indexes. Each user is a node and each followingX relation is an edge, It will limit the join operations
Bruce Wayne fantastic thoughts, Graph databases are designed for this use cases so they will be optimized for this.. that will be another way to avoid that large table .. I don’t know how graph databases work though so not sure of their scale and how do they store things
@@hnasr If I'm not wrong , they store the addresses in one file and the data in other ones , so if you find a node using indexes , you will find next to it the physicial address of each edge related to it , it's just a O(1) read operation, and so navigate in the graph by just simple read operations instead of complex joins. And this is an article introducing sharding for neo4j neo4j.com/whats-new-in-neo4j/scalability/ Thanks for your videos , I'm learning a lot of new things each day
Does the twitter example lend itself to Graph? The twitter database is basically three tables with single-field relations.The joins in the queries don't go deeper than one level. Graph is excellent for large...well graphs,where nodes are linked to nodes who are linked to nodes etc etc , but twitter doesn't d that.(unless you perhaps want to query all the users who follow all users who follow all users who follow allusers who follow Obama,but that's easily solved using recursion in SQLand really doesn't require graph.
Great video Hussein. Just wanted to share my thoughts as I have first hand experience with a table container over a half billion rows. I don’t believe, a billion row table is that big of a deal for a modern RDBMS if the schema is well designed and proper indexes are in place (to support the common query patterns). I have a database with over half billion rows which works flawless on a medium sized hardware. (Azure Sql database) Indexes go a long way, then partition and then shard as you mentioned. Also before making things complicated by adding shards etc, you can always ask yourself “is there data in the table that I can offload/archive/delete to some other table?” The answer in most cases would be yes. This should save your from a lot of additional complexity.
Rizwan, might i ask what database technology you are using? I have been working on a similar table (well, multiple tables with around half a billion rows) and i have faced the performance issues. I have tried it on postgres as well as mysql. Indexes definitely make things better, however, in case of searches, even they are not enough. iterating the table sometimes took me days. I didn't design the structure, but have to maintain it. The only way left was sharding for me. If you hhave any other tricks and tips, please share!
@@user-bk5xo1gj7k Hello, I am using Azure SQL Database but I would think others (MySql, Postgresql etc) perform very similar. For instance, here is the table I had that has: 213,461,562 Rows (almost a quarter million) I ran a query on it by filtering on a column that is indexed (datetime) and selected top 100 rows and here are the results: SQL Server parse and compile time: CPU time = 0 ms, elapsed time = 0 ms. (100 rows affected) SQL Server Execution Times: CPU time = 0 ms, elapsed time = 0 ms. Completion time: 2023-01-24T08:34:59.6096640-06:00 As you can see, it was barely anything for the server. A few things to look at: 1) Is your table indexed for the query you are running against it? 2) How much data is part of the query result? This is why my original comment stated that you barely need sharding (partition should come before sharding) until maybe you get into multi billions and space on a single server becomes a problem). Sharding is no easy topic when it comes to properly implementing it. Here is something that may be of help in your optimization scenario: use-the-index-luke.com/ Best of luck and let me know if you have any questions.
swapnil kale discussed this some what in my database engine course Database Engines Crash Course (MyISAM, Aria, InnoDB, XtraDB, LevelDB & RocksDB) ruclips.net/video/K9Qd3UMHUQ4/видео.html
Thanks for the video man! Loved it! In the redesigning the DB part of your video, taking the Profile example, where you have a column or a key which stores the list of followers, my doubt is will that not exceed the maximum limit of the row or a document and make it extremely heavy while retrieving one follower ? I think it's an open-ended discussion and totally depends on the use case.
I liked all your ideas. Although I was waiting if you could talk about data archival and purging. Something I tried was on archival of data to another database and purging it from the original database which may not be accessed frequently.
what to do when I want to overwrite 100 millions of rows into new table, in minutes? df.write.mode("overwrite").saveAsTable("FINAL"), if you could please help with this?
What is your thoughts on searching a varchar(1000) column in a billion rows table. Note , only one column and keyword search. Like %xyz% . I know like query will be supper slow. So what will be the alternates. How to solve
Help!! i am user of database so indexing is not granted to me and all the rows except id are of datatype text so i can't make partition with those. partitioning with Id is not an option because data stored randomly. i am junior developer so sharding is not option again. my requirement is to search user based on parameter like Country or job title or company or all of them. data is huge almost 1B+. so please give idea what i can do ?
I don't think you can do much if, any backend dev/DBA can help you with your queries? you need to be able to have the proper indexes on the table to get better performance
Regarding sharding: Isn't it better to put the database instances into a cluster and let the master take care of sharding so that the client will not need to know which db instance to query - it will only need to know the master? Im not a DB guru but i always thought that this is how it works???
Correct, Some database vendors provide such feature yes but not every DB has this feature and for the longest time the engineer has to built this themself. I believe Vitess solves this for mySQL. Cassandra also has consistent hashing which distribute the data across shards..
@@hnasr PostgreSQL can do it with Foreign-Data-Wrappers; essentially you create a partition for a table and stick it on a different node through FWD. www.percona.com/blog/2019/05/24/an-overview-of-sharding-in-postgresql-and-how-it-relates-to-mongodbs/ If you route all traffic for all shards through the root node that means you can still only go as fast as the root node, you're not leveraging the full capacity of each shard. Look at geaographical sharding where each shard holds the data that is used for a region, completely bypassing the 'root' node.
Thank you for the content it's helpful, I think the problem with your design is that when you try to discover all the profiles you follow, you will end up in trouble.
HI, Thanks for the video.This gives me an insights of how to design a very large database. But can we do this partitioning and sharding on a relational database like MYSQL for example? if we do so I believe we can't leverage the full benefits of MYSQL IMO ,a data base with lots of partitioned table how JOINS will work then?
Great topic. Very helpful 👍🏾. Truthfully, I always go with the last option 😅 because most of my projects will never have a billion rows. Now I have others options too. Thanks.
Wow great video. What about using both Spark+Hadoop in conjunction with the index-partitioning-sharding approach. Would such a strategy even makes sense?
Kinda don’t like the idea of having JSON tbh because you can’t enforce foreign keys and you, in this example, couldn’t get a list of your followers. That said, love your content. Thanks for the videos!
I'm currently having this problem right now with my current company. another way could be just archiving old data if they are not often accessed and creating a mechanism to load up the archive data whenever needed. for example transaction tables
Will your last solution work for the user with millions followers ? I can not wrap my head around. That's a huge amount of data for a single document/column
Very Nice content 🥇 I would just ask: isn't sharding just a type of partitioning? Sharding = multi-host-partitioning? Keep up the good work. Very interesting 👏👏
I'd like to see an example of how to handle sql joins when using shards. I'm guessing since you can't do joins using sharding, you'll have to pull everything in on the application side and glue everything together.
Thanks for the great video. I'm learning a lot from this channel. Could a junction table be used in this case as well? For example: id(primary key), followed_id(foreign key), follower_id(foreign key) ...; this would mean a single added row each time the 'follow' button is pressed and, if needed, an inner join to see the relevant details of the followers
no i don't think that'd work. see, the issue is the number of rows in the end. junction table would be larger than the 2 tables that table is a junction between. the junction table could grow to multiple billions of rows and we would end up with massive table. (hope i understood your question correctly)
@@user-bk5xo1gj7k yes, that's right. Thanks! I'm wondering now whether there are any other approaches to this problem, apart from those already mentioned. Maybe as far as the DB schema is concerned.
Hi Hussein, Love your videos. Can you please create a video on creating index on production tables without having a lock on the table.Maybe you can talk about Online DDL, percona , memory consumption and write latency impact.
I am planning to make steak of daily watching your video(s). It's really amazing content 😁✌️ Great work. Let's see how much consistent I be in this activity 🎉 #🔥 (1)
I would like to ask you just one thing, how did you gather so much knowledge? As a software engineer myself, I can assure you this isn't something you get from work experience. And I'm guessing it was books? If so please do share your recommended reading list
You had great ideas until the 10 minute mark. What happens a follower of 3 or n people leaves or changes their name. How will you find the follower unless you scan every single user? Stick to a real relational design and this will never be an issue.
your content is great but @10:20 u created list of users(mongodb/json) and that can grow infinite but also there is a limit of mongodb that 1 document cant exceed 16mb size.
Rahul Goyal ohhh I didn’t know Mongo has that limit.. I wouldn’t say infinite though, Tyler swift had 50 million followers which is still manageable but still you are right .. someone mentioned graph database which can also be useful Thanks for sharing thoughts
We could maintain a separate table, where each record being (user,list of follower Ids) which is queried only when required. You could break the total number of followers into multiple records. Something like paging from OS. Maybe userid+record_number as your primary key?
@@Iamhomkar Technicaly yes, but if the database cannot hold the amount of data you need it to hold then you should move to a different database that can.
Can't we use something like a Time Series DB, Elastic or Postgress or Cassandra, they are famous for being able handle heavy loads of data. Btw how the Bitcoin DB works ? There are millions of transactions
The only thing about this channel that makes me feel awful is that i didnt discovered it earlier. The topics discussed here are not really common, but extremely important in my opinion. I havent found anything that gets near this channel, some topics are so high level that I havent heard of them at all. Makes me feel like I know nothing, which is the best feeling ever. There is no much to learn! Thank you so much!
I totally agree with you. This guy makes my time spent worthful.
Exactly! RUclips hasn't reported me any similars yet😂
This idea of transitioning the delay to the writer AND then using queues for writes is true architecting foresight. I love it.
I'm a junior Software Engineer and working my way through to specialize with Backend. I always learn new concepts from your videos. This one in particular hits home for me because I've worked on couple of projects were I had to make the design choice of the database. I found it quite difficult to make the right choices because I always end up building a search endpoint with full text search and other search parameters.
To put it into context, the databases I had problems with were filled with recipes. I had multiple tables with 100k+ entries filled only with IDs (the problem you mentioned). After two projects I think I switched to the latest idea you mentioned with the list/json column and that one worked the best for me. Because not only it avoids searching through a big table but also saves me an extra query to another table.
This is kind of irrelevant the this video but when implementing full text search I think it's better to go with postgresql rather than mysql since it supports gin and gist indexes and fuzzy searching that can help build a nice, affordable and quick solution to meduim sized databases.
Keep doing the nice work.
Thanks for sharing your thoughts Ashraf appreciate it. this is useful
Isn't that best text search db is Elasticsearch??
It's kind of related to data structure instead of one to one ur having one to many relation per row
Man your content is GOLD. I come from a front-end stack and was underestimating the work with databases. But your content has helped in understanding the pitfalls of backend engineering.
You help not only software engineers with your channel, but also data analysts like me that works a lot with data engineering. Thanks! Greetings from Brazil.
Love to all subs from beautiful Brazil 🇧🇷
Yet another great video! Truly educational, even for someone who have been in the game for over 15 years. 👏
generally, it's helpful to think in terms of read and write paths of your data. On the read side, on top of partitioning you can add bloom filters to quicky test where a value exists or not to reduce searching the B-Tree or other persistent data structures
nilanjan sarkar nice idea Nilanjan, Completely forgot about bloom filters probabilistic data structure..
For anyone interested check out this video I discussed it before Bloom Filters Explained by Example
ruclips.net/video/gBygn3cVP80/видео.html
" On the read side, on top of partitioning you can add bloom filters to quicky test where a value exists"
That would mean scanning the bloom for every partition and the results you get will contain false-positives meaning you will still scan indexesof many partitions that wil not contain any matches. It's better than doing an index scan on all partitions, but I get the impression that we're not talking about the same thing when we say "partitioning'.
The point of partitioning is that you divide rows up by the value of fields so you can predict the partitions-to-search by looking at the query. If I partition a phonebook by the first letter of the last name then querying for "Johnes" will only use the 'J' partition and the btree of that table wil be 1/26th of the original table (roughly) and so scan much faster. I don't have to pre-filter to find out that I need partition 'J', the database knows that from the query.
@@vinny142 Yes. This is more from an LSM tree perspective where an entry in a partition can reside both in memory and on disk via multiple versions
Your content is top bro.
I’m front end developer acquired backend skills, all because of good content from this channel. Amazing content and energy, thank you 😊
Couldn't have explained it simpler! Big ups dude!!
Good video. The last trick in the video is called denormalization. Also as soon as you introduce sharding you need also add replication because the probability of failures increases.
I had to deal with tables with a few billion records per month, and MySql merge engine (merge tables) let us slice and dice them any way we needed. You can even have more specific indexes in the real tables themselves, as long as each merge index exists in all tables.
The downside of Merge tables is that it multiplies the open file handles on the system, which can be tricky for a machine doing public networking, but with latest kernels, you can get kind of crazy.
High memory makes a huge difference, of course.
Maybe I didn't understand the second last section about eliminating the need to update both ends of a connection, but your solution will crumble when person A who is following person B closes their account since the information on who follows person B is only in the person's B's records. So when person A exits the medium, we won't know who to update.
2 extra ideas:
1. table archiving: most large tables are caused by timeseries records -- just archive the old records in separate tables and keep the live table small
2. use modern databases that are more scalable than traditional single-host databases: cockroachdb, spanner, aurora, tidb, fauna, etc.
Great idea if its time series and you can afford to archive do it..
I'm a big fan of archiving, just make sure that you don't shift the problem from the "live" table to the archive table, if you keep archiving to the same table for a year then that table will become impossibly huge and essentially unusable and you end updeleting it. A simple cronjob can create a new monthly table or something.
Partitioning is in effect a type of archiving but with a transparent api so you can just run a query and the database will choose the partitions by itself.
As for more scalable database engines; that would depend on wether those can still deliver the functionality you need. For example; any app dealing with money cannot live without transactions and referental integrity because one mistake can cost millions.
Also; is't not exactly true that traditional databases are single-host. Replication is easy to setup to get more nodes involved and sharding is realtively easy to built into an app or a proxy.
Are we archiving entire db or just a table? How do we join between archived and non archived dbs? Because if there are joined users that do not exist anymore archive will not work. I guess we need to snapshot the db or at least more than one table that is problematic
I love you man. You are so crystal clear. you are legend.
I've got many solution on database side from your video. Thanks for your support.
I really like listening to you while doing other stuff, like driving eating or walking. Always being productive and learning new things "sidely" by your videos. Habibi ❤
In json design, the followers list column can grow very large to millions and can cause speed issues there
Woah the json method literally blew my mind.
Fabulous content on your channel. Subscribed. Thanks !! Keep up the good work :)
I believe the last concept is actually called denormalization. Another option could be considering NoSQL.
By the way, you are awesome
Sometimes you cannot redesign the table or escape getting at billion rows tables. For example, the journal entry for real-time payment system gets very large in very few months. Moreover, you cannot resort to NoSQL databases since these data are ACID in nature in which eventual consistency is not applicable. I guess you will end up using all the aforementioned techniques like: Indexing (Obvious One), Partitioning and Sharding. One problem with Partitioning though is that most ORMs are not Partition aware. On the other hand, you have to be careful of new distributed and horizontally scalable databases like Cockroach DB, YugaByte,etc,. Because there is always a catch.
I did have a field day implementing partitioning on top of JPA and Postgres a few years back. Works like a charm, but requires a lot of low level fiddling to get it to work
if there is a huge table and we need to run a migration on it (structural changes), how partitioning can help to minimising the outage in production environment as table lock will not allow other changes to be applied on that time.?
Hi Hussein, thank you for your awesome videos!!! They're really intresting!
Simillo De Simo ❤️❤️ thanks 😊
In case of a twitter like app, I will go certainly go for a graph database like Neo4j and use another database to store indexes. Each user is a node and each followingX relation is an edge, It will limit the join operations
Bruce Wayne fantastic thoughts, Graph databases are designed for this use cases so they will be optimized for this.. that will be another way to avoid that large table .. I don’t know how graph databases work though so not sure of their scale and how do they store things
@@hnasr If I'm not wrong , they store the addresses in one file and the data in other ones , so if you find a node using indexes , you will find next to it the physicial address of each edge related to it , it's just a O(1) read operation, and so navigate in the graph by just simple read operations instead of complex joins.
And this is an article introducing sharding for neo4j
neo4j.com/whats-new-in-neo4j/scalability/
Thanks for your videos , I'm learning a lot of new things each day
Does the twitter example lend itself to Graph? The twitter database is basically three tables with single-field relations.The joins in the queries don't go deeper than one level. Graph is excellent for large...well graphs,where nodes are linked to nodes who are linked to nodes etc etc , but twitter doesn't d that.(unless you perhaps want to query all the users who follow all users who follow all users who follow allusers who follow Obama,but that's easily solved using recursion in SQLand really doesn't require graph.
DGraph has nicer query syntax than Neo4j imo
Thanks for the video . I did a first time RUclips 'applaud' feature with a small token of 100 Rupees . Hope you recieved it . Keep going 🙏🙏🙏
Dinakaran S thank you so much Dinakaran for the support!
if we partition data how do we run sort order select query efficiently since it will hit multiple partition
Truly a hidden gem
GPU based query is the way to go 💥
how is that done?
Awesome as always, hands on examples of sharding, partitioning and indexing will double awesome bro.
Glad it helped Thank you so much!
top notch content, i just encountered similar problem at my work place. thanks.
great work there Hussein. could you please shed some light on database migration strategies. Thanks.
Apart from great contents you inspires me to love and play with technology....you are awsm brother..
Thank you for your amazing content, I am really happy that I found your channel
Thanks Menna 🙏😊
Great video Hussein. Just wanted to share my thoughts as I have first hand experience with a table container over a half billion rows.
I don’t believe, a billion row table is that big of a deal for a modern RDBMS if the schema is well designed and proper indexes are in place (to support the common query patterns).
I have a database with over half billion rows which works flawless on a medium sized hardware. (Azure Sql database)
Indexes go a long way, then partition and then shard as you mentioned.
Also before making things complicated by adding shards etc, you can always ask yourself “is there data in the table that I can offload/archive/delete to some other table?” The answer in most cases would be yes. This should save your from a lot of additional complexity.
Rizwan, might i ask what database technology you are using?
I have been working on a similar table (well, multiple tables with around half a billion rows) and i have faced the performance issues. I have tried it on postgres as well as mysql. Indexes definitely make things better, however, in case of searches, even they are not enough. iterating the table sometimes took me days. I didn't design the structure, but have to maintain it. The only way left was sharding for me. If you hhave any other tricks and tips, please share!
@@user-bk5xo1gj7k Hello, I am using Azure SQL Database but I would think others (MySql, Postgresql etc) perform very similar.
For instance, here is the table I had that has:
213,461,562 Rows (almost a quarter million)
I ran a query on it by filtering on a column that is indexed (datetime) and selected top 100 rows and here are the results:
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.
(100 rows affected)
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
Completion time: 2023-01-24T08:34:59.6096640-06:00
As you can see, it was barely anything for the server.
A few things to look at:
1) Is your table indexed for the query you are running against it?
2) How much data is part of the query result?
This is why my original comment stated that you barely need sharding (partition should come before sharding) until maybe you get into multi billions and space on a single server becomes a problem). Sharding is no easy topic when it comes to properly implementing it.
Here is something that may be of help in your optimization scenario:
use-the-index-luke.com/
Best of luck and let me know if you have any questions.
@@TheRizwanSharp thanks a bunch for reply!
Can you please talk about different databases and their pros and cons. Lsm, btree, column oriented, rdbms etc
swapnil kale discussed this some what in my database engine course
Database Engines Crash Course (MyISAM, Aria, InnoDB, XtraDB, LevelDB & RocksDB)
ruclips.net/video/K9Qd3UMHUQ4/видео.html
Great content 👍👌
Is there a method of grouping the data by vector similarity and when querying, the subset that is queried is chosen by a vectorized term in the query?
can NoSQL store Billions of datas... ??? or
use SQL tables stuff... is good ?
or is their NoSQL + SQL DB like the row json
How can we efficiently search in the JSON data? can someone share some tips on this?
Great video,question onf twitter design: what happens when someone unfllows you?
Awesome content, thanks!!!
What if someone writes a bot that just do follow and unfollows? Will that overload the database?
It will of course, that is why you shouldn’t make an API that will easily allow anyone to do so. Or charge them so much to call that api
Thanks for the video man! Loved it!
In the redesigning the DB part of your video, taking the Profile example, where you have a column or a key which stores the list of followers, my doubt is will that not exceed the maximum limit of the row or a document and make it extremely heavy while retrieving one follower ? I think it's an open-ended discussion and totally depends on the use case.
I liked all your ideas. Although I was waiting if you could talk about data archival and purging. Something I tried was on archival of data to another database and purging it from the original database which may not be accessed frequently.
what to do when I want to overwrite 100 millions of rows into new table, in minutes?
df.write.mode("overwrite").saveAsTable("FINAL"), if you could please help with this?
Best Practices Working with Billion-row Tables in Databases
:
Don't Work with Billion-row Table :)
But....is there not a problem with consistency in case of sharding?
what about a hashmap type algo where you can be guided to where you need to go.
What is your thoughts on searching a varchar(1000) column in a billion rows table. Note , only one column and keyword search. Like %xyz% . I know like query will be supper slow. So what will be the alternates. How to solve
Help!! i am user of database so indexing is not granted to me and all the rows except id are of datatype text so i can't make partition with those. partitioning with Id is not an option because data stored randomly. i am junior developer so sharding is not option again. my requirement is to search user based on parameter like Country or job title or company or all of them. data is huge almost 1B+. so please give idea what i can do ?
I don't think you can do much if, any backend dev/DBA can help you with your queries? you need to be able to have the proper indexes on the table to get better performance
Will this also help in reducing the query time for JOIN? Can someone explain how join will work?
Regarding sharding: Isn't it better to put the database instances into a cluster and let the master take care of sharding so that the client will not need to know which db instance to query - it will only need to know the master? Im not a DB guru but i always thought that this is how it works???
Correct, Some database vendors provide such feature yes but not every DB has this feature and for the longest time the engineer has to built this themself. I believe Vitess solves this for mySQL. Cassandra also has consistent hashing which distribute the data across shards..
@@hnasr PostgreSQL can do it with Foreign-Data-Wrappers; essentially you create a partition for a table and stick it on a different node through FWD.
www.percona.com/blog/2019/05/24/an-overview-of-sharding-in-postgresql-and-how-it-relates-to-mongodbs/
If you route all traffic for all shards through the root node that means you can still only go as fast as the root node, you're not leveraging the full capacity of each shard. Look at geaographical sharding where each shard holds the data that is used for a region, completely bypassing the 'root' node.
Shard - partition - index - parallel processing
your content is dope bro
Shashi Bhushan your dope thanks Shashi ❤️
@@hnasr Can you make a similar one for API design?
Thank you for the content it's helpful,
I think the problem with your design is that when you try to discover all the profiles you follow, you will end up in trouble.
HI, Thanks for the video.This gives me an insights of how to design a very large database. But can we do this partitioning and sharding on a relational database like MYSQL for example? if we do so I believe we can't leverage the full benefits of MYSQL IMO ,a data base with lots of partitioned table how JOINS will work then?
MySql has something even cooler - Merge Tables. Doesn't support putting tables on multiple drives, but there are a few ways to deal with that.
Great topic. Very helpful 👍🏾. Truthfully, I always go with the last option 😅 because most of my projects will never have a billion rows. Now I have others options too. Thanks.
Glad it was helpful Eddy! thanks
Wow great video. What about using both Spark+Hadoop in conjunction with the index-partitioning-sharding approach. Would such a strategy even makes sense?
Nathan Benton Works but so much infrastructure to support so the question becomes : is it worth it?
Kinda don’t like the idea of having JSON tbh because you can’t enforce foreign keys and you, in this example, couldn’t get a list of your followers.
That said, love your content. Thanks for the videos!
I'm currently having this problem right now with my current company. another way could be just archiving old data if they are not often accessed and creating a mechanism to load up the archive data whenever needed. for example transaction tables
Correct archiving is another remedy for this Tobi agree.
Will your last solution work for the user with millions followers ? I can not wrap my head around. That's a huge amount of data for a single document/column
Very Nice content 🥇
I would just ask: isn't sharding just a type of partitioning?
Sharding = multi-host-partitioning?
Keep up the good work.
Very interesting 👏👏
I'd like to see an example of how to handle sql joins when using shards. I'm guessing since you can't do joins using sharding, you'll have to pull everything in on the application side and glue everything together.
correct that is why I try to avoid sharding when possible
@@hnasr I guess that's a nightmare for reporting too.
Hussein has done a fenomenal job of impersonating Nigel Thornberry in the thumbnail picture.
Thanks for the great video. I'm learning a lot from this channel. Could a junction table be used in this case as well? For example: id(primary key), followed_id(foreign key), follower_id(foreign key) ...; this would mean a single added row each time the 'follow' button is pressed and, if needed, an inner join to see the relevant details of the followers
no i don't think that'd work. see, the issue is the number of rows in the end. junction table would be larger than the 2 tables that table is a junction between. the junction table could grow to multiple billions of rows and we would end up with massive table.
(hope i understood your question correctly)
@@user-bk5xo1gj7k yes, that's right. Thanks! I'm wondering now whether there are any other approaches to this problem, apart from those already mentioned. Maybe as far as the DB schema is concerned.
Hi Hussein, Love your videos. Can you please create a video on creating index on production tables without having a lock on the table.Maybe you can talk about Online DDL, percona , memory consumption and write latency impact.
I actually discussed this on my indexing video. You use create index concurrently on postgres ruclips.net/video/-qNSXK7s7_w/видео.html
wait saving data like that OK?
I had the same mindset but I thought It was wrong because no one did that around me.
thank you, amazing talk!
One thing you could add id vertical partition. Otherwise, top content.
You are amazing man!
Effective content
Nice video!
What about writing to a big postgres table. let's say, we have to do 1000 writes per second on a particular table.
Sounds like describing Cassandra and MongoDB =D. Cheers
I am planning to make steak of daily watching your video(s). It's really amazing content 😁✌️ Great work. Let's see how much consistent I be in this activity 🎉
#🔥 (1)
I keep hearing sharting
Hussain, how does a day in your life as an Engineer span out?. How do you manage to get time for your youtube videos?.
This is a typical job interview question
I would like to ask you just one thing, how did you gather so much knowledge? As a software engineer myself, I can assure you this isn't something you get from work experience. And I'm guessing it was books? If so please do share your recommended reading list
Json column will make the database very slow. Bad idea
What if there are millions of followers?? Would that create an issue for that column??
Some people like me don't design the table in advance rather they look for a solution once there are 2 billion rows already.
You had great ideas until the 10 minute mark. What happens a follower of 3 or n people leaves or changes their name. How will you find the follower unless you scan every single user? Stick to a real relational design and this will never be an issue.
Can you give any reference to what design are you talking about ?
What is the real relational version of it ?
your content is great but @10:20 u created list of users(mongodb/json) and that can grow infinite but also there is a limit of mongodb that 1 document cant exceed 16mb size.
Rahul Goyal ohhh I didn’t know Mongo has that limit.. I wouldn’t say infinite though, Tyler swift had 50 million followers which is still manageable but still you are right .. someone mentioned graph database which can also be useful
Thanks for sharing thoughts
We could maintain a separate table, where each record being (user,list of follower Ids) which is queried only when required. You could break the total number of followers into multiple records. Something like paging from OS. Maybe userid+record_number as your primary key?
Yes we can..@omkar
@@Iamhomkar Technicaly yes, but if the database cannot hold the amount of data you need it to hold then you should move to a different database that can.
With nosql no problems, just an JSON array of IDs
hmmmm I think there must be a limit so we need to think scalability
Can't we use something like a Time Series DB, Elastic or Postgress or Cassandra, they are famous for being able handle heavy loads of data. Btw how the Bitcoin DB works ? There are millions of transactions
But why are billion-row tables slow? Why having nice indexes is enough for a couple million rows, but it’s not enough for a > 1 billion rows?
Because the index grows as well and it eventually becomes slow to traverse it.
putting followers in one column violates 1NF i.e., atomic values.
assuming there is a field type called "list" than it shouldn't
❤❤
you missed: one more such video.
awsome
No one talks about scaling publicly like you do, and you do it for free.
So I can compress the whole video content into a single quote: "Move to no sql database" xD
𝘨𝘳𝘦𝘢𝘵 𝘪𝘥𝘦𝘢 𝘵𝘰 𝘴𝘦𝘵 𝘢 𝘵𝘦𝘹𝘵 𝘤𝘰𝘭𝘶𝘮𝘯 𝘸𝘪𝘵𝘩 𝘢 .𝘑𝘚𝘖𝘕 .
𝕋𝕐𝕍𝕄 𝟜𝕥𝕙𝕚𝕤 𝕜𝕚𝕔𝕜𝕒𝕤𝕤 𝕔𝕠𝕟𝕥𝕖𝕟𝕥 𝕊𝕚𝕣.
Try to make smaller videos
Can you share your views on fauna db ?
I've got many solution on database side from your video. Thanks for your support.