Best Practices Working with Billion-row Tables in Databases

Поделиться
HTML-код
  • Опубликовано: 19 июн 2024
  • In this video, I discuss 3 methods to work with tables of billion rows. This discussion is inspired by a comment on my RUclips video on Twitter System Design right here • Twitter System Design ...
    Chapters
    Intro 0:00
    1. Brute Force Distributed Processing 2:30
    2. Working with a Subset of table 3:35
    2.1 Indexing 3:55
    2.2 Partitioning 5:30
    2.3 Sharding 7:30
    3. Avoid it all together (reshuffle the whole design) 9:10
    Summary 11:30
    🎙️Listen to the Backend Engineering Podcast
    husseinnasser.com/podcast
    🏭 Backend Engineering Videos
    • Backend Engineering (B...
    💾 Database Engineering Videos
    • Database Engineering
    🏰 Load Balancing and Proxies Videos
    • Proxies
    🏛️ Software Archtiecture Videos
    • Software Architecture
    📩 Messaging Systems
    • Message Queues & PubSu...
    Become a Member
    / @hnasr
    Support me on PayPal
    bit.ly/33ENps4
    Stay Awesome,
    Hussein
  • НаукаНаука

Комментарии • 145

  • @elultimopujilense
    @elultimopujilense 3 года назад +196

    The only thing about this channel that makes me feel awful is that i didnt discovered it earlier. The topics discussed here are not really common, but extremely important in my opinion. I havent found anything that gets near this channel, some topics are so high level that I havent heard of them at all. Makes me feel like I know nothing, which is the best feeling ever. There is no much to learn! Thank you so much!

    • @oscarbarajas3610
      @oscarbarajas3610 3 года назад +6

      I totally agree with you. This guy makes my time spent worthful.

    • @saurabhsharma7123
      @saurabhsharma7123 3 года назад +6

      Exactly! RUclips hasn't reported me any similars yet😂

  • @zacharythatcher7328
    @zacharythatcher7328 2 года назад +17

    This idea of transitioning the delay to the writer AND then using queues for writes is true architecting foresight. I love it.

  • @achraf3117
    @achraf3117 3 года назад +41

    I'm a junior Software Engineer and working my way through to specialize with Backend. I always learn new concepts from your videos. This one in particular hits home for me because I've worked on couple of projects were I had to make the design choice of the database. I found it quite difficult to make the right choices because I always end up building a search endpoint with full text search and other search parameters.
    To put it into context, the databases I had problems with were filled with recipes. I had multiple tables with 100k+ entries filled only with IDs (the problem you mentioned). After two projects I think I switched to the latest idea you mentioned with the list/json column and that one worked the best for me. Because not only it avoids searching through a big table but also saves me an extra query to another table.
    This is kind of irrelevant the this video but when implementing full text search I think it's better to go with postgresql rather than mysql since it supports gin and gist indexes and fuzzy searching that can help build a nice, affordable and quick solution to meduim sized databases.
    Keep doing the nice work.

    • @hnasr
      @hnasr  3 года назад +5

      Thanks for sharing your thoughts Ashraf appreciate it. this is useful

    • @0161rt
      @0161rt 3 года назад

      Isn't that best text search db is Elasticsearch??

    • @tikz.-3738
      @tikz.-3738 3 года назад

      It's kind of related to data structure instead of one to one ur having one to many relation per row

  • @andreivilla9009
    @andreivilla9009 3 года назад +4

    Man your content is GOLD. I come from a front-end stack and was underestimating the work with databases. But your content has helped in understanding the pitfalls of backend engineering.

  • @t529615
    @t529615 3 года назад +8

    Yet another great video! Truly educational, even for someone who have been in the game for over 15 years. 👏

  • @PiyushChauhan2011
    @PiyushChauhan2011 3 года назад

    I’m front end developer acquired backend skills, all because of good content from this channel. Amazing content and energy, thank you 😊

  • @Bilo_7
    @Bilo_7 25 дней назад

    I really like listening to you while doing other stuff, like driving eating or walking. Always being productive and learning new things "sidely" by your videos. Habibi ❤

  • @OmarBravotube
    @OmarBravotube 3 года назад +1

    Couldn't have explained it simpler! Big ups dude!!

  • @joaopedromoreiradeoliveira4273
    @joaopedromoreiradeoliveira4273 3 года назад +1

    You help not only software engineers with your channel, but also data analysts like me that works a lot with data engineering. Thanks! Greetings from Brazil.

    • @hnasr
      @hnasr  3 года назад

      Love to all subs from beautiful Brazil 🇧🇷

  • @4am4i
    @4am4i 3 года назад +1

    Good video. The last trick in the video is called denormalization. Also as soon as you introduce sharding you need also add replication because the probability of failures increases.

  • @rajendiranvenkat5992
    @rajendiranvenkat5992 3 года назад

    I've got many solution on database side from your video. Thanks for your support.

  • @sachinkalkur
    @sachinkalkur 3 года назад

    top notch content, i just encountered similar problem at my work place. thanks.

  • @elmehdisaniss2731
    @elmehdisaniss2731 3 года назад +24

    Your content is top bro.

  • @p4s153
    @p4s153 3 года назад

    Apart from great contents you inspires me to love and play with technology....you are awsm brother..

  • @simonedeluca104
    @simonedeluca104 3 года назад +1

    Hi Hussein, thank you for your awesome videos!!! They're really intresting!

    • @hnasr
      @hnasr  3 года назад

      Simillo De Simo ❤️❤️ thanks 😊

  • @nilanjansarkar100
    @nilanjansarkar100 3 года назад +27

    generally, it's helpful to think in terms of read and write paths of your data. On the read side, on top of partitioning you can add bloom filters to quicky test where a value exists or not to reduce searching the B-Tree or other persistent data structures

    • @hnasr
      @hnasr  3 года назад +9

      nilanjan sarkar nice idea Nilanjan, Completely forgot about bloom filters probabilistic data structure..
      For anyone interested check out this video I discussed it before Bloom Filters Explained by Example
      ruclips.net/video/gBygn3cVP80/видео.html

    • @vinny142
      @vinny142 3 года назад +7

      " On the read side, on top of partitioning you can add bloom filters to quicky test where a value exists"
      That would mean scanning the bloom for every partition and the results you get will contain false-positives meaning you will still scan indexesof many partitions that wil not contain any matches. It's better than doing an index scan on all partitions, but I get the impression that we're not talking about the same thing when we say "partitioning'.
      The point of partitioning is that you divide rows up by the value of fields so you can predict the partitions-to-search by looking at the query. If I partition a phonebook by the first letter of the last name then querying for "Johnes" will only use the 'J' partition and the btree of that table wil be 1/26th of the original table (roughly) and so scan much faster. I don't have to pre-filter to find out that I need partition 'J', the database knows that from the query.

    • @nilanjansarkar100
      @nilanjansarkar100 3 года назад

      @@vinny142 Yes. This is more from an LSM tree perspective where an entry in a partition can reside both in memory and on disk via multiple versions

  • @videosforthegoodlife2253
    @videosforthegoodlife2253 Год назад +2

    I had to deal with tables with a few billion records per month, and MySql merge engine (merge tables) let us slice and dice them any way we needed. You can even have more specific indexes in the real tables themselves, as long as each merge index exists in all tables.
    The downside of Merge tables is that it multiplies the open file handles on the system, which can be tricky for a machine doing public networking, but with latest kernels, you can get kind of crazy.
    High memory makes a huge difference, of course.

  • @insearchof9090
    @insearchof9090 21 день назад

    I love you man. You are so crystal clear. you are legend.

  • @lucasbastos8019
    @lucasbastos8019 3 года назад +1

    Awesome content, thanks!!!

  • @mennahamedo8204
    @mennahamedo8204 3 года назад

    Thank you for your amazing content, I am really happy that I found your channel

    • @hnasr
      @hnasr  3 года назад

      Thanks Menna 🙏😊

  • @sany2k8
    @sany2k8 3 года назад

    Awesome as always, hands on examples of sharding, partitioning and indexing will double awesome bro.

    • @hnasr
      @hnasr  3 года назад +1

      Glad it helped Thank you so much!

  • @rashmidewangan8627
    @rashmidewangan8627 3 года назад +1

    Fabulous content on your channel. Subscribed. Thanks !! Keep up the good work :)

  • @natem889
    @natem889 3 года назад

    I liked all your ideas. Although I was waiting if you could talk about data archival and purging. Something I tried was on archival of data to another database and purging it from the original database which may not be accessed frequently.

  • @alichoobdar9030
    @alichoobdar9030 Год назад +2

    I believe the last concept is actually called denormalization. Another option could be considering NoSQL.
    By the way, you are awesome

  • @Sarwaan001
    @Sarwaan001 3 года назад +1

    Woah the json method literally blew my mind.

  • @EddyCaffrey
    @EddyCaffrey 3 года назад

    Great topic. Very helpful 👍🏾. Truthfully, I always go with the last option 😅 because most of my projects will never have a billion rows. Now I have others options too. Thanks.

    • @hnasr
      @hnasr  3 года назад

      Glad it was helpful Eddy! thanks

  • @arianseyedi
    @arianseyedi 3 года назад +2

    Maybe I didn't understand the second last section about eliminating the need to update both ends of a connection, but your solution will crumble when person A who is following person B closes their account since the information on who follows person B is only in the person's B's records. So when person A exits the medium, we won't know who to update.

  • @serhiihorun6298
    @serhiihorun6298 3 года назад

    You are amazing man!

  • @zhaoxueli315
    @zhaoxueli315 3 года назад

    Truly a hidden gem

  • @JoaoPinela
    @JoaoPinela 3 года назад

    Very Nice content 🥇
    I would just ask: isn't sharding just a type of partitioning?
    Sharding = multi-host-partitioning?
    Keep up the good work.
    Very interesting 👏👏

  • @aX4UTL1O9
    @aX4UTL1O9 3 года назад +7

    2 extra ideas:
    1. table archiving: most large tables are caused by timeseries records -- just archive the old records in separate tables and keep the live table small
    2. use modern databases that are more scalable than traditional single-host databases: cockroachdb, spanner, aurora, tidb, fauna, etc.

    • @hnasr
      @hnasr  3 года назад

      Great idea if its time series and you can afford to archive do it..

    • @vinny142
      @vinny142 3 года назад +1

      I'm a big fan of archiving, just make sure that you don't shift the problem from the "live" table to the archive table, if you keep archiving to the same table for a year then that table will become impossibly huge and essentially unusable and you end updeleting it. A simple cronjob can create a new monthly table or something.
      Partitioning is in effect a type of archiving but with a transparent api so you can just run a query and the database will choose the partitions by itself.
      As for more scalable database engines; that would depend on wether those can still deliver the functionality you need. For example; any app dealing with money cannot live without transactions and referental integrity because one mistake can cost millions.
      Also; is't not exactly true that traditional databases are single-host. Replication is easy to setup to get more nodes involved and sharding is realtively easy to built into an app or a proxy.

    • @filippetrovic2596
      @filippetrovic2596 3 месяца назад

      Are we archiving entire db or just a table? How do we join between archived and non archived dbs? Because if there are joined users that do not exist anymore archive will not work. I guess we need to snapshot the db or at least more than one table that is problematic

  • @dinakaranonline
    @dinakaranonline 3 года назад +3

    Thanks for the video . I did a first time RUclips 'applaud' feature with a small token of 100 Rupees . Hope you recieved it . Keep going 🙏🙏🙏

    • @hnasr
      @hnasr  3 года назад

      Dinakaran S thank you so much Dinakaran for the support!

  • @GurpreetSingh-pl8uz
    @GurpreetSingh-pl8uz 3 года назад +2

    In json design, the followers list column can grow very large to millions and can cause speed issues there

  • @treyheaney9483
    @treyheaney9483 3 года назад

    Is there a method of grouping the data by vector similarity and when querying, the subset that is queried is chosen by a vectorized term in the query?

  • @salehdanbous6783
    @salehdanbous6783 Год назад +4

    Sometimes you cannot redesign the table or escape getting at billion rows tables. For example, the journal entry for real-time payment system gets very large in very few months. Moreover, you cannot resort to NoSQL databases since these data are ACID in nature in which eventual consistency is not applicable. I guess you will end up using all the aforementioned techniques like: Indexing (Obvious One), Partitioning and Sharding. One problem with Partitioning though is that most ORMs are not Partition aware. On the other hand, you have to be careful of new distributed and horizontally scalable databases like Cockroach DB, YugaByte,etc,. Because there is always a catch.

    • @vilkazz
      @vilkazz 10 месяцев назад

      I did have a field day implementing partitioning on top of JPA and Postgres a few years back. Works like a charm, but requires a lot of low level fiddling to get it to work

  • @positivity1016
    @positivity1016 3 года назад +1

    Great content 👍👌

  • @ghassanaljaafri9368
    @ghassanaljaafri9368 7 месяцев назад

    Thank you for the content it's helpful,
    I think the problem with your design is that when you try to discover all the profiles you follow, you will end up in trouble.

  • @nathanbenton2051
    @nathanbenton2051 3 года назад

    Wow great video. What about using both Spark+Hadoop in conjunction with the index-partitioning-sharding approach. Would such a strategy even makes sense?

    • @hnasr
      @hnasr  3 года назад

      Nathan Benton Works but so much infrastructure to support so the question becomes : is it worth it?

  • @sureshchaudhari4465
    @sureshchaudhari4465 6 месяцев назад

    if we partition data how do we run sort order select query efficiently since it will hit multiple partition

  • @woltron4o
    @woltron4o 3 года назад

    Nice video!

  • @shubhamchopra5518
    @shubhamchopra5518 3 года назад

    Hi Hussein, Love your videos. Can you please create a video on creating index on production tables without having a lock on the table.Maybe you can talk about Online DDL, percona , memory consumption and write latency impact.

    • @hnasr
      @hnasr  3 года назад

      I actually discussed this on my indexing video. You use create index concurrently on postgres ruclips.net/video/-qNSXK7s7_w/видео.html

  • @andreigatej6704
    @andreigatej6704 3 года назад

    Thanks for the great video. I'm learning a lot from this channel. Could a junction table be used in this case as well? For example: id(primary key), followed_id(foreign key), follower_id(foreign key) ...; this would mean a single added row each time the 'follow' button is pressed and, if needed, an inner join to see the relevant details of the followers

    • @user-bk5xo1gj7k
      @user-bk5xo1gj7k Год назад +1

      no i don't think that'd work. see, the issue is the number of rows in the end. junction table would be larger than the 2 tables that table is a junction between. the junction table could grow to multiple billions of rows and we would end up with massive table.
      (hope i understood your question correctly)

    • @andreigatej6704
      @andreigatej6704 Год назад

      @@user-bk5xo1gj7k yes, that's right. Thanks! I'm wondering now whether there are any other approaches to this problem, apart from those already mentioned. Maybe as far as the DB schema is concerned.

  • @krramesh5
    @krramesh5 3 года назад

    Will this also help in reducing the query time for JOIN? Can someone explain how join will work?

  • @brucewayne2480
    @brucewayne2480 3 года назад +3

    In case of a twitter like app, I will go certainly go for a graph database like Neo4j and use another database to store indexes. Each user is a node and each followingX relation is an edge, It will limit the join operations

    • @hnasr
      @hnasr  3 года назад +2

      Bruce Wayne fantastic thoughts, Graph databases are designed for this use cases so they will be optimized for this.. that will be another way to avoid that large table .. I don’t know how graph databases work though so not sure of their scale and how do they store things

    • @brucewayne2480
      @brucewayne2480 3 года назад

      @@hnasr If I'm not wrong , they store the addresses in one file and the data in other ones , so if you find a node using indexes , you will find next to it the physicial address of each edge related to it , it's just a O(1) read operation, and so navigate in the graph by just simple read operations instead of complex joins.
      And this is an article introducing sharding for neo4j
      neo4j.com/whats-new-in-neo4j/scalability/
      Thanks for your videos , I'm learning a lot of new things each day

    • @vinny142
      @vinny142 3 года назад +2

      Does the twitter example lend itself to Graph? The twitter database is basically three tables with single-field relations.The joins in the queries don't go deeper than one level. Graph is excellent for large...well graphs,where nodes are linked to nodes who are linked to nodes etc etc , but twitter doesn't d that.(unless you perhaps want to query all the users who follow all users who follow all users who follow allusers who follow Obama,but that's easily solved using recursion in SQLand really doesn't require graph.

    • @MartinsTalbergs
      @MartinsTalbergs 3 года назад

      DGraph has nicer query syntax than Neo4j imo

  • @TheRizwanSharp
    @TheRizwanSharp Год назад

    Great video Hussein. Just wanted to share my thoughts as I have first hand experience with a table container over a half billion rows.
    I don’t believe, a billion row table is that big of a deal for a modern RDBMS if the schema is well designed and proper indexes are in place (to support the common query patterns).
    I have a database with over half billion rows which works flawless on a medium sized hardware. (Azure Sql database)
    Indexes go a long way, then partition and then shard as you mentioned.
    Also before making things complicated by adding shards etc, you can always ask yourself “is there data in the table that I can offload/archive/delete to some other table?” The answer in most cases would be yes. This should save your from a lot of additional complexity.

    • @user-bk5xo1gj7k
      @user-bk5xo1gj7k Год назад +1

      Rizwan, might i ask what database technology you are using?
      I have been working on a similar table (well, multiple tables with around half a billion rows) and i have faced the performance issues. I have tried it on postgres as well as mysql. Indexes definitely make things better, however, in case of searches, even they are not enough. iterating the table sometimes took me days. I didn't design the structure, but have to maintain it. The only way left was sharding for me. If you hhave any other tricks and tips, please share!

    • @TheRizwanSharp
      @TheRizwanSharp Год назад +1

      @@user-bk5xo1gj7k Hello, I am using Azure SQL Database but I would think others (MySql, Postgresql etc) perform very similar.
      For instance, here is the table I had that has:
      213,461,562 Rows (almost a quarter million)
      I ran a query on it by filtering on a column that is indexed (datetime) and selected top 100 rows and here are the results:
      SQL Server parse and compile time:
      CPU time = 0 ms, elapsed time = 0 ms.
      (100 rows affected)
      SQL Server Execution Times:
      CPU time = 0 ms, elapsed time = 0 ms.
      Completion time: 2023-01-24T08:34:59.6096640-06:00
      As you can see, it was barely anything for the server.
      A few things to look at:
      1) Is your table indexed for the query you are running against it?
      2) How much data is part of the query result?
      This is why my original comment stated that you barely need sharding (partition should come before sharding) until maybe you get into multi billions and space on a single server becomes a problem). Sharding is no easy topic when it comes to properly implementing it.
      Here is something that may be of help in your optimization scenario:
      use-the-index-luke.com/
      Best of luck and let me know if you have any questions.

    • @user-bk5xo1gj7k
      @user-bk5xo1gj7k Год назад +1

      @@TheRizwanSharp thanks a bunch for reply!

  • @auroshisray9140
    @auroshisray9140 3 года назад

    Effective content

  • @mmu200
    @mmu200 3 года назад

    What is your thoughts on searching a varchar(1000) column in a billion rows table. Note , only one column and keyword search. Like %xyz% . I know like query will be supper slow. So what will be the alternates. How to solve

  • @sojuthomas7727
    @sojuthomas7727 3 года назад

    HI, Thanks for the video.This gives me an insights of how to design a very large database. But can we do this partitioning and sharding on a relational database like MYSQL for example? if we do so I believe we can't leverage the full benefits of MYSQL IMO ,a data base with lots of partitioned table how JOINS will work then?

    • @videosforthegoodlife2253
      @videosforthegoodlife2253 Год назад

      MySql has something even cooler - Merge Tables. Doesn't support putting tables on multiple drives, but there are a few ways to deal with that.

  • @_SoundByte_
    @_SoundByte_ 3 года назад +1

    Can you please talk about different databases and their pros and cons. Lsm, btree, column oriented, rdbms etc

    • @hnasr
      @hnasr  3 года назад

      swapnil kale discussed this some what in my database engine course
      Database Engines Crash Course (MyISAM, Aria, InnoDB, XtraDB, LevelDB & RocksDB)
      ruclips.net/video/K9Qd3UMHUQ4/видео.html

  • @flygonfiasco9751
    @flygonfiasco9751 3 года назад +1

    Kinda don’t like the idea of having JSON tbh because you can’t enforce foreign keys and you, in this example, couldn’t get a list of your followers.
    That said, love your content. Thanks for the videos!

  • @sarveshsawant7232
    @sarveshsawant7232 7 месяцев назад

    Great video,question onf twitter design: what happens when someone unfllows you?

  • @mayankpandey8271
    @mayankpandey8271 3 года назад

    One thing you could add id vertical partition. Otherwise, top content.

  • @gmggx8944
    @gmggx8944 3 года назад +1

    can NoSQL store Billions of datas... ??? or
    use SQL tables stuff... is good ?
    or is their NoSQL + SQL DB like the row json

  • @IdreesDargahwala47
    @IdreesDargahwala47 3 года назад +3

    Thanks for the video man! Loved it!
    In the redesigning the DB part of your video, taking the Profile example, where you have a column or a key which stores the list of followers, my doubt is will that not exceed the maximum limit of the row or a document and make it extremely heavy while retrieving one follower ? I think it's an open-ended discussion and totally depends on the use case.

  • @ryanisthewind
    @ryanisthewind 7 месяцев назад

    wait saving data like that OK?
    I had the same mindset but I thought It was wrong because no one did that around me.
    thank you, amazing talk!

  • @tobiowolawi5044
    @tobiowolawi5044 3 года назад

    I'm currently having this problem right now with my current company. another way could be just archiving old data if they are not often accessed and creating a mechanism to load up the archive data whenever needed. for example transaction tables

    • @hnasr
      @hnasr  3 года назад +1

      Correct archiving is another remedy for this Tobi agree.

  • @toxiclife
    @toxiclife Год назад

    what to do when I want to overwrite 100 millions of rows into new table, in minutes?
    df.write.mode("overwrite").saveAsTable("FINAL"), if you could please help with this?

  • @shei69
    @shei69 3 года назад

    Can you share your views on fauna db ?

  • @AchwaqKhalid
    @AchwaqKhalid 3 года назад +4

    GPU based query is the way to go 💥

    • @minscj
      @minscj 3 года назад

      how is that done?

  • @shashibhushan3289
    @shashibhushan3289 3 года назад +1

    your content is dope bro

    • @hnasr
      @hnasr  3 года назад

      Shashi Bhushan your dope thanks Shashi ❤️

    • @happyandinformedlife1212
      @happyandinformedlife1212 3 года назад

      @@hnasr Can you make a similar one for API design?

  • @decipheringthematrix56
    @decipheringthematrix56 5 месяцев назад

    Sounds like describing Cassandra and MongoDB =D. Cheers

  • @maikalkumar4382
    @maikalkumar4382 3 года назад

    Shard - partition - index - parallel processing

  • @javaprogramming7235
    @javaprogramming7235 4 месяца назад

    I am planning to make steak of daily watching your video(s). It's really amazing content 😁✌️ Great work. Let's see how much consistent I be in this activity 🎉
    #🔥 (1)

  • @CrzyGazara
    @CrzyGazara 3 года назад +10

    Best Practices Working with Billion-row Tables in Databases
    :
    Don't Work with Billion-row Table :)

  • @TechPrinc
    @TechPrinc 3 года назад

    Help!! i am user of database so indexing is not granted to me and all the rows except id are of datatype text so i can't make partition with those. partitioning with Id is not an option because data stored randomly. i am junior developer so sharding is not option again. my requirement is to search user based on parameter like Country or job title or company or all of them. data is huge almost 1B+. so please give idea what i can do ?

    • @hnasr
      @hnasr  3 года назад

      I don't think you can do much if, any backend dev/DBA can help you with your queries? you need to be able to have the proper indexes on the table to get better performance

  • @rashadarbab2769
    @rashadarbab2769 3 года назад

    what about a hashmap type algo where you can be guided to where you need to go.

  • @santoshbhatnagar2155
    @santoshbhatnagar2155 10 месяцев назад

    How can we efficiently search in the JSON data? can someone share some tips on this?

  • @muhammadkhan-qd-c5996
    @muhammadkhan-qd-c5996 3 года назад

    awsome

  • @tmanley1985
    @tmanley1985 3 года назад

    I'd like to see an example of how to handle sql joins when using shards. I'm guessing since you can't do joins using sharding, you'll have to pull everything in on the application side and glue everything together.

    • @hnasr
      @hnasr  3 года назад +1

      correct that is why I try to avoid sharding when possible

    • @tmanley1985
      @tmanley1985 3 года назад

      @@hnasr I guess that's a nightmare for reporting too.

  • @112BALAGE112
    @112BALAGE112 3 года назад

    Hussein has done a fenomenal job of impersonating Nigel Thornberry in the thumbnail picture.

  • @pavankumardittakavi5325
    @pavankumardittakavi5325 3 года назад

    Hussain, how does a day in your life as an Engineer span out?. How do you manage to get time for your youtube videos?.

  • @enlgn7050
    @enlgn7050 3 года назад

    But....is there not a problem with consistency in case of sharding?

  • @user-nw3rz4yp4u
    @user-nw3rz4yp4u 11 месяцев назад

    Will your last solution work for the user with millions followers ? I can not wrap my head around. That's a huge amount of data for a single document/column

  • @Flankymanga
    @Flankymanga 3 года назад

    Regarding sharding: Isn't it better to put the database instances into a cluster and let the master take care of sharding so that the client will not need to know which db instance to query - it will only need to know the master? Im not a DB guru but i always thought that this is how it works???

    • @hnasr
      @hnasr  3 года назад

      Correct, Some database vendors provide such feature yes but not every DB has this feature and for the longest time the engineer has to built this themself. I believe Vitess solves this for mySQL. Cassandra also has consistent hashing which distribute the data across shards..

    • @vinny142
      @vinny142 3 года назад

      @@hnasr PostgreSQL can do it with Foreign-Data-Wrappers; essentially you create a partition for a table and stick it on a different node through FWD.
      www.percona.com/blog/2019/05/24/an-overview-of-sharding-in-postgresql-and-how-it-relates-to-mongodbs/
      If you route all traffic for all shards through the root node that means you can still only go as fast as the root node, you're not leveraging the full capacity of each shard. Look at geaographical sharding where each shard holds the data that is used for a region, completely bypassing the 'root' node.

  • @echoky3189
    @echoky3189 3 года назад

    What if someone writes a bot that just do follow and unfollows? Will that overload the database?

    • @hnasr
      @hnasr  3 года назад

      It will of course, that is why you shouldn’t make an API that will easily allow anyone to do so. Or charge them so much to call that api

  • @tanzimchowdhury320
    @tanzimchowdhury320 Год назад

    I would like to ask you just one thing, how did you gather so much knowledge? As a software engineer myself, I can assure you this isn't something you get from work experience. And I'm guessing it was books? If so please do share your recommended reading list

  • @rabbyhossain6150
    @rabbyhossain6150 Год назад

    What about writing to a big postgres table. let's say, we have to do 1000 writes per second on a particular table.

  • @nagrajankola1323
    @nagrajankola1323 8 месяцев назад

    ❤❤

  • @claushellsing
    @claushellsing 3 года назад

    Can't we use something like a Time Series DB, Elastic or Postgress or Cassandra, they are famous for being able handle heavy loads of data. Btw how the Bitcoin DB works ? There are millions of transactions

  • @m.m.4589
    @m.m.4589 3 года назад

    With nosql no problems, just an JSON array of IDs

    • @hnasr
      @hnasr  3 года назад

      hmmmm I think there must be a limit so we need to think scalability

  • @GistTech
    @GistTech 3 года назад

    This is a typical job interview question

  • @rahulgoyal3780
    @rahulgoyal3780 3 года назад

    your content is great but @10:20 u created list of users(mongodb/json) and that can grow infinite but also there is a limit of mongodb that 1 document cant exceed 16mb size.

    • @hnasr
      @hnasr  3 года назад

      Rahul Goyal ohhh I didn’t know Mongo has that limit.. I wouldn’t say infinite though, Tyler swift had 50 million followers which is still manageable but still you are right .. someone mentioned graph database which can also be useful
      Thanks for sharing thoughts

    • @Iamhomkar
      @Iamhomkar 3 года назад +1

      We could maintain a separate table, where each record being (user,list of follower Ids) which is queried only when required. You could break the total number of followers into multiple records. Something like paging from OS. Maybe userid+record_number as your primary key?

    • @rahulgoyal3780
      @rahulgoyal3780 3 года назад

      Yes we can..@omkar

    • @vinny142
      @vinny142 3 года назад

      @@Iamhomkar Technicaly yes, but if the database cannot hold the amount of data you need it to hold then you should move to a different database that can.

  • @rahulmasurkar6971
    @rahulmasurkar6971 3 года назад

    What if there are millions of followers?? Would that create an issue for that column??

  • @nixietubes
    @nixietubes Год назад +1

    I keep hearing sharting

  • @sohahm
    @sohahm 3 года назад

    Some people like me don't design the table in advance rather they look for a solution once there are 2 billion rows already.

  • @a.yashwanth
    @a.yashwanth 3 года назад +1

    putting followers in one column violates 1NF i.e., atomic values.

    • @hnasr
      @hnasr  3 года назад +1

      assuming there is a field type called "list" than it shouldn't

  • @hurleyd9828
    @hurleyd9828 3 года назад

    No one talks about scaling publicly like you do, and you do it for free.

  • @mikinyaa
    @mikinyaa 3 года назад

  • @kamillatosinski3054
    @kamillatosinski3054 3 года назад

    But why are billion-row tables slow? Why having nice indexes is enough for a couple million rows, but it’s not enough for a > 1 billion rows?

    • @hnasr
      @hnasr  3 года назад +1

      Because the index grows as well and it eventually becomes slow to traverse it.

  • @insearchof9090
    @insearchof9090 21 день назад

    you missed: one more such video.

  • @alooooooola
    @alooooooola 3 года назад

    So I can compress the whole video content into a single quote: "Move to no sql database" xD

  • @benmoskovits5816
    @benmoskovits5816 11 месяцев назад +2

    You had great ideas until the 10 minute mark. What happens a follower of 3 or n people leaves or changes their name. How will you find the follower unless you scan every single user? Stick to a real relational design and this will never be an issue.

    • @avinashthakur80
      @avinashthakur80 3 месяца назад

      Can you give any reference to what design are you talking about ?
      What is the real relational version of it ?

  • @auseralastname8466
    @auseralastname8466 3 года назад

    𝘨𝘳𝘦𝘢𝘵 𝘪𝘥𝘦𝘢 𝘵𝘰 𝘴𝘦𝘵 𝘢 𝘵𝘦𝘹𝘵 𝘤𝘰𝘭𝘶𝘮𝘯 𝘸𝘪𝘵𝘩 𝘢 .𝘑𝘚𝘖𝘕 .
    𝕋𝕐𝕍𝕄 𝟜𝕥𝕙𝕚𝕤 𝕜𝕚𝕔𝕜𝕒𝕤𝕤 𝕔𝕠𝕟𝕥𝕖𝕟𝕥 𝕊𝕚𝕣.

  • @0xfaizan
    @0xfaizan 3 года назад

    Try to make smaller videos

  • @rajendiranvenkat5992
    @rajendiranvenkat5992 3 года назад

    I've got many solution on database side from your video. Thanks for your support.