System Design Interview - Top K Problem (Heavy Hitters)

Поделиться
HTML-код
  • Опубликовано: 28 сен 2024
  • Please check out my other video courses here: www.systemdesi...
    Topics mentioned in the video:
    - Stream and batch processing data pipelines.
    - Count-min sketch data structure.
    - MapReduce paradigm.
    - Various applications of the top k problem solution (Google/Twitter/RUclips trends, popular products, volatile stocks, DDoS attack prevention).
    Merge N sorted lists problem: leetcode.com/p...
    Inspired by the following interview questions:
    Amazon (www.careercup....)
    Facebook (www.careercup....)
    Google (www.careercup....)
    LinkedIn (www.careercup....)
    Twitter (www.careercup....)
    Yahoo (www.careercup....)

Комментарии • 648

  • @souravmojumder5563
    @souravmojumder5563 5 лет назад +83

    This s one of the best system design video I came across in long time .. keep up the good work !

  • @balajipattabhiraman
    @balajipattabhiraman 2 года назад +9

    As luck would have it i had a similar question for make or break round in google and I nailed it since I watched it several times over before the interview. Got a L6 role offered at Google. Thanks for making my dream come true.

  • @NikhilSharad
    @NikhilSharad 3 года назад +18

    You're amazing, by far the most detailed and deeply analysed solution I've seen on any design channel. Please never stop making videos.

  • @DickWu1111
    @DickWu1111 3 года назад +453

    Jesus christ this guy's material is amazing... and each video is so compact. He basically never wastes a single word....

    • @antonfeng1434
      @antonfeng1434 2 года назад +20

      I have to pause or rewind constantly, and watch every video twice to digest it.

    • @amitdubey9201
      @amitdubey9201 2 года назад

      @@antonfeng1434 me too

    • @xordux7
      @xordux7 2 года назад

      @@antonfeng1434 Same here

    • @jasonl412
      @jasonl412 5 месяцев назад +1

      @@xordux7 Same here

  • @itdepends5906
    @itdepends5906 2 года назад +2

    THIS GUY is SO COOL. Who else feel that when he's speaking, explaining difficult concepts in the most concise way possible - and also touching on what we really need to hear about?!

  • @pulkitb4Mv
    @pulkitb4Mv 3 года назад +6

    I love Mikhail's content, the video is so interactive that it looks like he is talking to you and he knows what is going inside your head :)

  • @stefanlyew8821
    @stefanlyew8821 5 лет назад +26

    one of the best technical discussions I have seen

  • @radsimu
    @radsimu 2 года назад +4

    I think it is admirable that you explained all the inner workings. In a real interview you can probably skip the single host solution with the heap, that's good for an explanation on youtube. What I think is more valuable is to also propose some actual technologies for the various components to make it clear that you are not proposing building this from scratch. I'm surprised that Kafka Streams was not mentioned. Also for the long path, it is worth discussing the option to store the raw or pre-aggregated requests in an OLAP db like Redshift. The olap can do the top k efficiently for you with a simple sql query (all the map reduce magic will be handled under the hood), can act as main storage, and will also make you flexible to other analytics queries. Integrates directly with various dashboarding products and one rarely wants to do just top k.

  • @niosocket1
    @niosocket1 4 года назад +13

    So funny, found this channel yesterday and watched this video and been asked pretty much same question at my interview at LinkedIn today. Thanks a lot.

    • @SystemDesignInterview
      @SystemDesignInterview  4 года назад +2

      Funny, indeed )) This world is so small ))
      Thanks for sharing!

    • @niosocket1
      @niosocket1 4 года назад +5

      Actually got an offer from Amazon, LinkedIn, Roku and probably Google as well. A lot of it because of this channel. Can’t recommend it enough! Thanks again!

    • @HieuNguyen-ty7vw
      @HieuNguyen-ty7vw 4 года назад

      I was asked this same question at my interview last Friday and found out your video today :( Didn't nail it though, hope I can do better next time. Thank you Mikhail, hope you can spend time to create more video like this.

    • @SystemDesignInterview
      @SystemDesignInterview  4 года назад

      Wow, Sergey. You rock!
      And thank you for the praise.

    • @SystemDesignInterview
      @SystemDesignInterview  4 года назад +2

      Time will come, Hugh. Just keep pushing!

  • @deathbombs
    @deathbombs 3 года назад +5

    19:05 slow path
    22:00 faster than map reduce but more accurate than countmin
    22:43 fast path
    25:38 Data partitioner is basically kafka that reads message(logs, processed logs w counts,etc..) and stores them to topics

  • @SystemDesignInterview
    @SystemDesignInterview  5 лет назад +6

    By some reason RUclips hides valid comments. I can only see such comments in the inbox, but there is no way for me to reply. Let me re-post such comments on behalf of people who submitted it.
    From @sachinjp
    Very detailed and in depth technical video on system design. Thanks for putting so much effort into this.

    • @SystemDesignInterview
      @SystemDesignInterview  5 лет назад +1

      Thank you for the feedback, @sachinjp. Hopefully this message will find you.

  • @andrepinto7895
    @andrepinto7895 2 года назад +10

    It is not enough to send the count min sketch matrix to storage only, you also need to send a list of all the event types that were processed, otherwise you have no way of moving from the matrix data to the actual values (before hashing). The only advantage over the map solution is that you don't need to keep all of it in memory at once, you can stream it as you go from disk for example.
    Calculating the min for each key is O(number of hash functions, H) and you need to do that for all types of events, so O(E*H). Then you use the priority queue to get the top K, O(E*log(K)), so total time complexity is O(E*H*log(K)).

    • @xiaopeiyi
      @xiaopeiyi Год назад +1

      Well, you are right. But I think the video is more about one of a general design for a single event type. Then we can start from here based on the functional requirement.

  • @HieuNguyen-ty7vw
    @HieuNguyen-ty7vw 4 года назад +7

    The best system design answer I have seen on RUclips. Thank you!

  • @coolgoose8555
    @coolgoose8555 4 года назад +1

    Ohhhh why I did not find this channel before.... The way you approach the problem and take it forward it make it so easy else the realm of system design concepts are huge.... We need more videos like this.... This is design pattern of system design.... Good Job!!!!

  • @arvind30
    @arvind30 4 года назад +10

    One of the best system design channel ive come across! great job! I particularly liked how you were able to describe a fundamental pattern that can be applied in multiple scenarios

  • @anderswb
    @anderswb 2 года назад +2

    These are by far the best videos on system design for interviews. Thanks a lot for taking the time to make and publish these!

  • @abysswatcher4907
    @abysswatcher4907 2 года назад +3

    For people wondering why heap complexity is O(nlog(k)) for single host top k, we do a simple optimization to pop least frequent item when heap size reaches K, so we have n operations each taking order log(k).

  • @gameonline6769
    @gameonline6769 3 года назад +1

    Thanks Mikhail. I can bet..this is the best channel on RUclips. Just binge watch all the videos from this channel and you will learn so much.

  • @mohitnagpal8025
    @mohitnagpal8025 4 года назад +1

    I have seen lot of system design videos but this content's quality is way above rest. Really appreciate the effort. Please keep posting new topics. Or you can pick top k heavy hitters system design problem requests from comments :)

  • @prateek_jesingh
    @prateek_jesingh 7 месяцев назад

    This is one of the best system design videos on this topic I have come across. Thanks & keep up the great work, Mikhail!

  • @theghostwhowalk
    @theghostwhowalk 4 года назад +2

    Great video. Request you to cover couple of popular System Design questions when get chance: (1) recommendation of celebrity on Instagram or Song Recommendation (2) Real time coding competition and display 10 top winners.

  • @biaozhang1643
    @biaozhang1643 2 года назад +1

    Wow! This is the best system design review video I've ever seen.

  • @manasdalai3934
    @manasdalai3934 4 года назад +3

    This is one of the best system design content I have came across. Thanks a lot.

  • @HeckscherHH
    @HeckscherHH Год назад +2

    I think your great coverage of the topic show how you really know it and understand it compared to other guys who just share what they read last night. Thank you

  • @boombasach
    @boombasach Год назад

    Among all the materials I have seen in youtube, this is really the top one. Keep up the good work and thanks for sharing

  • @OKJazzBro
    @OKJazzBro 2 года назад +3

    This is an excellent video, but I am left with these questions:
    1. Count min-sketch does not really keep track of video IDs in its cells. Each cell in the table could be from several collisions from different videos. So once we have our final aggregated min-sketch table, we pick the top k frequencies, but we can't tell which video ID each cell corresponds to. So how would it work? I haven't come up with an answer for this.
    2. What would be type of database used to store the top k lists?
    I would just use a simple MySql database since the number of rows would not be very large if we have to retain top k lists for a short window of time (say for 1 week) and k is not too big. We can always add new instances of the db for each week of data if we need preserve data for older weeks. We would have to create an index for the time range column to efficiently search.

    • @MrIshan07
      @MrIshan07 3 дня назад

      For 2nd we can use redia sorted set

  • @ahanjura
    @ahanjura 5 лет назад +4

    The system design video to beat. PERIOD!!!

  • @prashub8707
    @prashub8707 2 года назад +1

    So far I am loving it. Keeps me glued to ur channel. Fantastic job I must say

  • @dc5
    @dc5 3 года назад +1

    Awesome videos Mikhail... thanks a lot for sharing! That last part showing other problems with similar solutions was the cherry on top.

  • @drakezen
    @drakezen 5 лет назад +1

    Bonus on mentioning using Spark and Kafka as I was thinking that during the video. Great stuff as usual!

  • @DinkarGahoi
    @DinkarGahoi 4 года назад +1

    This is by far the best content I have found on the System Design. I am addicted to this content.
    Keep up the good work, waiting for more videos .. :)

    • @SystemDesignInterview
      @SystemDesignInterview  4 года назад +1

      Glad you enjoy it, Dinkar! Sure, more videos to come. I feel very busy these days. But I try to use whatever time is left to work on more content.

  • @ashwin4911
    @ashwin4911 Год назад +1

    Hi Mikhail, I was going over this video again. I am not clear how count min sketch will save memory. Even if we have a predefined size width and height. We still need to know all the videos like A, B, C, D... so we can calculate the different hash values for them before doing a min operation to find the count. So that means we need to persist this list of videos somewhere for the period of time we are accumulating the counts.

  • @terigopula
    @terigopula 3 года назад +2

    Your content is PURE GOLD. Hats off! :)

  • @andrewsouthpaw
    @andrewsouthpaw 3 года назад +2

    9:08 🤯 It's so cool seeing various algorithm problems actually applied to real-world designs!

  • @Preppy_Gaming_forlife
    @Preppy_Gaming_forlife 4 года назад +2

    Thank you so much Mikhail for adding top quality system design videos. I find the content very useful not only for preparing system design interviews but also applying them in my daily work.
    I have a question regarding slow path: What if some certain message keys become hot? In other words how should we rebalance the partitions if most of the messages go to the same partition?
    As far as, i know Kafka does not support increasing partitions in a topic dynamically. In here, it seems to me that we should use a different approach than distributed cache design to solve hot partitions.
    Thanks

    • @SystemDesignInterview
      @SystemDesignInterview  4 года назад +2

      Hi erol serbest,
      Good question. I talk about hot partitions a little bit in this video: ruclips.net/video/bUHFg8CZFws/видео.html (step by step interview guide). One of the ideas mentioned there is to include event time, for example in minutes, into partition key. All events within the current minute interval are forwarded to some partition. Next minute, all events go to a different partition. Within one minute interval a single partition gets a lot of data, but over several minutes data is spread more evenly among partitions.
      In general, the hot partition problem is a tough one. And there is no ideal solution. People try to chose partition keys and strategies properly to achieve more or less even distribution. Typically, systems heavily rely on monitoring to timely identify hot partitions and do something. For example, split partitions, if this is a consistent high load. Or use a fallback mechanism to handle excessive traffic, if it is temporary.

  • @js4773
    @js4773 2 года назад +1

    Do you have any plan to upload new series? This site is by far the best system design preparation material!

  • @warnercooler4488
    @warnercooler4488 2 года назад

    The amount of info you have covered here is amazing! Thank you so much!

  • @kaqqao
    @kaqqao Год назад +1

    Wait. How does count-min sketch actually help me get the overall heavy hitters? It gives me the count for each ID, but I still have to keep a billion IDs somewhere and loop through them and ask the count for each to determine top k? That doesn't sound right. If I have to keep all the IDs anyway, I might as well keep the true counts, it wouldn't be more expensive. Wouldn't Lossy Count be a better algorithm to use for this purpose?

  • @harishshankar682
    @harishshankar682 4 года назад +2

    Cant thank you enough for your efforts in sharing such a high quality content for us!

  • @TheDevYah
    @TheDevYah 3 года назад +1

    these videos are top quality! Do you have a link for donation? What I learned from your videos is way more valuable than many of the books / paid courses that I went through

  • @saurabhchoudhary9260
    @saurabhchoudhary9260 5 лет назад +7

    Awesome and detailed explanation. Hats off

  • @laracroft9079
    @laracroft9079 2 года назад

    ddos attack is more like rate limiter than top k problem, because you care about whether this one ip reaches some frequency limit, rather than what the most frequently visited ip is

  • @shw4083
    @shw4083 2 года назад

    This is really the best tutorial, and I hope there is article like this content!

  • @bishnuagrawal828
    @bishnuagrawal828 2 года назад +1

    - what if k is dynamic and the range is between 1 to n?
    - what if granularity is based on minutes only and how merging would work with dynamic range constraints?

  • @SystemDesignInterview
    @SystemDesignInterview  5 лет назад +3

    By some reason RUclips hides valid comments. I can only see such comments in the inbox, but there is no way for me to reply. Let me re-post such comments on behalf of people who submitted it.
    From @Howell Pan
    so essentially we need to maintain a series of topK list (heap, map) using count-min, each represent one minute (or a few min or an hour, depending on the granular level required). So then we can aggregate/merge these resultset in order to get topK for last min, last hour, last day, last week, month etc.. is this the idea? I think there is an edge case where some key could have a count just below topK for many time slots in a row, but add together it could make topK. So we can't just maintain topK for each list, we need to maintain top (K+N) to handle this case, right?

    • @SystemDesignInterview
      @SystemDesignInterview  5 лет назад +1

      Yes, Howell, you got the idea correctly. And you are right, there is this edge case you mentioned. You actually explained what I asked in the video. To provide an example of why we need the whole data set to calculate top K list precisely (well done!). When we merge top K lists we may lose data. Keeping top (K + N) can help to reduce the total error, but will not help to eliminate the problem completely. That is why we rely on MapReduce jobs.

    • @Tony-cy2yr
      @Tony-cy2yr 4 года назад

      @@SystemDesignInterview Will this N be the number of collision ? Will the keeping of top (K + N) shows more accurate results by sacrificing with including false positive results? (Say we request top 2 among A=3, B=2, c=1, d=1, and c & d has collision causing the combined count-min sketch table to have c=2, and we are returning top (2 + 1))

    • @SystemDesignInterview
      @SystemDesignInterview  4 года назад +1

      Keeping (K + N) elements makes the system more accurate. As you pointed out in your example, the same element (C in your case) may appear on several different machines. C may not be among top K on each of the machines. But it may be among top (K + N), and eventually appear among top K on the Storage machine.
      What this N should be? I do not know. Meaning that I have not seen any papers that describe how to properly choose the value of N, to guarantee the desired error.

  • @SahilSharma-ug8xk
    @SahilSharma-ug8xk 3 года назад +2

    HE explains so well

  • @renon3359
    @renon3359 3 года назад

    This content is mind-boggling, hands down the best system design videos on youtube.
    Just one question, why couldn't the gateway service put messages in different Kafka partitions and the rest of the things can go on. I mean anyway we were going to partition at the further stage, so why not do it from the source itself? Can someone answer this please?

    • @MEhagar
      @MEhagar 3 года назад

      Since the API gateway service will be handling API requests from potentially many different clients, it needs to be as simple and efficient as possible so as to not increase request latency. Partitioning the data would take processing time.

  • @AshishGupta-jx1ys
    @AshishGupta-jx1ys 4 года назад +2

    Amazing Video and in detail great explanation. Thanks a lot for creating this in-depth video. Please keep creating more awesome stuff.

  • @zhouchong90
    @zhouchong90 4 года назад +1

    For Count Min Sketch, if you use multiple nodes to process then merge, you can only get the global frequency of an item, but not the top-K. Because after you merge the count, all you have is a table of hashed integers, and you don't know what's the original data that produced this hash.
    For this one, I think you can only do a master-slave model, and use a single node to do the counting, and produce top k using a heap.

    • @zhouchong90
      @zhouchong90 4 года назад

      www.cs.rice.edu/~as143/Papers/topkapi.pdf This is an interesting paper that solves merging the count min sketch matrix from multiple processor nodes.

    • @SystemDesignInterview
      @SystemDesignInterview  4 года назад +1

      Hi Chong Zhou. Thank you for the question and the reference to the paper. A good read!
      Please take a look at other comments under this video, we address some of your concerns there. Let me provide a summary:
      - Fast Processor machines keep both a count-min sketch and a heap. Count-min sketch stores the frequency, while heap stores the local top-K (original data).
      - Each Fast Processor machine sends data to a single Storage machine. What data we send? You are right, we need to send the original data. In other words this is the data stored in the heap. We may or may not send the count-min sketch as well.
      - To avoid single point of failure in the face of a single Storage machine, we replicate data across several Storage machines.
      Please let me know if you have more questions. Will be glad to clarify further.

    • @alirezaamedeo
      @alirezaamedeo 8 месяцев назад

      Hi@@SystemDesignInterview. I am wondering, why bothering with CMS when you want to maintain a heap on a Fast Processor?
      I think it makes sense that we maintain a list of IDs on each node and send them along with CMS tables to the merger node.
      But cannot get why a heap would be needed.
      I appreciate your thought here.

  • @natarajaneelakanta353
    @natarajaneelakanta353 5 лет назад +3

    This is Terrific stuff, keep these coming.

    • @SystemDesignInterview
      @SystemDesignInterview  5 лет назад

      Thank you for sharing your feedback, Nataraja.

    • @natarajaneelakanta353
      @natarajaneelakanta353 5 лет назад

      @@SystemDesignInterview at 13:53, how did min val for A become 4 ?

    • @SystemDesignInterview
      @SystemDesignInterview  5 лет назад

      My mistake. It should be 3, of course. Thanks for pointing out.

    • @natarajaneelakanta353
      @natarajaneelakanta353 5 лет назад

      @@SystemDesignInterview I am sorry, I didn't mean to point mistake, i was just inquisitive. You have done a tremendous job (i don't have a better word) in explaining these so beautifully. I keep looking into this every week !

    • @SystemDesignInterview
      @SystemDesignInterview  5 лет назад

      Thank you, Nataraja, for the kind words.

  • @qiushili9932
    @qiushili9932 4 года назад +2

    Thank you for this amazing video. I have some questions:
    1. Count-min sketch is to save memory, but we still have n log k time to get top k, right?
    2. If count-min sketch is only used for 1 min count, why wouldn't we directly use a hash table to count? After all the size of data set won't grow infinitely.
    3. How to merge two top k lists of one hour to obtain top k for two hours?

    • @SystemDesignInterview
      @SystemDesignInterview  4 года назад

      Hi Qiushi. Thanks for the questions.
      1. Correct. It is n log k (for Heap) + k log k (for sorting the final list). N is typically much larger then k. So, n log k is the dominant.
      2. For small to medium scale, hash tables solution may work just fine. But keep in mind that if we try to create a service that needs to find top K lists for many different scenarios, there may be many such hash tables and it will not scale well. For example, top K list for most liked/disliked videos, most watched (based on time) videos, most commented, with the highest number of exceptions during video opening, etc. Similar statistics may be calculated on channels level, per country/region and so on. Long story short, there may be many different top K lists we may need to calculate with our service.
      3. We need to sum up values for the same identifiers. In other words we sum up views for the same videos from both lists. And take the top K of the merged list (either by sorting or using a Heap).
      Let me know if you still have questions. Will be glad to answer.

    • @bhanumanagadeep
      @bhanumanagadeep 4 года назад

      @@SystemDesignInterview Thanks for explaining. How does count min sketch work when there are different scenarios like you mentioned.... most liked/disliked videos. Do we need to build multiple sketch? Do we need to have designated hash for each of these categories? Eitherways, they need more memory just like hash table.
      Also, one of the advantage of count min sketch vs hash is countmin is sublinear in space while hash is linear.

    • @SystemDesignInterview
      @SystemDesignInterview  4 года назад

      Correct. We need its own sketch to count different event types: video views, likes, dislikes, submission of a comment, etc.

  • @datbui5863
    @datbui5863 3 года назад +1

    Please, make more videos! Absolutely amazing explanation!!!!!!!!!!!

  • @vANvTO
    @vANvTO 4 года назад +4

    Thank you so much for these videos! I mistakenly bought a course on educative and the content covered there pales in comparison to yours. I've watched your videos over and over and I learned something new each different time. However, I'd like some clarification on API Gateway. I assume this component is made up of many nodes to prevent single point failures, and I'd assume there'd be a Zookeeper of some sort to route to different nodes within this API Gateway. Is this correct? If so, why not just have a normal load balancer instead of an API Gateway? Thanks!

    • @SystemDesignInterview
      @SystemDesignInterview  4 года назад +5

      Hi vANvTO. Thank you very much for the feedback! Much appreciated.
      You asked good questions. Let me try to answer them briefly.
      - You are right, API Gateway is a cluster consisting of many nodes (machines, servers, whatever definition you prefer).
      - Zookeeper or similar services are not necessarily needed for routing. If this is easier, think of API gateway as a reverse proxy that routes requests to one or more backend services. There are different ways how service discovery can be implemented. Starting from some hardcoded configuration, DNS, dynamic configuration, and all the way to having a separate configuration service.
      - Many things need to happen before routing the request to a backend service. E.g. SSL termination, authentication, throttling, logging, request/response decompression/compression, etc. This is what API Gateway is doing. After that we can send this request to some load balancer that sits in front of a backend service. This load balancer will route the request to one of the backend service's nodes.
      I plan to make a separate video about API Gateway. How and where to use them and how to design one.

  • @XChen-zj1ek
    @XChen-zj1ek 4 года назад

    This is the best video I can find for this topic on RUclips so far! Question about data retrieval. If a user wants a more flexible time range search, say top k from 1:30 to 2:30, but in this design, I only see buckets for 1:00 and 2:00. How will you solve this?

    • @SystemDesignInterview
      @SystemDesignInterview  4 года назад +1

      Hi X. Chen. Thank you for the feedback!
      If we are ok to have approximate results, we can easily merge lists for smaller range time intervals to get the result for a wider range interval. E.g. we can merge 1-min or 5-min or 10-min lists to get any 1-hour list (for example from 1:10 to 2:10).
      To get an exact top-k list this logic does not apply. And if user needs results not only for every 00 hour interval (1:00, 2:00, 3:00, etc.), but for 1:30, 2:30: 3:30 ... intervals, we need to calculate results for all such intervals separately. E.g. have several MapReduce jobs. One calculates hh:00 hour intervals, while another one calculates hh:30 hour intervals.

  • @kumarc4853
    @kumarc4853 3 года назад

    Thank you very much Sir, excellent demonstration of coherent design thinking. I feel more equipped than ever to solve system design problems.

  • @yuanjunzhou2373
    @yuanjunzhou2373 4 года назад +1

    When you say the reason that spark might not solve the problem in big data volume case, could you elaborate a bit more? I guess a cluster of spark workers can do it anyway? Or is it not worthy? Thanks!

    • @SystemDesignInterview
      @SystemDesignInterview  4 года назад +1

      Never said that (about Spark cannot solve the problem) :) Sorry for the confusion.
      I believe, the phrase "Depending on the data volume we can solve the top K problem precisely and having a single data processing path. And Kafka + Spark can do the trick." can indeed be understood ambiguously.
      Spark can help to solve this problem. And on a very big scale.

    • @yuanjunzhou2373
      @yuanjunzhou2373 4 года назад +1

      ​@@SystemDesignInterview THanks man. Also thank you for Insightful answers to previous questions. I found the answer to what I exactly wanted to ask for example why we need two sets of distributed msg queues here. :)

    • @SystemDesignInterview
      @SystemDesignInterview  4 года назад

      You are welcome, Yuanjun!

  • @chaosu2755
    @chaosu2755 2 года назад

    We cannot remove entry from heap as we may update the same key two times with different value. The top K elements may be removed if its value is decreased.

  • @GuitarViolinCovers
    @GuitarViolinCovers 4 года назад +1

    Incredibly clear explanation! This might be obvious but how would you adjust this design if we wanted to return the top K list per user?

    • @SystemDesignInterview
      @SystemDesignInterview  4 года назад +1

      Hi Esam. Thank you for the feedback!
      Regarding your question, please note that complexity of the original (top K videos based on views) problem mostly comes from the fact that we may have millions of videos and billions of views. And we need to find a top K list out of this big data set. But if we need to get a top K list of videos per each user, we may utilize a more straightforward and simpler approach, as number of videos per user is relatively small (thousands?). We may simply sort all user's videos and get the top K. Though, we still need to solve a problem of counting video views (please take a look at this video for more details ruclips.net/video/bUHFg8CZFws/видео.html).
      If we have users with millions of videos and we need to quickly find a top K list for such users, solution mentioned in the video fully applies. We just need to create a count-min sketch for such users. And adjust MapReduce jobs to use user account while counting the top K list.

    • @GuitarViolinCovers
      @GuitarViolinCovers 4 года назад

      @@SystemDesignInterview Modifying the MapReduce jobs to use users makes a lot of sense. This helps, thanks!

  • @AmrEldfrawy94
    @AmrEldfrawy94 4 года назад +1

    Really detailed, very rational and the simulation was awesome. Please keep posting !!
    Can we use use hash maps but flush it's content (after converting to heap) each few seconds to the storage instead of using CMS?

    • @SystemDesignInterview
      @SystemDesignInterview  4 года назад

      Hi Amr! Good question!
      For small scale it is totally fine to use hash maps. When scale grows, hash maps may become too big (use a lot of memory). To prevent this we can partition data, so that only subset of all the data comes to a Fast Processor service host. But it complicates the architecture. The beauty of CMS is that it consumes a limited (defined) memory and there is no need to partition the data. The drawback of CMS, it calculates numbers approximately. Tradeoffs, tradeoffs...

  • @GelinLuo
    @GelinLuo 4 года назад +4

    Hi I love your video! Awesome works! One question with regarding to count min sketch, as it only use hashcode to keep track counts, once we have finished aggregated K result, we still need to know what that hash code actually is, so do we need a database table to store the item name(id) along with the multiple hash code values for each hash function ?

    • @yangchen392
      @yangchen392 Год назад

      Yes you need that. Actually you only need to keep a heap of size K in mem to store the topK keys so mem usage is very small. Every time you receive a key to count, you can increase and get its current approximate frequency from count-min sketch matrix. Then you could update the heap based on the frequency. More specifically, if the key is already in the heap, you increase its frequency in heap and adjust the heap; otherwise, you can compare its frequency with the heap's top elements and see whether you can insert the key into the heap and pop out old elements if heap's size is great than K.

  • @nagarjuna119
    @nagarjuna119 4 года назад +1

    Definitely one of the best. Thank you. Really like your approach, Do you plan to do more ?

    • @SystemDesignInterview
      @SystemDesignInterview  4 года назад +5

      Hi Nagarjuna. Thank you for the feedback!
      I will return back with more regular postings of new videos. There are so many topics we still need to cover. And I try to use all my spare time to prepare new content.

  • @VinayHPTP
    @VinayHPTP Год назад +1

    Thanks a lot for detailed explanation. much appreciated! ❤

  • @jagrick
    @jagrick 4 года назад +1

    I think the count min sketch representation in the simulation is a bit misleading - you only increment the value in the cell that corresponds to the hash of the actual key. Just like a bloom filter, you lose information about the actual key. The image shows the actual video id being stored which doesn't happen.
    Also You'll later need the key, the video id in this case, to get the approximate count. For that you'll need to store the recent unique video id somewhere that you'll fetch when computing the top k from the count min sketch.
    Great video as always, except for this tiny error.

    • @SystemDesignInterview
      @SystemDesignInterview  4 года назад

      I see your point, J. Indeed, there is a room for improvement here. It would have been better to show count-min sketch in conjunction with heap and update both of them during simulation.
      There are also some more details in the comments. They help to clarify the data flow and the algorithm a bit further.

    • @scabbage
      @scabbage 4 года назад +1

      If you need to store ALL the unique ids of the recently watched videos, doesn't that destroy the purpose of CMS? That will have essentially the same type of footprint as a hashmap, isn't it?

    • @jagrick
      @jagrick 4 года назад

      @@scabbage The only advantage I can think of is that by storing just the unique IDs, and not the actual counts, you'll avoid the complexity associated with having to increment the hash value at scale. However, we need to increment CMS cells anyway. Hmm, this one is a tricky problem.

  • @william0377
    @william0377 2 года назад

    I hope I can have a college like you, super pro, thank you for the excellent video.

  • @xiaolanli7985
    @xiaolanli7985 2 года назад +2

    This guy is amazing!!!

  • @yodali7999
    @yodali7999 3 года назад +1

    super helpful and pretty on point. appreciate the video.

  • @lexdimio
    @lexdimio 5 лет назад +1

    Thanks for the great video, i've learned a ton. Would be great if you could make a video on how to do correctly the data size and latency estimations.

    • @SystemDesignInterview
      @SystemDesignInterview  5 лет назад +2

      Thank you, Alex, for the feedback. I will definitely make a video on various kinds of estimates required for the system design:
      - resources (e.g. CPU, RAM, storage)
      - bandwidth (network, disk)
      - capacity (number of hosts/machines)
      - SLA (latency, availability, durability)
      - etc.

    • @lexdimio
      @lexdimio 5 лет назад +1

      Fantastic, thanks!

  • @yezhouchen9710
    @yezhouchen9710 Год назад

    Amazed topic. But not quite understand what does data aggregation mean after the streaming process?

  • @ElvinYX
    @ElvinYX 3 года назад +1

    This API of getTopK(startTime, endTime) may not make sense as merging 5 one-minute top k lists does not give you 1 five-minute top k list, this is not approximation but could be very wrong.
    I don't think there is a good solution but in reality maybe getToday and getThisWeek make more sense.

  • @didizi836
    @didizi836 4 года назад +2

    Too bad we can only give one like for this video

  • @mallika2259
    @mallika2259 3 года назад

    @16:48 Serialize data before sending over the network, to save on network IO if request rate is very high ... Why does this help? Took me a few mins: Serialized data is stored on disk first, this functions like a queuing mechanism, so if there is a spike in the rate of requests and the buffer needs to flush more frequently, this data doesn't have to be sent out over the network at an increased frequency right when the system is busy. Serialization means the rate of sending info to the back end can be decoupled from the rate requests are coming in at the front. Things can build up on disk for a bit, but the expectation is that the back end sending will catch up, assuming the front end spike is not sustained and the request rate drops after a bit.
    (If this isn't what is being implied, please correct me)

  • @RR8401674
    @RR8401674 5 лет назад +4

    Awesome work.

  • @dbenbasa
    @dbenbasa 4 года назад

    1) Which Database would you consider for the Storage and how would you model the data for fast retrieval?
    2) Accuracy - depends on the granularity of the data our MR job will get as input. IOW, the longer the periods we will aggregate along the way (i.e. at Partition Processor) the less accurate our results be. Another tradeoff to discuss regarding the minimum level of granularity.
    3) Can we also related to the case where user asks for top K list from 1pm to 2:25pm? I assume we will merge 1 hour and 25 1-min lists, would you agree?

    • @SystemDesignInterview
      @SystemDesignInterview  4 года назад

      Hi David,
      1. We store a top K list per time interval, e.g. 1-min, 5-min, 1-hour. There are many options for the format, e.g. JSON, binary formats, or any custom serialization.
      As for the persistent store, we have the whole spectrum: databases (sql, nosql), object stores (e.g. S3). It really depends on many factors, like number of reads, latency expectations, etc. Important thing to remember that we need to merge lists when retrieved. So, there should be a service in front of the storage that does this.
      2. Not sure I fully understand the use case you describe. Can you please elaborate?
      3. Agree. It can be 1 hour list + 1 minute lists. Or 1 hour list + 5 minute lists. And so on. Depends on the granularity of time periods.
      To speed up data retrieval, we can have a background process that merges smaller lists into bigger lists. E.g. in order not to merge 25 1-minute lists during data retrieval, we can merge smaller lists into bigger ones, e.g. 5-minute or 10-minute, so that we have less lists to merge when asked.

  • @Snowflake_Warrior
    @Snowflake_Warrior Год назад

    @system Design Interview How do we solve the hot key issues? For example, If A has most of the traffic, then the message queue and partition processor for A will be overloaded. Also, how to choose the topic and partition strategy, in case there are 100k videos?

  • @barryliii3234
    @barryliii3234 5 месяцев назад

    Some confusion around merging top K lists over a time interval. Are we losing a lot of data here? For example, k = 2, for first 1 min, top 2 are A = 100, B = 99, for the second min, top 2 are C = 100, D = 99, merging these two lists gives us A and B as the top 2. But there might be a E which had 98 views in the first min and 98 views in the second min. E should be the most viewed video in this case. In general, how does merging top K lists work?

  • @bishnuagrawal828
    @bishnuagrawal828 2 года назад +2

    content is really good.

  • @jamest7700
    @jamest7700 2 года назад

    Hi Mikhail, your videos are amazing especially this one! Thank you for posting this. When will be the release date of the follow up video for distributed counters (seems like an iteration of this problem but unbounded K)?

  • @alirezaamedeo
    @alirezaamedeo 8 месяцев назад

    11:15 think of it like this:
    Let's say K=1 i.e. we want the Top 1 item. If we go with 1-minute approach, we'll end up with 60 items after 1 hour. So, it's not possible to accurately calculate Top 1 item over the course of 60 minutes with that approach.

  • @Clostridiumtetani
    @Clostridiumtetani 2 года назад +1

    11:21 Assume 5 events, A, B, C, D and E and k=2. In one minute, top-k is A with 5 views and B with 4 views. C has 3 views and the rest has 0. The next minute top-k now consists of D with 5 and E with 4. Again C has 3 and the rest has 0 views. If we create a combined two minute top-K we would have A and D with 5 views each in that period, even though C got a total of 6, meaning that C should be the highest rated element. Instead, it is not even included in the top-k. This problem can be generalized to an hour.

  • @snehamurganoor6867
    @snehamurganoor6867 4 года назад +1

    Thank you! Great video. Have a question why we need 2 MapReduce processing one for Frequency Count MapReduce and another for Top K lists? Further Why Top K lists are split into local Top K list and then to a Global Top K list.

    • @SystemDesignInterview
      @SystemDesignInterview  4 года назад +5

      Let's say we want to find the Top K most viewed videos for a day. To do this, we need to count how many times each video was viewed for that day, right? For simplicity, let's say that every time video is viewed, we capture this information in a log file. As a result, video view events are scattered all over many different log files. For example in 24 1-hour log files. The goal of the Frequency Count MapReduce job is to go over all the log files, gather all view events for every video in one place and calculate the total count for every video.
      Top K MapReduce job takes results of the Frequency Count MapReduce job and calculates the Top K list. The way it does it is by splitting the set of all the videos into smaller subsets. Then it finds Top K videos in every subset (local Top K list). And then it finds the final Top K list (global Top K list).
      It may remind you how winners are identified in a tournament. First, we find winners in smaller groups. And then winners from every group play among themselves for the prize.

  • @fazalali7162
    @fazalali7162 3 года назад

    Another amazing video. I simply loved it. Will same approach of fast/slow pipeline also work for building views/likes/comment count for FB/Instagram scale systems?

  • @shubhamshah-sf1po
    @shubhamshah-sf1po 3 месяца назад

    Merging top k lists from different partitions won't be entirely accurate right? so lets say if one video falls into k+1 position into two partitions, then we lose that video right. Or have we made sure that partitioning is such that each partition will get the same video, but then we might have the problem of hot partitions?

  • @Legendary-Akshit
    @Legendary-Akshit 3 года назад

    Also from 25:06 to 25:33, what is the count min sketch supposed to store ? Is it a key-value pair as shown (A=2, B=1) or counts for keys ? If it is the former (key-value pairs in the count min sketch), then the memory footprint in the fast processor host/aggregate host would certainly increase.

  • @ameetc3306
    @ameetc3306 3 года назад

    Sorry, I don't understand the formula for width and height. According to the paper, shouldn't it be w = 2/e, where is the error rather than accuracy. and the height, the denominator should be ln(1/2) rather than ln(2) ? anyways, beautiful video, very nicely done, thank you.

  • @aiman_yt
    @aiman_yt 2 года назад +1

    If only I saw this before my interview 😭😭

  • @secondsandthings
    @secondsandthings 5 лет назад +2

    Great video. I'm not sure if I missed this part - but how would you scale this for large datasets? I.e. Top Google keyword searches - could have billions of different keywords a day?

    • @SystemDesignInterview
      @SystemDesignInterview  5 лет назад +4

      It is scalable for large data sets. But I can see where your doubts may come from. Let me clarify some points.
      Let's take Google search example.
      First important concept to mention (and I did not mention this in the video) is that our top K service will be geographically distributed. It means we have several copies of the service deployed in different parts of the globe. And data is then aggregated for a specific region (country). This reduces total cardinality of the problem. We no longer deal with all searches across the globe, but for example all searches in the US only.
      Second, we have several stages where we reduce data. We aggregate on the API Gateway side, aggregate further on the Processor side (both Fast and Slow), aggregate on the Storage side. We never count all the data in one place. Each stage (service) helps to decrease burden on the every next service in the data processing pipeline.
      Third, and you will see it in practice a lot, larger the scale - accuracy becomes less important. Meaning that probabilistic data structures and algorithms (e.g. count-min sketch) may be heavily utilized. And those probabilistic algorithms are very scalable.
      Please let me know if still in doubt. I may try to pull up some numbers.

    • @secondsandthings
      @secondsandthings 5 лет назад +2

      Thanks for the explanation!

    • @anjanaramia3251
      @anjanaramia3251 4 года назад

      @@SystemDesignInterview So in summary, we still need to store the keys...count-min sketch helps achieve savings by not having to maintain counts for keys individually...when one has to find the top k elements, one has to iterate thru every single key and use count-min sketch to find the top k elements...is this understanding accurate?

    • @SystemDesignInterview
      @SystemDesignInterview  4 года назад +1

      Hi Anjana,
      We need to store the keys, but only K of them (or a bit more). Not all.
      When every key comes, we do the following:
      - Add it to the count-min sketch.
      - Get key count from the count-min sketch.
      - Check if the current key is in the heap. If it presents in the heap, we update its count value there. If it not present in the heap, we check if heap is already full. If not full, we add this key to the heap. If heap is full, we check the minimal heap element and compare its value with the current key count value. At this point we may remove the minimal element and add the current key (if current key count > minimal element value).
      This way we only keep a predefined number of keys. This guarantees that we never exceed the memory, as both count-min sketch and the heap has a limited size.
      Hope this helps to clarify it further. Please let me know if you still have questions.

  • @atulkumartata
    @atulkumartata Год назад

    Simple explanation for why we need the entire dataset of the day in order to compute top K of the day, and it is impossible to stitch together the answers of top K of every hour to get the answer for the entire day :
    A - Think of a video that wasn't lucky enough to make it into the Top K videos of any hour of the day, it was always the top (K+1)th element in all the hours.
    B - also assume that the Top K videos of each hour were one hit wonders, i.e. they were the toppers of the hour but never did well again.
    In such a case, the video from A, would be in the top K videos of the day, which was never present in the hourly top K videos.
    So, it shows that in order to compute the top K videos of the day, you need the entire day's dataset.

  • @stackunderflow5951
    @stackunderflow5951 2 года назад

    If the partition processor does not need to send top-k lists to the storage service, can we remove the partition processor and the 2nd stream? We may have an HDFS dumper that reads from the first stream and writes to HDFS.

  • @TheGhumanz
    @TheGhumanz 2 года назад

    You explained it very well, thank you!

  • @nazz2711
    @nazz2711 2 года назад

    Hello author, thank you for your videos and work, it was very interesting to watch them. And especially this particular video, I'll explain you why.
    I don't have experience to build such services in google/facebook scale, but than I saw topK problem, my first idea was to use some timeseries dbs such as timescale, influxdb, etc to store all events like views or likes. These dbs are usually extreme perfomamnt for writes (millions events/sec it's not a problem at all in my experience) and they have clear scalability also. With them you could easily select events for 1h, 1d 1m etc. These selects of couse not so fast in millions of records and I compare this solution to yours count-min sketch algorithm. For better read performance you need to use something like map-reduce certainly.
    What do you think about my approach?

  • @TheHinduRakshak
    @TheHinduRakshak 4 года назад +1

    awesome content! learnt a lot, many thanks !

  • @yuganderkrishansingh3733
    @yuganderkrishansingh3733 2 года назад

    Answer to question that why do we require video for specific duration to correctly figure out top k heavy hitters for that duration: it might be that if we have data of top k heavy hitters for only 1 minutes then it might be that data which is lost may have some data which has more frequency in later time in 1 hr or 1 day and thus we cant extrapolate 1 min top k hitter data to 1 hr or 1 day...cahnces of incorrectness are higher.

  • @deshengli
    @deshengli 4 года назад

    Greate video again! My question is is "e" is error rate instead of accuracy we want? I don't understand how the height and width are not decided by the unique number of events. Let's say RUclips has billions of unique videos, they will be tons of collision happened in the CM sketch.

    • @SystemDesignInterview
      @SystemDesignInterview  4 года назад

      Hi Desheng. You are right, width and height does not depend on number of events. Which looks nonsense, as we cannot put billions of unique events into small table (count-min sketch) and expect numbers to be accurate, right? Number of collisions depends on the size of this table. Smaller the table, more collisions we will have.
      This is where "e" comes into play. Let's say we want e=1% (0,01). Width in this case will be 200. The real question here is 1% of what? And the answer is 1% of total number of items seen by the sketch. And if we have 1 million of videos and every video was viewed only once, the value for top 1 video (actually any video) must be one, right? But it will be 1 + 1% * 1,000,000 = 10,000 (approximately), which is really bad. To fix that, we want "e" to be much smaller, e.g. 0,001%, which will immediately lead to a bigger count-min sketch table.
      So, important to understand that the error can be quite big. And to make it smaller we need a table of a bigger size. If data is skewed (we have clear outliers in the data set), this matters less. If data is not skewed (closer to even distribution), count-min sketch values should be treated with care.

  • @An-fb6ue
    @An-fb6ue 2 года назад

    18:50 For the fast path, the top k refers to the top k start from 0 within current second or up to now?

  • @cursedswordsman
    @cursedswordsman 2 года назад

    There is no hope a candidate will arrive at this solution without seeing this video first. What's the point in asking such a question?

  • @SystemDesignInterview
    @SystemDesignInterview  5 лет назад

    By some reason RUclips hides valid comments. I can only see such comments in the inbox, but there is no way for me to reply. Let me re-post such comments on behalf of people who submitted it.
    From @Howell Pan
    if we send all the events to a Kafka event server, and on the subscriber end have a cluster of data processing service that aggregate and save data/count into a non-sql db (essentially one billion key-value pair), at the same time maintain/update top K in the cache, using the same logic (already in the cache -> update value, add if not yet reach K total and replace existing if new key with higher value count), read service will read from cache directly. Wouldn't this be both fast and accurate?

    • @SystemDesignInterview
      @SystemDesignInterview  5 лет назад

      These all are very good considerations. But there are some challenges we will need to address along the way.
      First such challenge is "hot" partitions. Kafka consumers will experience uneven load distribution. People search some keywords much more often than others. Or some video got viral. "Hot partition" problem is much more often in practice than it might seem. And quite complex. That is one of the reasons why we introduced an additional layer - Partitioner.
      Another reason, data partitioning helps each processor to aggregate only a portion of the whole data set, as we cannot aggregate data for the whole data set on a single processor instance (may be only for a very short period of time). So, processing service you mentioned will need to take care of this. We need to wisely partition the data either before putting it into Kafka or right after consuming the data.
      Let's say we solved problems mentioned above and database contains frequency counts for all the items. How to calculate top K list accurately based on this data? The procedure you mentioned, when we keep top K list in cache and replace items, only work when all frequency counts are calculated. Why? You actually explained it already in your other comment. When some item may never be a part of top K list for each minute, but the sum over a longer period (e.g. 5 minutes) will allow this item to appear in the final top K list. So, if we want to count top K list for a day, we will first need to wait till the end of that day, accumulate all frequency counts in the database, iterate over the items in the database and calculate the top K list. And I am not really sure how cache can help us to speed the things up.
      But I agree that cache can help if accuracy is not important and we can calculate a top K list approximately. Procedure you mentioned will help with this.
      Please let me know if I missed anything.

  • @taishancorp7720
    @taishancorp7720 3 года назад

    This design is good, but what if you just don't know about count-min sketch. I dont think you can expect many interviewees or interviewers (even good ones) to know it. For interview, can we stick to min heap approach ?

  • @maheswarreddyu
    @maheswarreddyu 4 года назад

    Very good explanation . Thank you. On the Fast path, would Distributed Messaging system not be a single point of failure?

  • @anilkatakam2738
    @anilkatakam2738 2 года назад +1

    Thank You!

  • @yueliang171
    @yueliang171 4 года назад

    Another great video!! I need to re-watch it to digest all the details. Some initial questions here.. I assume the time window here is fixed, right? Is it possible to achieve sliding window (I bet no..) ? Another question is where in the system control the time frame, i.e. start/ end a process for a result from 1:00 to 2:00? I am guessing the distributed messaging system after API gateway can do this? Thanks again!

    • @SystemDesignInterview
      @SystemDesignInterview  4 года назад

      Hi Yue Liang,
      You are correct, we use fixed (tumbling) window. We aggregate data for small time intervals, e.g. 1 minute. And this allows us to get result for larger time intervals, e.g. 1 hour, by merging 1-minute intervals. So, we can answer a question like "give me the top K most viewed videos for every 1 hour interval". And a question like "give me the top K most viewed videos for the last 10 minutes". The first question is a "fixed window type" of question, when we need to calculate statistics for every fixed-sized interval. While the second question is more like a "sliding window type" of question, where we calculate statistics for every possible 10-minutes interval (with 1-minute step). This is not exactly a "classic sliding window", but logically close. Please remember that this approach works to calculate approximate results only. We cannot merge intervals to get accurate results.
      This concept may be hard to grasp. Please let me know if more clarification is needed.
      Regarding your second question, we need to define time frame for which data is aggregated (e.g. 1 hour or 1 minute interval) when we aggregate data in both Fast or Partition Processor services. As well as MapReduce jobs.
      Please let me know if you have more questions.