Hyperloglog: Facebook's algorithm to count distinct elements

Поделиться
HTML-код
  • Опубликовано: 16 ноя 2024

Комментарии • 118

  • @antonkolupayev726
    @antonkolupayev726 2 года назад +1

    Thank you for speeding moments when you don't talk(drawing), and thanks for your fast and understandable speech so I don't fall asleep, and Iam hooked all the time. Bravo!

  • @shravanm7131
    @shravanm7131 5 лет назад +29

    So the Crux of this algo is smart bit manipulation , Very happy to see the usefulness of CP problems in real life ! 😄

    • @gkcs
      @gkcs  5 лет назад +1

      Yup!

  • @Charles-m7j
    @Charles-m7j 5 месяцев назад

    Great walkthrough. I’ve seen a few others and they didn’t quite click. Watched this and it makes perfect sense. Thanks!

    • @gkcs
      @gkcs  5 месяцев назад +2

      Cheers!

  • @eldoprano
    @eldoprano 3 года назад +2

    wtf, I spent like 3 weeks reading papers about this structure, and I finally was able to understand it thanks to you.. Thanks!

    • @gkcs
      @gkcs  3 года назад

      :D

  • @YOTUBE8848
    @YOTUBE8848 5 лет назад +24

    9:00 *"10 by 4 is 2.25" lol. Talk about professorial performance pressure! JK... Great job Gaurav*

  • @DarKCremeTai
    @DarKCremeTai 5 лет назад +2

    Read a lot about this, but understood very well by watching this. Thanks for covering it in such good depth!

  • @Dimich1993
    @Dimich1993 3 года назад

    You're like the rockstar of algorithms.

  • @tiagoavila7818
    @tiagoavila7818 Год назад

    Thanks a lot for the explanation, it was awesome!

    • @gkcs
      @gkcs  Год назад

      You are welcome!

  • @ChristopherJereza
    @ChristopherJereza 5 лет назад +2

    this thumbnail looks really good haha keep it up man! Thanks for these videos

    • @gkcs
      @gkcs  5 лет назад

      Thanks! 😁

  • @inak4760
    @inak4760 3 года назад

    Thank you, Gaurav Sen!

  • @golangNinja29
    @golangNinja29 5 лет назад +1

    Brotherrrrrr because of u I got placed , thank you so much for your system designing , seriously fan of yours 😍😍😍😍😍

    • @gkcs
      @gkcs  5 лет назад +2

      Congratulations 😁

  • @darrensapalo
    @darrensapalo 4 года назад

    What a fantastic discussion, at a fantastic pace. Thank you!

  • @sethmarcus1784
    @sethmarcus1784 4 года назад

    You explained how this algorithm works better than my Prof!

  • @oldoctopus393
    @oldoctopus393 День назад

    you nailed it, thanks!

  • @bratanvonehre3559
    @bratanvonehre3559 3 года назад

    Brilliant video, thanks!

  • @HimanshuSingh_92
    @HimanshuSingh_92 5 лет назад +31

    At 8:10, how you calculated 10, 01, 01 and 00 for those 4 userid's?

    • @gkcs
      @gkcs  4 года назад +4

      I am assuming a hash function which takes those numbers and outputs 10, 01, 01 and 10. So 00000 and 10011 give the same hash value of 01.

    • @iotax5
      @iotax5 3 года назад +5

      yes this part was poorly explained rest of video was good though

    • @asakhala
      @asakhala 3 года назад

      @@gkcs but is that based on some logic or you just made them up?

    • @asakhala
      @asakhala 3 года назад

      @@gkcs and how did you know those 2 numbers will give you 01

    • @zixiangwang5514
      @zixiangwang5514 3 года назад +1

      ​@@asakhala I think it's just a assumption, and the logic does not matter

  • @loveOstriches
    @loveOstriches 2 года назад

    Thank you so much! It was really easy to understand and really helpful!

  • @subirchowdhury8052
    @subirchowdhury8052 4 года назад

    Your videos are amazing, I keep learning a lot everytime.

  • @priyanshudas568
    @priyanshudas568 5 лет назад

    Awesome concise explanation! Do more of these!

  • @bhanumanagadeep
    @bhanumanagadeep 4 года назад

    Great explanation... Thanks Gaurav.

  • @vidhansharma1787
    @vidhansharma1787 5 лет назад +2

    I didn't understood much about the topic......but i understood the bit manipulation part......😃

  • @amulyam0000
    @amulyam0000 5 лет назад +1

    I have used this Algorithm in my UDF. But the way you explained it man.. wow.. concepts are clear now.

    • @gkcs
      @gkcs  5 лет назад

      Thanks!

  • @PBrrtrn
    @PBrrtrn 5 лет назад

    Amazing explanation, really down to earth and well driven. Thank you very much.

  • @OliveBZH
    @OliveBZH 5 лет назад

    Impressive demonstration ! Thks a lot

  • @manikandanchinnathambi728
    @manikandanchinnathambi728 5 лет назад +1

    Hey Gaurav.. I just feel like you are covering advanced topics.. I personally expect you to post some videos for a beginner/intermediate coder like me.

  • @minhnguyennhat5112
    @minhnguyennhat5112 5 лет назад +23

    10/4 = 2.25 lol what a shitty channel...jk Gaurav, keep up the good work buddy, your videos are extremely helpful!!

    • @gkcs
      @gkcs  5 лет назад

      Hahaha thanks!

  • @siddhantrao6017
    @siddhantrao6017 3 года назад +2

    Can someone explain to me why the time complexity is O(1)? If I have 100 bitstrings, I still have to check all 100 to see which one has the most zeroes, making the time complexity O(n). Or am I missing something?

  • @chinmaychandak
    @chinmaychandak 5 лет назад +2

    Hi Gaurav, at 4:31 the hint say about bloom filter, but when you explained it, it seemed to me more like Flajolet Martin algorithm with some little bit of Bloom Filter, does that seem right?

  • @janetsmith2000
    @janetsmith2000 Год назад

    Thanks @Graurav. Is the bucket approach requires multiple servers? Or they all sit in the same server?

  • @JohnWick-ep7qr
    @JohnWick-ep7qr 4 года назад +1

    Dear Gaurav. Hyperloglog uses leading zeroes.Not trailing zeroes.

    • @gkcs
      @gkcs  4 года назад

      For a random number, either one is good.

  • @albertteplitsky7629
    @albertteplitsky7629 5 лет назад +4

    Hi Gaurav, this was an awesome video and I am watching the rest of the system design classes. I have having trouble with why the answer is loglog(7) at 7:28. The logic I used to justify is that the values from 0-64 have 65 numbers, which can only be expressed with 7 bits. Is this the correct logic?
    Thanks again and looking forward to more content.

    • @gkcs
      @gkcs  5 лет назад

      Thanks!
      That's correct, the log of log of N is at most 7. We hence need at most 7 bits to represent this number.

  • @venkteshv377
    @venkteshv377 5 лет назад

    I think they additionally use skip lists ( sorted sets in redis ) for live updates of count .

    • @gkcs
      @gkcs  5 лет назад +1

      Interesting stuff.

  • @myjiosim2208
    @myjiosim2208 5 лет назад +5

    @9:27 shouldn't it be 2.5?

    • @gkcs
      @gkcs  5 лет назад +2

      Yes. Messed up there 😅

  • @nalamda3682
    @nalamda3682 2 месяца назад

    How do you remove elements from hyperloglog?

  • @abhinavsharma2086
    @abhinavsharma2086 5 лет назад +2

    What do you mean by "hash the values based on the next two bits from the right side"? Can you please clarify sir?

  • @siddharthgaglani4445
    @siddharthgaglani4445 5 лет назад

    Hii Gaurav. I've been watching your videos since a month and before that I had never started programming. Your videos are a little tough for me to understand quickly..can you provide any links or resources for programming ? Also what did you refer for programming?

    • @gkcs
      @gkcs  5 лет назад +1

      I learnt programming in college, so I had 4 years and teachers to learn.
      I'll try and find some links to start programming, and will post them in a future video 🙂

    • @samundrakc2533
      @samundrakc2533 5 лет назад

      go with this teachyourselfcs.com

  • @bhanumanagadeep
    @bhanumanagadeep 4 года назад +1

    How does this algorithm be expanded when we want multiple metrics (ex: Num of distinct active users in each region (US,EU etc)). I could think of a solution but not sure if this probabilistic algorithm can be expanded in this way. Instead of storing max zeroes in each server, we may store hash map of max zeroes observed in each region. Then consolidate for overall. Does this work? Also, what happens when a user goes offline? How is the substraction from total done?

  • @AliFeroz-ps7cz
    @AliFeroz-ps7cz 5 месяцев назад

    why isnt harmonic mean used here ?

  • @kennethcarvalho3684
    @kennethcarvalho3684 2 года назад

    So white was your fav color during this period

    • @gkcs
      @gkcs  2 года назад

      Lol, I have few colors :p

  • @deb232
    @deb232 5 лет назад

    Maybe a noob question, the 2nd solution which has a hashmap where we store users, wouldn't we need to remove users if connection isn't active anymore? In general, whats the strategy around dead connections?

    • @gkcs
      @gkcs  5 лет назад

      We could mark the connections as closed, or delete them on the database. I personally prefer marking them closed.

  • @nuti09to
    @nuti09to 4 года назад

    We are eventually using a hash table(buckets per node) of size m, for storing the max count. Then how does this imply that we have reduced the space complexity to O(1) ? Thank you.

  • @abhishekyakhmi
    @abhishekyakhmi 5 лет назад

    Bro i have been watching your content for a while now n let me tell you its bomb for students getting started with system design, keep up and thanks for the tutorials.

    • @gkcs
      @gkcs  5 лет назад

      Thanks!

  • @rakeshyadav-ho7vt
    @rakeshyadav-ho7vt 5 лет назад +2

    how come the complexity of insertion came to be of order of 1 in second algo at 3:00

    • @amber_p
      @amber_p 4 года назад +1

      hashmap lookup is order 1 : O(1), barring heaps most DS would have "insertion complexity" 1 :D

  • @mayadharpandit862
    @mayadharpandit862 4 года назад

    Can you please clarify how u make the buckets

  • @asakhala
    @asakhala 3 года назад

    I have a question let’s say you’ve gotten user ids between 1-15 now when you go on to represent this range in binary your smallest will be 0001 and largest will be 1111 but if we follow the logic of keeping track of maximum number of 0’s we’ve seen in right then that will come from 1000 which is 2^3 =8 . But in reality we’ve seen a range of 15 user ids but we are saying we’ve seen only 8. Did I understand it wrong?

    • @pspkfan7653
      @pspkfan7653 2 года назад +1

      why are you ignoring 0000. If you consider this number then, the number of zeros would be 4 which gives 2^4 = 16.

    • @onebyte827
      @onebyte827 2 года назад

      You completely ignored the "Hash" function that converts the number into binary. It will not be that straightforward.

  • @RandomVideos-ll6jz
    @RandomVideos-ll6jz 5 лет назад +1

    Hey Gaurav, This video is great. Kindly suggest me a Database Design Techniques book. Thanks in Advance!!!

    • @jayanthmanklu8642
      @jayanthmanklu8642 5 лет назад +2

      Designing Data Intensive Applications by Martin Kleppman is a fantastic reference to algorithms and storage structures used in popular distributed systems

  • @bharathpreetham310
    @bharathpreetham310 5 лет назад

    good video

    • @gkcs
      @gkcs  5 лет назад

      Thank you!

  • @RAJESH2010able
    @RAJESH2010able 5 лет назад

    Request, will it be possible for you to do a video on 'Design Online food ordering service (Uber eats/doordash/zomato) and explain how to integrate it with existing (Uber) ride-sharing service'

  • @shadab-hassan
    @shadab-hassan 5 лет назад

    Hi Gaurav, how is number of zeros in user Id related to the number of users on a particular box?

    • @gkcs
      @gkcs  5 лет назад +1

      I explained this in the video.

  • @Girish9777
    @Girish9777 5 лет назад

    Is it necessary to be from an engineering background to get an IT job in India, what you think about coding bootcamps in India?

  • @pratikthacker
    @pratikthacker 5 лет назад

    please make a video on system design irctc

  • @devran4169
    @devran4169 4 года назад

    Gauray , please add subtitle,my friend. Thx.

  • @godwinsmith5615
    @godwinsmith5615 4 года назад

    I have only one common question for you. How do you identify these algorithms and find explanations to it ? Just being curious.

    • @gkcs
      @gkcs  4 года назад

      Check out the sources in the description :)

  • @sunnybarua5061
    @sunnybarua5061 5 лет назад

    Bro, I have an ambition for doing developer job in Canada or New York base software company. At the beginning of my university life, I did problem-solving. After 1 year I discarded to problem-solving. Now Learning Node JS and all stuff related front end development. Can you suggest me the real path so that I will definitely get a job in an international company? Or will you suggest to start problem-solving again?? Thanks in advance!!

  • @rsrssanthosh751
    @rsrssanthosh751 Год назад

    Some what poorly explained is what I felt, compared to the other videos.

  • @uneeb123
    @uneeb123 5 лет назад

    You should have explained with hashing first. This is a bit confusing. Cool algorithm though.

  • @peasant12345
    @peasant12345 Год назад

    I dont think merging is that simple. we need to consider the correlation between two bucket columns.

  • @rajanbhandari607
    @rajanbhandari607 5 лет назад +4

    something new and good . make me laugh at 7.32

    • @gkcs
      @gkcs  5 лет назад

      😁

  • @arunkumardave1333
    @arunkumardave1333 3 года назад

    I used this algo to my existing dataset - using Presto's "approx_distinct" function, and got results to be +(3-4)% of the actual number. And this error rate was gradually increasing when I applied a "groupby" to the same query - Can you throw some light on this, why?

    • @gkcs
      @gkcs  3 года назад +1

      Interesting. I don't know the internal workings of Presto, especially for group results with approximates.
      Would love it if someone could help in the comments. Or I'll try and read up on this soon :)

  • @juniorbansal
    @juniorbansal 5 лет назад +4

    Gaurav - 10/4 is 2.5. Isn’t it?

    • @gkcs
      @gkcs  5 лет назад

      Oh damn, sorry and thanks haha!

    • @prsh9610
      @prsh9610 5 лет назад

      In the flow,it just happens sometimes.. Not a big deal..

  • @malavipande6693
    @malavipande6693 Год назад

    Man you are talking like Pratham Mittal ;)

  • @vasupatel635
    @vasupatel635 5 лет назад

    Where is your brother. ? are you not making videos with your brother anymore ? After getting job in Uber. ?

    • @prsh9610
      @prsh9610 5 лет назад

      He is taking about log of log.. He had the guts to take it on before we do..

    • @prsh9610
      @prsh9610 5 лет назад

      He is taking about log of log..

  • @nishgupta29
    @nishgupta29 5 лет назад

    Are you an interviewbit graduate ?

  • @ijaz2020
    @ijaz2020 4 года назад

    Since no one talks about this. Let me do the revelation 10/4 = 2.5

  • @myjiosim2208
    @myjiosim2208 5 лет назад

    @3:02 how is it hashmap when you have a table in db?

    • @gkcs
      @gkcs  5 лет назад +2

      We can create an index on the column and assume a fast response on an "exists" query.

    • @myjiosim2208
      @myjiosim2208 5 лет назад

      @@gkcs so index would take logn time and not constant right?

    • @gkcs
      @gkcs  5 лет назад +1

      @@myjiosim2208 They can also be stored using hash data structures. UserID as key.
      That's O(1) amortized.

  • @Algoritmik
    @Algoritmik 2 года назад

    09:00 10/4 = 2.5

  • @lijlijlil1898
    @lijlijlil1898 2 года назад

    Hey, 10/4 is 2.5, not 2.25 lol
    9:01

  • @fictionstudios6876
    @fictionstudios6876 5 лет назад

    Thanks bro. I love your videos.Videos were really helpful for me for understanding designing high scaling applications.
    I have a doubt. I have learn about CDN and Redis Cache Server.
    In my understanding both perform similar kind of functions (at least 70% same).
    Which will be better ?
    Are both needed for really high scalable application?
    If you need to explain the differences clearly ,if you have time you can make a video on it.😊😊

  • @adipratapsinghaps
    @adipratapsinghaps 5 лет назад +2

    Elasticsearch uses this for aggregations 😄😄

    • @gkcs
      @gkcs  5 лет назад +1

      Really? Wow...

    • @adipratapsinghaps
      @adipratapsinghaps 5 лет назад

      @@gkcs www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html
      Uses for cardinality aggregation. It mentions some hyperlog++ algorithm. God knows what hyperlog++ is. Infact, it must have started using the ++ version for 7.3 version. Before that, plain hyperloglog.

  • @vipulpatel9235
    @vipulpatel9235 5 лет назад

    0-64 - Can't they be represented by 6 bits? @7:16, Should be 6 bits

    • @gkcs
      @gkcs  5 лет назад +1

      0-63 = 64 numbers in range. 6 bits required.
      0-64 = 128> 65 >64 numbers in range. 7 bits required.

  • @kimapplegate5066
    @kimapplegate5066 9 месяцев назад

    Facebook did not create the hyperloglog process , mathematicians did this en.wikipedia.org/wiki/Flajolet%E2%80%93Martin_algorithm. Please change the title to give them credit!

  • @sumonmal009
    @sumonmal009 3 года назад

    THIS COMMENT IS FOR MY PERSONAL REFERENCE. TO UNDERSTAND PROPERLY WATCH THE FULL VIDEO
    --------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    requirement 0:13
    brute force complexity 1:50
    sol1 1:59
    problem 3:17
    problem statement 4:31
    sol 5:15