Hyperloglog: Facebook's algorithm to count distinct elements

Поделиться
HTML-код
  • Опубликовано: 18 янв 2025

Комментарии • 118

  • @YOTUBE8848
    @YOTUBE8848 5 лет назад +24

    9:00 *"10 by 4 is 2.25" lol. Talk about professorial performance pressure! JK... Great job Gaurav*

  • @antonkolupayev726
    @antonkolupayev726 2 года назад +1

    Thank you for speeding moments when you don't talk(drawing), and thanks for your fast and understandable speech so I don't fall asleep, and Iam hooked all the time. Bravo!

  • @HimanshuSingh_92
    @HimanshuSingh_92 5 лет назад +31

    At 8:10, how you calculated 10, 01, 01 and 00 for those 4 userid's?

    • @gkcs
      @gkcs  4 года назад +4

      I am assuming a hash function which takes those numbers and outputs 10, 01, 01 and 10. So 00000 and 10011 give the same hash value of 01.

    • @iotax5
      @iotax5 3 года назад +5

      yes this part was poorly explained rest of video was good though

    • @asakhala
      @asakhala 3 года назад

      @@gkcs but is that based on some logic or you just made them up?

    • @asakhala
      @asakhala 3 года назад

      @@gkcs and how did you know those 2 numbers will give you 01

    • @zixiangwang5514
      @zixiangwang5514 3 года назад +1

      ​@@asakhala I think it's just a assumption, and the logic does not matter

  • @shravanm7131
    @shravanm7131 5 лет назад +29

    So the Crux of this algo is smart bit manipulation , Very happy to see the usefulness of CP problems in real life ! 😄

    • @gkcs
      @gkcs  5 лет назад +1

      Yup!

  • @eldoprano
    @eldoprano 3 года назад +2

    wtf, I spent like 3 weeks reading papers about this structure, and I finally was able to understand it thanks to you.. Thanks!

    • @gkcs
      @gkcs  3 года назад

      :D

  • @ChristopherJereza
    @ChristopherJereza 5 лет назад +2

    this thumbnail looks really good haha keep it up man! Thanks for these videos

    • @gkcs
      @gkcs  5 лет назад

      Thanks! 😁

  • @Charles-m7j
    @Charles-m7j 7 месяцев назад

    Great walkthrough. I’ve seen a few others and they didn’t quite click. Watched this and it makes perfect sense. Thanks!

    • @gkcs
      @gkcs  7 месяцев назад +2

      Cheers!

  • @DarKCremeTai
    @DarKCremeTai 5 лет назад +2

    Read a lot about this, but understood very well by watching this. Thanks for covering it in such good depth!

  • @minhnguyennhat5112
    @minhnguyennhat5112 5 лет назад +23

    10/4 = 2.25 lol what a shitty channel...jk Gaurav, keep up the good work buddy, your videos are extremely helpful!!

    • @gkcs
      @gkcs  5 лет назад

      Hahaha thanks!

  • @darrensapalo
    @darrensapalo 4 года назад

    What a fantastic discussion, at a fantastic pace. Thank you!

  • @inak4760
    @inak4760 3 года назад

    Thank you, Gaurav Sen!

  • @Dimich1993
    @Dimich1993 3 года назад

    You're like the rockstar of algorithms.

  • @oldoctopus393
    @oldoctopus393 2 месяца назад

    you nailed it, thanks!

  • @tiagoavila7818
    @tiagoavila7818 Год назад

    Thanks a lot for the explanation, it was awesome!

    • @gkcs
      @gkcs  Год назад

      You are welcome!

  • @siddhantrao6017
    @siddhantrao6017 3 года назад +2

    Can someone explain to me why the time complexity is O(1)? If I have 100 bitstrings, I still have to check all 100 to see which one has the most zeroes, making the time complexity O(n). Or am I missing something?

  • @chinmaychandak
    @chinmaychandak 5 лет назад +2

    Hi Gaurav, at 4:31 the hint say about bloom filter, but when you explained it, it seemed to me more like Flajolet Martin algorithm with some little bit of Bloom Filter, does that seem right?

  • @albertteplitsky7629
    @albertteplitsky7629 5 лет назад +4

    Hi Gaurav, this was an awesome video and I am watching the rest of the system design classes. I have having trouble with why the answer is loglog(7) at 7:28. The logic I used to justify is that the values from 0-64 have 65 numbers, which can only be expressed with 7 bits. Is this the correct logic?
    Thanks again and looking forward to more content.

    • @gkcs
      @gkcs  5 лет назад

      Thanks!
      That's correct, the log of log of N is at most 7. We hence need at most 7 bits to represent this number.

  • @bratanvonehre3559
    @bratanvonehre3559 3 года назад

    Brilliant video, thanks!

  • @priyanshudas568
    @priyanshudas568 5 лет назад

    Awesome concise explanation! Do more of these!

  • @janetsmith2000
    @janetsmith2000 2 года назад

    Thanks @Graurav. Is the bucket approach requires multiple servers? Or they all sit in the same server?

  • @myjiosim2208
    @myjiosim2208 5 лет назад +5

    @9:27 shouldn't it be 2.5?

    • @gkcs
      @gkcs  5 лет назад +2

      Yes. Messed up there 😅

  • @subirchowdhury8052
    @subirchowdhury8052 4 года назад

    Your videos are amazing, I keep learning a lot everytime.

  • @loveOstriches
    @loveOstriches 3 года назад

    Thank you so much! It was really easy to understand and really helpful!

  • @nalamda3682
    @nalamda3682 5 месяцев назад

    How do you remove elements from hyperloglog?

  • @golangNinja29
    @golangNinja29 5 лет назад +1

    Brotherrrrrr because of u I got placed , thank you so much for your system designing , seriously fan of yours 😍😍😍😍😍

    • @gkcs
      @gkcs  5 лет назад +2

      Congratulations 😁

  • @sethmarcus1784
    @sethmarcus1784 4 года назад

    You explained how this algorithm works better than my Prof!

  • @OliveBZH
    @OliveBZH 5 лет назад

    Impressive demonstration ! Thks a lot

  • @PBrrtrn
    @PBrrtrn 5 лет назад

    Amazing explanation, really down to earth and well driven. Thank you very much.

  • @abhinavsharma2086
    @abhinavsharma2086 5 лет назад +2

    What do you mean by "hash the values based on the next two bits from the right side"? Can you please clarify sir?

  • @vidhansharma1787
    @vidhansharma1787 5 лет назад +2

    I didn't understood much about the topic......but i understood the bit manipulation part......😃

  • @rakeshyadav-ho7vt
    @rakeshyadav-ho7vt 5 лет назад +2

    how come the complexity of insertion came to be of order of 1 in second algo at 3:00

    • @amber_p
      @amber_p 4 года назад +1

      hashmap lookup is order 1 : O(1), barring heaps most DS would have "insertion complexity" 1 :D

  • @bhanumanagadeep
    @bhanumanagadeep 5 лет назад

    Great explanation... Thanks Gaurav.

  • @JohnWick-ep7qr
    @JohnWick-ep7qr 4 года назад +1

    Dear Gaurav. Hyperloglog uses leading zeroes.Not trailing zeroes.

    • @gkcs
      @gkcs  4 года назад

      For a random number, either one is good.

  • @manikandanchinnathambi728
    @manikandanchinnathambi728 5 лет назад +1

    Hey Gaurav.. I just feel like you are covering advanced topics.. I personally expect you to post some videos for a beginner/intermediate coder like me.

  • @amulyam0000
    @amulyam0000 5 лет назад +1

    I have used this Algorithm in my UDF. But the way you explained it man.. wow.. concepts are clear now.

    • @gkcs
      @gkcs  5 лет назад

      Thanks!

  • @AliFeroz-ps7cz
    @AliFeroz-ps7cz 8 месяцев назад

    why isnt harmonic mean used here ?

  • @deb232
    @deb232 5 лет назад

    Maybe a noob question, the 2nd solution which has a hashmap where we store users, wouldn't we need to remove users if connection isn't active anymore? In general, whats the strategy around dead connections?

    • @gkcs
      @gkcs  5 лет назад

      We could mark the connections as closed, or delete them on the database. I personally prefer marking them closed.

  • @bhanumanagadeep
    @bhanumanagadeep 5 лет назад +1

    How does this algorithm be expanded when we want multiple metrics (ex: Num of distinct active users in each region (US,EU etc)). I could think of a solution but not sure if this probabilistic algorithm can be expanded in this way. Instead of storing max zeroes in each server, we may store hash map of max zeroes observed in each region. Then consolidate for overall. Does this work? Also, what happens when a user goes offline? How is the substraction from total done?

  • @mayadharpandit862
    @mayadharpandit862 4 года назад

    Can you please clarify how u make the buckets

  • @venkteshv377
    @venkteshv377 5 лет назад

    I think they additionally use skip lists ( sorted sets in redis ) for live updates of count .

    • @gkcs
      @gkcs  5 лет назад +1

      Interesting stuff.

  • @shadab-hassan
    @shadab-hassan 5 лет назад

    Hi Gaurav, how is number of zeros in user Id related to the number of users on a particular box?

    • @gkcs
      @gkcs  5 лет назад +1

      I explained this in the video.

  • @kennethcarvalho3684
    @kennethcarvalho3684 2 года назад

    So white was your fav color during this period

    • @gkcs
      @gkcs  2 года назад

      Lol, I have few colors :p

  • @pratikthacker
    @pratikthacker 5 лет назад

    please make a video on system design irctc

  • @nuti09to
    @nuti09to 4 года назад

    We are eventually using a hash table(buckets per node) of size m, for storing the max count. Then how does this imply that we have reduced the space complexity to O(1) ? Thank you.

  • @siddharthgaglani4445
    @siddharthgaglani4445 5 лет назад

    Hii Gaurav. I've been watching your videos since a month and before that I had never started programming. Your videos are a little tough for me to understand quickly..can you provide any links or resources for programming ? Also what did you refer for programming?

    • @gkcs
      @gkcs  5 лет назад +1

      I learnt programming in college, so I had 4 years and teachers to learn.
      I'll try and find some links to start programming, and will post them in a future video 🙂

    • @samundrakc2533
      @samundrakc2533 5 лет назад

      go with this teachyourselfcs.com

  • @arunkumardave1333
    @arunkumardave1333 4 года назад

    I used this algo to my existing dataset - using Presto's "approx_distinct" function, and got results to be +(3-4)% of the actual number. And this error rate was gradually increasing when I applied a "groupby" to the same query - Can you throw some light on this, why?

    • @gkcs
      @gkcs  4 года назад +1

      Interesting. I don't know the internal workings of Presto, especially for group results with approximates.
      Would love it if someone could help in the comments. Or I'll try and read up on this soon :)

  • @sunnybarua5061
    @sunnybarua5061 5 лет назад

    Bro, I have an ambition for doing developer job in Canada or New York base software company. At the beginning of my university life, I did problem-solving. After 1 year I discarded to problem-solving. Now Learning Node JS and all stuff related front end development. Can you suggest me the real path so that I will definitely get a job in an international company? Or will you suggest to start problem-solving again?? Thanks in advance!!

  • @RandomVideos-ll6jz
    @RandomVideos-ll6jz 5 лет назад +1

    Hey Gaurav, This video is great. Kindly suggest me a Database Design Techniques book. Thanks in Advance!!!

    • @jayanthmanklu8642
      @jayanthmanklu8642 5 лет назад +2

      Designing Data Intensive Applications by Martin Kleppman is a fantastic reference to algorithms and storage structures used in popular distributed systems

  • @myjiosim2208
    @myjiosim2208 5 лет назад

    @3:02 how is it hashmap when you have a table in db?

    • @gkcs
      @gkcs  5 лет назад +2

      We can create an index on the column and assume a fast response on an "exists" query.

    • @myjiosim2208
      @myjiosim2208 5 лет назад

      @@gkcs so index would take logn time and not constant right?

    • @gkcs
      @gkcs  5 лет назад +1

      @@myjiosim2208 They can also be stored using hash data structures. UserID as key.
      That's O(1) amortized.

  • @godwinsmith5615
    @godwinsmith5615 4 года назад

    I have only one common question for you. How do you identify these algorithms and find explanations to it ? Just being curious.

    • @gkcs
      @gkcs  4 года назад

      Check out the sources in the description :)

  • @asakhala
    @asakhala 3 года назад

    I have a question let’s say you’ve gotten user ids between 1-15 now when you go on to represent this range in binary your smallest will be 0001 and largest will be 1111 but if we follow the logic of keeping track of maximum number of 0’s we’ve seen in right then that will come from 1000 which is 2^3 =8 . But in reality we’ve seen a range of 15 user ids but we are saying we’ve seen only 8. Did I understand it wrong?

    • @pspkfan7653
      @pspkfan7653 2 года назад +1

      why are you ignoring 0000. If you consider this number then, the number of zeros would be 4 which gives 2^4 = 16.

    • @onebyte827
      @onebyte827 2 года назад

      You completely ignored the "Hash" function that converts the number into binary. It will not be that straightforward.

  • @RAJESH2010able
    @RAJESH2010able 5 лет назад

    Request, will it be possible for you to do a video on 'Design Online food ordering service (Uber eats/doordash/zomato) and explain how to integrate it with existing (Uber) ride-sharing service'

  • @nishgupta29
    @nishgupta29 5 лет назад

    Are you an interviewbit graduate ?

  • @abhishekyakhmi
    @abhishekyakhmi 5 лет назад

    Bro i have been watching your content for a while now n let me tell you its bomb for students getting started with system design, keep up and thanks for the tutorials.

    • @gkcs
      @gkcs  5 лет назад

      Thanks!

  • @Girish9777
    @Girish9777 5 лет назад

    Is it necessary to be from an engineering background to get an IT job in India, what you think about coding bootcamps in India?

  • @devran4169
    @devran4169 4 года назад

    Gauray , please add subtitle,my friend. Thx.

  • @juniorbansal
    @juniorbansal 5 лет назад +4

    Gaurav - 10/4 is 2.5. Isn’t it?

    • @gkcs
      @gkcs  5 лет назад

      Oh damn, sorry and thanks haha!

    • @prsh9610
      @prsh9610 5 лет назад

      In the flow,it just happens sometimes.. Not a big deal..

  • @rajanbhandari607
    @rajanbhandari607 5 лет назад +4

    something new and good . make me laugh at 7.32

    • @gkcs
      @gkcs  5 лет назад

      😁

  • @uneeb123
    @uneeb123 5 лет назад

    You should have explained with hashing first. This is a bit confusing. Cool algorithm though.

  • @vasupatel635
    @vasupatel635 5 лет назад

    Where is your brother. ? are you not making videos with your brother anymore ? After getting job in Uber. ?

    • @prsh9610
      @prsh9610 5 лет назад

      He is taking about log of log.. He had the guts to take it on before we do..

    • @prsh9610
      @prsh9610 5 лет назад

      He is taking about log of log..

  • @lijlijlil1898
    @lijlijlil1898 2 года назад

    Hey, 10/4 is 2.5, not 2.25 lol
    9:01

  • @peasant12345
    @peasant12345 Год назад

    I dont think merging is that simple. we need to consider the correlation between two bucket columns.

  • @Algoritmik
    @Algoritmik 3 года назад

    09:00 10/4 = 2.5

  • @bharathpreetham310
    @bharathpreetham310 5 лет назад

    good video

    • @gkcs
      @gkcs  5 лет назад

      Thank you!

  • @rsrssanthosh751
    @rsrssanthosh751 2 года назад

    Some what poorly explained is what I felt, compared to the other videos.

  • @ijaz2020
    @ijaz2020 4 года назад

    Since no one talks about this. Let me do the revelation 10/4 = 2.5

  • @malavipande6693
    @malavipande6693 Год назад

    Man you are talking like Pratham Mittal ;)

  • @adipratapsinghaps
    @adipratapsinghaps 5 лет назад +2

    Elasticsearch uses this for aggregations 😄😄

    • @gkcs
      @gkcs  5 лет назад +1

      Really? Wow...

    • @adipratapsinghaps
      @adipratapsinghaps 5 лет назад

      @@gkcs www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html
      Uses for cardinality aggregation. It mentions some hyperlog++ algorithm. God knows what hyperlog++ is. Infact, it must have started using the ++ version for 7.3 version. Before that, plain hyperloglog.

  • @fictionstudios6876
    @fictionstudios6876 5 лет назад

    Thanks bro. I love your videos.Videos were really helpful for me for understanding designing high scaling applications.
    I have a doubt. I have learn about CDN and Redis Cache Server.
    In my understanding both perform similar kind of functions (at least 70% same).
    Which will be better ?
    Are both needed for really high scalable application?
    If you need to explain the differences clearly ,if you have time you can make a video on it.😊😊

  • @vipulpatel9235
    @vipulpatel9235 5 лет назад

    0-64 - Can't they be represented by 6 bits? @7:16, Should be 6 bits

    • @gkcs
      @gkcs  5 лет назад +1

      0-63 = 64 numbers in range. 6 bits required.
      0-64 = 128> 65 >64 numbers in range. 7 bits required.

  • @kimapplegate5066
    @kimapplegate5066 11 месяцев назад

    Facebook did not create the hyperloglog process , mathematicians did this en.wikipedia.org/wiki/Flajolet%E2%80%93Martin_algorithm. Please change the title to give them credit!

  • @sumonmal009
    @sumonmal009 3 года назад

    THIS COMMENT IS FOR MY PERSONAL REFERENCE. TO UNDERSTAND PROPERLY WATCH THE FULL VIDEO
    --------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    requirement 0:13
    brute force complexity 1:50
    sol1 1:59
    problem 3:17
    problem statement 4:31
    sol 5:15