Count min sketch | Efficient algorithm for counting stream of data | system design components

Поделиться
HTML-код
  • Опубликовано: 28 сен 2024

Комментарии • 132

  • @mattleahy3951
    @mattleahy3951 5 лет назад +5

    I appreciate this; I was unaware of this algorithm until I saw your easy to understand explanation.

  • @ankitgarg4273
    @ankitgarg4273 5 лет назад +18

    For the failure case where output is probable, letter "B" frequency can be seen. It should be 1 but from the CSM it is coming as 2. So, that's the reason it is called probable.

    • @waylonbradley1345
      @waylonbradley1345 3 года назад

      pro trick : you can watch movies at flixzone. I've been using them for watching loads of movies recently.

    • @cadejose7770
      @cadejose7770 3 года назад

      @Waylon Bradley Yup, I've been watching on flixzone} for since december myself :D

  • @shreyasns1
    @shreyasns1 2 года назад

    Count Min Sketch explained with Min complexity, Great Job Naren. Good luck for your videos

  • @tassoskat8623
    @tassoskat8623 4 года назад +1

    Great video! You are better than my professor

  • @anandkumar2058
    @anandkumar2058 5 лет назад +15

    video suggestion : Design maps (google map for example ) and finding the shortest route from A to B with/without traffic

  • @karl6892
    @karl6892 2 года назад

    Great video Naren! Very clear explanation

  • @anupamtamrakar3985
    @anupamtamrakar3985 5 лет назад +3

    Great job. Really like the way you explained

  • @apoorvasaxena4522
    @apoorvasaxena4522 5 лет назад +6

    Thank you for this video... You did a great job... :)

  • @mydodethailung395
    @mydodethailung395 Год назад

    Great explanation !!

  • @mohanishmedar5332
    @mohanishmedar5332 4 года назад

    Perfect explanation! Good job

  • @leozhang7341
    @leozhang7341 3 года назад +1

    '''
    Assumption:
    input stream could be single numbers or lower letters: 36 possibilities
    8 columns
    4 hash functions:
    1. md5
    2. sha1
    3. sha256
    4. sha384
    '''
    import hashlib
    from typing import List
    class CountMinSketch:
    def __init__(self):
    self.cols = 8
    self.rows = 4
    self.h1 = lambda x: hashlib.md5(x).hexdigest()
    self.h2 = lambda x: hashlib.sha1(x).hexdigest()
    self.h3 = lambda x: hashlib.sha256(x).hexdigest()
    self.h4 = lambda x: hashlib.sha384(x).hexdigest()
    self.hash_funcs = [self.h1, self.h2, self.h3, self.h4]
    self.matrix = [[0] * self.cols for _ in range(self.rows)]
    def __get_hashes(self, c: str) -> List[int]:
    hash_results = []
    for func in self.hash_funcs:
    # Generate has value from each hash function as string
    hash_result = func(c.encode())
    # Convert it into number and divide by columns size
    hash_results.append(int(hash_result, 16) % 8)
    return hash_results
    def process_element(self, c: str) -> None:
    hash_results = self.__get_hashes(c)
    # Update count values in the matrix
    for i, val in enumerate(hash_results):
    self.matrix[i][val] += 1
    def get_freq(self, c: str) -> int:
    hash_results = self.__get_hashes(c)
    freqs = []
    # Find frequence value for each hash function
    for i, val in enumerate(hash_results):
    freqs.append(self.matrix[i][val])
    return min(freqs)
    if __name__ == '__main__':
    # Use list for simplicity
    input = ['1', 'v', 'd', 'c', '2', '1', 'c', 'a',
    '1', '0', 'v', '1', '1', '0', '1', '6', 'd']
    cms = CountMinSketch()
    # Assume the last element of the list in the start of the data stream
    while len(input) > 0:
    c = input.pop()
    cms.process_element(c)
    print(cms.get_freq('1'))
    Great video! Thanks for the explanation. Just a side note: if 'e' is the most tolerented error rate, the width of the table is: w = 2 / e , and the height of the table is: h = log(e) / log(0.5). (ref: ieeexplore.ieee.org/abstract/document/6042851?casa_token=r2yECvKp2esAAAAA:4V0CffEkxHBXXe1rvs928EebsoOg79dKrxfIjhsjHMroV2t_-fiQrnDt1qmU2dqvV0NIHCkn4A)

  • @dpsky666
    @dpsky666 Год назад

    Great explanation! Thank you!

  • @mayankrathore7558
    @mayankrathore7558 5 лет назад +1

    Awesome algo and ofcourse nicely explained thanks . :)

  • @jonarmani8654
    @jonarmani8654 2 года назад

    Replace the counters with bloom filters, and you've got yourself a RAMBO.

  • @GSSMoney
    @GSSMoney 3 года назад

    Given a count-Min sketch - how to get the top K? Would be nice to mention what else we need to store in along with the count?

  • @barrymullan6363
    @barrymullan6363 2 года назад

    Great explanation, thanks!

  • @mahdiarghavani1804
    @mahdiarghavani1804 5 лет назад

    You describe it well.
    Thanks

  • @sebastiandanielripari6457
    @sebastiandanielripari6457 5 лет назад

    Nice explanation. Great Job !

  • @satishvedpathak9972
    @satishvedpathak9972 5 лет назад

    for hashFn = 1 to 4
    for hashValue = 1 to n
    MAT(hashFn,hashValue) = 0;
    end for
    end for
    while (inpStream)
    for hashFn = 1 to 4
    HV = s1(hashFn(inpStream))
    MAT(hashFn,HV) = MAT(hashFn,HV)++;
    end for
    end while
    findFrequncey(inputSteam)
    for hashFn = 1 to 3
    if ( MAT(hashFn,HV) > MAT(hashFn++,HV)
    freq = MAT(hashFn++,HV)
    else
    freq = MAT(hashFn,HV)
    end if
    end for
    return freq

  • @ThoRicHeLLi
    @ThoRicHeLLi 5 лет назад +1

    Keep up the good work!

  • @antrapurohit8010
    @antrapurohit8010 2 года назад

    Thank you for the explanation. Just have one doubt. Can we use this data structures to implement top K URLs accessed in last 1hr?

  • @rameshnraghavan
    @rameshnraghavan 5 лет назад

    Very nice explanation.. Thank you..

  • @singaravelann3671
    @singaravelann3671 3 года назад

    What should be the modules value for each hash function??

  • @DorinCiobanu007
    @DorinCiobanu007 5 лет назад +2

    How is CMS different from a hash map with no collision verification and with smaller than the data output space size of the hash function?
    Example: hash(n) = n mod 17, int freq[17]
    freq[hash("your input")]++;

    • @AmanSingh-nu6ng
      @AmanSingh-nu6ng 5 лет назад +1

      I don't think it is different, conceptually. CMS with one hash function is just hash map with no collision verification.

    • @usamahashmi7274
      @usamahashmi7274 5 лет назад

      @@AmanSingh-nu6ng It is different because you are not storing against each object. The total size of your data structure is the size of the sketch in this case. With hashmap, your hashmap will store all distinct elements which can be much greater than the size of your sketch.

    • @AmanSingh-nu6ng
      @AmanSingh-nu6ng 5 лет назад

      @@usamahashmi7274 no hash map with no collision verification will not store all the elements distinctly. If n keys will have same hash they will be stored as 1 key value. CMS with 1 hash function is same as hashmap with no collision verification. The key part is no collision verification.

  • @Cosciug1234
    @Cosciug1234 6 лет назад +2

    video suggestion: file storage and synchronization service system design (dropbox, google drive)

  • @moshelevi5485
    @moshelevi5485 3 года назад

    Amazing job!

  • @arayanamit123
    @arayanamit123 6 лет назад

    @Tech Dummies , In this I what should be the max value my Hash function should return, or should we consider using modulo operation to keep the max value in check. what would be the impact of this max value on the collision.

  • @AmolGautam
    @AmolGautam 8 месяцев назад

    Thanks

  • @saitrinathdubba
    @saitrinathdubba 4 года назад

    Concise !! Thank you

  • @大盗江南
    @大盗江南 4 года назад

    Thanks bro! We love u!

  • @Tasteofgyan
    @Tasteofgyan 5 лет назад

    Great job!! Thank you!

  • @bharath_v
    @bharath_v 3 года назад

    As you pointed out, if the hash value is same for both 'A' and 'B', then it might overcount, in such case we should increase the number of hash function.

  • @varunganesh2028
    @varunganesh2028 5 лет назад

    Is it possible to evict entries from the countmin sketch? Isn't that an important criterion while handling streams? If I subtract for all the hashes on removal, will the "min" property be preserved?

    • @TechDummiesNarendraL
      @TechDummiesNarendraL  5 лет назад

      I am not sure, but min property might be violated!! I will try to get more info and update you

  • @chinna311983
    @chinna311983 4 года назад

    What if we use a hash function which uses string each literal ascii multiplied by index of the letter where it appears, and add all together, doesn't that guarantee the unique hash value all the time?

    • @thedankest1974
      @thedankest1974 2 года назад

      No, because of the pigeon hole principle. Remember the purpose of this data structure is to support counting frequencies of elements in sub-linear time, by trading-off accuracy. If we have unbounded strings mapping to a finite amount of space, there will always be a collision.

  • @bhulakshmidevanamaina3001
    @bhulakshmidevanamaina3001 4 года назад

    Is it also called the kth min value sketch algorithm

  • @nareshm8291
    @nareshm8291 5 лет назад

    can anyone please tell me what would be the space and time complexity of count min sketch algorithm

    • @nzs1
      @nzs1 5 лет назад

      Kind of complex question. But a quick answer: According to wikipedia, the space is decided beforehand and is only based on how big of an array to allocate (where rows = # hashes, cols = # of hash values).

  • @KrishnaList
    @KrishnaList 5 лет назад

    Gr8

  • @wendyhome6055
    @wendyhome6055 2 года назад

    三味中药泡 茶

  • @escobyte
    @escobyte 3 года назад

    Matrix 🤣

  • @rinkalagrawal
    @rinkalagrawal 3 года назад +13

    Thanks for this simplistic explanation of the algorithm. I've watched your other videos as well and it has helped a lot. Appreciate the work you are doing. Thank you!

  • @sumonmal009
    @sumonmal009 3 года назад +1

    THIS COMMENT IS FOR MY PERSONAL REFERENCE. TO UNDERSTAND PROPERLY WATCH THE FULL VIDEO
    --------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    sublinear 1:00
    Why we need count min sketch 2:30
    Count sketch 5:50 8:30 9:40
    calculate frequency13:20 to 15:06

  • @mayuresh247
    @mayuresh247 6 лет назад +3

    This is somewhat similar to bloom filters..
    Thanks for the video.. Keep it up !!
    Suggestion: food delivery application like swiggy/uberEats

  • @hidekazushidara320
    @hidekazushidara320 6 лет назад +4

    video suggestion: microservice architecture vs traditional architecture

  • @fernandozago
    @fernandozago Год назад +1

    Pretty coooll!!!
    Is pretty close algorithm like KTop of Redis....
    But, the only difference between this and Ktop from redis is that KTop uses two Values on each array element (Like: Count and Hash) and when a collision happens, it does some calculation to "decay" (meaning Decrement the count) from the data on the sketch instead of always increase the count.
    Very gool explanation !! =)

  • @sarthakuiit
    @sarthakuiit 2 года назад

    I am not able to understand how does it take sub linear space if we have to maintain a reasonable amount of accuracy?
    For 6x4 matrix it can’t store count of more than 24 elements accurately.
    (that too when hash functions are working perfectly, leaving at least 1 count accurately in the matrix)
    Then how is it different from a regular hashMap. There could be collisions.. yes.. but then also search time is LOG(N) for few of the elements which underwent collision.
    But then, let’s consider cost for counting in CMS -
    We need to use k hash functions to both put and get values with a risk of some inaccuracy.
    So I fail to understand, what great CMS is achieving when we call it ‘sub linear’ data structure.

  • @tutsdriveonline2632
    @tutsdriveonline2632 5 месяцев назад

    Won't it be better if increase number of possible columns i.e. from 0-6 to like 0-100 instead of increasing the hasfunctions ?

  • @SHASHALIBRA
    @SHASHALIBRA 2 года назад

    Did not understand how executing 4 hash function can have lower time complexity than map. Hash functions will always be excited 4 times but a hash function can return the result in O(1). Somebody plz help me understand

  • @YT-yt-yt-3
    @YT-yt-yt-3 2 года назад

    Thanks. Helpful. However the video explains how count min sketch works but not why it works.

  • @viputdBeast
    @viputdBeast 4 года назад +1

    No one ..!! Seriously no one does it better than you.. You are GOAT sir when it comes to system design....

  • @awaissindhu
    @awaissindhu 4 года назад +1

    how would we use this for trending data? for example, where we want to see the data, say, for the last 24 hours? do we subtract entries from these values that are older than 24 hours?

  • @yuenyiupang
    @yuenyiupang 2 года назад

    that mean u need to know what you looking for, say i wanna know the freq of "A", can it apply to i wanna the top 20 freq item?

  • @pixelizedworld
    @pixelizedworld 3 года назад

    my first comment here... just THANK YOU.

  • @janina542
    @janina542 5 лет назад +2

    Nice explanation. keep it Up

  • @sivamunnaluri2263
    @sivamunnaluri2263 3 года назад

    How you are concluded more hash function will have more accurate ?

  • @李雨蓁-i4t
    @李雨蓁-i4t Год назад

    very good!!!!! Thanks a lot over 1000 times!

  • @aprilfool302
    @aprilfool302 2 года назад

    You do awesome bro, but sometimes there is issue with sound, your voice becomes feeble. Keep it up

  • @satang500
    @satang500 5 лет назад +1

    Thanks a lot of your videos.
    Suggestions:
    1. your video can be categorized into two:
    1) fundamentals and glossary ( which is building blocks to design other systems such as consistent hashing, caching, load balancing... )
    2) system design ( such as design uber, facebook, youtube, netflix... ).
    So how about creating two groups and put videos into these groups?
    2. request of video lecture: proxy, load balancing, web crawler, server and slave architecture for availability

  • @prateekthakur4060
    @prateekthakur4060 6 лет назад +2

    This one is new for me,Thanks for sharing!
    Also for video suggestion: How to design multiplayer gaming system

  • @vincentmax4571
    @vincentmax4571 5 лет назад +2

    Good tutorial.

  • @dataguy7013
    @dataguy7013 4 года назад +1

    Naren, I am officially your fan. Nice video. You simplify and explain really well. Thanks!

  • @chitthiaayeehai
    @chitthiaayeehai 5 лет назад +1

    Super

  • @smithabs
    @smithabs 26 дней назад

    Very good explanation

  • @FengZhu-ym6qu
    @FengZhu-ym6qu Год назад

    Thanks bro, that really helps!

  • @Templesify
    @Templesify 7 месяцев назад

    awesome 👌

  • @jhnflory
    @jhnflory 2 года назад

    Thanks...nice explanation!

  • @mgayatri
    @mgayatri 5 лет назад +1

    Very similar to the Consistent Hash ring solution you explained.

  • @gauravdudeja
    @gauravdudeja 3 года назад

    Just curious why it works?😳

  • @olenaolenaolena
    @olenaolenaolena 3 года назад

    wow, smart and cute)

  • @Beadyboyworld
    @Beadyboyworld 2 года назад

    Good explanation 😎

  • @kumarc4853
    @kumarc4853 3 года назад

    thank you Sys Guru!

  • @nik15oc
    @nik15oc 6 лет назад +1

    very nicely explained. This data structure is frequently used in the caches

  • @ghazalabd8890
    @ghazalabd8890 Год назад

    great!

  • @shortstories765
    @shortstories765 3 года назад

    what is the advantage of count-min sketch over hashtable???
    we can store the count in hashtable, same time and space complexity

  • @ameynaik2743
    @ameynaik2743 3 года назад +1

    Nice video, could you please elaborate on how you decided on the number of columns for count-min sketch?

    • @_ityadi
      @_ityadi 3 года назад

      Good question. I was also wondering on that coz in reality the hash function might return a really huge value which we would have to fit into our window size. But how do we calibrate the window size, that is the question

    • @RandomShowerThoughts
      @RandomShowerThoughts Год назад

      @@_ityadi I don't think you can ever change them, but it sounded like you needed to take the max of all the hash functions

  • @AbhishekSharma-si8ui
    @AbhishekSharma-si8ui 4 года назад +1

    AWESOME

  • @satishvedpathak9972
    @satishvedpathak9972 5 лет назад

    while (inpStream)
    for hashFn = 1 to 4
    HV = s1(hashFn(inpStream))
    MAT(hashFn,HV) = MAT(hashFn,HV)++;
    end for
    end while
    findFrequncey(inputSteam)
    for hashFn = 1 to 4
    freq = freq + MAT(hashFn,HV)
    end for
    return freq/hashFn

  • @satishvedpathak9972
    @satishvedpathak9972 5 лет назад

    Updated: for hashFn = 1 to 4
    for hashValue = 1 to n
    MAT(hashFn,hashValue) = 0;
    end for
    end for
    while (inpStream)
    for hashFn = 1 to 4
    HV = s1(hashFn(inpStream))
    MAT(hashFn,HV) = MAT(hashFn,HV)++;
    end for
    end while
    findFrequncey(inputSteam)
    for hashFn = 1 to 3
    if ( MAT(hashFn,HV) > MAT(hashFn++,HV)
    freq = MAT(hashFn++,HV)
    else
    freq = MAT(hashFn,HV)
    end if
    end for
    return freq

  • @pramodsingh4668
    @pramodsingh4668 2 года назад

    Fully clear explanation Narendra. One suggestion, you could have reduced the video to 6-7 minutes instead of 19 minute.

  • @yazicib1
    @yazicib1 4 года назад

    Counting characters in a stream is not O(n) it is O(m) space where m is the number of characters in the alphabet. Hashing has no benefit here. You could have counted precisely and faster using just an array once cell for each character...
    4:30 Also there is no data loss in compression... unless it is lossy image compression or something...
    5:00 A good hash function shouldnt have too many collisions... if your character set is english alphabet, as i said above, you don’t even need a has function. There are great 64 bit hash function with amazing no-collision properties... you can always use twoof these to minimize even further...

  • @mahesh116
    @mahesh116 6 лет назад +1

    Great work!!!

  • @rozawar
    @rozawar 3 года назад

    Thanks for the video explanation. How to query top K using this algorithm? it shows object to count mapping. What about reverse?

  • @chikudholu
    @chikudholu 6 лет назад +1

    Great work!!

  • @PrateekMehtaABDFAN
    @PrateekMehtaABDFAN 2 года назад

    Great videos sir . I came here from the caching for distributed system video where you mentioned this as a use-case for calculating the frequency of data in eviction policies . You kind of dint mention that in the applications . So, CACHING can also be an application for the same .

  • @meghadave9363
    @meghadave9363 2 года назад

    Can you please also post a video on concurrency management in database as well as code for situations like concurrent ticket reservation. Also what is the best strategy for handling millions of operation(update/insert) in a relational database scenario

  • @moulisiramdasu6753
    @moulisiramdasu6753 2 года назад

    Thank you. Before watching this I found it difficult to understand but by the end of the video, it's crystal clear.

  • @studentforlife5502
    @studentforlife5502 3 года назад

    U SAVED MY LIFE!! please provide videos for all sketching algorithms

  • @AnuragSingh7
    @AnuragSingh7 3 года назад

    If we use reservoir sampling with a fairly large sample size wouldn't we be guaranteed to find the true frequent elements?
    I mean the frequent elements present in the sample would also be the frequent elements in the stream.

  • @ShabnamKhan-cj4zc
    @ShabnamKhan-cj4zc 3 года назад

    Thank you easy explanation.. It feels like good teacher explaning to students.

  • @saurabhgohil2232
    @saurabhgohil2232 5 лет назад

    This is little bit similar to bloom filter. but instead of Boolean value we keep count in our sketch.

  • @harshav3902
    @harshav3902 3 года назад

    Great Explanation. Thank you.

  • @abhishek_sengupta
    @abhishek_sengupta 4 года назад

    Very nice explanation! Thank You!!

  • @vitaliano01
    @vitaliano01 3 года назад

    Thank you for the good explanation

  • @rajeshji7698
    @rajeshji7698 4 года назад

    Can u do videos for how Facebook like button works ?

  • @prashidell8980
    @prashidell8980 5 лет назад

    Hi, I really enjoy your videos. Count Min Sketch is using basic idea of bloom filter, right?

  • @biboswanroy6699
    @biboswanroy6699 4 года назад

    Scalability vs Accuracy

  • @harsha.m4026
    @harsha.m4026 4 года назад

    Quick question. How do we decide the sketch's size?

  • @rishabhgupta2513
    @rishabhgupta2513 5 лет назад

    The frequency of B is coming wrong via this CSM.How can we get the most accurate solution?

  • @arvindaaswani1303
    @arvindaaswani1303 4 года назад

    Nice explanation 👍

  • @satishvedpathak9972
    @satishvedpathak9972 5 лет назад

    you have excellent skills to simplify the complex concepts ; this was better than explain in my big data training course .

  • @агатакристи-г3ы
    @агатакристи-г3ы 5 лет назад

    Interesting, is there an improvement of such algo which dinamically increase the matrix size (horizontally or vertically or both) proportionally, say log(N) to keep collision probability near constant..
    (I assume here the key space is also has infinite size.)