Thanks for this simplistic explanation of the algorithm. I've watched your other videos as well and it has helped a lot. Appreciate the work you are doing. Thank you!
Pretty coooll!!! Is pretty close algorithm like KTop of Redis.... But, the only difference between this and Ktop from redis is that KTop uses two Values on each array element (Like: Count and Hash) and when a collision happens, it does some calculation to "decay" (meaning Decrement the count) from the data on the sketch instead of always increase the count. Very gool explanation !! =)
For the failure case where output is probable, letter "B" frequency can be seen. It should be 1 but from the CSM it is coming as 2. So, that's the reason it is called probable.
''' Assumption: input stream could be single numbers or lower letters: 36 possibilities 8 columns 4 hash functions: 1. md5 2. sha1 3. sha256 4. sha384 ''' import hashlib from typing import List class CountMinSketch: def __init__(self): self.cols = 8 self.rows = 4 self.h1 = lambda x: hashlib.md5(x).hexdigest() self.h2 = lambda x: hashlib.sha1(x).hexdigest() self.h3 = lambda x: hashlib.sha256(x).hexdigest() self.h4 = lambda x: hashlib.sha384(x).hexdigest() self.hash_funcs = [self.h1, self.h2, self.h3, self.h4] self.matrix = [[0] * self.cols for _ in range(self.rows)] def __get_hashes(self, c: str) -> List[int]: hash_results = [] for func in self.hash_funcs: # Generate has value from each hash function as string hash_result = func(c.encode()) # Convert it into number and divide by columns size hash_results.append(int(hash_result, 16) % 8) return hash_results def process_element(self, c: str) -> None: hash_results = self.__get_hashes(c) # Update count values in the matrix for i, val in enumerate(hash_results): self.matrix[i][val] += 1 def get_freq(self, c: str) -> int: hash_results = self.__get_hashes(c) freqs = [] # Find frequence value for each hash function for i, val in enumerate(hash_results): freqs.append(self.matrix[i][val]) return min(freqs) if __name__ == '__main__': # Use list for simplicity input = ['1', 'v', 'd', 'c', '2', '1', 'c', 'a', '1', '0', 'v', '1', '1', '0', '1', '6', 'd'] cms = CountMinSketch() # Assume the last element of the list in the start of the data stream while len(input) > 0: c = input.pop() cms.process_element(c) print(cms.get_freq('1')) Great video! Thanks for the explanation. Just a side note: if 'e' is the most tolerented error rate, the width of the table is: w = 2 / e , and the height of the table is: h = log(e) / log(0.5). (ref: ieeexplore.ieee.org/abstract/document/6042851?casa_token=r2yECvKp2esAAAAA:4V0CffEkxHBXXe1rvs928EebsoOg79dKrxfIjhsjHMroV2t_-fiQrnDt1qmU2dqvV0NIHCkn4A)
while (inpStream) for hashFn = 1 to 4 HV = s1(hashFn(inpStream)) MAT(hashFn,HV) = MAT(hashFn,HV)++; end for end while findFrequncey(inputSteam) for hashFn = 1 to 4 freq = freq + MAT(hashFn,HV) end for return freq/hashFn
Thanks a lot of your videos. Suggestions: 1. your video can be categorized into two: 1) fundamentals and glossary ( which is building blocks to design other systems such as consistent hashing, caching, load balancing... ) 2) system design ( such as design uber, facebook, youtube, netflix... ). So how about creating two groups and put videos into these groups? 2. request of video lecture: proxy, load balancing, web crawler, server and slave architecture for availability
Updated: for hashFn = 1 to 4 for hashValue = 1 to n MAT(hashFn,hashValue) = 0; end for end for while (inpStream) for hashFn = 1 to 4 HV = s1(hashFn(inpStream)) MAT(hashFn,HV) = MAT(hashFn,HV)++; end for end while findFrequncey(inputSteam) for hashFn = 1 to 3 if ( MAT(hashFn,HV) > MAT(hashFn++,HV) freq = MAT(hashFn++,HV) else freq = MAT(hashFn,HV) end if end for return freq
Great videos sir . I came here from the caching for distributed system video where you mentioned this as a use-case for calculating the frequency of data in eviction policies . You kind of dint mention that in the applications . So, CACHING can also be an application for the same .
for hashFn = 1 to 4 for hashValue = 1 to n MAT(hashFn,hashValue) = 0; end for end for while (inpStream) for hashFn = 1 to 4 HV = s1(hashFn(inpStream)) MAT(hashFn,HV) = MAT(hashFn,HV)++; end for end while findFrequncey(inputSteam) for hashFn = 1 to 3 if ( MAT(hashFn,HV) > MAT(hashFn++,HV) freq = MAT(hashFn++,HV) else freq = MAT(hashFn,HV) end if end for return freq
Can you please also post a video on concurrency management in database as well as code for situations like concurrent ticket reservation. Also what is the best strategy for handling millions of operation(update/insert) in a relational database scenario
Good question. I was also wondering on that coz in reality the hash function might return a really huge value which we would have to fit into our window size. But how do we calibrate the window size, that is the question
@Tech Dummies , In this I what should be the max value my Hash function should return, or should we consider using modulo operation to keep the max value in check. what would be the impact of this max value on the collision.
how would we use this for trending data? for example, where we want to see the data, say, for the last 24 hours? do we subtract entries from these values that are older than 24 hours?
If we use reservoir sampling with a fairly large sample size wouldn't we be guaranteed to find the true frequent elements? I mean the frequent elements present in the sample would also be the frequent elements in the stream.
How is CMS different from a hash map with no collision verification and with smaller than the data output space size of the hash function? Example: hash(n) = n mod 17, int freq[17] freq[hash("your input")]++;
@@AmanSingh-nu6ng It is different because you are not storing against each object. The total size of your data structure is the size of the sketch in this case. With hashmap, your hashmap will store all distinct elements which can be much greater than the size of your sketch.
@@usamahashmi7274 no hash map with no collision verification will not store all the elements distinctly. If n keys will have same hash they will be stored as 1 key value. CMS with 1 hash function is same as hashmap with no collision verification. The key part is no collision verification.
Think you about the task of estimating heavy hitters or the most frequent events in an event stream. Events can come from the IoT devices or from airplane sensors, the source doesn't really matter, matters only the fact that there are a lot of such events of different types. If we need to find the most frequent events in a traditional way, we need to count all of them and maintain such counters for all of them, even for rare once. Instead, we can use Count-Min Sketch and maintain only a few counters with some small number of "promising" elements. If you are interested in such data structures, check out my recent book "Probabilistic Data Structures and Algorithms in Big Data Applications" (pdsa.gakhov.com).
Is it possible to evict entries from the countmin sketch? Isn't that an important criterion while handling streams? If I subtract for all the hashes on removal, will the "min" property be preserved?
As you pointed out, if the hash value is same for both 'A' and 'B', then it might overcount, in such case we should increase the number of hash function.
Thanks for this, great video! One question: why do you call it sublinear? Because it looks to me like all your operations are constant except for getting the min. So is the time complexity dependent on the amount of hash functions?
Sublinear is the relationship between data and memory. As the data grows, the memory required to store it does not grow in a linear fashion since we are using a fixed set of hash functions and hash outputs.
Hi Narendra, The video was quite informational. But I am yet unsure how this is taking less space. How are we deciding the size of hash here? And ain't the taking the 5-6 hashes will be equivalent to taking single hash for maintaining count?
it was maybe not as clear as it could be with the small example size, but the idea would be that maybe you have millions of videos and you are counting likes, then you have a set of hash functions and then columns that are much smaller than the total number of videos, whereas a normal hash table would be linear with the number of the videos.
I am not able to understand how does it take sub linear space if we have to maintain a reasonable amount of accuracy? For 6x4 matrix it can’t store count of more than 24 elements accurately. (that too when hash functions are working perfectly, leaving at least 1 count accurately in the matrix) Then how is it different from a regular hashMap. There could be collisions.. yes.. but then also search time is LOG(N) for few of the elements which underwent collision. But then, let’s consider cost for counting in CMS - We need to use k hash functions to both put and get values with a risk of some inaccuracy. So I fail to understand, what great CMS is achieving when we call it ‘sub linear’ data structure.
What if we use a hash function which uses string each literal ascii multiplied by index of the letter where it appears, and add all together, doesn't that guarantee the unique hash value all the time?
No, because of the pigeon hole principle. Remember the purpose of this data structure is to support counting frequencies of elements in sub-linear time, by trading-off accuracy. If we have unbounded strings mapping to a finite amount of space, there will always be a collision.
Interesting, is there an improvement of such algo which dinamically increase the matrix size (horizontally or vertically or both) proportionally, say log(N) to keep collision probability near constant.. (I assume here the key space is also has infinite size.)
Counting characters in a stream is not O(n) it is O(m) space where m is the number of characters in the alphabet. Hashing has no benefit here. You could have counted precisely and faster using just an array once cell for each character... 4:30 Also there is no data loss in compression... unless it is lossy image compression or something... 5:00 A good hash function shouldnt have too many collisions... if your character set is english alphabet, as i said above, you don’t even need a has function. There are great 64 bit hash function with amazing no-collision properties... you can always use twoof these to minimize even further...
THIS COMMENT IS FOR MY PERSONAL REFERENCE. TO UNDERSTAND PROPERLY WATCH THE FULL VIDEO -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- sublinear 1:00 Why we need count min sketch 2:30 Count sketch 5:508:309:40 calculate frequency13:20 to 15:06
Did not understand how executing 4 hash function can have lower time complexity than map. Hash functions will always be excited 4 times but a hash function can return the result in O(1). Somebody plz help me understand
No one ..!! Seriously no one does it better than you.. You are GOAT sir when it comes to system design....
Thanks for this simplistic explanation of the algorithm. I've watched your other videos as well and it has helped a lot. Appreciate the work you are doing. Thank you!
Thank you. Before watching this I found it difficult to understand but by the end of the video, it's crystal clear.
I appreciate this; I was unaware of this algorithm until I saw your easy to understand explanation.
video suggestion : Design maps (google map for example ) and finding the shortest route from A to B with/without traffic
U SAVED MY LIFE!! please provide videos for all sketching algorithms
Great video! You are better than my professor
Count Min Sketch explained with Min complexity, Great Job Naren. Good luck for your videos
Pretty coooll!!!
Is pretty close algorithm like KTop of Redis....
But, the only difference between this and Ktop from redis is that KTop uses two Values on each array element (Like: Count and Hash) and when a collision happens, it does some calculation to "decay" (meaning Decrement the count) from the data on the sketch instead of always increase the count.
Very gool explanation !! =)
Naren, I am officially your fan. Nice video. You simplify and explain really well. Thanks!
For the failure case where output is probable, letter "B" frequency can be seen. It should be 1 but from the CSM it is coming as 2. So, that's the reason it is called probable.
pro trick : you can watch movies at flixzone. I've been using them for watching loads of movies recently.
@Waylon Bradley Yup, I've been watching on flixzone} for since december myself :D
Great video Naren! Very clear explanation
Thank you easy explanation.. It feels like good teacher explaning to students.
you have excellent skills to simplify the complex concepts ; this was better than explain in my big data training course .
Fully clear explanation Narendra. One suggestion, you could have reduced the video to 6-7 minutes instead of 19 minute.
Great job. Really like the way you explained
This is somewhat similar to bloom filters..
Thanks for the video.. Keep it up !!
Suggestion: food delivery application like swiggy/uberEats
Thanks and Sure, I will work on it
'''
Assumption:
input stream could be single numbers or lower letters: 36 possibilities
8 columns
4 hash functions:
1. md5
2. sha1
3. sha256
4. sha384
'''
import hashlib
from typing import List
class CountMinSketch:
def __init__(self):
self.cols = 8
self.rows = 4
self.h1 = lambda x: hashlib.md5(x).hexdigest()
self.h2 = lambda x: hashlib.sha1(x).hexdigest()
self.h3 = lambda x: hashlib.sha256(x).hexdigest()
self.h4 = lambda x: hashlib.sha384(x).hexdigest()
self.hash_funcs = [self.h1, self.h2, self.h3, self.h4]
self.matrix = [[0] * self.cols for _ in range(self.rows)]
def __get_hashes(self, c: str) -> List[int]:
hash_results = []
for func in self.hash_funcs:
# Generate has value from each hash function as string
hash_result = func(c.encode())
# Convert it into number and divide by columns size
hash_results.append(int(hash_result, 16) % 8)
return hash_results
def process_element(self, c: str) -> None:
hash_results = self.__get_hashes(c)
# Update count values in the matrix
for i, val in enumerate(hash_results):
self.matrix[i][val] += 1
def get_freq(self, c: str) -> int:
hash_results = self.__get_hashes(c)
freqs = []
# Find frequence value for each hash function
for i, val in enumerate(hash_results):
freqs.append(self.matrix[i][val])
return min(freqs)
if __name__ == '__main__':
# Use list for simplicity
input = ['1', 'v', 'd', 'c', '2', '1', 'c', 'a',
'1', '0', 'v', '1', '1', '0', '1', '6', 'd']
cms = CountMinSketch()
# Assume the last element of the list in the start of the data stream
while len(input) > 0:
c = input.pop()
cms.process_element(c)
print(cms.get_freq('1'))
Great video! Thanks for the explanation. Just a side note: if 'e' is the most tolerented error rate, the width of the table is: w = 2 / e , and the height of the table is: h = log(e) / log(0.5). (ref: ieeexplore.ieee.org/abstract/document/6042851?casa_token=r2yECvKp2esAAAAA:4V0CffEkxHBXXe1rvs928EebsoOg79dKrxfIjhsjHMroV2t_-fiQrnDt1qmU2dqvV0NIHCkn4A)
Very good explanation
very nicely explained. This data structure is frequently used in the caches
Perfect explanation! Good job
Great explanation !!
Great explanation! Thank you!
video suggestion: microservice architecture vs traditional architecture
Sure
very good!!!!! Thanks a lot over 1000 times!
while (inpStream)
for hashFn = 1 to 4
HV = s1(hashFn(inpStream))
MAT(hashFn,HV) = MAT(hashFn,HV)++;
end for
end while
findFrequncey(inputSteam)
for hashFn = 1 to 4
freq = freq + MAT(hashFn,HV)
end for
return freq/hashFn
Very nice explanation! Thank You!!
Thank you for the good explanation
Thanks a lot of your videos.
Suggestions:
1. your video can be categorized into two:
1) fundamentals and glossary ( which is building blocks to design other systems such as consistent hashing, caching, load balancing... )
2) system design ( such as design uber, facebook, youtube, netflix... ).
So how about creating two groups and put videos into these groups?
2. request of video lecture: proxy, load balancing, web crawler, server and slave architecture for availability
Great Explanation. Thank you.
Updated: for hashFn = 1 to 4
for hashValue = 1 to n
MAT(hashFn,hashValue) = 0;
end for
end for
while (inpStream)
for hashFn = 1 to 4
HV = s1(hashFn(inpStream))
MAT(hashFn,HV) = MAT(hashFn,HV)++;
end for
end while
findFrequncey(inputSteam)
for hashFn = 1 to 3
if ( MAT(hashFn,HV) > MAT(hashFn++,HV)
freq = MAT(hashFn++,HV)
else
freq = MAT(hashFn,HV)
end if
end for
return freq
Good explanation 😎
Great videos sir . I came here from the caching for distributed system video where you mentioned this as a use-case for calculating the frequency of data in eviction policies . You kind of dint mention that in the applications . So, CACHING can also be an application for the same .
This one is new for me,Thanks for sharing!
Also for video suggestion: How to design multiplayer gaming system
Sure, I will work on it
Thanks...nice explanation!
Thanks bro, that really helps!
Nice explanation 👍
Thank you for this video... You did a great job... :)
Nice explanation. keep it Up
thank you Sys Guru!
Very similar to the Consistent Hash ring solution you explained.
for hashFn = 1 to 4
for hashValue = 1 to n
MAT(hashFn,hashValue) = 0;
end for
end for
while (inpStream)
for hashFn = 1 to 4
HV = s1(hashFn(inpStream))
MAT(hashFn,HV) = MAT(hashFn,HV)++;
end for
end while
findFrequncey(inputSteam)
for hashFn = 1 to 3
if ( MAT(hashFn,HV) > MAT(hashFn++,HV)
freq = MAT(hashFn++,HV)
else
freq = MAT(hashFn,HV)
end if
end for
return freq
Awesome algo and ofcourse nicely explained thanks . :)
Nice explanation. Great Job !
Replace the counters with bloom filters, and you've got yourself a RAMBO.
Amazing job!
Keep up the good work!
Thank you for the explanation. Just have one doubt. Can we use this data structures to implement top K URLs accessed in last 1hr?
Very nice explanation.. Thank you..
Great work!!
Good tutorial.
You describe it well.
Thanks
my first comment here... just THANK YOU.
Can you please also post a video on concurrency management in database as well as code for situations like concurrent ticket reservation. Also what is the best strategy for handling millions of operation(update/insert) in a relational database scenario
Thanks for the video explanation. How to query top K using this algorithm? it shows object to count mapping. What about reverse?
very good explanation! thanks
You do awesome bro, but sometimes there is issue with sound, your voice becomes feeble. Keep it up
awesome 👌
Nice video, could you please elaborate on how you decided on the number of columns for count-min sketch?
Good question. I was also wondering on that coz in reality the hash function might return a really huge value which we would have to fit into our window size. But how do we calibrate the window size, that is the question
@@_ityadi I don't think you can ever change them, but it sounded like you needed to take the max of all the hash functions
video suggestion: file storage and synchronization service system design (dropbox, google drive)
@cos thanks alot, I will consider the topics
@Cos Sure, I will work on it
@Tech Dummies , In this I what should be the max value my Hash function should return, or should we consider using modulo operation to keep the max value in check. what would be the impact of this max value on the collision.
what is the advantage of count-min sketch over hashtable???
we can store the count in hashtable, same time and space complexity
What should be the modules value for each hash function??
How you are concluded more hash function will have more accurate ?
how would we use this for trending data? for example, where we want to see the data, say, for the last 24 hours? do we subtract entries from these values that are older than 24 hours?
Concise !! Thank you
if it is half of the space, it is still linear, sublinear means when the data grows, the difference will be bigger and bigger
space wont grow though, it will be a grid fixed width and height (height = no of hash functions, width= any arbitrary large number)
Thanks bro! We love u!
Given a count-Min sketch - how to get the top K? Would be nice to mention what else we need to store in along with the count?
Won't it be better if increase number of possible columns i.e. from 0-6 to like 0-100 instead of increasing the hasfunctions ?
If we use reservoir sampling with a fairly large sample size wouldn't we be guaranteed to find the true frequent elements?
I mean the frequent elements present in the sample would also be the frequent elements in the stream.
Super
Thanks. Helpful. However the video explains how count min sketch works but not why it works.
Great job!! Thank you!
This is little bit similar to bloom filter. but instead of Boolean value we keep count in our sketch.
Thanks
that mean u need to know what you looking for, say i wanna know the freq of "A", can it apply to i wanna the top 20 freq item?
How is CMS different from a hash map with no collision verification and with smaller than the data output space size of the hash function?
Example: hash(n) = n mod 17, int freq[17]
freq[hash("your input")]++;
I don't think it is different, conceptually. CMS with one hash function is just hash map with no collision verification.
@@AmanSingh-nu6ng It is different because you are not storing against each object. The total size of your data structure is the size of the sketch in this case. With hashmap, your hashmap will store all distinct elements which can be much greater than the size of your sketch.
@@usamahashmi7274 no hash map with no collision verification will not store all the elements distinctly. If n keys will have same hash they will be stored as 1 key value. CMS with 1 hash function is same as hashmap with no collision verification. The key part is no collision verification.
Hi @Tech Dummies, which kind of data stream applications will benefit from ising count-min sketches?
Think you about the task of estimating heavy hitters or the most frequent events in an event stream. Events can come from the IoT devices or from airplane sensors, the source doesn't really matter, matters only the fact that there are a lot of such events of different types. If we need to find the most frequent events in a traditional way, we need to count all of them and maintain such counters for all of them, even for rare once. Instead, we can use Count-Min Sketch and maintain only a few counters with some small number of "promising" elements.
If you are interested in such data structures, check out my recent book "Probabilistic Data Structures and Algorithms in Big Data Applications" (pdsa.gakhov.com).
Hi, I really enjoy your videos. Count Min Sketch is using basic idea of bloom filter, right?
Is it possible to evict entries from the countmin sketch? Isn't that an important criterion while handling streams? If I subtract for all the hashes on removal, will the "min" property be preserved?
I am not sure, but min property might be violated!! I will try to get more info and update you
As you pointed out, if the hash value is same for both 'A' and 'B', then it might overcount, in such case we should increase the number of hash function.
Is it also called the kth min value sketch algorithm
great!
Quick question. How do we decide the sketch's size?
Thanks for this, great video! One question: why do you call it sublinear? Because it looks to me like all your operations are constant except for getting the min. So is the time complexity dependent on the amount of hash functions?
Sublinear is the relationship between data and memory. As the data grows, the memory required to store it does not grow in a linear fashion since we are using a fixed set of hash functions and hash outputs.
AWESOME
Hi Narendra,
The video was quite informational. But I am yet unsure how this is taking less space. How are we deciding the size of hash here? And ain't the taking the 5-6 hashes will be equivalent to taking single hash for maintaining count?
it was maybe not as clear as it could be with the small example size, but the idea would be that maybe you have millions of videos and you are counting likes, then you have a set of hash functions and then columns that are much smaller than the total number of videos, whereas a normal hash table would be linear with the number of the videos.
I am not able to understand how does it take sub linear space if we have to maintain a reasonable amount of accuracy?
For 6x4 matrix it can’t store count of more than 24 elements accurately.
(that too when hash functions are working perfectly, leaving at least 1 count accurately in the matrix)
Then how is it different from a regular hashMap. There could be collisions.. yes.. but then also search time is LOG(N) for few of the elements which underwent collision.
But then, let’s consider cost for counting in CMS -
We need to use k hash functions to both put and get values with a risk of some inaccuracy.
So I fail to understand, what great CMS is achieving when we call it ‘sub linear’ data structure.
What if we use a hash function which uses string each literal ascii multiplied by index of the letter where it appears, and add all together, doesn't that guarantee the unique hash value all the time?
No, because of the pigeon hole principle. Remember the purpose of this data structure is to support counting frequencies of elements in sub-linear time, by trading-off accuracy. If we have unbounded strings mapping to a finite amount of space, there will always be a collision.
Interesting, is there an improvement of such algo which dinamically increase the matrix size (horizontally or vertically or both) proportionally, say log(N) to keep collision probability near constant..
(I assume here the key space is also has infinite size.)
Counting characters in a stream is not O(n) it is O(m) space where m is the number of characters in the alphabet. Hashing has no benefit here. You could have counted precisely and faster using just an array once cell for each character...
4:30 Also there is no data loss in compression... unless it is lossy image compression or something...
5:00 A good hash function shouldnt have too many collisions... if your character set is english alphabet, as i said above, you don’t even need a has function. There are great 64 bit hash function with amazing no-collision properties... you can always use twoof these to minimize even further...
THIS COMMENT IS FOR MY PERSONAL REFERENCE. TO UNDERSTAND PROPERLY WATCH THE FULL VIDEO
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
sublinear 1:00
Why we need count min sketch 2:30
Count sketch 5:50 8:30 9:40
calculate frequency13:20 to 15:06
Did not understand how executing 4 hash function can have lower time complexity than map. Hash functions will always be excited 4 times but a hash function can return the result in O(1). Somebody plz help me understand
Please answer me .. how about B ?? why it is 2 (when i calculate it ) and in the stream appears only one time ??
collision
What kind of hash functions to use in count in sketch.
Here is the list: en.wikipedia.org/wiki/List_of_hash_functions
But the catch is to select the hash functions which has property of equal distribution.
Scalability vs Accuracy
Can u do videos for how Facebook like button works ?
The frequency of B is coming wrong via this CSM.How can we get the most accurate solution?
Just curious why it works?😳