Count min sketch | Efficient algorithm for counting stream of data | system design components

Tech Dummies Narendra L

Просмотров 99 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 16 ноя 2024

Комментарии • 132

@viputdBeast 4 года назад ⁺¹
No one ..!! Seriously no one does it better than you.. You are GOAT sir when it comes to system design....
@rinkalagrawal 3 года назад ⁺¹⁴
Thanks for this simplistic explanation of the algorithm. I've watched your other videos as well and it has helped a lot. Appreciate the work you are doing. Thank you!
@moulisiramdasu6753 2 года назад
Thank you. Before watching this I found it difficult to understand but by the end of the video, it's crystal clear.
@mattleahy3951 5 лет назад ⁺⁵
I appreciate this; I was unaware of this algorithm until I saw your easy to understand explanation.
@anandkumar2058 5 лет назад ⁺¹⁵
video suggestion : Design maps (google map for example ) and finding the shortest route from A to B with/without traffic
@studentforlife5502 4 года назад
U SAVED MY LIFE!! please provide videos for all sketching algorithms
@tassoskat8623 5 лет назад ⁺¹
Great video! You are better than my professor
@shreyasns1 2 года назад
Count Min Sketch explained with Min complexity, Great Job Naren. Good luck for your videos
@fernandozago Год назад ⁺¹
Pretty coooll!!!
Is pretty close algorithm like KTop of Redis....
But, the only difference between this and Ktop from redis is that KTop uses two Values on each array element (Like: Count and Hash) and when a collision happens, it does some calculation to "decay" (meaning Decrement the count) from the data on the sketch instead of always increase the count.
Very gool explanation !! =)
@dataguy7013 4 года назад ⁺¹
Naren, I am officially your fan. Nice video. You simplify and explain really well. Thanks!
@ankitgarg4273 5 лет назад ⁺¹⁹
For the failure case where output is probable, letter "B" frequency can be seen. It should be 1 but from the CSM it is coming as 2. So, that's the reason it is called probable.
@waylonbradley1345 3 года назад
pro trick : you can watch movies at flixzone. I've been using them for watching loads of movies recently.
@cadejose7770 3 года назад
@Waylon Bradley Yup, I've been watching on flixzone} for since december myself :D
@karl6892 2 года назад
Great video Naren! Very clear explanation
@ShabnamKhan-cj4zc 3 года назад
Thank you easy explanation.. It feels like good teacher explaning to students.
@satishvedpathak9972 5 лет назад
you have excellent skills to simplify the complex concepts ; this was better than explain in my big data training course .
@pramodsingh4668 2 года назад
Fully clear explanation Narendra. One suggestion, you could have reduced the video to 6-7 minutes instead of 19 minute.
@anupamtamrakar3985 5 лет назад ⁺³
Great job. Really like the way you explained
@mayuresh247 6 лет назад ⁺³
This is somewhat similar to bloom filters..
Thanks for the video.. Keep it up !!
Suggestion: food delivery application like swiggy/uberEats
@TechDummiesNarendraL 6 лет назад ⁺²
Thanks and Sure, I will work on it
@leozhang7341 4 года назад ⁺¹
'''
Assumption:
input stream could be single numbers or lower letters: 36 possibilities
8 columns
4 hash functions:
1. md5
2. sha1
3. sha256
4. sha384
'''
import hashlib
from typing import List
class CountMinSketch:
def __init__(self):
self.cols = 8
self.rows = 4
self.h1 = lambda x: hashlib.md5(x).hexdigest()
self.h2 = lambda x: hashlib.sha1(x).hexdigest()
self.h3 = lambda x: hashlib.sha256(x).hexdigest()
self.h4 = lambda x: hashlib.sha384(x).hexdigest()
self.hash_funcs = [self.h1, self.h2, self.h3, self.h4]
self.matrix = [[0] * self.cols for _ in range(self.rows)]
def __get_hashes(self, c: str) -> List[int]:
hash_results = []
for func in self.hash_funcs:
# Generate has value from each hash function as string
hash_result = func(c.encode())
# Convert it into number and divide by columns size
hash_results.append(int(hash_result, 16) % 8)
return hash_results
def process_element(self, c: str) -> None:
hash_results = self.__get_hashes(c)
# Update count values in the matrix
for i, val in enumerate(hash_results):
self.matrix[i][val] += 1
def get_freq(self, c: str) -> int:
hash_results = self.__get_hashes(c)
freqs = []
# Find frequence value for each hash function
for i, val in enumerate(hash_results):
freqs.append(self.matrix[i][val])
return min(freqs)
if __name__ == '__main__':
# Use list for simplicity
input = ['1', 'v', 'd', 'c', '2', '1', 'c', 'a',
'1', '0', 'v', '1', '1', '0', '1', '6', 'd']
cms = CountMinSketch()
# Assume the last element of the list in the start of the data stream
while len(input) > 0:
c = input.pop()
cms.process_element(c)
print(cms.get_freq('1'))
Great video! Thanks for the explanation. Just a side note: if 'e' is the most tolerented error rate, the width of the table is: w = 2 / e , and the height of the table is: h = log(e) / log(0.5). (ref: ieeexplore.ieee.org/abstract/document/6042851?casa_token=r2yECvKp2esAAAAA:4V0CffEkxHBXXe1rvs928EebsoOg79dKrxfIjhsjHMroV2t_-fiQrnDt1qmU2dqvV0NIHCkn4A)
@smithabs 2 месяца назад
Very good explanation
@nik15oc 6 лет назад ⁺¹
very nicely explained. This data structure is frequently used in the caches
@mohanishmedar5332 4 года назад
Perfect explanation! Good job
@mydodethailung395 Год назад
Great explanation !!
@dpsky666 2 года назад
Great explanation! Thank you!
@hidekazushidara320 6 лет назад ⁺⁴
video suggestion: microservice architecture vs traditional architecture
@TechDummiesNarendraL 6 лет назад
Sure
@李雨蓁-i4t Год назад
very good!!!!! Thanks a lot over 1000 times!
@satishvedpathak9972 5 лет назад
while (inpStream)
for hashFn = 1 to 4
HV = s1(hashFn(inpStream))
MAT(hashFn,HV) = MAT(hashFn,HV)++;
end for
end while
findFrequncey(inputSteam)
for hashFn = 1 to 4
freq = freq + MAT(hashFn,HV)
end for
return freq/hashFn
@abhishek_sengupta 4 года назад
Very nice explanation! Thank You!!
@vitaliano01 3 года назад
Thank you for the good explanation
@satang500 5 лет назад ⁺¹
Thanks a lot of your videos.
Suggestions:
1. your video can be categorized into two:
1) fundamentals and glossary ( which is building blocks to design other systems such as consistent hashing, caching, load balancing... )
2) system design ( such as design uber, facebook, youtube, netflix... ).
So how about creating two groups and put videos into these groups?
2. request of video lecture: proxy, load balancing, web crawler, server and slave architecture for availability
@harshav3902 3 года назад
Great Explanation. Thank you.
@satishvedpathak9972 5 лет назад
Updated: for hashFn = 1 to 4
for hashValue = 1 to n
MAT(hashFn,hashValue) = 0;
end for
end for
while (inpStream)
for hashFn = 1 to 4
HV = s1(hashFn(inpStream))
MAT(hashFn,HV) = MAT(hashFn,HV)++;
end for
end while
findFrequncey(inputSteam)
for hashFn = 1 to 3
if ( MAT(hashFn,HV) > MAT(hashFn++,HV)
freq = MAT(hashFn++,HV)
else
freq = MAT(hashFn,HV)
end if
end for
return freq
@Beadyboyworld 2 года назад
Good explanation 😎
@PrateekMehtaABDFAN 2 года назад
Great videos sir . I came here from the caching for distributed system video where you mentioned this as a use-case for calculating the frequency of data in eviction policies . You kind of dint mention that in the applications . So, CACHING can also be an application for the same .
@prateekthakur4060 6 лет назад ⁺²
This one is new for me,Thanks for sharing!
Also for video suggestion: How to design multiplayer gaming system
@TechDummiesNarendraL 6 лет назад
Sure, I will work on it
@jhnflory 2 года назад
Thanks...nice explanation!
@FengZhu-ym6qu Год назад
Thanks bro, that really helps!
@arvindaaswani1303 4 года назад
Nice explanation 👍
@apoorvasaxena4522 6 лет назад ⁺⁶
Thank you for this video... You did a great job... :)
@janina542 6 лет назад ⁺²
Nice explanation. keep it Up
@kumarc4853 3 года назад
thank you Sys Guru!
@mgayatri 5 лет назад ⁺¹
Very similar to the Consistent Hash ring solution you explained.
@satishvedpathak9972 5 лет назад
for hashFn = 1 to 4
for hashValue = 1 to n
MAT(hashFn,hashValue) = 0;
end for
end for
while (inpStream)
for hashFn = 1 to 4
HV = s1(hashFn(inpStream))
MAT(hashFn,HV) = MAT(hashFn,HV)++;
end for
end while
findFrequncey(inputSteam)
for hashFn = 1 to 3
if ( MAT(hashFn,HV) > MAT(hashFn++,HV)
freq = MAT(hashFn++,HV)
else
freq = MAT(hashFn,HV)
end if
end for
return freq
@mayankrathore7558 6 лет назад ⁺¹
Awesome algo and ofcourse nicely explained thanks . :)
@sebastiandanielripari6457 5 лет назад
Nice explanation. Great Job !
@jonarmani8654 2 года назад
Replace the counters with bloom filters, and you've got yourself a RAMBO.
@moshelevi5485 3 года назад
Amazing job!
@ThoRicHeLLi 6 лет назад ⁺¹
Keep up the good work!
@antrapurohit8010 2 года назад
Thank you for the explanation. Just have one doubt. Can we use this data structures to implement top K URLs accessed in last 1hr?
@rameshnraghavan 5 лет назад
Very nice explanation.. Thank you..
@chikudholu 6 лет назад ⁺¹
Great work!!
@vincentmax4571 6 лет назад ⁺²
Good tutorial.
@mahdiarghavani1804 5 лет назад
You describe it well.
Thanks
@pixelizedworld 3 года назад
my first comment here... just THANK YOU.
@meghadave9363 3 года назад
Can you please also post a video on concurrency management in database as well as code for situations like concurrent ticket reservation. Also what is the best strategy for handling millions of operation(update/insert) in a relational database scenario
@rozawar 3 года назад
Thanks for the video explanation. How to query top K using this algorithm? it shows object to count mapping. What about reverse?
@felipegutierrez7856 5 лет назад
very good explanation! thanks
@aprilfool302 2 года назад
You do awesome bro, but sometimes there is issue with sound, your voice becomes feeble. Keep it up
@Templesify 9 месяцев назад
awesome 👌
@ameynaik2743 3 года назад ⁺¹
Nice video, could you please elaborate on how you decided on the number of columns for count-min sketch?
@_ityadi 3 года назад
Good question. I was also wondering on that coz in reality the hash function might return a really huge value which we would have to fit into our window size. But how do we calibrate the window size, that is the question
@RandomShowerThoughts Год назад
@@_ityadi I don't think you can ever change them, but it sounded like you needed to take the max of all the hash functions
@Cosciug1234 6 лет назад ⁺²
video suggestion: file storage and synchronization service system design (dropbox, google drive)
@TechDummiesNarendraL 6 лет назад
@cos thanks alot, I will consider the topics
@TechDummiesNarendraL 6 лет назад ⁺¹
@Cos Sure, I will work on it
@arayanamit123 6 лет назад
@Tech Dummies , In this I what should be the max value my Hash function should return, or should we consider using modulo operation to keep the max value in check. what would be the impact of this max value on the collision.
@shortstories765 3 года назад
what is the advantage of count-min sketch over hashtable???
we can store the count in hashtable, same time and space complexity
@singaravelann3671 3 года назад
What should be the modules value for each hash function??
@sivamunnaluri2263 3 года назад
How you are concluded more hash function will have more accurate ?
@awaissindhu 4 года назад ⁺¹
how would we use this for trending data? for example, where we want to see the data, say, for the last 24 hours? do we subtract entries from these values that are older than 24 hours?
@saitrinathdubba 4 года назад
Concise !! Thank you
@zhongyuanzhang5611 3 года назад
if it is half of the space, it is still linear, sublinear means when the data grows, the difference will be bigger and bigger
@snr7131 3 года назад
space wont grow though, it will be a grid fixed width and height (height = no of hash functions, width= any arbitrary large number)
@大盗江南 4 года назад
Thanks bro! We love u!
@GSSMoney 3 года назад
Given a count-Min sketch - how to get the top K? Would be nice to mention what else we need to store in along with the count?
@tutsdriveonline2632 7 месяцев назад
Won't it be better if increase number of possible columns i.e. from 0-6 to like 0-100 instead of increasing the hasfunctions ?
@AnuragSingh7 3 года назад
If we use reservoir sampling with a fairly large sample size wouldn't we be guaranteed to find the true frequent elements?
I mean the frequent elements present in the sample would also be the frequent elements in the stream.
@chitthiaayeehai 5 лет назад ⁺¹
Super
@YT-yt-yt-3 2 года назад
Thanks. Helpful. However the video explains how count min sketch works but not why it works.
@Tasteofgyan 5 лет назад
Great job!! Thank you!
@saurabhgohil2232 5 лет назад
This is little bit similar to bloom filter. but instead of Boolean value we keep count in our sketch.
@AmolGautam 9 месяцев назад
Thanks
@yuenyiupang 2 года назад
that mean u need to know what you looking for, say i wanna know the freq of "A", can it apply to i wanna the top 20 freq item?
@DorinCiobanu007 6 лет назад ⁺²
How is CMS different from a hash map with no collision verification and with smaller than the data output space size of the hash function?
Example: hash(n) = n mod 17, int freq[17]
freq[hash("your input")]++;
@AmanSingh-nu6ng 5 лет назад ⁺¹
I don't think it is different, conceptually. CMS with one hash function is just hash map with no collision verification.
@usamahashmi7274 5 лет назад
@@AmanSingh-nu6ng It is different because you are not storing against each object. The total size of your data structure is the size of the sketch in this case. With hashmap, your hashmap will store all distinct elements which can be much greater than the size of your sketch.
@AmanSingh-nu6ng 5 лет назад
@@usamahashmi7274 no hash map with no collision verification will not store all the elements distinctly. If n keys will have same hash they will be stored as 1 key value. CMS with 1 hash function is same as hashmap with no collision verification. The key part is no collision verification.
@felipegutierrez7856 5 лет назад
Hi @Tech Dummies, which kind of data stream applications will benefit from ising count-min sketches?
@gakhov 5 лет назад ⁺²
Think you about the task of estimating heavy hitters or the most frequent events in an event stream. Events can come from the IoT devices or from airplane sensors, the source doesn't really matter, matters only the fact that there are a lot of such events of different types. If we need to find the most frequent events in a traditional way, we need to count all of them and maintain such counters for all of them, even for rare once. Instead, we can use Count-Min Sketch and maintain only a few counters with some small number of "promising" elements.
If you are interested in such data structures, check out my recent book "Probabilistic Data Structures and Algorithms in Big Data Applications" (pdsa.gakhov.com).
@prashidell8980 5 лет назад
Hi, I really enjoy your videos. Count Min Sketch is using basic idea of bloom filter, right?
@varunganesh2028 5 лет назад
Is it possible to evict entries from the countmin sketch? Isn't that an important criterion while handling streams? If I subtract for all the hashes on removal, will the "min" property be preserved?
@TechDummiesNarendraL 5 лет назад
I am not sure, but min property might be violated!! I will try to get more info and update you
@bharath_v 3 года назад
As you pointed out, if the hash value is same for both 'A' and 'B', then it might overcount, in such case we should increase the number of hash function.
@bhulakshmidevanamaina3001 4 года назад
Is it also called the kth min value sketch algorithm
@ghazalabd8890 Год назад
great!
@harsha.m4026 4 года назад
Quick question. How do we decide the sketch's size?
@carlaludgate6597 3 года назад
Thanks for this, great video! One question: why do you call it sublinear? Because it looks to me like all your operations are constant except for getting the min. So is the time complexity dependent on the amount of hash functions?
@yogeshverma9267 2 года назад
Sublinear is the relationship between data and memory. As the data grows, the memory required to store it does not grow in a linear fashion since we are using a fixed set of hash functions and hash outputs.
@AbhishekSharma-si8ui 4 года назад ⁺¹
AWESOME
@prakashmishra 5 лет назад
Hi Narendra,
The video was quite informational. But I am yet unsure how this is taking less space. How are we deciding the size of hash here? And ain't the taking the 5-6 hashes will be equivalent to taking single hash for maintaining count?
@arthursimeon2620 2 года назад
it was maybe not as clear as it could be with the small example size, but the idea would be that maybe you have millions of videos and you are counting likes, then you have a set of hash functions and then columns that are much smaller than the total number of videos, whereas a normal hash table would be linear with the number of the videos.
@sarthakuiit 2 года назад
I am not able to understand how does it take sub linear space if we have to maintain a reasonable amount of accuracy?
For 6x4 matrix it can’t store count of more than 24 elements accurately.
(that too when hash functions are working perfectly, leaving at least 1 count accurately in the matrix)
Then how is it different from a regular hashMap. There could be collisions.. yes.. but then also search time is LOG(N) for few of the elements which underwent collision.
But then, let’s consider cost for counting in CMS -
We need to use k hash functions to both put and get values with a risk of some inaccuracy.
So I fail to understand, what great CMS is achieving when we call it ‘sub linear’ data structure.
@chinna311983 5 лет назад
What if we use a hash function which uses string each literal ascii multiplied by index of the letter where it appears, and add all together, doesn't that guarantee the unique hash value all the time?
@thedankest1974 3 года назад
No, because of the pigeon hole principle. Remember the purpose of this data structure is to support counting frequencies of elements in sub-linear time, by trading-off accuracy. If we have unbounded strings mapping to a finite amount of space, there will always be a collision.
@агатакристи-г3ы 5 лет назад
Interesting, is there an improvement of such algo which dinamically increase the matrix size (horizontally or vertically or both) proportionally, say log(N) to keep collision probability near constant..
(I assume here the key space is also has infinite size.)
@yazicib1 4 года назад
Counting characters in a stream is not O(n) it is O(m) space where m is the number of characters in the alphabet. Hashing has no benefit here. You could have counted precisely and faster using just an array once cell for each character...
4:30 Also there is no data loss in compression... unless it is lossy image compression or something...
5:00 A good hash function shouldnt have too many collisions... if your character set is english alphabet, as i said above, you don’t even need a has function. There are great 64 bit hash function with amazing no-collision properties... you can always use twoof these to minimize even further...
@sumonmal009 3 года назад ⁺¹
THIS COMMENT IS FOR MY PERSONAL REFERENCE. TO UNDERSTAND PROPERLY WATCH THE FULL VIDEO
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
sublinear 1:00
Why we need count min sketch 2:30
Count sketch 5:50 8:30 9:40
calculate frequency13:20 to 15:06
@SHASHALIBRA 2 года назад
Did not understand how executing 4 hash function can have lower time complexity than map. Hash functions will always be excited 4 times but a hash function can return the result in O(1). Somebody plz help me understand
@donyayasir2681 4 года назад
Please answer me .. how about B ?? why it is 2 (when i calculate it ) and in the stream appears only one time ??
@NoWarForever 4 года назад
collision
@flyingonsnow 6 лет назад
What kind of hash functions to use in count in sketch.
@TechDummiesNarendraL 6 лет назад
Here is the list: en.wikipedia.org/wiki/List_of_hash_functions
But the catch is to select the hash functions which has property of equal distribution.
@biboswanroy6699 4 года назад
Scalability vs Accuracy
@rajeshji7698 4 года назад
Can u do videos for how Facebook like button works ?
@rishabhgupta2513 5 лет назад
The frequency of B is coming wrong via this CSM.How can we get the most accurate solution?
@gauravdudeja 3 года назад
Just curious why it works?😳

Следующие

Автовоспроизведение

Bloom filter for System Design | Bloom filter applications | learn bloom filter easily