What are Bloom Filters? - Hashing

Поделиться
HTML-код
  • Опубликовано: 7 сен 2024

Комментарии • 75

  • @othmanee.h4738
    @othmanee.h4738 4 года назад +41

    Straight to the point, excellent content !
    Allow me to add a quick reminder for myself in case I forget and for others in case they want to look into the details, probability requires a bit of mental warm up sometimes to fully understand :
    A bloom filter can be seen as F = {0,1}^m
    For each word, hashing means setting k indexes in F to 1, such as F[ h_i(w) ] = 1 where h_i is the i-th hash function in { h_1, ... , h_k }
    We do this for all words in the dictionary/list to input
    Initially we put all values of F to 0, so F={0}^m
    Then we apply one hash like h_1(w) = some_index
    So now F[ some_index ] = 1
    Then the probability of finding a bit at one, is literally one in m since there is only a single 1 in F:
    p1 = 1/m
    So that the probability of finding a bit at zero would be the inverse :
    p0 = 1 - p1 = 1 - 1/m
    (**) So after applying all k hash functions to the first word the probability of finding a bit at zero becomes :
    p' =(1-1/m)^k
    Then after hashing all words (or until the iteration i of the list of words), the probability of finding a bit at zero will be :
    p=(1-1/m)^(k*i)
    So now the probability of getting a one for a random word w, p(F[h_i(w)]=1) is the inverse of p which is :
    q = 1 - p = 1 - (1-1/m)^(k*i)
    Therefore if we want the probability of getting a false positive = an error, we want the probability of getting k times the value 1 in succession.
    So that F[h_1(w)] = 1 AND F[h_2(w)] = 1 AND F[h_3(w)] = 1 ....AND F[h_k(w)] =1
    But we know from previous that p(F[h_i(w)]=1) = q = 1 - (1-1/m)^(k*i)
    Finally the probability of an error is
    e = p(F[h_1(w)]=1)*p(F[h_2(w)]=1)*...*p(F[h_k(w)]=1) = q*q*q...*q = q^k = ( 1 - (1-1/m)^(k*i) )^k
    ** In this calculation one could argue that after completing the first word, why wouldn't the probability of finding a zero be
    p' = 1 - k/m
    It's a valid remarque, we could say that after applying each h_1(w) to h_k(w) we've been adding a new 1 to F, so p1 = k/m
    While that is highly possible, since h is built so that each hash gives a different result, it wouldn't be mathematically correct , because that would violate the case where h_1(w) = h_2(w) for instance. (If you assume all other h_i(w) are distinct, then p' = 1 - (k-1)/m in that case)
    Then what if three hashes are the same, what if any random number n of hashes are the same, this formula would lead to incoherence.
    => My intuition to solve this question is that when you don't know how many ones are in your filter the experience of drawing a zero can be simulated as not drawing one for all possible values of h_i(w).
    So the probability of not drawing one (or drawing 0) for the first index h_1(w) is (1-1/m), then for h_2(w) it's the same (1-1/m) etc... until h_k(w)
    And so drawing a zero when you don't know exactly how many ones there are in your Filter is like not drawing one for the first hashed index, and not drawing one for the second etc... until not drawing one for the last hashed index.
    In fact the more I think about this the more I find it counterintuitive, I'm not 100% sure how to interpret this formula probabilistically, because when you say the probability of getting a 0 for a single hash, that's one operation, but when you say p = (1-1/m)^k that's the probability of multiple events happening in succession.

    • @Md_sadiq_Md
      @Md_sadiq_Md 5 месяцев назад +1

      I was writing blog on bloom Filters and thanks for this man 🔥

  • @salookie8000
    @salookie8000 6 лет назад +14

    Excellent comment you made that it doesn't guarantee string has been searched but it does guarantee it has not been searched.

    • @gkcs
      @gkcs  6 лет назад +2

      It takes some time to roll of the toungue :)

    • @shivas9911
      @shivas9911 6 лет назад

      Are you saying that with the right number of hash functions, the "cc" search should respond as not been searched?

  • @fiveyearclub6024
    @fiveyearclub6024 5 лет назад +7

    i got bloom filters in an interview. i had read a wikipedia article on it, but i just didn't remember enough to say anything about it. this video was clear! you're a rock star.

    • @gkcs
      @gkcs  5 лет назад

      Thank you 😁

  • @rajeevanand6870
    @rajeevanand6870 4 года назад +2

    Nice simple explanation of Bloom Filter. Only he got bit confused during 23:12-23:24. It's simply "1 divided by 1 is 1, and a minus 1 is 0... so on". Anyways good talk.

  • @abhaypatil2000
    @abhaypatil2000 4 года назад

    This is the best channel for a cs student. We have many youtube channels which solve competitive programming problems. But who will make up for rest of the concepts. Especially when the IIT professors are not so great😅😅

  • @PiyushSingh-vx7bx
    @PiyushSingh-vx7bx 4 года назад +2

    U r one of the best computer science RUclipsr bro

  • @mairasultan884
    @mairasultan884 6 лет назад +3

    A simple and easy to understand talk. thumbs up

  • @ThePujjwal
    @ThePujjwal 2 года назад +6

    17:32 Small tip.
    If sum of digits of a number is divisible by 9 it will be divisible by 9.

  • @sujitkumar-xs2wy
    @sujitkumar-xs2wy 6 лет назад +3

    one of ur''s best lecture ...@gaurav

  • @sagnikbilltrim4716
    @sagnikbilltrim4716 4 года назад +7

    I liked the video very much, except the fact u forgot "p" comes after "o" not "q" :P

  • @saaqibz
    @saaqibz 5 лет назад +6

    Great video and clear explanations, thanks! small note: wouldn't comparing two strings be a complexity of O(min(m,n))? since you don't really need to compare past the end of one string after the end of the other is finished.

    • @gkcs
      @gkcs  5 лет назад +1

      Yes, that's correct. I am assuming n to be the length of the smaller one :)

  • @jairajkhushalani99
    @jairajkhushalani99 4 года назад +1

    Great stuff Gaurav!!

  • @Atpugtihsrah
    @Atpugtihsrah 5 лет назад +10

    1/1 is 0 :D
    Bdw amazing video!
    Thanks!

    • @gkcs
      @gkcs  5 лет назад +1

      😁

  • @abhishek_sengupta
    @abhishek_sengupta 3 года назад

    Very nicely explained!! Thanks.

  • @priyankagrawal9333
    @priyankagrawal9333 2 года назад

    Amazing!! Thanks for sharing!!

  • @BRAJAGOP
    @BRAJAGOP 6 лет назад +5

    Good talk on bloom filter and liked the math part, would have loved if you solved the optimal part as well, that is something for me to look up ;) .

    • @gkcs
      @gkcs  6 лет назад +1

      That math was too complicated to talk about here :)
      Thanks!

  • @satviknema8629
    @satviknema8629 4 месяца назад

    having some trouble understanding this:
    Probability of finding that a bit is set after 'i' iterations of 'k' hash functions is
    1 - (1-1/m)^(k*i)
    Why is this value not simply
    (1/m)^(k*i)

  • @PrateekMehtaABDFAN
    @PrateekMehtaABDFAN 2 года назад

    Great video sir . Huge fan of your content .
    Quick question : Rather than Multiple bloom filter wouldn't it be better to implement count min sketch or do we have an advantage of using multiple bloom filters over count min sketch ?

  • @shreyageek
    @shreyageek Месяц назад

    no one explains a single question with bloom filter , but its actually asked in microsoft interview

  • @krishnapatni
    @krishnapatni 5 лет назад +1

    very informative and you made it simple to understand. can the bloom filter indicate that the string being searched is a substring of another string that was searched before ? for e.g. if matthew was searched before mat, then m a and t bits have already been set and this shows mat is a substring for some another string already searched..

  • @ayushverma3863
    @ayushverma3863 6 лет назад +1

    Well explained Gaurav. I was curious how companies like Facebook handle the false positives while using Bloom filters.

    • @gkcs
      @gkcs  6 лет назад

      Thanks Ayush!

    • @muffinproject
      @muffinproject 6 лет назад

      In the case presented here (caching a search) they probably wouldn't do anything about it. It's just used to cut down on unnecesary caching, NOT to definitively prove a string has already been searched.
      Let's say that implementing this generates a 'not in set' result 90% of the time, and has a 50/50 chance of being a false positive. Instead of caching 100% of cases, we now toss away 90% of them - saving equally as much storage space. Yes, we do save twice as much as is ideal, but we're saving a lot. Ideally we'd only save 5% but then we'd have to compare the actual string to a very very large array of previous searches which takes longer to compare (and, ironically, use up some storage space in the form of a larger string-array than the bloom array)
      Compare it a bit to the game 'guess who?' where you have a bunch of people and have to find out whoever your opponent picked. You could ask naïvely; is it Adam? Is it Bertha? Is it Cedric? Is it Diane? - but that takes a long time. You could instead ask; Do they have blue eyes? Are they male? Etcetera. Which reduces the possibilities a lot faster.

  • @himalayan006
    @himalayan006 5 лет назад

    Very informative, really enjoyed.

  • @randomizing-life7016
    @randomizing-life7016 4 года назад

    great video, great explanation on topic that im just starting to get familiar with. but i couldnt lay my eyes off you. first YT engineer lecturer that is handsome like hell.I know purpose of the video is educational and all comments are about your lecture but damn i had to comment your good looks too :)

    • @gkcs
      @gkcs  4 года назад

      Hahaha, thanks!

  • @ruhinapatel6530
    @ruhinapatel6530 5 лет назад

    My quest is that basically from what I understood..bloom filter tells u whether the string is searched or not..n if yes then it stores it in the cache..but isn't there a lot of overhead for the string search..and performance issues..

  • @siarez
    @siarez 5 лет назад +2

    Thanks, that's a great explanation!
    What does 'n' mean @23:55? Is that the size of our vocabulary? Why do we need it here?

    • @gkcs
      @gkcs  5 лет назад +2

      Thanks!
      N is the number of words we will be searching in the bloom filter. 😁

  • @jeffreyqiu5128
    @jeffreyqiu5128 3 года назад

    absolute fire

  • @vivekmit
    @vivekmit 3 года назад

    I am not sure about your derived probability formula for error rate in bloom filter but not sure whether the ordering of bloom hashing bit set.... like hashing 1,2 ,3 for any key is (1,1,3 ; 3,1,1; 3 3,1; 333, 111.....any combination of 1,3) all falls under same consideration and treated as key already being searched...I don't think in practice it is a justified in any system design component.

  • @aravindm9687
    @aravindm9687 6 лет назад +1

    great explanation!
    @13:10 how about incrementing the array index value from 1 to 2 to later find out whether the string is matched twice ? Instead of handling multiple arrays

    • @gkcs
      @gkcs  6 лет назад +4

      aravind kumar That's a good idea and while work most cases. However, the second filter could use a different hash function, making collisions even less likely.

  • @Kavishkhullar
    @Kavishkhullar 9 месяцев назад

    @gaurav (1- 1/m)^(k*i) is the probability of us not not setting a bit after running i string through k hash functions. subtracting that from 1 gives us the probability of us setting the bit to 1. Could you explain "the bymistake" part? how it that the probability of us by mistakly setting that bit?

  • @7301G
    @7301G 6 месяцев назад

    @11:42 I have 1 question if the string is "anas" and the other string is "sana" so sana name string is new but when we make it go through hash function we will get all its bits set to 1 but its never have been searched ?

  • @parvathysuresh2821
    @parvathysuresh2821 4 года назад +1

    ur expressions are cute and ur explanations also good bro...

  • @mattb9916
    @mattb9916 5 лет назад +1

    Excellent!

    • @gkcs
      @gkcs  5 лет назад

      Thank you!

  • @prarabdhgupta1117
    @prarabdhgupta1117 4 года назад

    @Gaurav Sen, That link for Cache in the Description is hitting to a dead end, If you can update it, it'll be helpful.

    • @gkcs
      @gkcs  4 года назад +1

      Done. It's the wikipedia page on caching.

  • @rembautimes8808
    @rembautimes8808 2 года назад

    Interesting idea, and perhaps can be paired with BitSet in Python

  • @shuvopaul3089
    @shuvopaul3089 5 лет назад +2

    Expression😘😘

    • @gkcs
      @gkcs  5 лет назад +1

      😛

    • @shuvopaul3089
      @shuvopaul3089 5 лет назад

      Sir ...you communicate really well...which makes it more interesting

  • @gudduagrawal8523
    @gudduagrawal8523 8 месяцев назад +3

    when i heard 1/1 = 0😶‍🌫

  • @pratyushranjan277
    @pratyushranjan277 5 лет назад

    Hi gaurav bhai..i am a developer but i want to make my programming skills better...is there any algo book or programming book you would suggest..or any other way to improve my logic and programming??
    Thanks

  • @snehakavinkar2240
    @snehakavinkar2240 4 года назад

    What if we have to check for millions of such words, do we just need to increase the size of bit array?

    • @gkcs
      @gkcs  4 года назад

      To reduce false positives, yes.

  • @lakshyaagarwal4044
    @lakshyaagarwal4044 4 года назад

    The video was great but if you had used Blue instead of Red it would have been better.

  • @GelinLuo
    @GelinLuo 5 лет назад

    Curious to ask when do you call the `reset()` function to reset your bloom filter that determines whether a string has been searched or not in facebook

    • @gkcs
      @gkcs  5 лет назад

      A threshold on false positives is a good idea. Over time, the bits in a bloom filter are all set.
      Here is a thread discussing some resets: github.com/ben-manes/caffeine/issues/85

  • @reactdeveloper2368
    @reactdeveloper2368 Год назад

    If anyone has worked with mongoose implemeting bloom filters for full text search it will be nice! Thank you in advance

  • @vijayantpawar3137
    @vijayantpawar3137 Год назад

    What is the need of using multiple hash functions?

    • @bite076_tania3
      @bite076_tania3 5 месяцев назад

      to avoid probability of collision (two strings having same hash value)

  • @sixfacesenthil
    @sixfacesenthil 4 года назад

    when k=3 ,17%,k=4 ,1.18.how error rate is 5.4% . k=20 ,5.4 %

  • @anshashakkeer9277
    @anshashakkeer9277 5 лет назад

    sir Can u plz explain how to find hash of a string .. i didnt get thru dis lecture i cant clear

  • @Amol175
    @Amol175 4 года назад

    Derived it with respect to what?🤔

  • @awadheshsingh7299
    @awadheshsingh7299 5 лет назад +2

    1/1 is 1 and not 0, made a blunder there lol

    • @gkcs
      @gkcs  5 лет назад

      I did? Damn 🙈

  • @youtuberfanz7887
    @youtuberfanz7887 Год назад

    Cat this guy become a teacher? It really suits him if this is what he wants.

  • @kumarc4853
    @kumarc4853 Год назад

    bhakt jano, bloom filter baad mei seekh leha, , Lekin pehle check karlo ki Cache mei kahi *AMOEBA* toh nahi hai?!

    • @gkcs
      @gkcs  Год назад

      Hahaha 😂