Python Hash Sets Explained & Demonstrated - Computerphile

Поделиться
HTML-код
  • Опубликовано: 1 янв 2025

Комментарии • 164

  • @Computerphile
    @Computerphile  11 месяцев назад +13

    Jane Street skyscraper puzzle (and info on the AMP program) at bit.ly/computerphile-amp

    • @olivier2553
      @olivier2553 11 месяцев назад

      I get the skyscraper, but I don't get the route thing.
      Oh, I got t now. The font is too small, I was misreading.

  • @Davourflave
    @Davourflave 11 месяцев назад +258

    "O(1) means that there is no relationship between the speed of the lookup and the number of elements in your collection". Couldn't have said it better, as often with big O, the devil is in the constant you dropped during notation :D

    • @guinea_horn
      @guinea_horn 11 месяцев назад +17

      + c

    • @styleisaweapon
      @styleisaweapon 11 месяцев назад +13

      in modern times, the constants can frequently matter most (when you can perform more than several hundred operations (3+ ipc) during that cache miss, for example)

    • @mina86
      @mina86 11 месяцев назад +10

      @@styleisaweapon, yeah, which is why you should virtually never use linked lists even though it offers O(1) operations.

    • @antonf.9278
      @antonf.9278 11 месяцев назад +12

      ​@@mina86Linked lists are only O(1) if you have a pointer to the position you want to access. Unless you are already traversing the list you only have that for the first and last element.
      Appending to an array based list is already amortized constant and prependiding is as well in array based queues.
      In conclusion, linked lists are even worse.

    • @styleisaweapon
      @styleisaweapon 11 месяцев назад +1

      @@mina86 I have never written code that used a linked list! and trees I most often use stop bit encoding (heap-style implicit tree addressing), and when sparse I back it with a hash table instead of an array....

  • @ejc4684
    @ejc4684 11 месяцев назад +89

    Always love a Mike Pound video!!!

    • @programorprogrammed
      @programorprogrammed 11 месяцев назад +6

      You've been Pounded!

    • @ejc4684
      @ejc4684 11 месяцев назад

      Nice!!!​@@programorprogrammed

  • @Michael-sh1fb
    @Michael-sh1fb 11 месяцев назад +44

    I see Mike Pound, I click

  • @ibrahimmahrir
    @ibrahimmahrir 11 месяцев назад +32

    12:02 in the *___contains___* function there shouldn't be an *_else:_* before the *_return False_* at the end, otherwise in case if the list *_self._data[idx]_* is not *_None_* and the item is not in that list, the return value won't be a boolean.

    • @williamli55
      @williamli55 11 месяцев назад +3

      Good point. It should allow both of the above if statements to fall through, and simply return False as a default. It will still work for __contains__ though because None is falsey. It will evaluate to false when using the "in" operator. It's not a consistent return type though, so yeah, he would probably be better off just removing the else statement.

    • @vadrif-draco
      @vadrif-draco 11 месяцев назад +4

      While that is correct, it seems that the *___contains___* special function returns *False* by default if you don't return anything yourself (I tried it on an online interpreter). So, there shouldn't be an *else:* statement or a *return* statement at all. If you go to 15:52, it seems to have worked even though it was the same exact scenario you described: the list at the specified hash value does indeed exist ( is not *None* ) -- the list of items whose hash value is *999999* , however, the item itself ( *999999* ) is not in that list as it had just been removed.

    • @Ceelvain
      @Ceelvain 11 месяцев назад

      Thanks pep8. :D

    • @jamesarthurkimbell
      @jamesarthurkimbell 11 месяцев назад +4

      @@vadrif-draco Any Python function returns None by default, and None is falsey (i.e. it works with conditionals the same way that False does)

    • @michaelpound9891
      @michaelpound9891 11 месяцев назад +6

      Great spot! You’re absolutely right, I messed up the implementation :) even if technically it works this time, the code isn’t super clear due to the bug, correcting it would also make it more readable I think.

  • @gustafsundberg
    @gustafsundberg 11 месяцев назад +18

    I implemented this in ADA back in 1997 during a class in computer science. I think 73 was a pretty good prime to use while hashing to minimize collisions.

    • @halfsourlizard9319
      @halfsourlizard9319 11 месяцев назад +4

      Extremely surprised that people were still using ADA that late in an educational context -- would have suspected that 'trendy' and 'new' Java would have been on offer.

    • @tracyrreed
      @tracyrreed 11 месяцев назад

      ​@@halfsourlizard9319Java was released, brand new and buggy with no support or books yet, in 1995. Two years later is barely enough time for books to be written and schools to re-orient their curriculum to a new language. I used Java back then and it was very rough and didn't work well at all.

    • @gustafsundberg
      @gustafsundberg 11 месяцев назад

      @@halfsourlizard9319 ADA was used for algorithms, C-language was used for OS, Networking etc. We where using Sun Workstations and I believe Java became available at 1996. Universities are probably not so fast in switching languages.

    • @centerfield6339
      @centerfield6339 10 месяцев назад

      ​@@halfsourlizard9319I went to uni in 2000 and they were just introducing Java.

  • @MichaelKingsfordGray
    @MichaelKingsfordGray 10 месяцев назад +26

    I invented and implemented this very scheme in 1978, on an HP9845A in HP Basic with a 20 MB hard disk, and discovered a few things:
    1) Hash collisions are best stored in a sorted list, so that a binary search can be done, reducing the search time dramatically.
    2) Hashing integers as themselves is a disaster in the real world, where initial keys of 0 proliferate. (Amongst other common integers, such as -1.)

    • @anon_y_mousse
      @anon_y_mousse 10 месяцев назад

      Keys are unique in a hash table, and generally immutable. If you're confusing a multiway hash table, you still only use one instance of a key. In such a case, a key/value pair would be best with the value side of that equation being a balanced tree structure instead of the usual array.

    • @DanieleCapellini
      @DanieleCapellini 9 месяцев назад

      1) is debatable and depends on your number of insertions vs lookups, since if you have a lot of insertions, sorting every time there's an insertion collision can be more expensive than searching over an unsorted list every time there's a lookup collision

    • @anon_y_mousse
      @anon_y_mousse 9 месяцев назад

      @@DanieleCapellini If you have a lot of collisions, then it being sorted or not will be the least of your problems. However, in a tiny array, say with 16 or fewer elements, then sorting can be done fastest with an insertion sort algorithm and it'll be faster to lookup and eat less memory as an array. Of course, if you don't resize your table adequately and collision chains back up to an insane degree, then you'd be better off with a tree data structure all around.

  • @trevinbeattie4888
    @trevinbeattie4888 11 месяцев назад +6

    Side note: if your list of values is static and known in advance, the Gnu “gperf” tool can come up with a “perfect” hash function that gives you the minimum array size with no collisions. It generates C/C++ code, but the output should be portable to most other languages with a small amount of effort.

  • @Loki-
    @Loki- 11 месяцев назад +11

    I'm not new to programming, but I'm new to Python and I was just literally looking into what uses hash tables in Python. Thanks. Lol

  • @Richardincancale
    @Richardincancale 11 месяцев назад +11

    Implemented exactly the simple form of this is a commercial compiler around 1980 to store the symbol table (list of all identifiers defined in the program being compiled, what type, size etc.). Chosen for lookup speed as the symbol table is accessed frequently in the compilation process

  • @cinderwolf32
    @cinderwolf32 11 месяцев назад +1

    The topics discussed on this channel have been the ones that really specifically interest me as of late. This is cool, thank you!

  • @bensmith9253
    @bensmith9253 11 месяцев назад +3

    I saw an interview question video yesterday about these - really good timing for me this video. 😁😁😁

  • @prkhrsrvstv1
    @prkhrsrvstv1 11 месяцев назад +18

    Bloom filters could be a good follow-up to this.

    • @Checker8763
      @Checker8763 11 месяцев назад +4

      And the HyperLogLog

    • @styleisaweapon
      @styleisaweapon 11 месяцев назад +2

      yes bloom filters are good, but first you gotta cover how to do universal hash functions (and therefore can create an arbitrary number of them) .. I think Erik Demaine covers it well enough but his stuff is lecture format

    • @styleisaweapon
      @styleisaweapon 11 месяцев назад

      @@Checker8763 any probabilistic counting .. no need for the added complexity of HLL

  • @barneylaurance1865
    @barneylaurance1865 11 месяцев назад +3

    The built in `array` structure in PHP is mostly a hashmap, and is extremely widely used. Arguably a bit too widely used sometimes, since programmers often use it with strings that they have chosen in advance as the keys, and data supplied at runtime only as the values. In that situation replacing it with a class, with a predefined set of properties known to the compiler, both improves performance and can make the program easier to understand.

    • @vlc-cosplayer
      @vlc-cosplayer 8 месяцев назад

      "Arguably a bit too widely used sometimes" - There's a name for code that overuses primitive types, "stringly typed" 😆

  • @JaapVersteegh
    @JaapVersteegh 11 месяцев назад +10

    Not really interested in the topic, because I already know this, but still watched because Mike's presentations are always engaging

  • @exchange4918
    @exchange4918 11 месяцев назад +1

    Thanks for your videos Mike keep it rolling 🎉

  • @KipIngram
    @KipIngram 9 месяцев назад +1

    It's important to point out here that probability of collision can be reduced by increasing table size, but then your "utilization" of your table space will be lower. It's absolutely a trade-off and you can't improve one without degrading the other. As your table fills up, collisions will become more and more common. For example, if your table is 80% full, then almost by definition there's an 80% chance of a new item colliding with an existing one. The uniformity of the hash function pretty much guarantees it. There's a lot of nice probability theory analysis you can do around these things.
    Of course, that 80% full table gives you a 64% chance of colliding on your first two tries, a 51.2% chance of failing on the third probe, a 41% chance of failing on the fourth, and so on. The average length of the chains you wind up with goes up sharply as you push up utilization.

  • @jfftck
    @jfftck 11 месяцев назад +4

    I watched a talk on Python dictionaries, the guy that worked on the new implementation had gone into detail how they are more closely related to databases than hash maps. It was done to increase performance, and since almost everything in Python has a backing dictionary, it made a large difference in runtime.

    • @PhilippeCarphin
      @PhilippeCarphin 11 месяцев назад +1

      That sounds really interesting, might I trouble you for a link to this talk?

    • @rngwrldngnr
      @rngwrldngnr 9 месяцев назад

      ​@@PhilippeCarphinprobably Modern Dictionaries by Raymond Hettinger. It's on RUclips. It's from 2016, so it's possible it no longer represents the state-of-the-art.

  • @jonny__b
    @jonny__b 11 месяцев назад +5

    Oftentimes on modern hardware, particularly on small datasets, a linear scan can be faster than a hashmap lookup because the hashing function is slow.

  • @cacheman
    @cacheman 11 месяцев назад

    13:30 I get that it's easier to generate numbers, but I think pedagogically it makes much more sense to use strings.
    import random
    with open('/usr/share/dict/words') as f: words = [w.rstrip() for w in f]
    print(random.choices(words, k=10))

  • @skittles6949
    @skittles6949 10 месяцев назад

    I would love to see a follow up video for dictionaries/hashmaps :)

  • @DjoumyDjoums
    @DjoumyDjoums 11 месяцев назад +1

    To add to the video, a few key points about hash sets and maps :
    - Because of the hash and modulo, the contained items cannot be ordered in any way (sorted hash sets and maps exist, but are more costly).
    - Looping through the items is a lot slower than looping on a simple array, because of data locality.
    - The target load threshold is often 70% for optimal performance, meaning the capacity is about 50% greater than the number of contained items. This impacts memory consumption in large sets/maps, or when you have a lot of them.
    - To maintain the load threshold, the hash set may resize itself (increase its capacity), which requires a full rehash of the items, because the capacity on which we use a modulo has changed. This can be very costly, and is why adding an item to a hash set is said to be O(1) amortized (= is O(1) unless you hit a resize threshold)
    - Hence, specifying the initial capacity when you know it is a very good practice, and avoids any rehashing
    - From what I've seen, the best values for capacities are either prime numbers or powers of 2. .Net uses prime numbers for example.

    • @gdclemo
      @gdclemo 11 месяцев назад

      You can store the original un-modulo'd hash value with the key. This speeds up resizing at the expense of some RAM.

    • @kristik432
      @kristik432 11 месяцев назад

      yea python dict maintains insert order by also keeping a list of keys behind the scenes

    • @DjoumyDjoums
      @DjoumyDjoums 11 месяцев назад

      @@kristik432 Since 3.7 yes, they're now ordered, but as I said this is more costly than keeping it unordered.
      Probably not a big deal here anyway, because when you care about performance and memory consumption, Python isn't exactly the first candidate 😅

  • @cannaroe1213
    @cannaroe1213 11 месяцев назад

    At 7:00, i used to explain it this way, but i came to the understanding it's simpler to say imagine theres a second hashing function, more random than the last lol, that gets its 32bit hash stored at the location of the first hash instead of Obj. Now it's a fixed-length structure, and you get count/fingerprinting of fingerprints, and you dont need to make an Obj. Well in Python you do.

    • @cannaroe1213
      @cannaroe1213 11 месяцев назад

      or just a 64bit hash and you split it in half, quaters, etc.

  • @olivergurka2456
    @olivergurka2456 11 месяцев назад +1

    Hi, could you do a video on characteristics of a good hash function used in hashtables and their evaluation as a followup video?

  • @martinparidon9056
    @martinparidon9056 10 месяцев назад

    Very interesting, thank you!

  • @princevijay871
    @princevijay871 11 месяцев назад

    Just what i wanted now. thanks a lot :)

  • @casperes0912
    @casperes0912 11 месяцев назад

    I did that skyscraper puzzle as my first algorithm challenge at university

  • @Ceelvain
    @Ceelvain 11 месяцев назад +2

    A follow-up on the amortized complexity would be nice. Because it's a bit disingenuous to call the hashmap insertion O(1) if the underlying table doesn't grow. ^^

  • @DaniErik
    @DaniErik 11 месяцев назад +3

    If you use a naive hashing algorithm, I suppose a nefarious actor could engineer a stream of data where each entry collides. This could presumably be used in a DOS attack.

    • @whisperpone
      @whisperpone 11 месяцев назад +1

      hash collisions are used in quite a few different attacks, there is no need to correctly guess someone's password (for example) when you can just guess any value whose hash matches that password, increasing the chances of success for the hacker

  • @KipIngram
    @KipIngram 9 месяцев назад

    You do also need to compare the object that's sitting there in your hash table with your search spec - if a lookup of say, foo hash collides with a prior insertion bar, well, those don't match. You need to report that foo isn't in the table - not report bar as the successful search result.

  • @mstieferman
    @mstieferman 4 месяца назад

    Can you add sections to videos when different subtopics are covered?

  • @seapearl3175
    @seapearl3175 8 месяцев назад

    If the hash of 2 objects is the same and they end up as a list of that hash, how would I obtain the correct object from it later?

  • @billywoods1688
    @billywoods1688 11 месяцев назад

    Hi Mike, Sean and all, thanks for the video! I have a question about the claim that lookup in a hash table is O(1), at least as implemented in the video. Suppose the capacity of your hash set is c, and the number of data points you have is n. Then, to look up x, it takes O(log(c)) operations to find the list corresponding to the hash h(x), and O(n/c) operations to search through that (perhaps unsorted) list for x, plus maybe some small overhead for the hashing function -- that is, O(log(c) + n/c) in total, right?
    If so, then when n is much larger than c, the resulting complexity is still O(n). When n is smaller than log(c), we have O(log(c)), but that isn't actually any better than O(n). So it seems that the useful case is when n and c scale *together* such that log(c) < n < c. In this case, it takes O(log(c)) operations - but in this case O(log(c)) is no better than O(log(n)). Am I misunderstanding something? 😅 Thanks!

  • @LAGHRISSIAdnane
    @LAGHRISSIAdnane 9 месяцев назад

    Thank you mike

  • @jpopelish
    @jpopelish 11 месяцев назад

    As a school boy I heard about casting out 9s, but never learned what it was. I'm now 74 and have finally learned about casting out 9s. I now have a pass time of casting out 9s from every license plate I stop behind. So I could use the remainder produced by casting out 9s as a hash to store all these license codes into 9 sub-lists, instead of putting all the codes into one big list. Have I got this right?

  • @RonJohn63
    @RonJohn63 11 месяцев назад +1

    Big question: how do Dictionaries auto-expand?

    • @goaserer
      @goaserer 10 месяцев назад +1

      Usually the whole internal data structure gets rehashed into a new hash table of increased size. Since capacity plays a role where entries get stored, copying everything into a new hashtable under the hood is usually the only consistant option.
      Doing this can cost some time esp. on bigger structures, which is why pre-sizing dictionaries and sets is recommended in most languages

    • @RonJohn63
      @RonJohn63 10 месяцев назад

      @@goasererDoes presizing dictionaries exist in Python? I don't remember seeing it, last time I looked.

    • @goaserer
      @goaserer 10 месяцев назад

      @@RonJohn63 For Python standard dict's especially I think you are right.
      An option (or workaround) would be to use dict.fromkeys - which prevents rehashing but probably only makes sense from a performance standpoint if you got all keys in a separate list anyway.
      Well, since the dict itself only makes sense if data is more often read than written probably it's not worth loosing too much sleep about the rehashing. While it is not for free, it's still a in-memory operation

  • @SurjitSingh-n4e
    @SurjitSingh-n4e 21 день назад

    Perfect !!!

  • @Alchemetica
    @Alchemetica 11 месяцев назад +1

    Oh, I must not have watched an episode for a while as that is the first time, I have seen a sponsor acknowledgement. Is Mike building an AI supercomputer in his garage?

  • @Vixikats
    @Vixikats 10 месяцев назад

    Constant time does not mean "fastest time". The overhead from most constant time operations is usually significantly slower than even checking an array 1 element at a time unless the data structure gets beyond some breakpoint in size. You might need a million elements in an array before a hash set starts to be the fastest solution. This is why profiling your code is always necessary.

  • @sledgex9
    @sledgex9 11 месяцев назад

    Both the add() and contains() functions have a code branch that doesn't return a value if reached. Aka the first if is true but the nested if is false. However, Python has implicit returns. In this particular case for contains() it will return None which is evaluated as False in a boolean context and works by accident. Not sure how this affects the add() function.

  • @SangaPommike
    @SangaPommike 11 месяцев назад

    There's an issue with lines 26/27 on the HashSet. If there's something in the bucket for the object other than the object that's queried, the method will return None. Now, this will of course evaluate to False rather than True, but it's not really the expected value.

  • @idogendel
    @idogendel 11 месяцев назад

    ...but how do you implement a true O(1) "in" operator, unless you allocate enough memory in advance for all possible hash values? That's a serious price to pay for the performance.

    • @digama0
      @digama0 11 месяцев назад +1

      You are supposed to resize the array when the filling factor becomes larger than a fixed constant like 0.75.

    • @idogendel
      @idogendel 11 месяцев назад

      @@digama0 ok, so it's still a speed/memory tradeoff. I come from Embedded background, where such shenanigans are not welcomed 🙂

    • @digama0
      @digama0 11 месяцев назад

      @@idogendel I guess the embedded version of that is to just crash or start dropping items once the filling factor gets too large

    • @idogendel
      @idogendel 11 месяцев назад

      @@digama0 My guess would be a crash 🙂 I wonder how many python programmers add explicit safeguards against out-of-memory errors, instead of just hoping the computer has enough RAM.

  • @harold2718
    @harold2718 11 месяцев назад

    Please do a video on LLL, maybe show Coppersmith's attack on RSA, maybe knapsack

  • @jsomhorst
    @jsomhorst 11 месяцев назад

    I'm wondering, when using objects instead of integers, how would you handle collisions?

    • @Grimlock1979
      @Grimlock1979 11 месяцев назад

      A hash is just an index in an array. In the end, it has to be an int.

  • @aikumaDK
    @aikumaDK 11 месяцев назад

    I understood some of the words Mike used

  • @williamli55
    @williamli55 11 месяцев назад +1

    Recently, I needed ("wanted" would actually be a more fitting word) to create a class where I would be able to map multiple keys to the same values (i.e. multiple keys with different hashes reference the same memory address). It was an absolute nightmare to get right, and it likely wasn't worth the effort, but it was a fun experiment which produced something which was actually more useful than I anticipated. I'm really fascinated by hash maps. Dictionaries are by far my favorite data structure in Python, especially now that they're ordered and support set operations on their views.

  • @tomkmb4120
    @tomkmb4120 20 дней назад

    At the end of the video I thought it was Dr Pound putting on an Australian accent

  • @adambomb1337
    @adambomb1337 11 месяцев назад

    if two strings get hashed to the same key, and each key gets an .add(obj), how does it know which obj to pick from when I .get(oneOfThoseStrings)?
    edit: nvm i just realized this is a set, hah. Im curious then how hashmaps resolve this. looking forward fo that one!

    • @digama0
      @digama0 11 месяцев назад

      After identifying the right bucket, you just compare the key to every key in the bucket using =. In the hashset code this is handled by delegating to the contains function on lists.

  • @xv0047
    @xv0047 11 месяцев назад +1

    I wish I could explain concepts like this.

  • @steveHoweisno1
    @steveHoweisno1 11 месяцев назад

    Why is there not the following bug? In the __contains__ method, suppose the list is nonempty for given idx but the element is not there. Then the function returns None since return keyword is not encountered. I suppose when you call a in b for some object b with a __contains__ method, if it retruns None that is interpreted as False
    Still seems sloppy. I would prefer if you wrote return i in **list**

  • @gidi1899
    @gidi1899 11 месяцев назад

    Is there a "recursive hash" method?!

  • @dgo4490
    @dgo4490 11 месяцев назад +2

    The downside is it requires around 50% data overhead to retain good performance characteristics.

  • @Little-bird-told-me
    @Little-bird-told-me 9 месяцев назад

    Its Sunny in UK

  • @betterinbooks
    @betterinbooks 11 месяцев назад +1

    it's not just python hashsets but hashsets in general.

  • @4115steve
    @4115steve 11 месяцев назад

    why arent hash sets use inside sorting algorithms to reduce insertion time?

    • @phiefer3
      @phiefer3 11 месяцев назад +2

      Probably because hashsets sidestep the concept of 'sorting' altogether. They don't care about what 'order' things are arranged, and instead they just have a designated place for any possible object. So when you give them an object, either for insertion or searching, they don't need to worry about searching for that object, they just check its proper location and it's either there or it isn't.
      Hashes aren't really useful for sorting because they're not sorted in any meaningful way (unless you are just working with integers). Instead, objects will be arranged in what appears to be a random distribution according to the hash function that you're using. There's not really a straightforward way of extracting the contents of a hashset into a sorted structure, you'd probably have to literally just search the hashset for ever possible object and then append any that you find to your list. For example, if it was a list of strings, you'd need to do a search for every possible string that could be made (you can't just search for the strings from your list, because you'd need to search for them in order, which would require that you already have them sorted). Basically, it'd be like going through the dictionary one entry at a time in order to alphabetize a list.
      Hashsets are primarily used for datasets that either don't have a natural way to order them, or where their order itself doesn't matter; but you still want to have a fast way to access each element. Like as mentioned a software library is likely implemented with a hashset or something similar. Sure you could store the contents of the library alphabetically but there's no really anything gained by knowing where each element is in relation to the others; the order does not matter. So by using something like a hash you reduce the time it takes to find what you need form O(logn) to O(1) without loosing anything useful. If you actually care about the 'order', then chances are that a hashset isn't right for the job.

    • @4115steve
      @4115steve 11 месяцев назад

      @@phiefer3 So you cant use an inserstion sort algorithm that also includes a hash for things that have already been sorted?
      So if I were sorting things alpabetically with insertion sort, it wouldn't make the sorting algorithm faster to have what has already been sorted be sorted along side a hash table. I was thinking I could use both, a hash table of abc's, and the sorting algorithm, so then sorting algorithm didn't have to go through each letter, it could just use the hash table to go to the starting letter then continue sorting

    • @phiefer3
      @phiefer3 11 месяцев назад

      @@4115steve That's not how hashsets work. The objects in the hashset are not in any meaningful order. There is no concept of the "start" of the hashset, or "start of a letter" in the hashset. Nor is there a concept of what comes "next". Objects in the hashtable are effectively randomly distributed according to the hashfunction. Like if you were to store the words 'dog' and 'dogs' in a hashtable they would not be anywhere near eachother (well, they could be depending on the hashfunction, but even then it would be just coincidental).
      You could do something like take your already sorted list, and then make a hashtable where you stored their indexes using the strings themselves, ie you search the hashtable for 'dog' and it gives you the index where dog is stored in the list. This would in theory allow you to search the list in O(1) speed, but it would not really help that much with insertion or with sorting. The act of actually moving things around in the list in order to add, remove or sort the list would still require the same complexity as normal, the hashtable wouldn't help with that; but then every time you alter the list you would then need to rehash the ENTIRE list again in order to update it with the new positions of all the objects.
      What it sounds like you're actually talking about is having a list of lists, or a 2-dimensional array where you can use the first letter of a word to first identify which sublist to search. However, this would still have the same big O complexity as a regular list, since in general the size of each sublist would be roughly proportional to the total number of objects. And I suppose, if you squinted really really hard you might be able to cal this a very simplistic form of hashing, but in practice it's not really.

    • @trevinbeattie4888
      @trevinbeattie4888 11 месяцев назад +1

      ​@@4115stevewhat you’re describing kind of sounds like a Bucket Sort, but there’s no need for hashing in that case; you would just define the ranges of values for each bucket so the buckets are sorted in advance, then walk through the buckets in order to get the sorted items. Most hashing functions (other than binning) are inherently unsorted, so you would not use this in any sorting algorithm.

  • @gregoryfenn1462
    @gregoryfenn1462 Месяц назад

    It funny how you use a hash map to define a set which could later be made into a dictionary. In python a set is defined as a dictionary where every value is constant (probable True). This is because the keys of a dictionary are a set, mathematically.

  • @halfsourlizard9319
    @halfsourlizard9319 11 месяцев назад +4

    The difference between 'O(1)' and 'doing precisely a single operation' seems to be a subtlety that evades people ... I can't say how many dev candidates I've interviewed who've said that their 2-pass algorithm had 'O(2n)' perf -- that's not a thing, and it suggests that you've completely misunderstood asymptotes / limits.

    • @styleisaweapon
      @styleisaweapon 11 месяцев назад +1

      what do you think of dev candidates that give an estimated expected latency instead? people that specialize in optimization barely consider big-O because they are coding for the known target architecture and what they care about is the latency, not silly metrics that dont translate well to real world perf?

    • @halfsourlizard9319
      @halfsourlizard9319 11 месяцев назад +1

      @@styleisaweapon Depends very much on the use case -- absolutely that'd matter way more if it were some low-level hard-real-time system or a high-volume trading system or something ... and you're quite right -- even for say a frontend thing (and this is way outside of things I'm familiar with), I know that good devs grab flame graphs and are trying to cut out needlessly expensive ops in their JavaScript -- which is probably 5 levels of abstraction away from bare metal. But, I'd still expect someone with a comp sci background to be conversant in asymptotic complexity -- even if just to discuss why it misses things ... and why writing a quadratic solution to a thing that could be linear is probably (although not always) not a great idea. All abstractions ignore things; some abstractions are useful.

    • @hanifarroisimukhlis5989
      @hanifarroisimukhlis5989 11 месяцев назад

      Add difference between actual O(1) and amortized O(1) (and how that amortization can lie to you).

    • @halfsourlizard9319
      @halfsourlizard9319 11 месяцев назад

      @@hanifarroisimukhlis5989 Oh, yeah, that's super important for a lot of things that come up in practice with resizable data structures 👍👍

    • @halfsourlizard9319
      @halfsourlizard9319 11 месяцев назад

      (Fav case that I see a lot is: People doing mid-sequence insertion into something that's backed by an array ... and they just assume that it's efficient ... With good people, asking 'so, how would you implement insertion in a contiguous-mem-backed container?' will get them to realise it ... but I've had candidates refuse to believe that it's not const time.)

  • @403digital
    @403digital 6 месяцев назад

    Did A LOT of hash sorting in college..

  • @winstonsmith1977
    @winstonsmith1977 11 месяцев назад

    I am quite new to python. One issue about hash() is, that 'salting' is added for strings and other types. Can someone elaborate about this? Thanks in advance!

    • @antonf.9278
      @antonf.9278 11 месяцев назад +2

      Salting means putting known extra data into the hash function. This is normally done with random data when hashing passwords, so that multiple users with the same password have different hashes. (for a better explanation look up the "how not to store passwords" video with Tom Scott)
      I don't know about python, but I could see reasons for salting data in a dynamically typed language with the type. This would avoid collision between objects of different type but same representation.

    • @HTH565
      @HTH565 11 месяцев назад +2

      Python generate a random string on startup and combine it ith your string when you cal hash(). This is intended as a security measure to prevent attacker to force collisions. For instance if you write a web application that use a hashmap somewhere and where the keys of the hashmap can be influenced by the web clients, if there were no salting, an attacker could perhaps send some specific requests that would cause lots of collisions and suddently your web process would consume 100% CPU trying to find walk those collision chains. If the attacker doen't know the salt howerver it is much more difficult to do that.

  • @mikefochtman7164
    @mikefochtman7164 11 месяцев назад

    Always fascinated by hash related things. Perhaps discussing actual 'dictionaries' that use actual strings? Perhaps you've already done this in another video, sorry if you have already and I just missed it.
    We had a system that tracked all the variables in a system of over 250000 lines of code. The 'variable database' used a hash on each variable name (code had on the order of 50000 variables) and stored in each hash location a short list of all the source-code modules that contained the variable. I know today most folks would just 'grep' for things like this, but at the time (early 80's) this was far superior way to track variable usage when doing the direct searches was slower.

    • @williamli55
      @williamli55 11 месяцев назад

      That sounds difficult to maintain. How did you get the data into the database? Did you have a process that routinely scanned the source and updated the database?

    • @mikefochtman7164
      @mikefochtman7164 11 месяцев назад

      ​@@williamli55 It was only half of the database. The other half maintained a 'reverse list' of all variables used in a module. So when recompiling a module, the first step used second set to 'remove' the module name from each variable's list of places it was referenced.
      Then after the module compiled successfully, a 'post compiler' step would parse the source file looking for all variables now referenced by it and add the module name to each of those variable's entry.
      Because it used it's own data to keep itself up to date (the reverse list to know what to delete), it did occasionally get corrupted. We'd live with it a bit, but at some point we'd just 'recompile the world' overnight to restore everything.
      (sorry I've left out a lot of the details, but hopefully you get the 'gist' of it)

    • @williamli55
      @williamli55 11 месяцев назад

      @@mikefochtman7164 Okay I see. Parsing after compilation makes sense, plus it sounds like the 'post compiler' was only scanning what it needed. Thanks for the explanation!

  • @mrlithium69
    @mrlithium69 11 месяцев назад +2

    The differences between Hash Set and Hash Table in different languages can be confusing

    • @JoQeZzZ
      @JoQeZzZ 11 месяцев назад

      A hash set is just that: a set. Therefore it can only contain a single type of item. A hash map then maps this item onto a value. This is also called a hash table because a mathematical table is another way to display a surjection

    • @Grimlock1979
      @Grimlock1979 11 месяцев назад

      You could also say that a Hash Set is a Hash Table where a key is the same as the value.

  • @aidanthompson5053
    @aidanthompson5053 11 месяцев назад

    Tomrocksmaths | hash map 8:26

  • @KangoV
    @KangoV 7 месяцев назад

    Awesome. So the takeaway is that Python is slow ;)

  • @_aullik
    @_aullik 11 месяцев назад

    "if (boolean expression) return True else return False" is really strange code.

  • @anon_y_mousse
    @anon_y_mousse 10 месяцев назад

    When implementing a hash table, you should use a table size that's a power of two so that you can do a `bitwise and` on the hash code to index the table with, but also you should store the unnormalized hash with the corresponding keys. Everyone always recommends using a prime number for the table size, but that's the wrong approach. You can always make the hashing function more complicated, but the indexing should be as simple as possible. In case you're wondering why you should save the unnormalized hash code, that's so that you can resize the table as necessary when too many or too few collisions occur and rebuild the table without further querying the hashing function. There's a lot of balancing to be done in implementing a hash table, regardless of whether it's a set or map, but these two things should always be a part of the implementation.

  • @gameeverything816
    @gameeverything816 11 месяцев назад

    Neat

  • @tonywakuu
    @tonywakuu 11 месяцев назад

    👏👏👏👏👏

  • @the7Cofficial
    @the7Cofficial 11 месяцев назад +1

  • @chaoslab
    @chaoslab 11 месяцев назад

    "a few moments later..."

  • @mikelwrnc
    @mikelwrnc 11 месяцев назад

    Ease up on the skepticism that AI is AI guys. I’m a PhD cognitive scientist & statistician with moderate experience with the latest AI approaches and I’ve come to the assessment that what’s out now warrants the moniker “AI”. Previously we could look at a shallow net or forest or linear regression and assert confidently the limits thereof that made them feel qualitatively different from human information processing. Not so with the latest approaches.

  • @nHans
    @nHans 11 месяцев назад +1

    _"... once the dater is sautéed ..."_ From _1,000 Tasty Ways to Enjoy Your Date_ by Dr. Lecter.
    (It's okay. As an Indian, I'm allowed to make fun of other people's English accents. It's on a reciprocating basis.)

  • @pavelnoryk3823
    @pavelnoryk3823 11 месяцев назад +1

    The main thing u didn't explain is why getting an element from set is O(1) Coz you told that it's just looking for the elements hash in your collection. But it could be done in different ways and with different complexity. E.g. "el in [...]" it's too looking for an el, but it has O(n)
    It's O(1) coz u know:
    - the structure u created saves only ints, coz of implementation of hash function
    - the fixed bytes size of any int on your machine and know thst it's stores sequentially in a single block of memory
    If you have an adress of the first element, and know that all of the elements are the same block size, u can get an adress of n-th element by increasing this address n times of fixed int size
    For example (super simplified) u have an array with 4 int elements. Int takes 4 bytes to be stored = 32 bits. When u creates it, the cpu allocate memory for this 4 elements where everyone is 4 bytes. 1th stored in 0x0000. The 2th element would have 0000+32 = 0x0032. It's the 2th element adress

  • @lasdernas
    @lasdernas 11 месяцев назад

    corrected title: Hash Sets Explained & Demonstrated Using Python

  • @guanche011
    @guanche011 11 месяцев назад

    I'm probably going to get hate for this but; Computer scientists are so bad at choosing variable names. And i've seen this so many times.. In an interpreted language is doesn't make a difference and, still they insist on max 3 letters 😅.. (i'm saying this with a wink btw)

  • @robertnabholz2062
    @robertnabholz2062 11 месяцев назад

    dark mode please!

  • @nowster
    @nowster 11 месяцев назад

    All this without mentioning the standard Python "set" type which provides this functionality (and more). [Edit: as mentioned below, I was wrong about this]

    • @Jester01
      @Jester01 11 месяцев назад +4

      All this without watching the video? It is mentioned at 9:00

    • @nowster
      @nowster 11 месяцев назад +1

      @@Jester01 True. I was wrong. Something must have distracted me at that point.