"O(1) means that there is no relationship between the speed of the lookup and the number of elements in your collection". Couldn't have said it better, as often with big O, the devil is in the constant you dropped during notation :D
in modern times, the constants can frequently matter most (when you can perform more than several hundred operations (3+ ipc) during that cache miss, for example)
@@mina86Linked lists are only O(1) if you have a pointer to the position you want to access. Unless you are already traversing the list you only have that for the first and last element. Appending to an array based list is already amortized constant and prependiding is as well in array based queues. In conclusion, linked lists are even worse.
@@mina86 I have never written code that used a linked list! and trees I most often use stop bit encoding (heap-style implicit tree addressing), and when sparse I back it with a hash table instead of an array....
12:02 in the *___contains___* function there shouldn't be an *_else:_* before the *_return False_* at the end, otherwise in case if the list *_self._data[idx]_* is not *_None_* and the item is not in that list, the return value won't be a boolean.
Good point. It should allow both of the above if statements to fall through, and simply return False as a default. It will still work for __contains__ though because None is falsey. It will evaluate to false when using the "in" operator. It's not a consistent return type though, so yeah, he would probably be better off just removing the else statement.
While that is correct, it seems that the *___contains___* special function returns *False* by default if you don't return anything yourself (I tried it on an online interpreter). So, there shouldn't be an *else:* statement or a *return* statement at all. If you go to 15:52, it seems to have worked even though it was the same exact scenario you described: the list at the specified hash value does indeed exist ( is not *None* ) -- the list of items whose hash value is *999999* , however, the item itself ( *999999* ) is not in that list as it had just been removed.
Great spot! You’re absolutely right, I messed up the implementation :) even if technically it works this time, the code isn’t super clear due to the bug, correcting it would also make it more readable I think.
I implemented this in ADA back in 1997 during a class in computer science. I think 73 was a pretty good prime to use while hashing to minimize collisions.
Extremely surprised that people were still using ADA that late in an educational context -- would have suspected that 'trendy' and 'new' Java would have been on offer.
@@halfsourlizard9319Java was released, brand new and buggy with no support or books yet, in 1995. Two years later is barely enough time for books to be written and schools to re-orient their curriculum to a new language. I used Java back then and it was very rough and didn't work well at all.
@@halfsourlizard9319 ADA was used for algorithms, C-language was used for OS, Networking etc. We where using Sun Workstations and I believe Java became available at 1996. Universities are probably not so fast in switching languages.
I invented and implemented this very scheme in 1978, on an HP9845A in HP Basic with a 20 MB hard disk, and discovered a few things: 1) Hash collisions are best stored in a sorted list, so that a binary search can be done, reducing the search time dramatically. 2) Hashing integers as themselves is a disaster in the real world, where initial keys of 0 proliferate. (Amongst other common integers, such as -1.)
Keys are unique in a hash table, and generally immutable. If you're confusing a multiway hash table, you still only use one instance of a key. In such a case, a key/value pair would be best with the value side of that equation being a balanced tree structure instead of the usual array.
1) is debatable and depends on your number of insertions vs lookups, since if you have a lot of insertions, sorting every time there's an insertion collision can be more expensive than searching over an unsorted list every time there's a lookup collision
@@DanieleCapellini If you have a lot of collisions, then it being sorted or not will be the least of your problems. However, in a tiny array, say with 16 or fewer elements, then sorting can be done fastest with an insertion sort algorithm and it'll be faster to lookup and eat less memory as an array. Of course, if you don't resize your table adequately and collision chains back up to an insane degree, then you'd be better off with a tree data structure all around.
Side note: if your list of values is static and known in advance, the Gnu “gperf” tool can come up with a “perfect” hash function that gives you the minimum array size with no collisions. It generates C/C++ code, but the output should be portable to most other languages with a small amount of effort.
Implemented exactly the simple form of this is a commercial compiler around 1980 to store the symbol table (list of all identifiers defined in the program being compiled, what type, size etc.). Chosen for lookup speed as the symbol table is accessed frequently in the compilation process
yes bloom filters are good, but first you gotta cover how to do universal hash functions (and therefore can create an arbitrary number of them) .. I think Erik Demaine covers it well enough but his stuff is lecture format
The built in `array` structure in PHP is mostly a hashmap, and is extremely widely used. Arguably a bit too widely used sometimes, since programmers often use it with strings that they have chosen in advance as the keys, and data supplied at runtime only as the values. In that situation replacing it with a class, with a predefined set of properties known to the compiler, both improves performance and can make the program easier to understand.
It's important to point out here that probability of collision can be reduced by increasing table size, but then your "utilization" of your table space will be lower. It's absolutely a trade-off and you can't improve one without degrading the other. As your table fills up, collisions will become more and more common. For example, if your table is 80% full, then almost by definition there's an 80% chance of a new item colliding with an existing one. The uniformity of the hash function pretty much guarantees it. There's a lot of nice probability theory analysis you can do around these things. Of course, that 80% full table gives you a 64% chance of colliding on your first two tries, a 51.2% chance of failing on the third probe, a 41% chance of failing on the fourth, and so on. The average length of the chains you wind up with goes up sharply as you push up utilization.
I watched a talk on Python dictionaries, the guy that worked on the new implementation had gone into detail how they are more closely related to databases than hash maps. It was done to increase performance, and since almost everything in Python has a backing dictionary, it made a large difference in runtime.
@@PhilippeCarphinprobably Modern Dictionaries by Raymond Hettinger. It's on RUclips. It's from 2016, so it's possible it no longer represents the state-of-the-art.
13:30 I get that it's easier to generate numbers, but I think pedagogically it makes much more sense to use strings. import random with open('/usr/share/dict/words') as f: words = [w.rstrip() for w in f] print(random.choices(words, k=10))
To add to the video, a few key points about hash sets and maps : - Because of the hash and modulo, the contained items cannot be ordered in any way (sorted hash sets and maps exist, but are more costly). - Looping through the items is a lot slower than looping on a simple array, because of data locality. - The target load threshold is often 70% for optimal performance, meaning the capacity is about 50% greater than the number of contained items. This impacts memory consumption in large sets/maps, or when you have a lot of them. - To maintain the load threshold, the hash set may resize itself (increase its capacity), which requires a full rehash of the items, because the capacity on which we use a modulo has changed. This can be very costly, and is why adding an item to a hash set is said to be O(1) amortized (= is O(1) unless you hit a resize threshold) - Hence, specifying the initial capacity when you know it is a very good practice, and avoids any rehashing - From what I've seen, the best values for capacities are either prime numbers or powers of 2. .Net uses prime numbers for example.
@@kristik432 Since 3.7 yes, they're now ordered, but as I said this is more costly than keeping it unordered. Probably not a big deal here anyway, because when you care about performance and memory consumption, Python isn't exactly the first candidate 😅
At 7:00, i used to explain it this way, but i came to the understanding it's simpler to say imagine theres a second hashing function, more random than the last lol, that gets its 32bit hash stored at the location of the first hash instead of Obj. Now it's a fixed-length structure, and you get count/fingerprinting of fingerprints, and you dont need to make an Obj. Well in Python you do.
A follow-up on the amortized complexity would be nice. Because it's a bit disingenuous to call the hashmap insertion O(1) if the underlying table doesn't grow. ^^
If you use a naive hashing algorithm, I suppose a nefarious actor could engineer a stream of data where each entry collides. This could presumably be used in a DOS attack.
hash collisions are used in quite a few different attacks, there is no need to correctly guess someone's password (for example) when you can just guess any value whose hash matches that password, increasing the chances of success for the hacker
You do also need to compare the object that's sitting there in your hash table with your search spec - if a lookup of say, foo hash collides with a prior insertion bar, well, those don't match. You need to report that foo isn't in the table - not report bar as the successful search result.
Hi Mike, Sean and all, thanks for the video! I have a question about the claim that lookup in a hash table is O(1), at least as implemented in the video. Suppose the capacity of your hash set is c, and the number of data points you have is n. Then, to look up x, it takes O(log(c)) operations to find the list corresponding to the hash h(x), and O(n/c) operations to search through that (perhaps unsorted) list for x, plus maybe some small overhead for the hashing function -- that is, O(log(c) + n/c) in total, right? If so, then when n is much larger than c, the resulting complexity is still O(n). When n is smaller than log(c), we have O(log(c)), but that isn't actually any better than O(n). So it seems that the useful case is when n and c scale *together* such that log(c) < n < c. In this case, it takes O(log(c)) operations - but in this case O(log(c)) is no better than O(log(n)). Am I misunderstanding something? 😅 Thanks!
As a school boy I heard about casting out 9s, but never learned what it was. I'm now 74 and have finally learned about casting out 9s. I now have a pass time of casting out 9s from every license plate I stop behind. So I could use the remainder produced by casting out 9s as a hash to store all these license codes into 9 sub-lists, instead of putting all the codes into one big list. Have I got this right?
Usually the whole internal data structure gets rehashed into a new hash table of increased size. Since capacity plays a role where entries get stored, copying everything into a new hashtable under the hood is usually the only consistant option. Doing this can cost some time esp. on bigger structures, which is why pre-sizing dictionaries and sets is recommended in most languages
@@RonJohn63 For Python standard dict's especially I think you are right. An option (or workaround) would be to use dict.fromkeys - which prevents rehashing but probably only makes sense from a performance standpoint if you got all keys in a separate list anyway. Well, since the dict itself only makes sense if data is more often read than written probably it's not worth loosing too much sleep about the rehashing. While it is not for free, it's still a in-memory operation
Oh, I must not have watched an episode for a while as that is the first time, I have seen a sponsor acknowledgement. Is Mike building an AI supercomputer in his garage?
Constant time does not mean "fastest time". The overhead from most constant time operations is usually significantly slower than even checking an array 1 element at a time unless the data structure gets beyond some breakpoint in size. You might need a million elements in an array before a hash set starts to be the fastest solution. This is why profiling your code is always necessary.
Both the add() and contains() functions have a code branch that doesn't return a value if reached. Aka the first if is true but the nested if is false. However, Python has implicit returns. In this particular case for contains() it will return None which is evaluated as False in a boolean context and works by accident. Not sure how this affects the add() function.
There's an issue with lines 26/27 on the HashSet. If there's something in the bucket for the object other than the object that's queried, the method will return None. Now, this will of course evaluate to False rather than True, but it's not really the expected value.
...but how do you implement a true O(1) "in" operator, unless you allocate enough memory in advance for all possible hash values? That's a serious price to pay for the performance.
@@digama0 My guess would be a crash 🙂 I wonder how many python programmers add explicit safeguards against out-of-memory errors, instead of just hoping the computer has enough RAM.
Recently, I needed ("wanted" would actually be a more fitting word) to create a class where I would be able to map multiple keys to the same values (i.e. multiple keys with different hashes reference the same memory address). It was an absolute nightmare to get right, and it likely wasn't worth the effort, but it was a fun experiment which produced something which was actually more useful than I anticipated. I'm really fascinated by hash maps. Dictionaries are by far my favorite data structure in Python, especially now that they're ordered and support set operations on their views.
if two strings get hashed to the same key, and each key gets an .add(obj), how does it know which obj to pick from when I .get(oneOfThoseStrings)? edit: nvm i just realized this is a set, hah. Im curious then how hashmaps resolve this. looking forward fo that one!
After identifying the right bucket, you just compare the key to every key in the bucket using =. In the hashset code this is handled by delegating to the contains function on lists.
Why is there not the following bug? In the __contains__ method, suppose the list is nonempty for given idx but the element is not there. Then the function returns None since return keyword is not encountered. I suppose when you call a in b for some object b with a __contains__ method, if it retruns None that is interpreted as False Still seems sloppy. I would prefer if you wrote return i in **list**
Probably because hashsets sidestep the concept of 'sorting' altogether. They don't care about what 'order' things are arranged, and instead they just have a designated place for any possible object. So when you give them an object, either for insertion or searching, they don't need to worry about searching for that object, they just check its proper location and it's either there or it isn't. Hashes aren't really useful for sorting because they're not sorted in any meaningful way (unless you are just working with integers). Instead, objects will be arranged in what appears to be a random distribution according to the hash function that you're using. There's not really a straightforward way of extracting the contents of a hashset into a sorted structure, you'd probably have to literally just search the hashset for ever possible object and then append any that you find to your list. For example, if it was a list of strings, you'd need to do a search for every possible string that could be made (you can't just search for the strings from your list, because you'd need to search for them in order, which would require that you already have them sorted). Basically, it'd be like going through the dictionary one entry at a time in order to alphabetize a list. Hashsets are primarily used for datasets that either don't have a natural way to order them, or where their order itself doesn't matter; but you still want to have a fast way to access each element. Like as mentioned a software library is likely implemented with a hashset or something similar. Sure you could store the contents of the library alphabetically but there's no really anything gained by knowing where each element is in relation to the others; the order does not matter. So by using something like a hash you reduce the time it takes to find what you need form O(logn) to O(1) without loosing anything useful. If you actually care about the 'order', then chances are that a hashset isn't right for the job.
@@phiefer3 So you cant use an inserstion sort algorithm that also includes a hash for things that have already been sorted? So if I were sorting things alpabetically with insertion sort, it wouldn't make the sorting algorithm faster to have what has already been sorted be sorted along side a hash table. I was thinking I could use both, a hash table of abc's, and the sorting algorithm, so then sorting algorithm didn't have to go through each letter, it could just use the hash table to go to the starting letter then continue sorting
@@4115steve That's not how hashsets work. The objects in the hashset are not in any meaningful order. There is no concept of the "start" of the hashset, or "start of a letter" in the hashset. Nor is there a concept of what comes "next". Objects in the hashtable are effectively randomly distributed according to the hashfunction. Like if you were to store the words 'dog' and 'dogs' in a hashtable they would not be anywhere near eachother (well, they could be depending on the hashfunction, but even then it would be just coincidental). You could do something like take your already sorted list, and then make a hashtable where you stored their indexes using the strings themselves, ie you search the hashtable for 'dog' and it gives you the index where dog is stored in the list. This would in theory allow you to search the list in O(1) speed, but it would not really help that much with insertion or with sorting. The act of actually moving things around in the list in order to add, remove or sort the list would still require the same complexity as normal, the hashtable wouldn't help with that; but then every time you alter the list you would then need to rehash the ENTIRE list again in order to update it with the new positions of all the objects. What it sounds like you're actually talking about is having a list of lists, or a 2-dimensional array where you can use the first letter of a word to first identify which sublist to search. However, this would still have the same big O complexity as a regular list, since in general the size of each sublist would be roughly proportional to the total number of objects. And I suppose, if you squinted really really hard you might be able to cal this a very simplistic form of hashing, but in practice it's not really.
@@4115stevewhat you’re describing kind of sounds like a Bucket Sort, but there’s no need for hashing in that case; you would just define the ranges of values for each bucket so the buckets are sorted in advance, then walk through the buckets in order to get the sorted items. Most hashing functions (other than binning) are inherently unsorted, so you would not use this in any sorting algorithm.
It funny how you use a hash map to define a set which could later be made into a dictionary. In python a set is defined as a dictionary where every value is constant (probable True). This is because the keys of a dictionary are a set, mathematically.
The difference between 'O(1)' and 'doing precisely a single operation' seems to be a subtlety that evades people ... I can't say how many dev candidates I've interviewed who've said that their 2-pass algorithm had 'O(2n)' perf -- that's not a thing, and it suggests that you've completely misunderstood asymptotes / limits.
what do you think of dev candidates that give an estimated expected latency instead? people that specialize in optimization barely consider big-O because they are coding for the known target architecture and what they care about is the latency, not silly metrics that dont translate well to real world perf?
@@styleisaweapon Depends very much on the use case -- absolutely that'd matter way more if it were some low-level hard-real-time system or a high-volume trading system or something ... and you're quite right -- even for say a frontend thing (and this is way outside of things I'm familiar with), I know that good devs grab flame graphs and are trying to cut out needlessly expensive ops in their JavaScript -- which is probably 5 levels of abstraction away from bare metal. But, I'd still expect someone with a comp sci background to be conversant in asymptotic complexity -- even if just to discuss why it misses things ... and why writing a quadratic solution to a thing that could be linear is probably (although not always) not a great idea. All abstractions ignore things; some abstractions are useful.
(Fav case that I see a lot is: People doing mid-sequence insertion into something that's backed by an array ... and they just assume that it's efficient ... With good people, asking 'so, how would you implement insertion in a contiguous-mem-backed container?' will get them to realise it ... but I've had candidates refuse to believe that it's not const time.)
I am quite new to python. One issue about hash() is, that 'salting' is added for strings and other types. Can someone elaborate about this? Thanks in advance!
Salting means putting known extra data into the hash function. This is normally done with random data when hashing passwords, so that multiple users with the same password have different hashes. (for a better explanation look up the "how not to store passwords" video with Tom Scott) I don't know about python, but I could see reasons for salting data in a dynamically typed language with the type. This would avoid collision between objects of different type but same representation.
Python generate a random string on startup and combine it ith your string when you cal hash(). This is intended as a security measure to prevent attacker to force collisions. For instance if you write a web application that use a hashmap somewhere and where the keys of the hashmap can be influenced by the web clients, if there were no salting, an attacker could perhaps send some specific requests that would cause lots of collisions and suddently your web process would consume 100% CPU trying to find walk those collision chains. If the attacker doen't know the salt howerver it is much more difficult to do that.
Always fascinated by hash related things. Perhaps discussing actual 'dictionaries' that use actual strings? Perhaps you've already done this in another video, sorry if you have already and I just missed it. We had a system that tracked all the variables in a system of over 250000 lines of code. The 'variable database' used a hash on each variable name (code had on the order of 50000 variables) and stored in each hash location a short list of all the source-code modules that contained the variable. I know today most folks would just 'grep' for things like this, but at the time (early 80's) this was far superior way to track variable usage when doing the direct searches was slower.
That sounds difficult to maintain. How did you get the data into the database? Did you have a process that routinely scanned the source and updated the database?
@@williamli55 It was only half of the database. The other half maintained a 'reverse list' of all variables used in a module. So when recompiling a module, the first step used second set to 'remove' the module name from each variable's list of places it was referenced. Then after the module compiled successfully, a 'post compiler' step would parse the source file looking for all variables now referenced by it and add the module name to each of those variable's entry. Because it used it's own data to keep itself up to date (the reverse list to know what to delete), it did occasionally get corrupted. We'd live with it a bit, but at some point we'd just 'recompile the world' overnight to restore everything. (sorry I've left out a lot of the details, but hopefully you get the 'gist' of it)
@@mikefochtman7164 Okay I see. Parsing after compilation makes sense, plus it sounds like the 'post compiler' was only scanning what it needed. Thanks for the explanation!
A hash set is just that: a set. Therefore it can only contain a single type of item. A hash map then maps this item onto a value. This is also called a hash table because a mathematical table is another way to display a surjection
When implementing a hash table, you should use a table size that's a power of two so that you can do a `bitwise and` on the hash code to index the table with, but also you should store the unnormalized hash with the corresponding keys. Everyone always recommends using a prime number for the table size, but that's the wrong approach. You can always make the hashing function more complicated, but the indexing should be as simple as possible. In case you're wondering why you should save the unnormalized hash code, that's so that you can resize the table as necessary when too many or too few collisions occur and rebuild the table without further querying the hashing function. There's a lot of balancing to be done in implementing a hash table, regardless of whether it's a set or map, but these two things should always be a part of the implementation.
Ease up on the skepticism that AI is AI guys. I’m a PhD cognitive scientist & statistician with moderate experience with the latest AI approaches and I’ve come to the assessment that what’s out now warrants the moniker “AI”. Previously we could look at a shallow net or forest or linear regression and assert confidently the limits thereof that made them feel qualitatively different from human information processing. Not so with the latest approaches.
_"... once the dater is sautéed ..."_ From _1,000 Tasty Ways to Enjoy Your Date_ by Dr. Lecter. (It's okay. As an Indian, I'm allowed to make fun of other people's English accents. It's on a reciprocating basis.)
The main thing u didn't explain is why getting an element from set is O(1) Coz you told that it's just looking for the elements hash in your collection. But it could be done in different ways and with different complexity. E.g. "el in [...]" it's too looking for an el, but it has O(n) It's O(1) coz u know: - the structure u created saves only ints, coz of implementation of hash function - the fixed bytes size of any int on your machine and know thst it's stores sequentially in a single block of memory If you have an adress of the first element, and know that all of the elements are the same block size, u can get an adress of n-th element by increasing this address n times of fixed int size For example (super simplified) u have an array with 4 int elements. Int takes 4 bytes to be stored = 32 bits. When u creates it, the cpu allocate memory for this 4 elements where everyone is 4 bytes. 1th stored in 0x0000. The 2th element would have 0000+32 = 0x0032. It's the 2th element adress
I'm probably going to get hate for this but; Computer scientists are so bad at choosing variable names. And i've seen this so many times.. In an interpreted language is doesn't make a difference and, still they insist on max 3 letters 😅.. (i'm saying this with a wink btw)
All this without mentioning the standard Python "set" type which provides this functionality (and more). [Edit: as mentioned below, I was wrong about this]
Jane Street skyscraper puzzle (and info on the AMP program) at bit.ly/computerphile-amp
I get the skyscraper, but I don't get the route thing.
Oh, I got t now. The font is too small, I was misreading.
"O(1) means that there is no relationship between the speed of the lookup and the number of elements in your collection". Couldn't have said it better, as often with big O, the devil is in the constant you dropped during notation :D
+ c
in modern times, the constants can frequently matter most (when you can perform more than several hundred operations (3+ ipc) during that cache miss, for example)
@@styleisaweapon, yeah, which is why you should virtually never use linked lists even though it offers O(1) operations.
@@mina86Linked lists are only O(1) if you have a pointer to the position you want to access. Unless you are already traversing the list you only have that for the first and last element.
Appending to an array based list is already amortized constant and prependiding is as well in array based queues.
In conclusion, linked lists are even worse.
@@mina86 I have never written code that used a linked list! and trees I most often use stop bit encoding (heap-style implicit tree addressing), and when sparse I back it with a hash table instead of an array....
Always love a Mike Pound video!!!
You've been Pounded!
Nice!!!@@programorprogrammed
I see Mike Pound, I click
12:02 in the *___contains___* function there shouldn't be an *_else:_* before the *_return False_* at the end, otherwise in case if the list *_self._data[idx]_* is not *_None_* and the item is not in that list, the return value won't be a boolean.
Good point. It should allow both of the above if statements to fall through, and simply return False as a default. It will still work for __contains__ though because None is falsey. It will evaluate to false when using the "in" operator. It's not a consistent return type though, so yeah, he would probably be better off just removing the else statement.
While that is correct, it seems that the *___contains___* special function returns *False* by default if you don't return anything yourself (I tried it on an online interpreter). So, there shouldn't be an *else:* statement or a *return* statement at all. If you go to 15:52, it seems to have worked even though it was the same exact scenario you described: the list at the specified hash value does indeed exist ( is not *None* ) -- the list of items whose hash value is *999999* , however, the item itself ( *999999* ) is not in that list as it had just been removed.
Thanks pep8. :D
@@vadrif-draco Any Python function returns None by default, and None is falsey (i.e. it works with conditionals the same way that False does)
Great spot! You’re absolutely right, I messed up the implementation :) even if technically it works this time, the code isn’t super clear due to the bug, correcting it would also make it more readable I think.
I implemented this in ADA back in 1997 during a class in computer science. I think 73 was a pretty good prime to use while hashing to minimize collisions.
Extremely surprised that people were still using ADA that late in an educational context -- would have suspected that 'trendy' and 'new' Java would have been on offer.
@@halfsourlizard9319Java was released, brand new and buggy with no support or books yet, in 1995. Two years later is barely enough time for books to be written and schools to re-orient their curriculum to a new language. I used Java back then and it was very rough and didn't work well at all.
@@halfsourlizard9319 ADA was used for algorithms, C-language was used for OS, Networking etc. We where using Sun Workstations and I believe Java became available at 1996. Universities are probably not so fast in switching languages.
@@halfsourlizard9319I went to uni in 2000 and they were just introducing Java.
I invented and implemented this very scheme in 1978, on an HP9845A in HP Basic with a 20 MB hard disk, and discovered a few things:
1) Hash collisions are best stored in a sorted list, so that a binary search can be done, reducing the search time dramatically.
2) Hashing integers as themselves is a disaster in the real world, where initial keys of 0 proliferate. (Amongst other common integers, such as -1.)
Keys are unique in a hash table, and generally immutable. If you're confusing a multiway hash table, you still only use one instance of a key. In such a case, a key/value pair would be best with the value side of that equation being a balanced tree structure instead of the usual array.
1) is debatable and depends on your number of insertions vs lookups, since if you have a lot of insertions, sorting every time there's an insertion collision can be more expensive than searching over an unsorted list every time there's a lookup collision
@@DanieleCapellini If you have a lot of collisions, then it being sorted or not will be the least of your problems. However, in a tiny array, say with 16 or fewer elements, then sorting can be done fastest with an insertion sort algorithm and it'll be faster to lookup and eat less memory as an array. Of course, if you don't resize your table adequately and collision chains back up to an insane degree, then you'd be better off with a tree data structure all around.
Side note: if your list of values is static and known in advance, the Gnu “gperf” tool can come up with a “perfect” hash function that gives you the minimum array size with no collisions. It generates C/C++ code, but the output should be portable to most other languages with a small amount of effort.
I'm not new to programming, but I'm new to Python and I was just literally looking into what uses hash tables in Python. Thanks. Lol
Implemented exactly the simple form of this is a commercial compiler around 1980 to store the symbol table (list of all identifiers defined in the program being compiled, what type, size etc.). Chosen for lookup speed as the symbol table is accessed frequently in the compilation process
The topics discussed on this channel have been the ones that really specifically interest me as of late. This is cool, thank you!
I saw an interview question video yesterday about these - really good timing for me this video. 😁😁😁
Bloom filters could be a good follow-up to this.
And the HyperLogLog
yes bloom filters are good, but first you gotta cover how to do universal hash functions (and therefore can create an arbitrary number of them) .. I think Erik Demaine covers it well enough but his stuff is lecture format
@@Checker8763 any probabilistic counting .. no need for the added complexity of HLL
The built in `array` structure in PHP is mostly a hashmap, and is extremely widely used. Arguably a bit too widely used sometimes, since programmers often use it with strings that they have chosen in advance as the keys, and data supplied at runtime only as the values. In that situation replacing it with a class, with a predefined set of properties known to the compiler, both improves performance and can make the program easier to understand.
"Arguably a bit too widely used sometimes" - There's a name for code that overuses primitive types, "stringly typed" 😆
Not really interested in the topic, because I already know this, but still watched because Mike's presentations are always engaging
Thanks for your videos Mike keep it rolling 🎉
It's important to point out here that probability of collision can be reduced by increasing table size, but then your "utilization" of your table space will be lower. It's absolutely a trade-off and you can't improve one without degrading the other. As your table fills up, collisions will become more and more common. For example, if your table is 80% full, then almost by definition there's an 80% chance of a new item colliding with an existing one. The uniformity of the hash function pretty much guarantees it. There's a lot of nice probability theory analysis you can do around these things.
Of course, that 80% full table gives you a 64% chance of colliding on your first two tries, a 51.2% chance of failing on the third probe, a 41% chance of failing on the fourth, and so on. The average length of the chains you wind up with goes up sharply as you push up utilization.
I watched a talk on Python dictionaries, the guy that worked on the new implementation had gone into detail how they are more closely related to databases than hash maps. It was done to increase performance, and since almost everything in Python has a backing dictionary, it made a large difference in runtime.
That sounds really interesting, might I trouble you for a link to this talk?
@@PhilippeCarphinprobably Modern Dictionaries by Raymond Hettinger. It's on RUclips. It's from 2016, so it's possible it no longer represents the state-of-the-art.
Oftentimes on modern hardware, particularly on small datasets, a linear scan can be faster than a hashmap lookup because the hashing function is slow.
13:30 I get that it's easier to generate numbers, but I think pedagogically it makes much more sense to use strings.
import random
with open('/usr/share/dict/words') as f: words = [w.rstrip() for w in f]
print(random.choices(words, k=10))
I would love to see a follow up video for dictionaries/hashmaps :)
To add to the video, a few key points about hash sets and maps :
- Because of the hash and modulo, the contained items cannot be ordered in any way (sorted hash sets and maps exist, but are more costly).
- Looping through the items is a lot slower than looping on a simple array, because of data locality.
- The target load threshold is often 70% for optimal performance, meaning the capacity is about 50% greater than the number of contained items. This impacts memory consumption in large sets/maps, or when you have a lot of them.
- To maintain the load threshold, the hash set may resize itself (increase its capacity), which requires a full rehash of the items, because the capacity on which we use a modulo has changed. This can be very costly, and is why adding an item to a hash set is said to be O(1) amortized (= is O(1) unless you hit a resize threshold)
- Hence, specifying the initial capacity when you know it is a very good practice, and avoids any rehashing
- From what I've seen, the best values for capacities are either prime numbers or powers of 2. .Net uses prime numbers for example.
You can store the original un-modulo'd hash value with the key. This speeds up resizing at the expense of some RAM.
yea python dict maintains insert order by also keeping a list of keys behind the scenes
@@kristik432 Since 3.7 yes, they're now ordered, but as I said this is more costly than keeping it unordered.
Probably not a big deal here anyway, because when you care about performance and memory consumption, Python isn't exactly the first candidate 😅
At 7:00, i used to explain it this way, but i came to the understanding it's simpler to say imagine theres a second hashing function, more random than the last lol, that gets its 32bit hash stored at the location of the first hash instead of Obj. Now it's a fixed-length structure, and you get count/fingerprinting of fingerprints, and you dont need to make an Obj. Well in Python you do.
or just a 64bit hash and you split it in half, quaters, etc.
Hi, could you do a video on characteristics of a good hash function used in hashtables and their evaluation as a followup video?
Very interesting, thank you!
Just what i wanted now. thanks a lot :)
I did that skyscraper puzzle as my first algorithm challenge at university
A follow-up on the amortized complexity would be nice. Because it's a bit disingenuous to call the hashmap insertion O(1) if the underlying table doesn't grow. ^^
If you use a naive hashing algorithm, I suppose a nefarious actor could engineer a stream of data where each entry collides. This could presumably be used in a DOS attack.
hash collisions are used in quite a few different attacks, there is no need to correctly guess someone's password (for example) when you can just guess any value whose hash matches that password, increasing the chances of success for the hacker
You do also need to compare the object that's sitting there in your hash table with your search spec - if a lookup of say, foo hash collides with a prior insertion bar, well, those don't match. You need to report that foo isn't in the table - not report bar as the successful search result.
Can you add sections to videos when different subtopics are covered?
If the hash of 2 objects is the same and they end up as a list of that hash, how would I obtain the correct object from it later?
Hi Mike, Sean and all, thanks for the video! I have a question about the claim that lookup in a hash table is O(1), at least as implemented in the video. Suppose the capacity of your hash set is c, and the number of data points you have is n. Then, to look up x, it takes O(log(c)) operations to find the list corresponding to the hash h(x), and O(n/c) operations to search through that (perhaps unsorted) list for x, plus maybe some small overhead for the hashing function -- that is, O(log(c) + n/c) in total, right?
If so, then when n is much larger than c, the resulting complexity is still O(n). When n is smaller than log(c), we have O(log(c)), but that isn't actually any better than O(n). So it seems that the useful case is when n and c scale *together* such that log(c) < n < c. In this case, it takes O(log(c)) operations - but in this case O(log(c)) is no better than O(log(n)). Am I misunderstanding something? 😅 Thanks!
Thank you mike
As a school boy I heard about casting out 9s, but never learned what it was. I'm now 74 and have finally learned about casting out 9s. I now have a pass time of casting out 9s from every license plate I stop behind. So I could use the remainder produced by casting out 9s as a hash to store all these license codes into 9 sub-lists, instead of putting all the codes into one big list. Have I got this right?
Big question: how do Dictionaries auto-expand?
Usually the whole internal data structure gets rehashed into a new hash table of increased size. Since capacity plays a role where entries get stored, copying everything into a new hashtable under the hood is usually the only consistant option.
Doing this can cost some time esp. on bigger structures, which is why pre-sizing dictionaries and sets is recommended in most languages
@@goasererDoes presizing dictionaries exist in Python? I don't remember seeing it, last time I looked.
@@RonJohn63 For Python standard dict's especially I think you are right.
An option (or workaround) would be to use dict.fromkeys - which prevents rehashing but probably only makes sense from a performance standpoint if you got all keys in a separate list anyway.
Well, since the dict itself only makes sense if data is more often read than written probably it's not worth loosing too much sleep about the rehashing. While it is not for free, it's still a in-memory operation
Perfect !!!
Oh, I must not have watched an episode for a while as that is the first time, I have seen a sponsor acknowledgement. Is Mike building an AI supercomputer in his garage?
Constant time does not mean "fastest time". The overhead from most constant time operations is usually significantly slower than even checking an array 1 element at a time unless the data structure gets beyond some breakpoint in size. You might need a million elements in an array before a hash set starts to be the fastest solution. This is why profiling your code is always necessary.
Both the add() and contains() functions have a code branch that doesn't return a value if reached. Aka the first if is true but the nested if is false. However, Python has implicit returns. In this particular case for contains() it will return None which is evaluated as False in a boolean context and works by accident. Not sure how this affects the add() function.
There's an issue with lines 26/27 on the HashSet. If there's something in the bucket for the object other than the object that's queried, the method will return None. Now, this will of course evaluate to False rather than True, but it's not really the expected value.
...but how do you implement a true O(1) "in" operator, unless you allocate enough memory in advance for all possible hash values? That's a serious price to pay for the performance.
You are supposed to resize the array when the filling factor becomes larger than a fixed constant like 0.75.
@@digama0 ok, so it's still a speed/memory tradeoff. I come from Embedded background, where such shenanigans are not welcomed 🙂
@@idogendel I guess the embedded version of that is to just crash or start dropping items once the filling factor gets too large
@@digama0 My guess would be a crash 🙂 I wonder how many python programmers add explicit safeguards against out-of-memory errors, instead of just hoping the computer has enough RAM.
Please do a video on LLL, maybe show Coppersmith's attack on RSA, maybe knapsack
I'm wondering, when using objects instead of integers, how would you handle collisions?
A hash is just an index in an array. In the end, it has to be an int.
I understood some of the words Mike used
Recently, I needed ("wanted" would actually be a more fitting word) to create a class where I would be able to map multiple keys to the same values (i.e. multiple keys with different hashes reference the same memory address). It was an absolute nightmare to get right, and it likely wasn't worth the effort, but it was a fun experiment which produced something which was actually more useful than I anticipated. I'm really fascinated by hash maps. Dictionaries are by far my favorite data structure in Python, especially now that they're ordered and support set operations on their views.
At the end of the video I thought it was Dr Pound putting on an Australian accent
if two strings get hashed to the same key, and each key gets an .add(obj), how does it know which obj to pick from when I .get(oneOfThoseStrings)?
edit: nvm i just realized this is a set, hah. Im curious then how hashmaps resolve this. looking forward fo that one!
After identifying the right bucket, you just compare the key to every key in the bucket using =. In the hashset code this is handled by delegating to the contains function on lists.
I wish I could explain concepts like this.
Why is there not the following bug? In the __contains__ method, suppose the list is nonempty for given idx but the element is not there. Then the function returns None since return keyword is not encountered. I suppose when you call a in b for some object b with a __contains__ method, if it retruns None that is interpreted as False
Still seems sloppy. I would prefer if you wrote return i in **list**
Is there a "recursive hash" method?!
The downside is it requires around 50% data overhead to retain good performance characteristics.
Its Sunny in UK
it's not just python hashsets but hashsets in general.
why arent hash sets use inside sorting algorithms to reduce insertion time?
Probably because hashsets sidestep the concept of 'sorting' altogether. They don't care about what 'order' things are arranged, and instead they just have a designated place for any possible object. So when you give them an object, either for insertion or searching, they don't need to worry about searching for that object, they just check its proper location and it's either there or it isn't.
Hashes aren't really useful for sorting because they're not sorted in any meaningful way (unless you are just working with integers). Instead, objects will be arranged in what appears to be a random distribution according to the hash function that you're using. There's not really a straightforward way of extracting the contents of a hashset into a sorted structure, you'd probably have to literally just search the hashset for ever possible object and then append any that you find to your list. For example, if it was a list of strings, you'd need to do a search for every possible string that could be made (you can't just search for the strings from your list, because you'd need to search for them in order, which would require that you already have them sorted). Basically, it'd be like going through the dictionary one entry at a time in order to alphabetize a list.
Hashsets are primarily used for datasets that either don't have a natural way to order them, or where their order itself doesn't matter; but you still want to have a fast way to access each element. Like as mentioned a software library is likely implemented with a hashset or something similar. Sure you could store the contents of the library alphabetically but there's no really anything gained by knowing where each element is in relation to the others; the order does not matter. So by using something like a hash you reduce the time it takes to find what you need form O(logn) to O(1) without loosing anything useful. If you actually care about the 'order', then chances are that a hashset isn't right for the job.
@@phiefer3 So you cant use an inserstion sort algorithm that also includes a hash for things that have already been sorted?
So if I were sorting things alpabetically with insertion sort, it wouldn't make the sorting algorithm faster to have what has already been sorted be sorted along side a hash table. I was thinking I could use both, a hash table of abc's, and the sorting algorithm, so then sorting algorithm didn't have to go through each letter, it could just use the hash table to go to the starting letter then continue sorting
@@4115steve That's not how hashsets work. The objects in the hashset are not in any meaningful order. There is no concept of the "start" of the hashset, or "start of a letter" in the hashset. Nor is there a concept of what comes "next". Objects in the hashtable are effectively randomly distributed according to the hashfunction. Like if you were to store the words 'dog' and 'dogs' in a hashtable they would not be anywhere near eachother (well, they could be depending on the hashfunction, but even then it would be just coincidental).
You could do something like take your already sorted list, and then make a hashtable where you stored their indexes using the strings themselves, ie you search the hashtable for 'dog' and it gives you the index where dog is stored in the list. This would in theory allow you to search the list in O(1) speed, but it would not really help that much with insertion or with sorting. The act of actually moving things around in the list in order to add, remove or sort the list would still require the same complexity as normal, the hashtable wouldn't help with that; but then every time you alter the list you would then need to rehash the ENTIRE list again in order to update it with the new positions of all the objects.
What it sounds like you're actually talking about is having a list of lists, or a 2-dimensional array where you can use the first letter of a word to first identify which sublist to search. However, this would still have the same big O complexity as a regular list, since in general the size of each sublist would be roughly proportional to the total number of objects. And I suppose, if you squinted really really hard you might be able to cal this a very simplistic form of hashing, but in practice it's not really.
@@4115stevewhat you’re describing kind of sounds like a Bucket Sort, but there’s no need for hashing in that case; you would just define the ranges of values for each bucket so the buckets are sorted in advance, then walk through the buckets in order to get the sorted items. Most hashing functions (other than binning) are inherently unsorted, so you would not use this in any sorting algorithm.
It funny how you use a hash map to define a set which could later be made into a dictionary. In python a set is defined as a dictionary where every value is constant (probable True). This is because the keys of a dictionary are a set, mathematically.
The difference between 'O(1)' and 'doing precisely a single operation' seems to be a subtlety that evades people ... I can't say how many dev candidates I've interviewed who've said that their 2-pass algorithm had 'O(2n)' perf -- that's not a thing, and it suggests that you've completely misunderstood asymptotes / limits.
what do you think of dev candidates that give an estimated expected latency instead? people that specialize in optimization barely consider big-O because they are coding for the known target architecture and what they care about is the latency, not silly metrics that dont translate well to real world perf?
@@styleisaweapon Depends very much on the use case -- absolutely that'd matter way more if it were some low-level hard-real-time system or a high-volume trading system or something ... and you're quite right -- even for say a frontend thing (and this is way outside of things I'm familiar with), I know that good devs grab flame graphs and are trying to cut out needlessly expensive ops in their JavaScript -- which is probably 5 levels of abstraction away from bare metal. But, I'd still expect someone with a comp sci background to be conversant in asymptotic complexity -- even if just to discuss why it misses things ... and why writing a quadratic solution to a thing that could be linear is probably (although not always) not a great idea. All abstractions ignore things; some abstractions are useful.
Add difference between actual O(1) and amortized O(1) (and how that amortization can lie to you).
@@hanifarroisimukhlis5989 Oh, yeah, that's super important for a lot of things that come up in practice with resizable data structures 👍👍
(Fav case that I see a lot is: People doing mid-sequence insertion into something that's backed by an array ... and they just assume that it's efficient ... With good people, asking 'so, how would you implement insertion in a contiguous-mem-backed container?' will get them to realise it ... but I've had candidates refuse to believe that it's not const time.)
Did A LOT of hash sorting in college..
I am quite new to python. One issue about hash() is, that 'salting' is added for strings and other types. Can someone elaborate about this? Thanks in advance!
Salting means putting known extra data into the hash function. This is normally done with random data when hashing passwords, so that multiple users with the same password have different hashes. (for a better explanation look up the "how not to store passwords" video with Tom Scott)
I don't know about python, but I could see reasons for salting data in a dynamically typed language with the type. This would avoid collision between objects of different type but same representation.
Python generate a random string on startup and combine it ith your string when you cal hash(). This is intended as a security measure to prevent attacker to force collisions. For instance if you write a web application that use a hashmap somewhere and where the keys of the hashmap can be influenced by the web clients, if there were no salting, an attacker could perhaps send some specific requests that would cause lots of collisions and suddently your web process would consume 100% CPU trying to find walk those collision chains. If the attacker doen't know the salt howerver it is much more difficult to do that.
Always fascinated by hash related things. Perhaps discussing actual 'dictionaries' that use actual strings? Perhaps you've already done this in another video, sorry if you have already and I just missed it.
We had a system that tracked all the variables in a system of over 250000 lines of code. The 'variable database' used a hash on each variable name (code had on the order of 50000 variables) and stored in each hash location a short list of all the source-code modules that contained the variable. I know today most folks would just 'grep' for things like this, but at the time (early 80's) this was far superior way to track variable usage when doing the direct searches was slower.
That sounds difficult to maintain. How did you get the data into the database? Did you have a process that routinely scanned the source and updated the database?
@@williamli55 It was only half of the database. The other half maintained a 'reverse list' of all variables used in a module. So when recompiling a module, the first step used second set to 'remove' the module name from each variable's list of places it was referenced.
Then after the module compiled successfully, a 'post compiler' step would parse the source file looking for all variables now referenced by it and add the module name to each of those variable's entry.
Because it used it's own data to keep itself up to date (the reverse list to know what to delete), it did occasionally get corrupted. We'd live with it a bit, but at some point we'd just 'recompile the world' overnight to restore everything.
(sorry I've left out a lot of the details, but hopefully you get the 'gist' of it)
@@mikefochtman7164 Okay I see. Parsing after compilation makes sense, plus it sounds like the 'post compiler' was only scanning what it needed. Thanks for the explanation!
The differences between Hash Set and Hash Table in different languages can be confusing
A hash set is just that: a set. Therefore it can only contain a single type of item. A hash map then maps this item onto a value. This is also called a hash table because a mathematical table is another way to display a surjection
You could also say that a Hash Set is a Hash Table where a key is the same as the value.
Tomrocksmaths | hash map 8:26
Awesome. So the takeaway is that Python is slow ;)
"if (boolean expression) return True else return False" is really strange code.
When implementing a hash table, you should use a table size that's a power of two so that you can do a `bitwise and` on the hash code to index the table with, but also you should store the unnormalized hash with the corresponding keys. Everyone always recommends using a prime number for the table size, but that's the wrong approach. You can always make the hashing function more complicated, but the indexing should be as simple as possible. In case you're wondering why you should save the unnormalized hash code, that's so that you can resize the table as necessary when too many or too few collisions occur and rebuild the table without further querying the hashing function. There's a lot of balancing to be done in implementing a hash table, regardless of whether it's a set or map, but these two things should always be a part of the implementation.
Neat
👏👏👏👏👏
❤
"a few moments later..."
Ease up on the skepticism that AI is AI guys. I’m a PhD cognitive scientist & statistician with moderate experience with the latest AI approaches and I’ve come to the assessment that what’s out now warrants the moniker “AI”. Previously we could look at a shallow net or forest or linear regression and assert confidently the limits thereof that made them feel qualitatively different from human information processing. Not so with the latest approaches.
_"... once the dater is sautéed ..."_ From _1,000 Tasty Ways to Enjoy Your Date_ by Dr. Lecter.
(It's okay. As an Indian, I'm allowed to make fun of other people's English accents. It's on a reciprocating basis.)
The main thing u didn't explain is why getting an element from set is O(1) Coz you told that it's just looking for the elements hash in your collection. But it could be done in different ways and with different complexity. E.g. "el in [...]" it's too looking for an el, but it has O(n)
It's O(1) coz u know:
- the structure u created saves only ints, coz of implementation of hash function
- the fixed bytes size of any int on your machine and know thst it's stores sequentially in a single block of memory
If you have an adress of the first element, and know that all of the elements are the same block size, u can get an adress of n-th element by increasing this address n times of fixed int size
For example (super simplified) u have an array with 4 int elements. Int takes 4 bytes to be stored = 32 bits. When u creates it, the cpu allocate memory for this 4 elements where everyone is 4 bytes. 1th stored in 0x0000. The 2th element would have 0000+32 = 0x0032. It's the 2th element adress
corrected title: Hash Sets Explained & Demonstrated Using Python
I'm probably going to get hate for this but; Computer scientists are so bad at choosing variable names. And i've seen this so many times.. In an interpreted language is doesn't make a difference and, still they insist on max 3 letters 😅.. (i'm saying this with a wink btw)
dark mode please!
All this without mentioning the standard Python "set" type which provides this functionality (and more). [Edit: as mentioned below, I was wrong about this]
All this without watching the video? It is mentioned at 9:00
@@Jester01 True. I was wrong. Something must have distracted me at that point.