How do Spell Checkers work? Levenshtein Edit Distance

Поделиться
HTML-код
  • Опубликовано: 27 сен 2024

Комментарии • 144

  • @zactron1997
    @zactron1997 3 года назад +142

    What's really interesting about the full-matrix version of this algorithm is when you compute the values you have the option to make a much smarter system for little extra work. Instead of weighting each edit operation equally, you can weight them based on the probability of the user having mistakenly performed the inverse operation. It's easy to type a "w" instead of and "e" on a QWERTY keyboard for example, but much harder to fumble an "m"
    Additionally, you could assign weights to the words themselves based on the probability the user would've typed the word at all. A user is far more like to type "absolutely" than "aardvark", for example.
    A very cool algorithm with lots of room for extension!

    • @WhatsACreel
      @WhatsACreel  3 года назад +32

      Brilliant insight! I think the original Lev used different weights too. It cost 2 to perform the sub operation, if I'm not mistaken. Your suggestion is very flexible!! Cheers for watching, and cheers for sharing :)

    • @nate_d376
      @nate_d376 3 года назад +6

      I was thinking about this very thing before I saw your comment. It breaks down a little if a user is guessing at the spelling though, instead of just a typo.
      Maybe flipping those weights if the weights get too large for common words, like if someone was guessing at spelling an uncommon word, thus it would do the comparison on them instead. Just a thought.

    • @mikec518
      @mikec518 3 года назад +6

      Yeah, cool modification idea!
      One thing to note with variations on the original unit distances idea, is that Levenshtein distance will no longer behave as a metric. In other words, the computed distance will no longer satisfy the triangle inequality.

    • @gummansgubbe6225
      @gummansgubbe6225 3 года назад +3

      People looking at protein sequences are doing that. One is called "Grantham's distance", but there are others.

    • @maxfinnian
      @maxfinnian 2 года назад +3

      Another very useful modification of this is to weight each character by their visual similarity in some font, that way you can easily account for mistakes in a machine learning OCR model. Surprisingly though there is little resources online for creating this visual edit distance, but it can be fairly simply implemented by printing each character of the font to a small image and calculating the SSIM metric between each character and normalizing them into a lookup matrix. This matrix can then be used in a modified levenshtein function for weighting each swap distance (I found that using the quartic of this matrix is the most useful). Because of how it is set up it is effectively a normalized pixel swap distance between the current word and the target word. This makes working with OCR mistakes such as I->l, I->T, 1->I, etc. much less of a pain.

  • @shards1627
    @shards1627 3 года назад +129

    "this right here is... ummm... code" why yes, yes it is creel, good job

    • @msoulforged
      @msoulforged 3 года назад +18

      Hmm, yes...the code here is made of code.

  • @TheTheHellbean
    @TheTheHellbean 3 года назад +82

    For a spellchecker, since you are checking the same word against your entire dictionary, you can also optimize by reusing the first bits of the matrix up until the nth row, where n is the length of the common substring between the word you checked before this and the one you are checking now. No need to recompute the entire matrix after checking "dinosaur" if you are now checking "dinosaurs"

    • @WhatsACreel
      @WhatsACreel  3 года назад +17

      Wow! That would certainly be much faster! Cheers for sharing this idea and cheers for watching :)

    • @mikec518
      @mikec518 3 года назад +3

      This idea lends itself well to the use of a trie or finite state transducer.
      Andrew Gallant has a really cool write-up on this idea, regarding his transducer library: blog.burntsushi.net/transducers/

  • @Stelios.Posantzis
    @Stelios.Posantzis Год назад +2

    Best quick explanation of the Ficher-Wagner method that is sufficiently detailed to convince one that it works and demonstrates how it works in less than ten minutes.

  • @bird536
    @bird536 3 года назад +26

    I love this algorithm, the first time I used it was for comparing strands of DNA, which took a lot of memory without optimizations. We had to compare two strings that were 10000000+ characters. You do the math for the space required.

    • @WhatsACreel
      @WhatsACreel  3 года назад +7

      Wow! Pure class mate! Well done ya :)

    • @idjles
      @idjles 3 года назад +15

      @@WhatsACreel You can massively speed up Lenvenshtein for very large strings with this recursive approach that I invented (I use this for comparing texts between two versions of a book). It is 1000 times faster for 1,000,000 elements and uses 1,000,000 times less memory. And it spits out the subs/insert/del for each DNA step.
      1. Break the DNA strand into groups of 5 or 10 or whatever.
      2. Build a hash of these groups and remove from the hash any duplicates. O(n)
      3. Do the same with the other strand. O(n)
      4. Compare the two hashes and remove elements that are not in the other. O(n)
      5. You now have a hash of elements that are unique in each strand and are also in the other strand.
      6. Now find the longest stretches in both that are indentical. O(n)
      7. Split the strand into the left substrand, the middle identical substrand and the right substrand. We are going to recurse the left and right sub-strands.
      8. If the longest matching section of a substrand is less than some value X then perform Levensthein on that sequence. O(n^2) on short sections. otherwise recurse from step 6.
      9. Calculate that best backtracking path by prefering the step the gets you closer to the origin each step) through the matrix and store into the Output. O(n)
      So with basic Levensthein, you have 1,000,000 characters and 10^12 comparisons in a matrix of 10^12
      With my method you have 100,000 sections, and you can be sure that the sections that don't match are under 1000. so the comparisions are 1000^2*1000 = 10^9, and the matrix is only about 10^6. This is 1000 times faster than basic Levensthein, and uses a million times less memory.
      You MUST keep the entire Matrix

    • @dstgre
      @dstgre 7 месяцев назад

      I am realizing that this operation can be also done piecemeal. And as someone else pointed out, the similar comparisons can be memoized out

  • @ir2001
    @ir2001 3 года назад +16

    Aside, this video could also be a very good introduction for those who are looking for the whys and hows of Dynamic Programming

  • @AnyVideo999
    @AnyVideo999 2 года назад +2

    For anyone looking for the general lesson to apply to other algorithms, this is a classic case where the first version is easy to understand, but as it compares the execution to the problem on several small sub-executions, usually the accumulative recursion version works better.

  • @waaaaaaah5135
    @waaaaaaah5135 3 года назад +9

    Levenshtein edit distance is how I got my current job :)

    • @mikec518
      @mikec518 3 года назад

      Wow really? Did they ask it during your interview or something?

    • @waaaaaaah5135
      @waaaaaaah5135 3 года назад +2

      @@mikec518 Nope, I was given a project to do, and the levenshtein edit distance was an important part of it

    • @mikec518
      @mikec518 3 года назад +1

      Sound like a lot of fun, hope that project goes well

  • @dstgre
    @dstgre 7 месяцев назад +1

    I like this explanation much better as a lot of the others are just passing on rote information with missing component of real understanding

  • @streetcube-x3h
    @streetcube-x3h 2 года назад +1

    The best explanation for the Levenshtein Edit Distance on the Interwebs. Kudos to you for explaining it so good!

  • @jeroenodb
    @jeroenodb 3 года назад +10

    He Creel, loved this video! I think the time it takes to evaluate the original levenshtein recurrence is actually omega(3^n) when you have two words of n size, so quite a bit slower than n cubed.

    • @WhatsACreel
      @WhatsACreel  3 года назад +6

      Ouch!! I knew it was slow, but that's really something! Cheers for sharing and watching :)

    • @gideonmaxmerling204
      @gideonmaxmerling204 3 года назад +1

      well, you would usually write 2^n or e^n instead of 3^n since 3^n=e^{n ln(3)}

    • @jeroenodb
      @jeroenodb 3 года назад +4

      @@gideonmaxmerling204 Actually with logarithms this works, but with Theta(a^n) and Theta( b^n) it does not work. These are really different complexities. You can't rewrite 3^n as C*2^n. 3^n will eventually outgrow c*2^n whichever constant c you choose. Also my original bound is not totally correct, but say you have the strings aa...aa and bb...bb. This is an example where the running time is omega(3^n)

    • @user-qq6si7zv3t
      @user-qq6si7zv3t 2 года назад

      @@jeroenodb I'm not 100% certain, but as someone who has taken a course on algorithm design and analysis, I think it would be valid to simply write Theta(exp(n)) or Theta(e^n). The parameter of the exponential function is not important. All that matters is the pattern of growth. Another example, 3n is bigger than 2n, but we still just write Theta(n).

    • @klobiforpresident2254
      @klobiforpresident2254 2 года назад

      @@user-qq6si7zv3t
      I'll just for the moment drop Theta and use O (since Theta follows from O and Omega that should be fine).
      The was this works is that if we have a function f(n) describing the class in O, commonly n, n^2, log(n), and so forth, and a function g(n) describing the algorithm's running time then the algorithm is in O(f(n)) if, and only if, we can find a constant C and an n\_{0} such that C f(n) ≥ g(n) for all n > n\_{0}.
      This means g(n) = 2n and g(n) = 3n will be in O(n) if we set C = 3 (here n\_{0} 1). Indeed they're also in O(n^2) since for C = 3 and n\_{0} = 1 the same logic works.
      For logarithms we can change the Bede simply. logb(x) = loga(x) / loga(b). This means we can set C = 1 / loga(b).
      I am unaware of such a system for exponents. A previous post here talks about a way of introducing a constant factor into the exponent. That is indeed possible but that unfortunately does not suffice.

  • @MrUtak
    @MrUtak 3 года назад +1

    Man, I recently had to find a code of levenshtein Distance for excel because the quiz-level of my class didn't check for plagiarism. So I had to run it myself, on huge strings. And it works!!!

    • @MrUtak
      @MrUtak 3 года назад

      Checking the levenshtein distance between a whole document takes a lot of brain power for excel, but it does work remarkably fast. all things considered. It must use one of the later two algorithms.

  • @lepatenteux592
    @lepatenteux592 3 года назад +5

    That was really interesting! Thank you for your enthusiasm on such a niche subject!

  • @lcarliner
    @lcarliner 2 года назад

    There is one non-word that would trip up so many early technology spelling checkers was “irregardless!”

  • @black_platypus
    @black_platypus Год назад

    This is so satisfying! 🤩
    Abstracting away everything until we just need to check for equality of a character, this is giving me quite the software hard-on 😁

  • @mikec518
    @mikec518 3 года назад +2

    There are a few additional optimizations you can make to the Wagner-Fischer algorithm too!
    For instance, if you don’t need to keep track of the edit operations, you only need one row in memory at a time.
    Also, there is a threshold variation, requiring only distance subcomputations that are within the threshold away from the main diagonal.
    This is one of my favorite things to research in my spare time, thanks for making a video on it! :)

  • @MagicGonads
    @MagicGonads 2 года назад

    The checkerboard pattern made me think that "Spell Checkers" was a variant on the game of Checkers

  • @msoulforged
    @msoulforged 3 года назад +1

    This was very eye-opening and fun to watch, thanks! And we cannot wait for the priority queue and maybe few other important/common data structures video :)

    • @posterfemiemmanuel1039
      @posterfemiemmanuel1039 3 года назад

      Beloved, I don't know you in person but God knows you. God ministered to me in a revelation when I was on your profile to see things around you,I saw blessings but spiritual attacks holding onto them,in prayers,i saw a woman in the realm of the spirit monitoring and plotting delay in your life, with an evil mirror, and with motive to destroy. But as I speak to you now her time is up, Render hand of favour with Anything you can afford or give to these motherless foundation (Godstime MOTHERLESS FOUNDATION) in kebbi state Nigeria before 2DAYS with faith, as I Rise my hands towards heaven and pray for you they shall serve as point of contact wherever you are, you will receive double portion of grace to excel and total restoration of breakthrough in your life and in the life of your family. Ask for their acct details and help them call the MD in charge of the orphanage to get their details on (WhatsApp or call them now on +2349056513338) him I sent an you. For it is not by might nor by in power but of the spirit faith the lord (zechariah 4:6). You shall testify to the Glory of God in your life. God bless you.

  • @tahacaliskanileenbasitseki4356
    @tahacaliskanileenbasitseki4356 4 месяца назад

    thank you so much from Turkey

  • @phillies4eva
    @phillies4eva 2 года назад +1

    Great video! Can you explain why Google nearly 100% of the time gives me the correct spelling when the default spell check fails?

  • @joaquinglez6889
    @joaquinglez6889 3 года назад +1

    Wow! You alway inspire me! Looks like a great algorithm to create some sort of super-grep version for linux, where you type a word.and you might add as a parameter how close you want the word of the returned line be similar to your original word. Now I wonder why is it not even there

  • @jackkensik7002
    @jackkensik7002 3 года назад +2

    great video, this is the best teaching of cs concepts I have seen keep it up!

    • @posterfemiemmanuel1039
      @posterfemiemmanuel1039 3 года назад

      Beloved, I don't know you in person but God knows you. God ministered to me in a revelation when I was on your profile to see things around you,I saw blessings but spiritual attacks holding onto them,in prayers,i saw a woman in the realm of the spirit monitoring and plotting delay in your life, with an evil mirror, and with motive to destroy. But as I speak to you now her time is up, Render hand of favour with Anything you can afford or give to these motherless foundation (Godstime MOTHERLESS FOUNDATION) in kebbi state Nigeria before 2DAYS with faith, as I Rise my hands towards heaven and pray for you they shall serve as point of contact wherever you are, you will receive double portion of grace to excel and total restoration of breakthrough in your life and in the life of your family. Ask for their acct details and help them call the MD in charge of the orphanage to get their details on (WhatsApp or call them now on +2349056513338) him I sent an you. For it is not by might nor by in power but of the spirit faith the lord (zechariah 4:6). You shall testify to the Glory of God in your life. God bless you.

  • @PeriOfTheGee
    @PeriOfTheGee 2 года назад

    What I believe could speed up the full matrix version and the 1 row too to a comparable speed, minus cache locality, is if we reused the same matrix instance rather than allocating a new one for each call.

  • @Dave_thenerd
    @Dave_thenerd 3 года назад +1

    This seems like it could be easily parallelized with Multi-Threading and AVX.

  • @karoshi2
    @karoshi2 2 года назад +1

    Did a full text indexing project once and actually had to implement a trie by hand. That's kind of a tree which you fill with all the words known from your dictionary: take the root node, link it to another one "consuming" the first letter, continue with the remaining string from then on and mark the final node for this word so you know it is contained in your dictionary.
    Once built up its rather simple to traverse: take the word you search for, see if there is a node following the first letter. If there is one, edit distance remains unchanged. Then follow all the other possible routes with increased distance.
    When you have a maximum edit distance it terminates quite fast: if the maximum distance is used up stop the process and terminate no result. Otherwise return all the results found so far.
    Recursive but quite easy to understand and implement, also somewhat object oriented.
    Was fun to implement and must admit I was quite proud as I came up with the algorithm and only then realised that it already had been invented the very same way but there was no implementation in the language we used at that time.
    Vanilla implementation that came with the search engine library did a brute force Levenshtein on every known word, then threw away those that were too distant. Took approx 30 seconds for every run. Trie took a few ms only.

  • @nytrex2001
    @nytrex2001 2 года назад

    So is this the algorithm they use for code-completion in IDE's when they list possible functions/commands you may be intending to type?

  • @coomservative
    @coomservative 2 года назад

    Can you do a video on how grammar checkers work? So much content about parsing/trees and I just want someone to walk through the problem first which is How Do You Make a Word Processor Detect Incorrect Grammar? The stuff out there just assumes familiarity with the concepts

  • @incription
    @incription Год назад

    I always thought they calculated this based on the distance to the key on the keyboard, I guess this is the case more for autocorrect

  • @redjr242
    @redjr242 3 года назад +2

    This algorithm looks to be symmetric, in that d(a,b)=d(b,a). Shouldn't that mean that you only have to compute half the matrix?

    • @WhatsACreel
      @WhatsACreel  3 года назад +2

      What a fantastic observation! It seems we could indeed get away with computing half the matrix! Nice one :)

  • @ricos1497
    @ricos1497 3 года назад

    Well, that was a good viewing. Thanks! Have you ever done a video on your background? Covering things like subjects you studied etc, the type of thing that has led you to do in your work and so forth. That would make for an interesting video.

    • @WhatsACreel
      @WhatsACreel  3 года назад

      I studied music and software engineering. My main area of interest is discrete maths, boolean algebra and sets, that kind of thing. Cheers for watching :)

    • @ricos1497
      @ricos1497 3 года назад

      @@WhatsACreel thanks for the reply! Interesting. I've always been too indiscreet with my maths. People are always saying that about me. Put your maths away Rico, you're making a fool of yourself, they say. Now I know where I went wrong.

  • @bpansky
    @bpansky 3 года назад

    spell checkers always tell me (even right now in this comment) that my writing of "dissapears" should be...dispersal...

  • @devtthompson3646
    @devtthompson3646 2 года назад

    Well presented with a great explanation and detail. Nice one. :)

  • @ChrisHalden007
    @ChrisHalden007 Год назад

    Great video. Thanks

  • @briandhackney
    @briandhackney 2 года назад

    Great content, love the channel, wish you did one a week ;)

  • @BlokeBritish
    @BlokeBritish 2 года назад

    im building a google sheets app for my users to do a search.
    i have a database of around 2 million terms or sentences.
    the user types something and all closely matching terms from the list must be displayed.
    will this be the best technique to use ?
    i just tested this and it seems really quick.
    my only concern is that it doesnt account for swapped letters, probably mistyped.
    for eg. the edit distance between
    'the' and 'teh'
    returns '2'
    now the words are 2 letters off but really its just a mistyped swap.
    i was thinking of comparing the users entry by additionlly swapping their search words too.
    so if they searched for 'debit'
    then i will additionally search for
    edbit
    dbeit
    deibt
    debti
    so even if the mistakenly swapped 2 adjacent letters i will catch that first.
    if not caught ,
    only then compare the edit distance.
    in my case the search could be a set of words.
    'debit card declined'
    so the results must include all those which have at least one word from the searched term.
    so something like google search.
    i think this levenstein techinique will be best suitable for me with little tweaks

  • @treyquattro
    @treyquattro 3 года назад

    from now on I'm calling dinosaurs "noshers"! "Did you see that T-Rex? It was the king of the noshers." Good on yer, mate!

  • @mohammedalsayed3508
    @mohammedalsayed3508 Год назад

    I wish you were my professor

  • @CheezePie
    @CheezePie 6 месяцев назад

    Great video.. you earned a subscriber!

  • @coder2k
    @coder2k 3 года назад

    Very interesting topic, great explanation as always! Thanks! :)

    • @posterfemiemmanuel1039
      @posterfemiemmanuel1039 3 года назад

      Beloved, I don't know you in person but God knows you. God ministered to me in a revelation when I was on your profile to see things around you,I saw blessings but spiritual attacks holding onto them,in prayers,i saw a woman in the realm of the spirit monitoring and plotting delay in your life, with an evil mirror, and with motive to destroy. But as I speak to you now her time is up, Render hand of favour with Anything you can afford or give to these motherless foundation (Godstime MOTHERLESS FOUNDATION) in kebbi state Nigeria before 2DAYS with faith, as I Rise my hands towards heaven and pray for you they shall serve as point of contact wherever you are, you will receive double portion of grace to excel and total restoration of breakthrough in your life and in the life of your family. Ask for their acct details and help them call the MD in charge of the orphanage to get their details on (WhatsApp or call them now on +2349056513338) him I sent an you. For it is not by might nor by in power but of the spirit faith the lord (zechariah 4:6). You shall testify to the Glory of God in your life. God bless you.

  • @jupjyotsinghkhosla5681
    @jupjyotsinghkhosla5681 2 года назад

    Amazing Video!!!

  • @IronCandyNotes
    @IronCandyNotes 3 года назад +1

    ...so does the same work for images?

    • @mikec518
      @mikec518 3 года назад

      It would take a very long time, but yeah!
      This works on any two sequences, where each elements in one is comparable with each element of the other.
      For example, you could represent both images as sequences of bytes. For two 500px x 500px images, each sequence would be 250k bytes long, and the table would have 62.5 billion cells lol

    • @IronCandyNotes
      @IronCandyNotes 3 года назад

      @@mikec518 maybe using GPU or multiple cores? no idea just guessing here...

  • @quantumastrologer5599
    @quantumastrologer5599 2 года назад

    Thx!

  • @joinedupjon
    @joinedupjon 3 года назад

    Some problems seem to lend themselves to a recursive solution very naturally. But that's not necessarily efficient. I don't think you've shown how to optimise recursion or convert a recursive algorithm to iteration, or if there are ways of writing high level language that make it more likely the compiler will optimise some grievous recursion away.

    • @joinedupjon
      @joinedupjon 3 года назад

      Above is meant to be a suggestion for possible future episodes. Not trying to have a dig.
      I'm enjoying the series.

  • @yash1152
    @yash1152 Год назад

    10:53 will watch later
    it looks tooo similar to the LCS problem's solving algorithm

  • @mapclickerandy
    @mapclickerandy 3 года назад

    Brilliant video, keep it up !

    • @posterfemiemmanuel1039
      @posterfemiemmanuel1039 3 года назад

      Beloved, I don't know you in person but God knows you. God ministered to me in a revelation when I was on your profile to see things around you,I saw blessings but spiritual attacks holding onto them,in prayers,i saw a woman in the realm of the spirit monitoring and plotting delay in your life, with an evil mirror, and with motive to destroy. But as I speak to you now her time is up, Render hand of favour with Anything you can afford or give to these motherless foundation (Godstime MOTHERLESS FOUNDATION) in kebbi state Nigeria before 2DAYS with faith, as I Rise my hands towards heaven and pray for you they shall serve as point of contact wherever you are, you will receive double portion of grace to excel and total restoration of breakthrough in your life and in the life of your family. Ask for their acct details and help them call the MD in charge of the orphanage to get their details on (WhatsApp or call them now on +2349056513338) him I sent an you. For it is not by might nor by in power but of the spirit faith the lord (zechariah 4:6). You shall testify to the Glory of God in your life. God bless you.

  • @tekhiun
    @tekhiun 3 года назад

    Hey Creel, have you ever tried using cheat engine ?? I am actually using it to learn asm and it would be interesting to see at least a video where you used a decompiler or CE it self to look at the ASM code of some software or game and hear what you have to say about it such as is it optimal code or not. Turns out hacking games can be as much fun as playing the game and I think you could make some quite interesting videos on the subject.

  • @Jkauppa
    @Jkauppa 2 года назад

    dont need, just construct the possible trees to get the distance

  • @lunalma
    @lunalma Год назад

    I thought this was a video about magic on 'checkers' (the boardgame)...

  • @DFPercush
    @DFPercush 3 года назад +1

    This could make a pretty good binary diff algorithm, could it not? Like when games update and need to patch huge files, this would be the way to do it, rather than downloading the entire gigabyte+ data.

    • @SolarShado
      @SolarShado 3 года назад

      Distributing software updates as diffs can get tricky , unless you're certain that everyone will only ever be one version behind. You have to keep old diffs around, for example v1->v2, _and_ v2->v3, and so on, which can get expensive (disk space is cheap, but not free). But then what if someone's updating from v1 to v3? Should you also have a diff for that? It's more storage on the server end, but, for larger version gaps, possibly less bandwidth (which is both cheaper _and_ faster).
      And you'll probably want to compress... something... to save bandwidth and/or disk space. You can compress the diffs easily enough (though they may or may not compress well), but do you diff compressed or unpacked versions of the files? A small change in an unpacked file could potentially result in a (much) larger change between the compressed versions. And compression isn't just used "on the wire"; often times part of the reason a program (especially games) take as long as they do to open is that they're decompressing some assets from disk into ram.
      I guess what I'm getting at is that it's a more complicated problem than it may, at first, appear. Don't get me wrong, binary diffs are absolutely used in such systems, but they're far from a magic bullet.

    • @meneldal
      @meneldal 3 года назад +1

      Binary diffs are more or less easy depending on what kind of files you have. Compression is where it tends to perform poorly. With a zip archive, as each file is compressed entirely independently, you can easily make a diff to fix just one file. But if the compression not per file but for the whole archive, unless you force your software to reuse the same dictionary, everything will be changed, so diffs aren't practical at all.
      For video assets, if you want to edit just a couple frames there are ways to avoid reencoding everything, but most people wouldn't bother so pretty much the whole video will have to be redownloaded.
      In the end, it depends a lot on the filesystem structure you use and how you compress your data.

    • @DFPercush
      @DFPercush 3 года назад

      @@meneldal Of course. Zip the diff, don't diff the zip. :p That's what git does.

    • @meneldal
      @meneldal 3 года назад

      @@DFPercush That's what makes sense for code, that doesn't mean it works best with all kinds of assets. You're probably not shipping code to consumers, it's compiled binaries.

    • @DFPercush
      @DFPercush 3 года назад

      True, but suppose you want to update one texture in a game assets package, or just a couple of polygons in a mesh. The rest of the (uncompressed) data would remain unchanged, so a good binary diff would take care of that. If you want an example see the Blizzard/Battle.net updater.

  • @KelvinNishikawa
    @KelvinNishikawa 3 года назад +1

    Could you A* your way from top-left to bottom-right in the matrix?

    • @mikec518
      @mikec518 3 года назад +1

      An interesting idea, but you don’t have the table values upfront.
      For each cell, you can keep track of which previous cell you used for the calculation, and once you reach the bottom right, simply follow the path backwards to get the edits necessary. No A* necessary :)

  • @cmilkau
    @cmilkau 3 года назад

    So an algorithm just applying dynamic programming to a pre-existing recursive definition has a name? Wow, I could get so many algorithms carry my name right now.
    EDIT: spelling

  • @casperes0912
    @casperes0912 3 года назад

    Fancy name for just using dynamic programming on the original algorithm

  • @gideonmaxmerling204
    @gideonmaxmerling204 3 года назад

    Hey creel, I was going through x86-64 instructions the other day for fun and saw this AVX512 instruction:
    www.felixcloutier.com/x86/vpternlogd:vpternlogq
    this instruction lets you perform custom 3-bit ternary logic on 3 zmmwords/ymmwords/xmmwords.
    do you have any idea for any good use case for this?
    I myself though that maybe a logic circuit simulator would be a ble to compile a circuit into a set of vpternlog operations and get some fucking intense performance but that would be very difficult to do and requires compiling code at runtime.
    do you have a better idea?

    • @WhatsACreel
      @WhatsACreel  3 года назад

      I reckon logic circuit simulations would be an excellent use!!
      I did a bunch of automaton simulations a while back. They worked with boolean/nand and an evolution algorithm. They generated code to divide and factor integers, and then killed each other or reproduced based on how accurate they were. Maybe this instruction would have helped that type of thing? It was very good fun, but my little automatons were very slow to train.
      It would be cool if there was a matching scattered load, or the imm8 could be an 8 bit reg.
      Great instruction :)

    • @gideonmaxmerling204
      @gideonmaxmerling204 3 года назад

      @@WhatsACreel you could also modify the code at runtime.
      I thought that this would be useful for compiling circuits and running them intensely fast but that would not be easy and require runtime compilation.

    • @WhatsACreel
      @WhatsACreel  3 года назад

      @@gideonmaxmerling204 That's an excellent idea mate!! It would absolutely fly :)

    • @gideonmaxmerling204
      @gideonmaxmerling204 3 года назад

      @@WhatsACreel the only other use for it that I saw online is for calculating md5 and sha1 hashes but why would you do that?
      might also be able to do sha2 but I'm not sure about that.

  • @brianb2308
    @brianb2308 3 года назад

    That was awesome
    "C, a, b, r, a, b, a, b, a, b, b, b" -creel
    When you dropping your next record?

  • @coshvjicujmlqef6047
    @coshvjicujmlqef6047 3 года назад +1

    soviet union was the best country in history.

  • @nayjames123
    @nayjames123 3 года назад +14

    There's also another optimization available for things like spell checks. If you have a maximum allowed edit distance, you can break out once you know the edit distance will be greater than that. A trivial short circuit is if the length of 2 strings is greater than that, then do no work, but you can work out the smallest value in the row and take that as the minimum possible edit distance.

  • @RickeyBowers
    @RickeyBowers 3 года назад +14

    Nice presentation.
    The compiler doesn't know that the edit distance will never be greater than the maximum string length (probably < 256). This fact increases the parallelization possible.
    Tarjan's SCC algorithm also reduces the problem to a single array - could make an interesting video as well. (c:

    • @mikec518
      @mikec518 3 года назад +3

      The maximum possible edit distance for two sequences is the length of the longer. Maybe an assertion that both lengths are under 256 could help the compiler in that way?
      And that’s a really interesting fact about Tarjan’s SCC algorithm! I had no idea it was related

    • @meneldal
      @meneldal 3 года назад +1

      The actual maximum edit distance doesn't really matter for a spell checker.
      You ought to abort when you go over 5, don't bother computing more than that.

  • @bart2019
    @bart2019 3 года назад +2

    I was wondering, the scan of the dictionary itself is brute force, one word at a time.
    I assume there must be some way to scan the dictionary smarter, by subdividing it into groups of words that are rather similar, so you can eliminate a whole lot of them as not interesting simply because you found one word that is quite similar...?
    Well, I would hope so, though I would not immediately know how.

    • @user-sl6gn1ss8p
      @user-sl6gn1ss8p 2 года назад

      I think alphabetical order would actually be a decent start, since it order words by initial substring, which you can then reuse, so effectively you're only calculating one or a couple new letter combinations per word (supposing a "dense" dictionary). Depending on the relative size of the words this also lets you know that the distance can only grow if the non-fixed word grows. But maybe there's some sort of pre-treatment that could help extend that?

  • @zperk13
    @zperk13 3 года назад +4

    I wrote the first function (levenshtein) in code and also added a thing to track stuff. I did the "cabbages" and "rabbit" like you did. The deepest depth of the recursion calls was 14 and the number of times the function was called was 7,749

  • @cmilkau
    @cmilkau 3 года назад +1

    What you actually want to do is build an index for the dictionary that plays nicely with the edit distance, don't you? Do you really scan the entire dictionary computing edit distances?

  • @DrakiniteOfficial
    @DrakiniteOfficial 3 года назад +3

    Wow. I'm astonished that all that calculation can be done in 50 milliseconds. Modern processors are amazing.

  • @au7weeng534
    @au7weeng534 2 года назад +1

    the second one looks like a memoized version of the first one

  • @NicosLeben
    @NicosLeben 3 года назад +3

    Can you modify it in a way that some substitutions have a higher score than others? For example if you have a typo you mostly hit a character in the neighbourhood of the character you were targeting for. It is not very likely to substitute a Q by a O, it is more like you tried to hit W, A or S. Of course this depends on the keyboard layout you are using.

    • @klobiforpresident2254
      @klobiforpresident2254 2 года назад +1

      Yes, this is very much possible to do. Modern spellcheckers often do this. In theory it's as simple as giving adjacent keys (for example {r, t, d, g c} for f on a standard QWERTY or QWERTZ layout) a lower weight of e.g. half. More complex approaches exist but this should illustrate the point.

  • @burnsy96
    @burnsy96 7 месяцев назад +1

    _colour_

  • @willofirony
    @willofirony 3 года назад +2

    Awesome! So, its fast enough for autocomplete on an edit box; especially as the dictionary count will be in the 100's rather than 10,000's. I have GOT to play with that. Thanks, Mate.

    • @WhatsACreel
      @WhatsACreel  3 года назад

      I reckon it would be excellent for that application!

  • @klobiforpresident2254
    @klobiforpresident2254 2 года назад

    At 00:56 I, who is way too tired and forgot the video title, audibly wondered, "wouldn't they just use a weighted edit distance?"
    Yeah. Good night, folks.

  • @msclrhd
    @msclrhd 2 года назад

    Given that you are comparing multiple words, I wonder if it is possible to store them in something like a trie data structure and reuse computed Levenshtein distances for the same prefixes.

  • @devinschlegel1763
    @devinschlegel1763 2 года назад

    i know i am late, but if you have a massive dictionary, could you not spend time precomputing up to a certain length to save time?

  • @williamrutherford553
    @williamrutherford553 2 года назад +1

    Looking at the recursive definition actually gives you insight into the three boxes you look at! Instead of doing a recursive call, it's just the value to the right, up, or diagonal. When the characters are equal, that's just the call where you return the distance of tail(a) and tail(b). It's not really important to knowing the algorithm, but it's a great way of understanding WHY it works.
    I think this is also a great example of why recursive algorithms can be so inefficient, and why iterative is almost always better. The number of repeated operations is massive. Generating a Fibonacci sequence is another great example, a recursive function goes from the top down, an iterative solution goes from the bottom up. Just like how you can do Levenshtein distance in two or one rows, you can calculate Fibonacci numbers using just the previous two numbers.

  • @familytp8390
    @familytp8390 2 года назад

    I'm having problems running this code idk why

  • @PaulaBean
    @PaulaBean Год назад

    Very informative, like many of your videos!

  • @JohnnyWednesday
    @JohnnyWednesday 2 года назад +1

    You're a fantastic teacher - master of analogies - thanks for sharing your hard earned knowledge :)

  • @nate_d376
    @nate_d376 3 года назад +1

    Very cool. Thank you for explaining this. I guess us engineers find this stuff interesting.

  • @ElfinaAshfield
    @ElfinaAshfield 3 года назад +1

    Ah, the magic of dynamic programming :)

  • @akashparhi7004
    @akashparhi7004 3 года назад

    Thank you

  • @nightpicnicwii
    @nightpicnicwii 3 года назад

    Very cool!

  • @IronCandyNotes
    @IronCandyNotes 3 года назад

    neat

  • @aroncanapa5796
    @aroncanapa5796 3 года назад

    👍

  • @lx2222x
    @lx2222x 3 года назад

    HLSL tutorial when? 😁

  • @theexplosionist2019
    @theexplosionist2019 3 года назад

    62ms for "2 rows" for "aaaaaaaaaa" using std::string to store the dictionary.
    Using the "1 row" version and fixed size array chars instead of std::string still take 62ms.

    • @posterfemiemmanuel1039
      @posterfemiemmanuel1039 3 года назад

      Beloved, I don't know you in person but God knows you. God ministered to me in a revelation when I was on your profile to see things around you,I saw blessings but spiritual attacks holding onto them,in prayers,i saw a woman in the realm of the spirit monitoring and plotting delay in your life, with an evil mirror, and with motive to destroy. But as I speak to you now her time is up, Render hand of favour with Anything you can afford or give to these motherless foundation (Godstime MOTHERLESS FOUNDATION) in kebbi state Nigeria before 2DAYS with faith, as I Rise my hands towards heaven and pray for you they shall serve as point of contact wherever you are, you will receive double portion of grace to excel and total restoration of breakthrough in your life and in the life of your family. Ask for their acct details and help them call the MD in charge of the orphanage to get their details on (WhatsApp or call them now on +2349056513338) him I sent an you. For it is not by might nor by in power but of the spirit faith the lord (zechariah 4:6). You shall testify to the Glory of God in your life. God bless you.

  • @damian_smith
    @damian_smith 3 года назад

    I enjoyed that - thank you. The algorithm assumes that dropping, adding, or mistyping to an arbitrary letter are all equally likely, but I suspect that's not a valid assumption and the actual error rates for the 3 classes of error, and for which letter gets swapped are different. Do other distance metrics for offering choices work by weighting those errors differently, or even via a different approach? I'm very impressed by the swiping keyboards on mobile phones, for example - I imagine they might be doing something like this?

    • @mikec518
      @mikec518 3 года назад +3

      Yeah, you could probably adjust the value added on mismatch based on anything you wanted, like distance on a keyboard.
      Though, keep in mind that this will probably affect the triangle-inequality conforming property of the standard Levenshtein distance

  • @alex_smallet
    @alex_smallet 3 года назад

    Beethoven notes on the table adds credibility.

  • @darske1
    @darske1 3 года назад

    Man, your videos are awesome, thanks and keep it up!

    • @posterfemiemmanuel1039
      @posterfemiemmanuel1039 3 года назад

      Beloved, I don't know you in person but God knows you. God ministered to me in a revelation when I was on your profile to see things around you,I saw blessings but spiritual attacks holding onto them,in prayers,i saw a woman in the realm of the spirit monitoring and plotting delay in your life, with an evil mirror, and with motive to destroy. But as I speak to you now her time is up, Render hand of favour with Anything you can afford or give to these motherless foundation (Godstime MOTHERLESS FOUNDATION) in kebbi state Nigeria before 2DAYS with faith, as I Rise my hands towards heaven and pray for you they shall serve as point of contact wherever you are, you will receive double portion of grace to excel and total restoration of breakthrough in your life and in the life of your family. Ask for their acct details and help them call the MD in charge of the orphanage to get their details on (WhatsApp or call them now on +2349056513338) him I sent an you. For it is not by might nor by in power but of the spirit faith the lord (zechariah 4:6). You shall testify to the Glory of God in your life. God bless you.

  • @-notakil
    @-notakil 3 года назад

    Very interesting!

    • @posterfemiemmanuel1039
      @posterfemiemmanuel1039 3 года назад

      Beloved, I don't know you in person but God knows you. God ministered to me in a revelation when I was on your profile to see things around you,I saw blessings but spiritual attacks holding onto them,in prayers,i saw a woman in the realm of the spirit monitoring and plotting delay in your life, with an evil mirror, and with motive to destroy. But as I speak to you now her time is up, Render hand of favour with Anything you can afford or give to these motherless foundation (Godstime MOTHERLESS FOUNDATION) in kebbi state Nigeria before 2DAYS with faith, as I Rise my hands towards heaven and pray for you they shall serve as point of contact wherever you are, you will receive double portion of grace to excel and total restoration of breakthrough in your life and in the life of your family. Ask for their acct details and help them call the MD in charge of the orphanage to get their details on (WhatsApp or call them now on +2349056513338) him I sent an you. For it is not by might nor by in power but of the spirit faith the lord (zechariah 4:6). You shall testify to the Glory of God in your life. God bless you.

  • @lemdixon01
    @lemdixon01 3 года назад +1

    Annoyingly.

  • @DukePaprikar
    @DukePaprikar 3 года назад

    11:40 Isn't it three edits?
    1) delete C
    2) delete the second B
    3) insert R
    How could you do it in two edits?
    Edit (pun intended): I forgot that one of the allowed operations is substitution. Btw, that's cheating.

  • @diconicabastion5790
    @diconicabastion5790 3 года назад

    Interesting kind of poor system though. A Phonetic comparison would be better.