Puzzled by this. Rust has _no_ standard regex library, and rust-lang discussion groups trumpets this as a feature, not a bug. Or has something changed.
I now understand why i'm good at regex where most people fail I never use backreferences Like idk why but they never felt right logically Even when i made crazy things like a youtube URL parser to clean those in regex, i've found ways to just do it without backreference when i could have used some And it's kinda cause when i'm building the regex, i'm running it in my mind and backreference just makes that impossible Like tracking what it does become too complex So big thanks for this vid, very informative and great !
for most CFGs I've always found it easiest to just build a tokenizer with regex and a parser with other techniques leaving the abstract computer science world and entering practical software engineering world - it is possible to express any CFG as just that, a parser, but the tokenizer/parser distinction often yields faster, more readable code with advisable properties for most practical applications of parsers (e.g. DSLs such as SQL or serialization formats such as JSON)
@@JordanManfrey I've had teachers and profs like this before. totally disconnected from the real world. they treat regexes like the devil, while pretending that memory safety errors in languages like C/C++ are a non-issue. regexes are an incredibly powerful tool if you know how and where to use them. definitely, part of "how and where" is knowing when you as a programmer, or your team, may be too error-prone to use the tool in confidence. I definitely would avoid using regex, or at least seriously consider and verify my use of it, in any high-availability or high-reliability context; but it's not so bad, all things said and done
@@EunakriaBizarrely enough, I've more often encountered new learners going the other direction - they "in practice" want to abuse regex features, because it's built in, and looks powerful, so it must be fast. (Think "I built a calculator in Minecraft" style shenanigans, and Parker's usual stuff for that matter.) And parsing and tokenizing is more of an abstraction for them, especially if they never got a sense for how to use the toolkit of functional programming.
I can't believe this video has only 31K views. All this work the amazing visualizations, the quality of the explanations, the lined exercises in the description. I truly hope all this work would get rewarded some day. Thank you so much.
I've always had a sneaking suspicion that regexes were too versatile and did too many things to actually be performant, turns out I was right though I did not expect back references to break the entire concept of a language being regular will probably use Google's regex engine in the future since I barely ever use backreferences anyway
i got a (.*?) tattoo when i barely knew what regexes were lol. thank you so much for this series, it has done wonders for my ability to live with that decision (and is also some of the best comp sci content I've ever seen on this platform
@@TVDaJaaccepts anything, but makes everything after it backtrack as much as possible if it fails. An incredibly efficient way to be as inefficient as possible!
Why not use the lockstep algorithm, when the regex has no backreference? It would be easy to just store a boolean with the automaton, that indicates whether it has backreferences, and pick the algorithm accordingly. This would limit catastrophic cases to regexes that actually use backreferences, which could be taught as something to be avoided.
wanted to comment this, but looked through the comments first. Seems very obvious, but I can't think of the reason why. Maybe the automata structures get reused when possible and you can't know where to use which engine.
The only reason I can imagine is that with and without backreferences would be two literally different algorithms and it's going to be hard to make the algorithms otherwise have exactly the same outputs for the same patterns & inputs. I've never used a backreference before, I can't really conceptualize why I would, but in part that's because I already know that regex can't parse html or email addresses, so I've never tried to do a thing you would use them for with regex; the obvious answer to me is basically what you said, but instead of a flag, the language has two literally distinct regex implementations that you're responsible for picking between, and writing regex that produces the result you want for the one you're using. One is fast, one has this weird trick. I do a lot of manual regexing stuff in text editors, spreadsheets and microsoft's PowerRenamer, so I'm already used to tailoring patterns and methodology to the environment I'm in. PowerRenamer in particular has an alternate engine it can use that's supposedly faster, but they're not 100% congruent in features and behaviors. In particular, PowerRenamer's default, a wrapping of .Net regex does something wrong with lookbacks so if you want lookbacks to work, you have to use the other one.
This might arguably be the best solution, but it's still far from ideal. Adding a backreference would completely change the performance characteristics of the regex, which is pretty unintuitive. It's better to have lockstep by default and force the programmer to use some special syntax for regexes with backtracking to make it clear that they have different performance characteristics. You might also want to use the ones with backtracking in some special cases anyway, since they can have a better best-case performance on certain input strings.
I use a regex to finite state machine code generation tool occasionally for tricky problems, and it's always violently obvious when I've accidentally added nondeterminism to my input regex - the state machine that comes out of the other side blows up spectacularly.
My personal position is that once one has spent a few hours trying to convince various HTML parsing libraries to _only_ parse the input string and stop reformatting it, parsing with a regex starts looking pretty good.
This is amazing!! I read about Thompson's algorithm last week when I was studying non-deterministic automata and the fact that regex engines in most modern and/or popular programming languages are slower than it and suffer from exponential blowup for longer expressions (if I remember correctly). The visualizations of algos was great and helpful in understanding them. Thumbs up for that!! All this increases my respect for these giants: programming all these using ed, the standard editor, on a teletype connected to a computer which was much slower than our present day handheld gadgets.
What a great overview of this! Great refresher of things I haven't though much about since college, and explained more concisely than any of my professors managed to.
I first learned about the concept of regular expressions at university. (yeah, my interests are not in parsing text, so i literally never saw them until then.) I was taught about the whole hierarchy, with context-free grammars and all that. a year later, i am working on a python project made by other students across the years, and it uses regular expressions all over the place. i sit down and properly learn about them, immediately notice that almost all of the code can be improved (such as some regexes potentially matching the wrong things), and of course, notice that the entire concept of matching groups and back references just... doesn't make sense. it's called a regular expression, but it actually isn't? the fact that every regex engine is forced to use the backtracking algorithm Precisely Because modern every-day regex isn't a regular language is just the cherry on top. this video mentions that people want to use regex to match html tags... just write a parser!? html is dead simple, it's the easiest thing on the planet to parse. i have manually written parsers for more complex things for fun. besides, i never liked the way regex strings look. they are just this ugly mess of punctuation and backslashes. i know some people who exclusively let ChatGPT write regex strings for them because they trust an LLM more with that mess than themselves.
0:40 lemme guess, backtracking is that least efficient method most Rgex engines use to support the backreference matching, with exception of RE2 (in c++) made by google's engineer with doesnt support that feature. Fun fact, backref matching is not a regex capability in first place.
This was a fantastic video and I did the first exercise and found strings that caused issues that made them go over 5 seconds. Most of the strings were well under 80 characters and I think the max was around 300. I likely wasn't efficient with the 300 one and could have made a shorter one. I think that this really made me understand the issue much better, but I will have to look into some more. Thank you, for making such an excellent set of resources.
watching this video is making me very glad i've spent the past few months reimplementing my friend's 32-line sed across multiple hundred json files into a program that actually parses and understands json properly. who would've guessed that this works infinitely better and is infinitely more capable?
I want to meet Kay Lack Turing Award Winner. I'll wait a bit. So much fun watching early 'go' videos about how Russ Cox, Rob Pike, Ken Thompson and Rob Griessimer used their decades of experience. Plus lots more refugees from Bell Labs....
video got me to start using this option in C#: [GeneratedRegex("[^a-zA-Z0-9]", RegexOptions.NonBacktracking)] from microsoft docs: Enable matching using an approach that avoids backtracking and guarantees linear-time processing in the length of the input.
Do i know enough to understand this video? No, absolutely not. Did your voice and explanations make this a really calming video? Yes, 100%. So despite the fact I'm probably not your target audience, I'm absolutely dropping a sub 😄
seems like plan9 is the branch of reality where ken thompson and ross cox fixed their regexps. I was confused why it was always greedy and didn't have back references.
I can only validate you by experience and not the full level of theory you know, but I’ve had to refactor SEVERAL notebooks containing regex once I learned about what you’re talking about (albeit not as well as you’ve explained it here) please take this as a sign you are not just teaching people who know nothing, but also people who know some, but need to be better (people like me!) Thanks!
Some things are problems that need solutions. Some things are solutions that are looking for a problem. RegExes are problems trying to convince you they are a solution.
I haven't done the math so I am not sure about this, but my guess would be that the resulting deterministic automata would be the same as traversing with backtracking through the nondeterministic one. The memory consumption of a deterministic automata can also grow quite quickly, so it would likely let you write OutOfMemory dos attacks.
regex defaults always seemed weird to me, the usual case is N serial chunks of: start parsing when x, stop parsing when y, check that what is captured along the way follows some rules. I feel like this idea could be expressed so much better if regex didn't try to be so smart, it would also make the runtime faster while avoiding the troublesome edge cases. I'd be really interested to find out if there's a legitimate production use case for the smarter regexes (even the 'a repeated pattern from start to end' one you mentioned seems far-fetched to me)
I love your channel! Just recently wrote my first parser in Fortran, language everyone tells me that I should not waste my life on, but it was possible!
Actually a really good idea by the way, many people would recommend more modern alternatives like Python or Julia in terms of languages (even though they have enough differences not to necessarily be considered straight up replacements) for it but considering the time in which it came out, I would say FORTRAN is more akin to Lisp than COBOL in terms of it's use cases. Alike Haskell, probably wonderful to look at not necessarily for it being a swiss army knife (sometimes even getting outclassed in terms of specialization when it comes to which language is "the right tool for the job") but for what it does theoretically and technically, as a construct that holds up throughout time in that fashion. I'm wishing I also get to using it sooner or later alike Python and Julia, although the latter two from what I've gathered are probably going to come first for me. Doesn't change the fact that it's a language to take notes from nonetheless.
Setting aside the maintenance of two code paths, it seems like a regex library should be able to detect if back references are used, and if not, use the more efficient algorithm. Are there any of that do that?
It seems like a good idea to do lockstep by default and then go to backtracking if a back reference is included in the regex? Most of the time you don't use back references, but if you do it's very useful, but obviously then causes the inability to use the efficient algorithm. Since regexes are usually compiled ahead of time, the algorithm can be chosen ahead of time depending on if back references are included. Just include a warning in your documentation that backrefs can cause major slowdown.
yeeah this is why I avoid regexes in most languages. as soon as they break the limits of their chomsky hierarchy I can't seem to wrap my head around them. they turn into this brutally terse unreadable embedded language that I find impossible to maintain. great explanations in this video. I was under the assumption that regex parsing already has exponential time complexity but learning that that is only actually tied to the "bonus" features that are the very reason I avoid regex was enlightening!
I think it's pretty common in search stuff to evaluate regexes as finite automata. This lets them intersect the automata with the search index to iterate all marching strings. That's nice because you can OR together the list of matching documents for all strings and get the list of matching documents for the whole regex. Also interesting, there is another sort of attack. Usually these algorithms want a DFA but regex make an NFA. Converting between the two *can* use a ton of memory. I accidentally crashed search on Wikipedia using it. That was exciting.
Thank you so much for producing this kind of content! Seconding those who enjoyed the transition to the html srackoverflow post. You explain things very well
1. use a regex that matches if a regex causes catastrophic backtracking 2. if it doesnt, match to see if it uses backreferences 3. if it uses backreferences, use backtracking 4. if it doesnt use backreferences, use lockstep
thanks for the interesting video! i used to hate regex cause it looked like arcane magic but the theory is so interesting. I've been trying to figure out how to implement capturing groups and I'm a bit stuck. I have some ideas about storing more state and I've optimized it so that it only takes O(capture groups^2) more memory but I don't know if that's good or not. I skimmed Russ Cox's papers you linked and I can't seem to find the part about implementing capture groups. Could you point me to which section talks about capture groups or some other link about that?
Hi Lucas! This is the one in which he talks about it: swtch.com/~rsc/regexp/regexp2.html - roughly the idea is to treat each active state as a ‘thread’ and then maintain a list of all the unique threads active. You don’t have to go the whole way implementing a virtual machine to get something to work but it’s an interesting exercise. If that doesn’t help though drop me an email at hello@0de5.net and I can share some code.
Being unable to negate generation in the same way you can negate recognition is reminiscent of how SAT is in NP, but UNSAT is in EXP because there’s no way to read a sufficient proof in non-exponential time.
I just avoid regexes as first thing when I parse something. If you ever thought to parse XML *or* HTML (they are quite not the same and I hate it) with regexes, consider to read specs on them, realise that people who wrote all that crap are just sociopaths and hate you, and go back to just splitting file and writing your automata. At least in that case you will find a bug right away or find it more easielly, almost always you will know what parts of the spec you are ignored, and in best case it will be much simpler and faster since you can say «nagh, I know I parsing this xml from that program, I just gonna be sure to preserve information that program produces and just skip parsing of the XML all-together».
If I wanted to right unreadable code, I'd enter an obfuscated C contest. I don't care what the theory is. I don't care how efficient you can make a regEx implementation. If you want to make things as efficient as possible you can always write in assembler, but people don't do that unless they absolutely have to (or for fun) and the reasons are obvious. For those same reasons, try to avoid regEx and realize they are from another time and solved a different problem than the ones we need to solve today.
Some tools probably do, but the specific website-taking-down cases were all backreferences related, so it wouldn't help with the really icky edge cases.
Maybe because I learnt compilers (Dragon book, yay!) before I programmed in a language (or used an OS) that had regexes, I always thought people coverted their NFAs into DFAs before recognition.
I think at best you can do a poor-man's tokenizer somewhat sanely using regexps. The whole grammar includes full recursion etc so no you can't properly parse arbitrary html just using a regexp. You can scrape stuff from html that follows known pattern or template but that can be extremely fragile when even minor changes to the format is done.
nothing is black and white, nothing is known for sure, our concept of objective is how repeatable something is in one or more cases but there will always be one or more cases that repeatable thing is not... repeatable. When we say "yes" in response to a "does this happen?" actually means, "More than likely yes"
You keep saying "in big-O notation..." but then giving the time complexity's *name* rather than its big-O notation... e.g. quadratic time would be O(n²)
Regular Expression. It's a computer program that lets you tell if a piece if text fits a common format, such as an email address, a phone number, etc. Theoretically it works with any kind of text pattern. The downside is that it looks nonsensical to a human reader
This video, while I would love to learn about regex efficiency as I use regex a lot, kinda lost me in talking a lot about different Algs and backtracking and such and not really relating it back to regex. I basically just went "ok, so just don't use lookahead or lookbehind probably?" Edit: I just noticed that every graph you show past like, halfway into the video, is showing what regex expression it is. Now I feel dumb.
Why does it matter at all? Stop transvestigating her. She's a woman, you don't need to know more than that. It's just you satiating your curiosity about the privacy of her life. This is the problem with society, you say we shove it down your throats - but you bring it up every opportunity you can and treat it like a dumb detective act. Let people have privacy.
@@humanfactorsio honestly if you can't answer that question but also are showing your appearance, you're not doing privacy right. I think it's a natural question to ask when you can look at her but then after hearing her you can't make the match.
Because they wont be jailed for it unlike in your country
Месяц назад
If You're so smart, just write better and be the hero. Huh? I guess not a chance, right? The complexity and unefficiency probably has its purpose. Well. But You just needed views, right?
The video literally explains the use of backtracking, and how removing it could lead to safer, faster code, admitting that doing so would involve removing back references and their expressive power. Besides, it's comparing the performance between two algorithms, both of which were written by the same person, so it's hardly a callout.
You are genuinely one of the greatest video creators I've seen in so long (also you are goals in general girl 🥹💕). I have a massive holiday watchload now. Thank you so so so so much for your hard work
i understand nothing but lowkey enjoying what is happening
if you keep at it you'll understand most of it pretty soon
kay is so pretty how am i supposed to focus 😵💫
That’s most developers experience using RegEx in a nutshell
It makes more sense if you've watched the first two videos :)
Please don't stop uploading. I found this channel by chance and it has to be one of the best things happening this year
Sounds like you should consider sponsoring the channel so it doesn’t happen 😊
FYI, Rust's standard regex library uses an O(mn) algorithm without backreferences.
Puzzled by this. Rust has _no_ standard regex library, and rust-lang discussion groups trumpets this as a feature, not a bug. Or has something changed.
The de-facto default regex crate "regex" does not implement look-around or backreferences
@@GregWhiteley The regex crate is not de jure standard, but de facto the one everyone uses.
I now understand why i'm good at regex where most people fail
I never use backreferences
Like idk why but they never felt right logically
Even when i made crazy things like a youtube URL parser to clean those in regex, i've found ways to just do it without backreference when i could have used some
And it's kinda cause when i'm building the regex, i'm running it in my mind and backreference just makes that impossible
Like tracking what it does become too complex
So big thanks for this vid, very informative and great !
for most CFGs I've always found it easiest to just build a tokenizer with regex and a parser with other techniques
leaving the abstract computer science world and entering practical software engineering world - it is possible to express any CFG as just that, a parser, but the tokenizer/parser distinction often yields faster, more readable code with advisable properties for most practical applications of parsers (e.g. DSLs such as SQL or serialization formats such as JSON)
I’ve never used them because when I learned them in school they would bring up stuff like that as being dangerous bullshit
@@JordanManfrey I've had teachers and profs like this before. totally disconnected from the real world. they treat regexes like the devil, while pretending that memory safety errors in languages like C/C++ are a non-issue.
regexes are an incredibly powerful tool if you know how and where to use them. definitely, part of "how and where" is knowing when you as a programmer, or your team, may be too error-prone to use the tool in confidence. I definitely would avoid using regex, or at least seriously consider and verify my use of it, in any high-availability or high-reliability context; but it's not so bad, all things said and done
@@EunakriaBizarrely enough, I've more often encountered new learners going the other direction - they "in practice" want to abuse regex features, because it's built in, and looks powerful, so it must be fast. (Think "I built a calculator in Minecraft" style shenanigans, and Parker's usual stuff for that matter.) And parsing and tokenizing is more of an abstraction for them, especially if they never got a sense for how to use the toolkit of functional programming.
Maybe your brain is simulating a finite automaton when doing regex.
YT algorithms just decided that my constant work with regex deserves this video. Thank you algorithm.
It was very pleasant to watch.
I can't believe this video has only 31K views. All this work the amazing visualizations, the quality of the explanations, the lined exercises in the description. I truly hope all this work would get rewarded some day. Thank you so much.
Reminded of what JWZ said: "Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems."
I've always had a sneaking suspicion that regexes were too versatile and did too many things to actually be performant, turns out I was right
though I did not expect back references to break the entire concept of a language being regular
will probably use Google's regex engine in the future since I barely ever use backreferences anyway
i got a (.*?) tattoo when i barely knew what regexes were lol. thank you so much for this series, it has done wonders for my ability to live with that decision (and is also some of the best comp sci content I've ever seen on this platform
that's a really good idea for a tattoo
Someone please explain for my lil brain
@@TVDaJaaccepts anything, but makes everything after it backtrack as much as possible if it fails. An incredibly efficient way to be as inefficient as possible!
@@SimonBuchanNz u can just put it between stuff
@@9_-_-_-_-_swo yeah, it's fine for input you know is going to be a reasonable size.
Why not use the lockstep algorithm, when the regex has no backreference? It would be easy to just store a boolean with the automaton, that indicates whether it has backreferences, and pick the algorithm accordingly. This would limit catastrophic cases to regexes that actually use backreferences, which could be taught as something to be avoided.
wanted to comment this, but looked through the comments first. Seems very obvious, but I can't think of the reason why. Maybe the automata structures get reused when possible and you can't know where to use which engine.
It's possible some systems do, these days. The problematic cases all have backreferences, so it wouldn't actually help in those edge cases.
The only reason I can imagine is that with and without backreferences would be two literally different algorithms and it's going to be hard to make the algorithms otherwise have exactly the same outputs for the same patterns & inputs. I've never used a backreference before, I can't really conceptualize why I would, but in part that's because I already know that regex can't parse html or email addresses, so I've never tried to do a thing you would use them for with regex; the obvious answer to me is basically what you said, but instead of a flag, the language has two literally distinct regex implementations that you're responsible for picking between, and writing regex that produces the result you want for the one you're using. One is fast, one has this weird trick.
I do a lot of manual regexing stuff in text editors, spreadsheets and microsoft's PowerRenamer, so I'm already used to tailoring patterns and methodology to the environment I'm in. PowerRenamer in particular has an alternate engine it can use that's supposedly faster, but they're not 100% congruent in features and behaviors. In particular, PowerRenamer's default, a wrapping of .Net regex does something wrong with lookbacks so if you want lookbacks to work, you have to use the other one.
This might arguably be the best solution, but it's still far from ideal. Adding a backreference would completely change the performance characteristics of the regex, which is pretty unintuitive. It's better to have lockstep by default and force the programmer to use some special syntax for regexes with backtracking to make it clear that they have different performance characteristics. You might also want to use the ones with backtracking in some special cases anyway, since they can have a better best-case performance on certain input strings.
They can and do. That boolean flags the difference between O(n) and O(2^n). That's the problem.
I use a regex to finite state machine code generation tool occasionally for tricky problems, and it's always violently obvious when I've accidentally added nondeterminism to my input regex - the state machine that comes out of the other side blows up spectacularly.
I studied this 4 years ago at uni and totally forgot this is how we reason about regexes! Super helpful video thank you!
Where [regex] blood makes [regex] blood unclean! First-time watcher and this was great.
My personal position is that once one has spent a few hours trying to convince various HTML parsing libraries to _only_ parse the input string and stop reformatting it, parsing with a regex starts looking pretty good.
Thank you for sharing the stackoverflow answer, I’ve never felt so seen
This is amazing!! I read about Thompson's algorithm last week when I was studying non-deterministic automata and the fact that regex engines in most modern and/or popular programming languages are slower than it and suffer from exponential blowup for longer expressions (if I remember correctly). The visualizations of algos was great and helpful in understanding them. Thumbs up for that!!
All this increases my respect for these giants: programming all these using ed, the standard editor, on a teletype connected to a computer which was much slower than our present day handheld gadgets.
What a great overview of this! Great refresher of things I haven't though much about since college, and explained more concisely than any of my professors managed to.
I'm really drunk and I don't know what you're talking about but I'm enjoying it
Best way to enjoy regexes, really.
I first learned about the concept of regular expressions at university. (yeah, my interests are not in parsing text, so i literally never saw them until then.) I was taught about the whole hierarchy, with context-free grammars and all that. a year later, i am working on a python project made by other students across the years, and it uses regular expressions all over the place. i sit down and properly learn about them, immediately notice that almost all of the code can be improved (such as some regexes potentially matching the wrong things), and of course, notice that the entire concept of matching groups and back references just... doesn't make sense. it's called a regular expression, but it actually isn't? the fact that every regex engine is forced to use the backtracking algorithm Precisely Because modern every-day regex isn't a regular language is just the cherry on top.
this video mentions that people want to use regex to match html tags... just write a parser!? html is dead simple, it's the easiest thing on the planet to parse. i have manually written parsers for more complex things for fun.
besides, i never liked the way regex strings look. they are just this ugly mess of punctuation and backslashes. i know some people who exclusively let ChatGPT write regex strings for them because they trust an LLM more with that mess than themselves.
0:40 lemme guess, backtracking is that least efficient method most Rgex engines use to support the backreference matching, with exception of RE2 (in c++) made by google's engineer with doesnt support that feature.
Fun fact, backref matching is not a regex capability in first place.
This was a fantastic video and I did the first exercise and found strings that caused issues that made them go over 5 seconds. Most of the strings were well under 80 characters and I think the max was around 300. I likely wasn't efficient with the 300 one and could have made a shorter one.
I think that this really made me understand the issue much better, but I will have to look into some more.
Thank you, for making such an excellent set of resources.
10:25 this transition cracked me up haha great video!
great video, i read that russ cox page years ago and it was fascinating. its surprising that so many languages still use slow regex algorithms
Chomsky hirearchy is not a guideline. If you adhere to proper definitions, it is provable. My formal languages teacher would tear you apart for this 😂
watching this video is making me very glad i've spent the past few months reimplementing my friend's 32-line sed across multiple hundred json files into a program that actually parses and understands json properly. who would've guessed that this works infinitely better and is infinitely more capable?
I want to meet Kay Lack Turing Award Winner. I'll wait a bit. So much fun watching early 'go' videos about how Russ Cox, Rob Pike, Ken Thompson and Rob Griessimer used their decades of experience. Plus lots more refugees from Bell Labs....
video got me to start using this option in C#:
[GeneratedRegex("[^a-zA-Z0-9]", RegexOptions.NonBacktracking)]
from microsoft docs:
Enable matching using an approach that avoids backtracking and guarantees linear-time processing in the length of the input.
Do i know enough to understand this video? No, absolutely not. Did your voice and explanations make this a really calming video? Yes, 100%. So despite the fact I'm probably not your target audience, I'm absolutely dropping a sub 😄
seems like plan9 is the branch of reality where ken thompson and ross cox fixed their regexps. I was confused why it was always greedy and didn't have back references.
I can only validate you by experience and not the full level of theory you know, but I’ve had to refactor SEVERAL notebooks containing regex once I learned about what you’re talking about (albeit not as well as you’ve explained it here)
please take this as a sign you are not just teaching people who know nothing, but also people who know some, but need to be better (people like me!)
Thanks!
yikes i think i missed some prerequisites
Haven't even watched this yet but a new Kay Lack video is automatic thumbs up.
Really interesting stuff, thank you for sharing.
Use the KISS approach when creating your regexes and you'll be fine. For anything extremely complicated, use a parser library.
Or just parse. It's not that hard, really.
Using regex is like the opposite of KISS. Parsing is as simple as you can get
Some things are problems that need solutions. Some things are solutions that are looking for a problem.
RegExes are problems trying to convince you they are a solution.
Why are nondetermenistic automata not compiled to deterministic ones? Wouldnt it solve the algorithmic complexity?
I haven't done the math so I am not sure about this, but my guess would be that the resulting deterministic automata would be the same as traversing with backtracking through the nondeterministic one. The memory consumption of a deterministic automata can also grow quite quickly, so it would likely let you write OutOfMemory dos attacks.
Given an NFA you can make a DFA that is equivalent, however the number of states in the DFA may be exponentially large.
how can it grow quickly? you just need to store the state and the rest string@@imacds
@@Vaaaaadimyeah this, it just sets off the “I’m creating a monster” alarm bells so you don’t do it
It's mathematically impossible, as far as we're aware
regex defaults always seemed weird to me, the usual case is N serial chunks of: start parsing when x, stop parsing when y, check that what is captured along the way follows some rules.
I feel like this idea could be expressed so much better if regex didn't try to be so smart, it would also make the runtime faster while avoiding the troublesome edge cases.
I'd be really interested to find out if there's a legitimate production use case for the smarter regexes (even the 'a repeated pattern from start to end' one you mentioned seems far-fetched to me)
I love your channel! Just recently wrote my first parser in Fortran, language everyone tells me that I should not waste my life on, but it was possible!
Actually a really good idea by the way, many people would recommend more modern alternatives like Python or Julia in terms of languages (even though they have enough differences not to necessarily be considered straight up replacements) for it but considering the time in which it came out, I would say FORTRAN is more akin to Lisp than COBOL in terms of it's use cases. Alike Haskell, probably wonderful to look at not necessarily for it being a swiss army knife (sometimes even getting outclassed in terms of specialization when it comes to which language is "the right tool for the job") but for what it does theoretically and technically, as a construct that holds up throughout time in that fashion.
I'm wishing I also get to using it sooner or later alike Python and Julia, although the latter two from what I've gathered are probably going to come first for me. Doesn't change the fact that it's a language to take notes from nonetheless.
Setting aside the maintenance of two code paths, it seems like a regex library should be able to detect if back references are used, and if not, use the more efficient algorithm. Are there any of that do that?
Is there a way to analyze regexp to use the backtracking algorithm only when back references are used ? It doesn't to seem to be used that often.
It seems like a good idea to do lockstep by default and then go to backtracking if a back reference is included in the regex? Most of the time you don't use back references, but if you do it's very useful, but obviously then causes the inability to use the efficient algorithm. Since regexes are usually compiled ahead of time, the algorithm can be chosen ahead of time depending on if back references are included. Just include a warning in your documentation that backrefs can cause major slowdown.
yeeah this is why I avoid regexes in most languages. as soon as they break the limits of their chomsky hierarchy I can't seem to wrap my head around them. they turn into this brutally terse unreadable embedded language that I find impossible to maintain.
great explanations in this video. I was under the assumption that regex parsing already has exponential time complexity but learning that that is only actually tied to the "bonus" features that are the very reason I avoid regex was enlightening!
I think it's pretty common in search stuff to evaluate regexes as finite automata. This lets them intersect the automata with the search index to iterate all marching strings. That's nice because you can OR together the list of matching documents for all strings and get the list of matching documents for the whole regex.
Also interesting, there is another sort of attack. Usually these algorithms want a DFA but regex make an NFA. Converting between the two *can* use a ton of memory. I accidentally crashed search on Wikipedia using it. That was exciting.
Thank you so much for producing this kind of content! Seconding those who enjoyed the transition to the html srackoverflow post. You explain things very well
1. use a regex that matches if a regex causes catastrophic backtracking
2. if it doesnt, match to see if it uses backreferences
3. if it uses backreferences, use backtracking
4. if it doesnt use backreferences, use lockstep
Cool gem of a channel
14:30 appreciate this story. I never realized it was Stream Ed, which is cool as I use sed a lot.
Has anyone done the quantum computation version of this?
Great video! I liked very much "... from somebody called Rob Pike..." 🙂I'm sure he doesn't mind!
keep up the great work
"I object to binary searches because I'm non-binary" -Pasty Zoomer
thanks for the interesting video! i used to hate regex cause it looked like arcane magic but the theory is so interesting. I've been trying to figure out how to implement capturing groups and I'm a bit stuck. I have some ideas about storing more state and I've optimized it so that it only takes O(capture groups^2) more memory but I don't know if that's good or not. I skimmed Russ Cox's papers you linked and I can't seem to find the part about implementing capture groups. Could you point me to which section talks about capture groups or some other link about that?
Hi Lucas! This is the one in which he talks about it: swtch.com/~rsc/regexp/regexp2.html - roughly the idea is to treat each active state as a ‘thread’ and then maintain a list of all the unique threads active. You don’t have to go the whole way implementing a virtual machine to get something to work but it’s an interesting exercise. If that doesn’t help though drop me an email at hello@0de5.net and I can share some code.
Being unable to negate generation in the same way you can negate recognition is reminiscent of how SAT is in NP, but UNSAT is in EXP because there’s no way to read a sufficient proof in non-exponential time.
I just avoid regexes as first thing when I parse something. If you ever thought to parse XML *or* HTML (they are quite not the same and I hate it) with regexes, consider to read specs on them, realise that people who wrote all that crap are just sociopaths and hate you, and go back to just splitting file and writing your automata. At least in that case you will find a bug right away or find it more easielly, almost always you will know what parts of the spec you are ignored, and in best case it will be much simpler and faster since you can say «nagh, I know I parsing this xml from that program, I just gonna be sure to preserve information that program produces and just skip parsing of the XML all-together».
Ehat about lookaheads and lookbehinds?
Stumbled here somehow. Quality of production and depth of content are absolutely unmatched.
I love your videos and contents so much Kay!! :^)
I love you kay
The presence of backreferences are a signpost that you have moved beyond the lexer and should be writing a parser
I think raku's approach to more verbose regexes is a step in the right direction.
I love your hair and your accent. But to echo others, how do you do your animations?
Great video! Not sure how it’s only at 1k views.
Very disappointed to discover after this video that the game engine I was using only supports PCRE2
If I wanted to right unreadable code, I'd enter an obfuscated C contest.
I don't care what the theory is. I don't care how efficient you can make a regEx implementation. If you want to make things as efficient as possible you can always write in assembler, but people don't do that unless they absolutely have to (or for fun) and the reasons are obvious. For those same reasons, try to avoid regEx and realize they are from another time and solved a different problem than the ones we need to solve today.
Did not know about redos, thx
what's a regular language tho. I don't think you ever got into the definitions in the series.
A regular language is one that can be recognized by a finite state automaton
I'm a feynman bro except instead of feynman it's regex
Thank God for RegEx
Why not scan the regex for back references and use lockstep if none exist?
Some tools probably do, but the specific website-taking-down cases were all backreferences related, so it wouldn't help with the really icky edge cases.
Maybe because I learnt compilers (Dragon book, yay!) before I programmed in a language (or used an OS) that had regexes, I always thought people coverted their NFAs into DFAs before recognition.
OK, but I need to know: can you parse HTML with a regex?
I think at best you can do a poor-man's tokenizer somewhat sanely using regexps. The whole grammar includes full recursion etc so no you can't properly parse arbitrary html just using a regexp. You can scrape stuff from html that follows known pattern or template but that can be extremely fragile when even minor changes to the format is done.
I hate this, the kind of "yes, but actually no" type stuff that, like the SO answer, consumes all that is pure and beautiful
would you mind saying a little more about this @ms-fk6eb?
nothing is black and white, nothing is known for sure, our concept of objective is how repeatable something is in one or more cases but there will always be one or more cases that repeatable thing is not... repeatable. When we say "yes" in response to a "does this happen?" actually means, "More than likely yes"
@@logickedmazimoon6001 things literally are known for sure but ok
@@Criz454 In what sense?
@@Criz454no. Everything has it's limits of applicability. Any physical model breaks apart or is too complicated to be computed. Sometimes both.
You keep saying "in big-O notation..." but then giving the time complexity's *name* rather than its big-O notation... e.g. quadratic time would be O(n²)
how do you create your animations?
Perl predates Linux by 3 or 4 years.
10:25
no way its actually been there the whole time holy shit
Wait, do you issue regex licenses?
"Someone called Rob Pike"
That stackoverflow answer is hilarious. Thanks for sharing xD
i agree
bro you are GORGEOUS omg
0:35 yeah let's hack !!!
I still don't know what a regex is.
Regular Expression. It's a computer program that lets you tell if a piece if text fits a common format, such as an email address, a phone number, etc. Theoretically it works with any kind of text pattern. The downside is that it looks nonsensical to a human reader
or do you just like looking like a medieval english lord
r u enby
u look like finlay christie
banger
How they got catastrophic xD
G … N … U ?!?
thank the Gods for gpt doing all my regex when I need it
Why do you know so much about regex?
This video, while I would love to learn about regex efficiency as I use regex a lot, kinda lost me in talking a lot about different Algs and backtracking and such and not really relating it back to regex.
I basically just went "ok, so just don't use lookahead or lookbehind probably?"
Edit: I just noticed that every graph you show past like, halfway into the video, is showing what regex expression it is. Now I feel dumb.
Love me some diversity in the mathmetics field
Yo draw a line between 1 and 5 and 3 and 7 and flip 1:45 90 degrees clockwise. If you recognize it, wtf
What, it looks like a cube?
Is this Tunic?????????????????
I'm guessing the tree of life. It turns out judaism was a finite state machine all along!
@@volbla Yeah I guess that's more likely than Tunic... Good game though.
most things done with regexes should instead be done with plain, readable, testable, debuggable code.
Ugh. Can’t do it.
I *have* to ask a question that's just on my mind. Are you a trans? It's just your voice and how you look I cannot add up.
Why does it matter at all? Stop transvestigating her. She's a woman, you don't need to know more than that. It's just you satiating your curiosity about the privacy of her life. This is the problem with society, you say we shove it down your throats - but you bring it up every opportunity you can and treat it like a dumb detective act. Let people have privacy.
@@humanfactorsio honestly if you can't answer that question but also are showing your appearance, you're not doing privacy right.
I think it's a natural question to ask when you can look at her but then after hearing her you can't make the match.
You lookin for a date or something lmao
what a handsome young lady
😂
Why do you need to an an absolute bigot on here? Show some respect FFS.
What are you???
Man or woman???
Why does it matter? You're focusing on the wrong thing in this video.
Isn't it obvious? 😞
@@flyingzeppo he wants to know obviously if he asked
M8 we're just trying to live. Can you leave her alone?
I honestly would have preferred the video a lot more if it were just narrated.
зачем он вырядился как баба?
Because they wont be jailed for it unlike in your country
If You're so smart, just write better and be the hero. Huh? I guess not a chance, right? The complexity and unefficiency probably has its purpose. Well. But You just needed views, right?
The video literally explains the use of backtracking, and how removing it could lead to safer, faster code, admitting that doing so would involve removing back references and their expressive power. Besides, it's comparing the performance between two algorithms, both of which were written by the same person, so it's hardly a callout.
Let me guess, self-taught
You are genuinely one of the greatest video creators I've seen in so long (also you are goals in general girl 🥹💕). I have a massive holiday watchload now. Thank you so so so so much for your hard work