My recommendation from experience is to have _a tonne_ of tests for your regex, especially if it's very important for your application like email checking is. In Python with pytest, you can use parametrised tests to load valid and invalid emails from text files and just check that the output is correct. As time goes by and you find some false positives and negatives, you can add them to your test data to ensure you've fixed the bug.
Valid email addresses aren't really possible to match with regular expressions, anyway. At least, not all possible addresses as allowed by the RFC. For that reason I don't know what regex I should use for email addresses, if anything at all.
From what I've heard, it is best just to check for only one @ sight with something before and something after. Now I wonder how to check for only one @ character in a string....
What I've had the most success with is doing a trivial check on the input for optional form validation, and then actually trying to send an email to the address. For the pattern, checking for non-whitespace characters, then an @, followed by more non-whitespace characters, then a period, then more non-whitespace characters is generally sufficient. A false positive match isn't really all that harmful, and you shouldn't get any false negatives, so it tends to ensure users have put in something vaguely, potentially correct before submitting the form. Or, like @Dragetta said, just check for an @ symbol in the string and be done with it. Then you try sending an email, and if it's successfully delivered, it's valid. If it fails to deliver, you can either stop there, or look into more robust retry logic e.g. using a pending registrations table in the DB that you try to verify several times before removing to avoid cluttering up your user table.
@@MichaelONeillIrish i like the check to send an email. However, domains don't need to have a period and spaces are allowed in the email address. That is what makes validating email addresses such a pain.
In terms of regex readability isn't adding comments to your regex using re.VERBOSE and rstring just a standard to be used? Do you find it helpful when coding complex matches?
And be careful when you create your own parsing algorithm in order not to use a regular expression because routines can be hard to read, can be sub-optimal and can contain an eternal loop. Or even a lot of them.
Only very simple regexes can be directly converted to/from DFAs. As soon as you get into stuff like backtracking and non-greedy matching, the conversion becomes convoluted. I believe it's rarely worth it.
A "clever" regex is a guaranteed way to show-off what a 10X ninja rockstar developer you are... and your team will "thank" you for it for many many years after you write it.
💡 Get my FREE 7-step guide to help you consistently design great software: arjancodes.com/designguide.
Always remember the adage: if you think regex is the solution to your problem, you now have 2 problems.
Very nice video, thank you, Arjan. I was just wondering: What specifically is bad about REGEX_1, and why are REGEX_2 and REGEX_3 better?
Unclassed greedy operators.
Love the new format! Thanks for the content.
I'm glad you're enjoying the new content! :)
Thanks! More on optimizing regex please.
PYtips.
Great job Arjan.
My recommendation from experience is to have _a tonne_ of tests for your regex, especially if it's very important for your application like email checking is. In Python with pytest, you can use parametrised tests to load valid and invalid emails from text files and just check that the output is correct. As time goes by and you find some false positives and negatives, you can add them to your test data to ensure you've fixed the bug.
Absolutely! Treat regex as if it's a malicious enemy.
Valid email addresses aren't really possible to match with regular expressions, anyway. At least, not all possible addresses as allowed by the RFC. For that reason I don't know what regex I should use for email addresses, if anything at all.
From what I've heard, it is best just to check for only one @ sight with something before and something after. Now I wonder how to check for only one @ character in a string....
What I've had the most success with is doing a trivial check on the input for optional form validation, and then actually trying to send an email to the address.
For the pattern, checking for non-whitespace characters, then an @, followed by more non-whitespace characters, then a period, then more non-whitespace characters is generally sufficient. A false positive match isn't really all that harmful, and you shouldn't get any false negatives, so it tends to ensure users have put in something vaguely, potentially correct before submitting the form. Or, like @Dragetta said, just check for an @ symbol in the string and be done with it.
Then you try sending an email, and if it's successfully delivered, it's valid. If it fails to deliver, you can either stop there, or look into more robust retry logic e.g. using a pending registrations table in the DB that you try to verify several times before removing to avoid cluttering up your user table.
@@MichaelONeillIrish i like the check to send an email.
However, domains don't need to have a period and spaces are allowed in the email address. That is what makes validating email addresses such a pain.
Tip of the week
Goed video format! Kort & informatief.
Dankjewel! 😊
Very very interesting! Sparked many thoughts ❤
Tuesday tips? Did you mean to say Code Snippets by ArjanCodes? 😁
Great video again. Thanks a lot. Is there a way in Python to limit the execution time of a regex to prevent such a scenario like a ReDoS attack?
Write A LOT of tests!
In terms of regex readability isn't adding comments to your regex using re.VERBOSE and rstring just a standard to be used? Do you find it helpful when coding complex matches?
One the naming question for the new series: Stay with "tuesday tips". Reason ? The 1st thought is the best one most of the time.
Writing Regex is kind of like writing raw sql, why does sql have abstraction libraries but regex doesn't?
Maybe do a video on Panel or Panel vs Dash
I'd say it should be "Arjan in shorts", I trust you can come up with your thumbnails 😊
And be careful when you create your own parsing algorithm in order not to use a regular expression because routines can be hard to read, can be sub-optimal and can contain an eternal loop. Or even a lot of them.
Why not "Arjan Tips"? Nice, simple and reminds the channel name kk
Codjan Tips or Code-jan Tips
arjan's ardvice
Since we are in the realm of web, maybe some review on usage of HTMX for python folks out there? Would be great to see that here! :)
Maybe everyday tips?
how about calling it:
Arjan's Tips
weekly tips
developers tips
Cue the XKCD about regular expressions….
As long as it's "Perl problems" or "regex golf" and not "save the day"
Automata Theory (DFA) helps to write better regex expressions ;)
Only very simple regexes can be directly converted to/from DFAs. As soon as you get into stuff like backtracking and non-greedy matching, the conversion becomes convoluted. I believe it's rarely worth it.
Can’t catch which regex is evil just like my regexs cant catch valid strings
If you use such a long regex, you probably shouldn't use one.
Also, to catch all possible triggers for an "infinite" loop, use a timeout.
Length is not a problem. Regex can't be infinite, but it can be exponential, even a short one.
@@QwDragon Length might not be a technical problem, but for sure a human one.
Easy Python Pills (to swallow) for the name?
It's obvious that the first is bad bacause of constraction before @ sign in form of ([smth]?[other]*)*
I dont think Regex1 will catch all email adress
regex is the perfect ai use case. something fairly easy for a computer to translate from plain written language, but looks crazy to a human.
There's a significant chance that the AI will hallucinate the regex, and it's really hard for the human to catch that.
... if the AI could be trusted to get it right... which it can't.
Regex are so tedious they need be verified with unit tests.
A "clever" regex is a guaranteed way to show-off what a 10X ninja rockstar developer you are... and your team will "thank" you for it for many many years after you write it.
Regexes are OK, but I consider them write-only code except in the most trivial cases. They're easy to write, but hard to read.
Then the double videos per week started... Great!
Greedy operators are nearly always bad, except when they're at the very end.
Oooh yeah. We’ve reDos’d ourselves before.
DickVanDyne Tips