Generic HTML Sanitizer Bypass Investigation

LiveOverflow

Просмотров 140 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 28 июн 2024
I stumbled over a weird HTML behavior on Twitter and started to investigate it. Did I just stumble over a generic HTML Sanitizer bypass?
Get my handwritten font shop.liveoverflow.com (advertisement)
Checkout our courses on hextree.io (advertisement)
The Tweet: / 1662701541680136195
Google XSS: • XSS on Google Search -...
HTML Spec: html.spec.whatwg.org/multipag...
Chapters:
00:00 - Intro
01:09 - Sanitizing vs. Encoding
02:32 - Developing HTML Sanitizer Bypass
05:03 - Attacking DOMPurify
07:08 - Attacking Server-side Sanitizer
08:31 - HTML Parse Error Specification
10:08 - Potential Impact
11:55 - hextree.io
=[ ❤️ Support ]=
→ per Video: / liveoverflow
→ per Month: / @liveoverflow
2nd Channel: / liveunderflow
=[ 🐕 Social ]=
→ Twitter: / liveoverflow
→ Streaming: twitch.tvLiveOverflow/
→ TikTok: / liveoverflow_
→ Instagram: / liveoverflow
→ Blog: liveoverflow.com/
→ Subreddit: / liveoverflow
→ Facebook: / liveoverflow

Комментарии • 196

@Fasguy Год назад ⁺⁵²⁷
I'm generally someone who likes to implement stuff themselves, instead of using an external dependency, but stuff like this is why i normally don't touch security related things (like HTML sanitization) myself and go for an existing solution instead.
@motbus3 Год назад ⁺¹⁸
Once I did as a personal challenge and then you can get a list of common bugs that might lead to other problems.
You can work one by one but then you get another list and you get another kick.
But it is fun to try
@arnevaneycken2878 Год назад ⁺¹⁰
I just remember log4j
@rosco3 Год назад ⁺¹⁷
Same, I prefer to implement myself most things but Date/Timezone management and security related stuff are 2 topics I don't want to touch if possible
@Fasguy Год назад ⁺⁷
@@rosco3 Oh god yes, f*ck Time management of any kind. Especially in JS.
@apIthletIcc 11 месяцев назад
Fun fact, html sanitation and this specific weirdness is what prevented a hackers malware from working on my phone(s). 😅
They shit in their own hand on that one
@PixelOverload Год назад ⁺⁸⁷
I'd read before that valid HTML tags can't start with a number so I wasn't surprised that was the root issue, I wasn't aware of how the specification detailed parsing the situation however and now I'm wondering the logic behind _why_ they specify it should be handled like _that_ of all things 🤔
My only guess is to make sure it mangles the output sufficiently as to hopefully make the developer notice something's wrong and fix it, but surely that could be done more gracefully... or maybe it's just grandfathered in from the quirky behaviour of some early parser?
@SimonBuchanNz Год назад ⁺¹⁰
The HTML 5 spec is nearly entirely trying to nail down the least broken interpretation of existing content written against the wacky browsers of the time.
@D0Samp Год назад ⁺²³
, etc. are a rare way to count items and basically ignoring , ... by parsing them as comments might be a concession to broken markup cleaners that tried to close those non-tags.
@0marble8 11 месяцев назад ⁺¹⁵
About the Chomsky hierarchy: what we call regex is not actually🤓 type-3/regular, it often has operations, such as repeating the match group with \1, that are only present in type-1 grammars. There is a somewhat well-known regex for prime numbers, and it is impossible to construct a corresponding state machine for it.
@dave7244 Год назад ⁺⁷²
This video is great. I've been a full stack web developer now for about 15 years and I learned quite a bit about HTML parsing. The onerror attribute isn't something I would think of at all because quite frankly I write JavaScript Event and Error handlers the recommended/modern way.
@_nikeee Год назад ⁺¹⁶³
There is also a new JS method on all HTML elements: setHTML(input, options).
It's basically innerHTML, but sanitizes the input. So i think it's just like DOMpurify, but natively in the browser.
@vaisakhkm783 Год назад ⁺⁹
as it's really new... it might contain issues....
@ET_AYY_LMAO Год назад ⁺⁵²
@@vaisakhkm783 Hey if you find any I bet there is a reward from google...
@ET_AYY_LMAO Год назад ⁺²⁸
Cool, but its not supported in FF or Safari.
@spicybaguette7706 Год назад ⁺²⁰
@@vaisakhkm783IDK, I'd trust browser authors more with sanitizing HTML, since they wrote their own parser. It wouldn't surprise me if it used some of the browsers own html parsing logic
@whannabi Год назад
@@spicybaguette7706until people figure out some no click exploit
@MLeoDaalder Год назад ⁺⁴²
The popular Java library for this, Jsoup, also looks to handle this correctly. The basic input turns into <22> though it strips the comment for the closing tag.
@JordanPlayz158 Год назад ⁺¹
Oh that is nice, I love jsoup
@SchonKonnie Год назад ⁺¹⁰
When I saw your thumbnail, I instantly tried it out with different numbers and I also tried random tags, starting with numbers.
I thought okay, variables can't start with numbers, so I guess that also applies to HTML tags.
I would have never thought about any security bypasses by my own but then I became curious and watched your video.
@soviut303 Год назад
The UI for your courses looks really nice!
@wartab Год назад ⁺⁸
I was nervous when you tried dompurify, cause we heavily rely on it in some of our projects.
@json_bourne3812 Год назад ⁺¹
8:00 funny you mentioned the syntax highlighting, because it was the FIRST thing my brain said when you first wrote it in your editor at the start of the video! 😂
@kiyov09 11 месяцев назад ⁺¹
Good video. You are never defeated if ends up learning something 💪
@jimdiroffii Год назад ⁺⁵
If you wasted your time researching this, then what have I done by watching!? Haha, great vid, interesting results.
@IllIl Год назад
I learned something new, very interesting thanks.
@eero8879 11 месяцев назад ⁺¹⁰
We should only allow standard ( and whitelisted/predefined custom tags ) and explicitly close them. And just refuse to parse and throw an error if document is invalid. It's just stupid to allow arbitrary syntax and try to parse it.
@seanthesheep 11 месяцев назад ⁺⁴
Markup languages seldom throw errors. It'd also be annoying to have an entire document not render for you just because they used something your renderer doesn't allow
But you can enforce well formed HTML as a style guide for your project, which many people already do
@namibjDerEchte 11 месяцев назад
@@seanthesheep Or block execution in case of malformed syntax?
@Mitsunee_ 11 месяцев назад ⁺⁷
I instantly remembered that Astro lets you define a variable to alias a component or html tag, so I instantly went and tried Wat=22. Got the same behaviour described in the spec, but Astro leaked the classname into the HTML content, so I guess I got a bug report to make...
EDIT: I tried around and there is just enough sanitization to break any XSS I could think of so far. Shoutouts to the Astro team I guess 😹
@mjerez6029 Год назад ⁺⁸
Say thanks to the big brains who didn't want to go with xhtml which was waaaaay more restrictive than html5
@maker0824 11 месяцев назад
I am just learning html, and I was so confused when you were calling an html tag. Only for the conclusion to the video be as simple as “it’s not one”. Like wow, what a shocker
@D0Samp Год назад ⁺³
In Python, both the original htmllib.HTMLParser, which was built on top of the SGML parser and no longer exists in Python 3, and the current html.parser.HTMLParser handle this according to the specification.
@ThePowerRanger Год назад ⁺¹
You learn something everyday.
@xorlop Год назад
I wish hextree was open... I am so excited!
@farismazlan5157 11 месяцев назад
Great content for junior 👍🏽
@ET_AYY_LMAO Год назад ⁺¹⁵
I remember back in the days many people sanitized for javascript links by checking if the url starts with javascript: that makes sense I guess... but then IE7 allowed for a tab (Or was is some other char? cant remember) characters infront of the javascript: part.
@ET_AYY_LMAO Год назад ⁺⁵
Another totally unrelated discovery I found common a decade ago or so is to not escape float arguments in SQL, right around where Gmap API was the next thing everyone wanted, I found so many sites where you could SQL inject the lat lng arguments on the endpoint for map data. This included the largest private buy and sell site in my country at the time, but their parent company scolded me at a job interview so I never told them about it >:)
Nowadays everybody thankfully uses data binding instead of concatenation when building SQL in their applications..
(And yes I was able to get the user table by reading the database structure from INFORMATION_SCHEMA table in mysql and absolutely pwn the shit out of them, but I'm a nice guy that does this shit just for bragging rights)
@ET_AYY_LMAO Год назад
@@blenderpanzi I really wanted to reply but youtube keeps deleting it lol.
@blenderpanzi Год назад ⁺¹
@@ET_AYY_LMAO You can't include any URLs in RUclips comments. They get auto-deleted.
@ET_AYY_LMAO Год назад
@@blenderpanzi Urls cover more than http, there is other protocols and pseudo protocols like mailto that could be 100% legit use cases as well as relative urls. But yes, always whitelist!
@blenderpanzi Год назад ⁺²
@@ET_AYY_LMAO Yes, as I said, you might over-block, but that is not as bad as having an injection. Add mailto: to the list of allowed protocols if you want to allow that. :D
@lancemarchetti8673 Год назад
This is so cool!
@boomknuffelaar Год назад ⁺⁷
Hey LiveOverflow, how about CTF challenges as hextree courses? I think those would nicely build onto your existing youtube video's.
@chocolateimage Год назад ⁺⁴²²
"one day i was scrolling through twitter", twitter does not exist anymore
@Rhidayah Год назад ⁺¹¹
Elon change to Tweslla
@ET_AYY_LMAO Год назад ⁺⁵⁸
Twitter in 2006: "Its like text-messages, but for companies to do PSAs and engage with their audience like 'Our website is down, sorry!' or '25% discount at XYZ on friday'"
Twitter in 2023: "yOU mUsT sIgN uP tO sEe tHiS tWeEt"
@jaydeep-p Год назад ⁺²³
@@ET_AYY_LMAO
Twitter in 2006: nobody cared
Twitter in 2023: nobody cares
@sylv512 11 месяцев назад ⁺³
@@jaydeep-pok
@partlyblue 11 месяцев назад ⁺⁹
I can't wait for the day that someone responds to this with something along the lines of, "wow, I can't believe you predicted the future!!!". Twitter will fall and after a couple awkward years (shit, months?) the Internet collective will find a new hate machine where people say funny little things and receive death threats in response. What a time to be alive 😔
@untitled8027 Год назад
good to know thanks
@shigekax Год назад ⁺¹
I can't remember any instance of variables starting with a number being valid.
Also, we did do a simple html parser in 3rd year cs and the alpha as a first letter is the first thing we put in 😂
@270jonp Год назад ⁺²
thank you thank you thank you for showing a "failure".
@tg7943 Год назад ⁺²
Push!
@matthewrease2376 11 месяцев назад
Love how this wouldn't even be an issue if you just turned ALL the < into <
@nixel1324 11 месяцев назад
You seem to have missed the bit at 1:15.
@MrGuppiSocks Год назад
You're awesome
@paulcasanova1909 11 месяцев назад ⁺¹
Yeah its the same for any other programming language. Every variable name cannot start with a number. Html's tags are no different
@Cdaprod 11 месяцев назад
I think I’ve seen this happen in a bug converting markdown to html before too
@luketurner314 Год назад ⁺⁹
11:01 I think if you add [\s]*[a-z][a-zA-Z0-9]* to the regex right after the < it should make it more spec compliant
@Victor_Marius Год назад ⁺⁵
You can have hiphens in tag names. Custom elements need it as in
@bynariizminecraftenplusfun4181 9 месяцев назад
Ad popped one minute in the video, great ...
@zdazeeeh Год назад
How did I just find out that we share first names :D
@TheDiveO 11 месяцев назад
pedantry corner here: it's not a letter, but an ASCII letter. öäüß are letters.
@Gastell0 Год назад ⁺²
My first though was - html tags can't start with digits, so it interprets it as text literal.
the becoming a comment is a surprise though
@outseeker 11 месяцев назад
neat! :)
@clehaxze Год назад ⁺¹
I just wrote my own HTML minimizer. When the video started I instantly know what's the issue. The HTML spce is like C++. They are insane. IMHO, the HTML grammar is simply * . There's no way to get regex (or even EBNF) to HTML without edge cases of syntax error.
@beeble2003 11 месяцев назад ⁺¹
There's no way to get a regex for HTML, period. It is provably impossible, and applies to any language where the syntax requires you to match opening and closing brackets or any equivalent thing such as tags.
@tsalVlog 11 месяцев назад
weirdly, I am more aware of Chomsky hierarchy through language and culture studies, and not through computer science. And my day job title is software engineer.
@leyasep5919 11 месяцев назад
thanks for sharing "anyway" 🙂
@williamm200 Год назад
LiveOverFlow!!!! !!!!
@typedeaf 10 месяцев назад
@LiveOverflow Do you know of any good resources for help with finding bugs? What method do you use to find a bug that looks like a potential vulnerability?
@kn19ht_s3c Год назад
Last smile 😂
@motbus3 Год назад ⁺³
Hextree signup is disabled. Is this part of no the test? 😅
@amorcomorco 9 месяцев назад
so interesting and fun to watch, thx, kiss lol
@Hofer2304 Год назад
Is it possible to tell the browser how it should handle errors and dubious code? Is it possible to let the browser check the syntax first, and only when it succeeds, it is allowed to render it.
@JOHN-um2 11 месяцев назад
This is expected behavior
@blenderpanzi Год назад
But if your buggy non-standard HTML parser then spits out normalized HTML with any < > & properly encoded and any tag/attribute/attribute value that is not explicitly allow-listed removed, no injections should be possible either, right?
@drkwrk5229 11 месяцев назад
Club-Mate bro :)
@prescientdove Год назад
CLUB MATE!
@serialkiller8783 Год назад ⁺¹
is the invite code to hextree meant to be a challenge to be bruteforced ?
@Epinardscaramel Год назад ⁺¹
Club Mate spotted 😅
@TheNullBox Год назад
Server side sanitizers are a thing of the past. Client side sanitization bypasses are more interesting.
@bdot02 11 месяцев назад
When is hextree going to open up?
@intron9 Год назад ⁺¹⁴
@vaisakhkm783 Год назад ⁺²
😂 trying to defeat yt??
@_erayerdin Год назад
bravo, you win the internet :D
@intron9 Год назад ⁺³
Guys, I actually found something, I know the comment looks ok, but when some of you liked it, my android notification showed "someone liked your comment ''".
So, it proves that these problems are everywhere.
@crimsonmegumin 11 месяцев назад
@@intron9 prob character limit?
@hangingwithvoid360 Год назад
holy shit
@nodnarb Год назад
Will you continue the Minecraft Hacked series
@Grstearns 11 месяцев назад
Markup shenanigans: ✔
Found out about debuggex: 😲
Club Mate in the promo: 🤯
Video rank: 🅰 ➕
@Jdbye 11 месяцев назад
Seems odd to me that the opening and closing tags are treated differently. It would make sense if the closing tag was also treated as text.
I suppose what is happening here is that the default behavior when the parser encounters a closing tag that is missing a corresponding and valid opening tag, is to turn it into a comment. But this check happens before the check for whether the tag name is valid. So that check is never made on the closing tag, because it's already been turned into a comment by the previous check.
That makes sense, but why do this intentionally and make it part of the spec? Intuitively, it would make much more sense to require that the opening and closing tags of an invalid tag name be treated the same, and have this check happen before the other check. Then the output would have made more sense, you probably wouldn't have questioned the behavior, and you wouldn't have spent an hour on figuring out why.
@Zadagu Год назад
12:14 is this an advertimesment for hextree and club mate at the same time? :D
@Sollace Год назад ⁺³
These kinds of things are why I designed my html/bbcode parser in the way I did. It can read whatever it's given into a dom tree, but when outputting that to html/bbcode it only returns what I've specifically allowed it to.
@felipemartins6433 Год назад
that regex html verifier seems like something people would use in amp pages because of their shitty js support
@kn19ht_s3c Год назад
Registration for hextree is not open? I really wants to try out hardware
@frosty1433 Год назад
You’re best off using bbcode, markdown, or writing your own parser.
@mokhtardz9889 Год назад
(Developing a TCP Network Proxy - Pwn Adventure 3) I have problem
@bugbountyhunter-eh8rq Год назад
test
@Vampirat3 Год назад
I respect your snipe
@gprime3113 Год назад ⁺³
Ya, I have found some strange XSS like this, you never know how the backend will process the input. for example '-alert(1)-' and '+alert(1)+' did not work.... but '-alert(1)+' did!?!, can't explain it.
@thecamlayton 11 месяцев назад
Michael Cera,that you?
@takeshikovacs667 11 месяцев назад
Will you open hextree to external content creator?
@zoenagy9458 Год назад
too much sunlight
@jamesflames6987 11 месяцев назад
test
@notavoicechanger1808 5 месяцев назад
11:24 I see this breaking if the parsing /segmentation of the document is not done "as the entirety"
Example:
If your parser has a size limit on how large the page can be, and your tag is theoretically longer than that, it would need to correctly correlate the opening and closing tag segments from two different segments of computed data, a much easier situation to engineer a flaw.
So say you create an html tag that is longer than a parser can handle as a single 'bite'.
"< 'x'*21474834648 >" - extremely long data
I believe the best solution to this problem would be a logical equivalent of 'just don't talk with your mouth full' as any legitimate code would have no business surpassing a theoretical limit on a parsers maximum length.
@jay25inteserve 11 месяцев назад
I saw what you hid. The behavior is different when stored
@nickp82 Год назад
Starting tags with numbers is invalid HTML / XML. Why would you ever need to do that?
@emireri2387 Год назад ⁺²
hi
@velox__ Год назад ⁺¹
sanitize deez
@andrewdunbar828 Год назад
Chomsky != Komsky
@TwoThreeFour 11 месяцев назад
Test 22
@zuctivazenci Год назад
foo 22
@MuscleTeamOfficial Год назад
yurrrrrrrrrrr first
@rtg_onefourtwoeightfiveseven 11 месяцев назад
Question from someone who knows very little HTML: Why does the get parsed as a comment?
@virinom Год назад
What happened to video "DONT USE ALERT(1) FOR XSS"?
@LiveOverflow Год назад
Nothing happened?
@LiveOverflow Год назад
Heh. I get it now. Haha
@MenkoDany Год назад ⁺¹
as a programmer that worked as web developer, I knew from the start, anything that start with a number is not a valid html tag
@oxymonster1337 Год назад
till 3:10 i think yeah nothing interesting or new.
At 3:20 holup!
@va1iduser682 10 месяцев назад
I am Disliking all videos on multiple accounts until minecraft hacked comes back!!!
@rokutv-2023 Год назад
Is your minecraft server still up
@bugkilla84 Год назад
Nice try!
@bryld_ Год назад ⁺¹
not first
@Gamesaucer Год назад
I tried this in PHP and the PHP DOMDocument class doesn't actually handle it correctly. It just... kind of eats the entire 22 tag when parsing, bizarrely outputting the text ">22> (with only one quote, by the way!) as the result.
PHP never ceases to disappoint me.
EDIT: It's worse than I thought, it fails to handle basically every parse error correctly. Invalid CDATA doesn't become a comment. A character reference that's out of unicode range doesn't become U+FFFD. It even turns attributes in closing tags into text. And that's just a few examples.
Something like is actually a vulnerability in the most recent PHP.
@beeble2003 11 месяцев назад
Re your Hextree chat at the end, don't go down the rabbit hole of trying to write a Python tutorial just because some of your users wn't know Python. There are a bazillion existing Python tutorials. If you go down that route, shouldn't you also write an English tutorial for your users who don't speak English? And cooking tutorials so your users aren't hungry and can concentrate better? Focus on the actual purpose of the site.
@crimsonmegumin 11 месяцев назад
why does it turns to a comment? This was not explained
@DefineSyntax994 11 месяцев назад
It was explained at 8:53
@crimsonmegumin 11 месяцев назад
@@DefineSyntax994 ahh I see, it treats it differently when it`s a closing tag, thanks!
@maxdemian6312 Год назад ⁺³
HTML sanitizers should be whitelist based and reject obscure tags like
@jonopens Год назад
Why does the browser do that with the invalid 22 'tag' instead of just discarding it? Is it related to custom elements and shadow DOM standards? So weird!
@LiveOverflow Год назад ⁺²
because the HTML specification says the browser has to do that ;)
@apIthletIcc 11 месяцев назад
There's a specific 4 char string you can use for your wifi network ssid that will cause it to display "unknown" ssid in alot of apps. Not really that useful for me but someone could find it useful. So I keep it a secret 😂
@djthdinsessions Год назад ⁺²
HTML tag names can't start with numbers
@robstamm60 Год назад
Arguably one of the biggest mistakes in the history of "the web" Webbrowsers should have been designed like compilers - if it doesn't comply with the specified standard show an error and nothing else!
@anon_y_mousse Год назад ⁺³
This is why it's important to actually know the specs to avoid running into brick walls. However, I still say that HTML/CSS/JS should be replaced with a singular language that incorporates most of their collective feature-set, and that it should *not* be a tagged markup language but an actual programming language with built-in support for styling and document structuring, something that could be used in both a relative and absolute manner to replace as many document formats as possible at the same time, such as PDF's and Office documents. Yeah, I know, it'll never happen and if anyone reads this they'll have an invalid criticism because no one wants to do the hard work to replace things that are wrong.
@ssokolow 11 месяцев назад
Does it count as a valid criticism that such a solution would probably be as fragile as its most fragile facet (the JavaScript successor), and, despite the people behind it, XHTML failed in the market because it had XML-style parse errors rather than HTML-style parsing recovery?
That's a big reason I never use anything client-side templated like React/Angular/Vue/etc. in my own projects. If a transient network fault causes a CSS subresource to fail to load, the page will be ugly, but probably still usable. If an IT department's application firewall hasn't added the CORS header needed to serve up my font to their HTML header whitelist, the page will be ugly but should still function. If a templating or sanitizing mistake results in malformed HTML, there's still an opportunity for things to work, and for another layer like CSP to prevent any sanitizing mistake from being exploitable. I can use uMatrix and uBlock Origin to block ads and potential exploit vectors without the site doing a web equivalent to segfaulting.
etc. etc. etc.
Hell, Postscript *is* an actual turing-complete programming language in the vein of what you seem to be asking for and, when it was adapted into PDF, they took it further *away* from what you're asking for. Various exploits have occurred because of the need to run turing-complete code to render a PDF document, even with the bolted-on JavaScript support turned off. Various people prefer SVG over Postscript specifically because it's a more HTML-like declarative solution for describing a document. etc. etc. etc.
Likewise, TeX is a programming language for document creation... people have been migrating away from writing it directly to writing declarative things like Markdown and reStructuredText and then programmatically translating to TeX code when they need to access its ecosystem of document typesetting extensions.
...not to mention that you're going to need an HTML-like DOM either way, because that's how screen readers and other accessibility tools see the world (GTK applications on Linux actually have an equivalent to a DOM explorer built in for their widget tree and ready to be turned on by an environment variable), and HTML has been drifting toward closer alignment with the accessibility DOM it generates over the last decade.
@anon_y_mousse 11 месяцев назад
@@ssokolow Fragility would depend on the design. So if a committee designs it, it'll take a decade and start out okay, but then turn fragile as more crap is bolted on. If a group of OSS nerds develop it, then it'll be fragile from the start and over time converge on being hardy, but become more and more bloated than even a committee could make it and it would still suck. It would require that a single person with a good vision for it to stringently design it and direct its implementations. This is something that will never happen because corporations would never adopt such a system even if it was the best system. So to answer your first question, yes *and* no, with the dependence on which answer is correct based on who gets to design it.
@ssokolow 11 месяцев назад
@@anon_y_mousse I disagree. I have yet to be convinced that it's not a technical impossibility to make anything practical which incorporates turing-complete code (i.e. the part which does JavaScript-y things) as gracefully degrading as the HTML+CSS side of what we already have, regardless of who designs it.
It's similar to how the power of a static type system or of Rust's "static, compile-time garbage collection" is in *restricting* what the system is capable of, to bring it to a point the computer can better understand your intentions.
@anon_y_mousse 11 месяцев назад
@@ssokolow It's not a technical impossibility, it's a human adoption problem. But if you're a Rustacean then you won't understand, so there's no point in discussing it with you.
@ssokolow 11 месяцев назад
@@anon_y_mousse Wow. That's a big assumption out of nowhere. I could have just as easily pointed to C and C++ forbidding unconstrained GOTO instead. (It used to be commonplace for high-level languages to give you assembly language's freedom to jump into the middle of a function, bypassing its beginning.) I just wanted a second example to complement what static typing brings.

Следующие

Автовоспроизведение