Reverse Engineering Data Files

Поделиться
HTML-код
  • Опубликовано: 23 ноя 2024

Комментарии • 106

  • @nashiora
    @nashiora Год назад +28

    I'm already using nob as my primary build tool for all of my projects, and I am fully prepared to migrate to the final version when it comes out. I've really enjoyed using it so far.

  • @qwqeqrqtqz
    @qwqeqrqtqz Год назад +13

    It has been a few years, but I also programmed something like this after watching that video. I had to chuckle when you called it binviz, because I called it binvis

    • @anon_y_mousse
      @anon_y_mousse Год назад +1

      Are you British or from a former British colony? I do find it interesting that he used Z instead of S, but maybe he learned English from watching US TV shows.

  • @CrossbowBeta
    @CrossbowBeta Год назад +12

    Such a cool topic, this is my favorite zozin session so far

    • @metaltyphoon
      @metaltyphoon Год назад +1

      Is that what he says on his intros? I can't ever tell :D

  • @frechjo
    @frechjo Год назад +12

    Those patterns make sense for the most part.
    In sounds, each sample will probably more or less be in a normal distribution, with the highest probability near the center, and smaller probabilities near the extremes. Sound is usually normalized, and it's a wave that crosses zero periodically (127 likely represents 0 in this case).
    And what we see is that correlations involving two samples in the middle are more often than those involving one or two samples in the extremes.
    For images, the lines parallel to the major diagonal lines are likely due to pixels involving the same proportions in color components (major diagonal being black to white), so monochromatic gradients.
    For x86 code, it's really interesting. Vertical and horizontal lines should form when there's some particular value associated to a set of bytes, and when the values in that set only change in the lower bits. As pairs interleave (probably instructions and operands), verticals and horizontals switch for each pair.
    I think the x86 instruction set encodes registers in the lowest 3 or 4 bits? I don't really know anything about it, but that sounds like something I've seen somewhere.
    Ascii is the most obvious one.
    Now, ogg is amazing. My guess is that it has some very frequent bit width in it's encoding, which ends up aligning a lot with byte boundaries, and that's what's causing those squares. But that's not a full explanation, just a general direction in which it could make sense.
    The compressed files (other than ogg) could be explained by entropy (a good compression should eliminate patterns as much as possible), but I think that's not all of it. I think a bigger reason for not displaying discernible patterns, is because they won't align with byte boundaries: compression schemes typically encode variable width bits. PNG in particular uses some LZ scheme with Huffman encoding iirc. If there are patterns, they'll happen across bytes, and each time appearing in different value positions.
    Very interesting for sure :)

    • @ratchet1freak
      @ratchet1freak Год назад +2

      for x86 I believe the main culprit would be prefix bytes which are common for instructions operating on 32 and 64 bit registers.

    • @frechjo
      @frechjo Год назад

      @@ratchet1freak Lots of different bytes appearing immediately after and before a few repeating ones. Makes sense, yo are probably right :)

    • @SerBallister
      @SerBallister Год назад

      @@ratchet1freak ARM32 had something similar, each instruction had a conditional flag like LE or GT.. but the general case it was AL (always) - so every 32bits you would see E at the start of the opcode.. making it really easy to eyeball ARM code in a hex dumps.

    • @flleaf
      @flleaf 10 месяцев назад

      interestingly you wav files can be stored with floats so i wonder how that would look like. also he did ffprobe on his wav files and it said signed 16 (on the stream)

    • @frechjo
      @frechjo 10 месяцев назад +1

      @@flleaf Ah I probably missed that. 16b signed, that's a bit puzzling.
      IIRC, the wav files were symmetrical, with the highest frequencies towards the center.
      2 byte two's complement numbers, that's interesting.Haven't thought about that one, or why could it look like that at all.
      For floats, I guess some pattern should emerge from a few repeating or similar exponents? (Depending how they decompose into bytes). That would be a nice one to look at :D

  • @johanngambolputty5351
    @johanngambolputty5351 Год назад +19

    Apart from where you transition from the end of one row to the beginning of another, it makes a lot of sense that the pattern is diagonal, because images are normally smooth, so the value of one pixel and a neighbouring one is usually not too different (for just an array of pixels anyway, although I guess rgb triplets might be packed next to one another, which could jump?).

    • @mire6134
      @mire6134 Год назад +7

      Right, as well as it makes sense that ASCII characters always form the same pattern as certain characters are a lot more frequent than others and the frequency of each character is, to an extent, predictable given a long enough piece of text.

    • @ratchet1freak
      @ratchet1freak Год назад

      it's more the values of the color channels though if there is a certain hue used often at a range of brightnesses over the image then that's gonna be 2 diagonals, one for the RG pair and one for the GB pair.

  • @para-be4bf
    @para-be4bf Год назад +6

    1:52:07 you would probably enjoy making yourself a "slugify" command for such scenarios, basically makes any string into an acceptable filename, only thing you'd have to take care of is somehow handling the / character

  • @ivanjermakov
    @ivanjermakov Год назад +1

    Regarding why data looks the way it is:
    - Executable format lines represent instructions and registers, it is a line because names are usually multiple bytes
    - Images usually look smooth, thus adjacent bytes are similar and diagonal lines appear
    - Soundwave data consist of byte sized samples, low value following high value, thus it "sticks" to view sides

  • @cobbcoding
    @cobbcoding Год назад +3

    12:56 this is the best home folder I've ever seen

  • @ce5983
    @ce5983 Год назад +4

    Such an interesting topic Zozi, hadnt heard of this at all

  • @tiberiumihairezus417
    @tiberiumihairezus417 Год назад +1

    Thank you so much, the value from your videos is astonishing

  • @EliasOjeda-mv6cg
    @EliasOjeda-mv6cg Год назад +2

    after watching your videos, i got back to programming in c for fun.

  • @rodelias9378
    @rodelias9378 Год назад +4

    Really interesting stream. Thanks and keep up with the good work!

  • @JohnKouts
    @JohnKouts 4 месяца назад

    We love you man, don't ever stop making these videos! We also need merchandise!!

  • @kuyajj68
    @kuyajj68 Год назад +77

    Tsoding is the person I wish I am 😂😂

    • @jordixboy
      @jordixboy Год назад +28

      then go grind and code a lot, dont just wish, take action lol

    • @anon-fz2bo
      @anon-fz2bo Год назад +16

      only thing we have in common is duckduckgo

    • @kuyajj68
      @kuyajj68 Год назад

      @@jordixboy I code a lot, and I use arch btw.

    • @ce5983
      @ce5983 Год назад +7

      Just get on the command line and start experimenting and learning, it would be great to have another Tsoding making interesting content

    • @kuyajj68
      @kuyajj68 Год назад +6

      @@anon-fz2bo I use vanilla emacs 😂

  • @UltravioletMind
    @UltravioletMind Год назад +4

    I use a jetbrains editor and if you make the mistake of opening multiple large repos, you will DEFINITElY run out of memory. when you restart the IDE, it restores your previous work spaces as expected. but instead of indexing just the current window and queuing the rest, it immediately tries to index all the repos you had opened previously. and then i get the popup on mac, running low on memory and everything grinds to a halt

  • @chiefxtrc
    @chiefxtrc Год назад +6

    I think you could have bumped up the base "brightness" of the pixels to make it easier to see the lower values

  • @olekbeluga314
    @olekbeluga314 Год назад +5

    You guys are like doing spectrum analyzer for files awesome :)

  • @brissance
    @brissance Год назад +1

    сэр ,
    спасибо за отличные видео.

  • @TheMASTERshadows
    @TheMASTERshadows Год назад +3

    map[y][x] + 1 is a better fit, to omit sub 1 values, edit: also I was thinking maybe normalizing the values by dividing by the squared deviation would make the result less contrasty and remove the overshadowing of the high frequency values

  • @bbq1423
    @bbq1423 Год назад +13

    Another way to visualize 3d stuff would be to map the 3rd dimension to a color. That way you don't need a fancy 3d viewer to look at the visualization.

    • @CD4017BE
      @CD4017BE Год назад +4

      That method is not possible in this case because the dataset to visualize is effectively `float[256][256][256]`. If you put that on a 256 x 256 image then each pixel still needs to represent float[256] which doesn't work with only 3 (or 4) color channels (you would need 256 color channels).
      But what you could do is make a video with 256 frames of 256 x 256 images displaying gray-scale.

  • @bbq1423
    @bbq1423 Год назад +9

    Maybe it's time to create a binary format that always looks like Rick Astley when viewed via binviz

  • @vantadaga
    @vantadaga Год назад +3

    Reverse engineering is a really interesting field

  • @KitsuneAlex
    @KitsuneAlex Год назад +2

    PNG is an interesting one because afaik it's essentially a bitmap inside a GZip archive.

  • @nkusters
    @nkusters Год назад

    as for "why are raw audio files like that?", it's a wave, so the most values will be at the extremes as it slows down and moves to the other side again, where you'll have more values near 0 and 255, as it's a wave and slows down at those points.

  • @jtucker87
    @jtucker87 Год назад

    Hey, that's Derbycon! I'm in Louisville! Wasn't expecting that.

  • @mire6134
    @mire6134 Год назад +9

    How about training a generative model on this kind of data, then having it generate raw bytes and seeing what they look/sound like, depending on whether the model was trained on images or sounds.

    • @ЕгорКолов-ч5с
      @ЕгорКолов-ч5с Год назад +6

      Sadly, it probably won't look very interesting, because the structure of the resulting 256x256 image simply describes probabilities for each pair of bytes. It's just not enough to capture meaningful details of the real data, so the generative model won't significantly outperform an actual bigram model

  • @omarojmb
    @omarojmb Год назад

    yet another fucking good recreation programming session

  • @Anonymous44223usa
    @Anonymous44223usa Год назад +1

    Tsoding Daily,which I search Daily for a programming magic.

  • @adolfocarrillo248
    @adolfocarrillo248 Год назад +3

    Man, you've a keen mind😂😂. Amazing what you can do with a computer language!!

  • @danielleontiev7134
    @danielleontiev7134 Год назад

    My assumption as to why images look the way they do, is because any image element is a run of pixels in an x direction, spanning multiple sections of the same section downward in the y direction. In the output it looks like the byte patterns are shifted right and down, because all images take their origin of 0.0 at the top left corner 🤔

  • @Ivan-qw4mn
    @Ivan-qw4mn Год назад +1

    I am really eager to understand THE MATH behind this stuff (to be precise it is probably math statistics) -- any ideas where to start precisely?

  • @arcxm
    @arcxm Год назад

    Very interesting stuff as always! Would love a follow up on this topic

  • @Lofen
    @Lofen Год назад +5

    Why not put nob in its own repo and simply use it as a submodule for new projects?

  • @rogo7330
    @rogo7330 Год назад +5

    My sollution for dependencies: just steal the code and fix it yourselves. If you can't fix it yourselves: dismember it and write again. God bless MIT licenses with which you can forget that you steal someones code and you will not be crucified by GNU church or sold on the black market by big corp.

  • @skr-kute1677
    @skr-kute1677 Год назад +1

    I think u could just apply log AFTER you normalize and it would work just fine
    How ever normalize to a max value of like 10 so that the log curve is actually "used"
    And you can avoid the none defined 0 by adding a 1 to all the values
    It doesnt affect the patern much

  • @0ne87
    @0ne87 Год назад +5

    "I don't even see the code. All I see is blonde, brunette, red-head."

  • @volbla
    @volbla Год назад +1

    Haha, i did this on a text file and only got single pixel lines at the top and left of the image. It turned out the file was coded in UFT-16, so every other byte was just zero ^ -^
    Another loss for UTF-16. Great stuff.

  • @SiddheshPardeshi-mp9cr
    @SiddheshPardeshi-mp9cr Год назад +1

    Tsoding poggers as usual

  • @ThatGuyexe
    @ThatGuyexe Год назад +5

    Love this guy ❤

  • @aemogie
    @aemogie Год назад +1

    is it possible to use nob's GO_REBUILD_URSELF from the main file instead of a seperate build script?

  • @rogo7330
    @rogo7330 Год назад +1

    I feel like man pages will produce very specific pattern when doing that shit. Also, thinking while watching video, you probably want to make most popular pixels brighter, while rare hits must have low value, because they are just random hits.

  • @KalinRangelov
    @KalinRangelov Год назад

    Very cool stuff.
    But kind of missing the point on images. Binviz should recognize the file type. In order to recognize image, you need to know its image and read it raw.

  • @kevinquintana3085
    @kevinquintana3085 5 месяцев назад

    Amazing

  • @forayer
    @forayer Год назад

    Very cool topic! Ty

  • @GlobalYoung7
    @GlobalYoung7 11 месяцев назад

    Thank you 😊

  • @ChaoticNeutral6
    @ChaoticNeutral6 Год назад

    This is an amazing video, although I can't get the line pattern at all when I tried this out at home (the x64 pattern works fine).
    I wonder if you used stenography to hide an executable inside an image, would this sort of visualisation technique be able to identify that? Maybe it would just see higher entropy in the picture but no other signs

  • @LeysTeamProsperity
    @LeysTeamProsperity Год назад

    ❤ Nice topic

  • @hoteny
    @hoteny Год назад

    8:58 i would love a NN that can guess structures from a multistructured file though. Idk how that would work, though. Like, how to guess when a structure ends and the next begins.

    • @belst_
      @belst_ Год назад +1

      you can probably scan small slices of the file, then when u find a pattern, increase the slice until the pattern gets more blurry, and then mark the area as that file structure, then continue after the marked area with a smaller slice again

    • @hoteny
      @hoteny Год назад

      @@belst_ yeah but how do you decide the slice size and slice incrementation amount to minimize risk while not making this operation take hundreds of years?
      Also can another approach be possible or an ai approach?

  • @angelomarano8458
    @angelomarano8458 Год назад

    I'm trying to make it 3D and obviously i can't. Any suggestions?

  • @MACAYCZ
    @MACAYCZ Год назад +1

    I love you and your awesome content!♥

  • @cobbcoding
    @cobbcoding Год назад +1

    1:58:10 super smash bros in executable confirmed?

  • @ecosta
    @ecosta Год назад

    1:14:14 - This weirdness in MSVC is true for a bunch of POSIX stuff. They have "stat" as "_stat" and other shite like that. Stupid vendor-locks are stupid.

  • @Author-Bangladesh
    @Author-Bangladesh Год назад

    Why you don't use doom emacs or spacemacs? How row emacs save your time?

  • @josephcbs6510
    @josephcbs6510 Год назад +2

    This msvc complex numbers thing is so fucking annoying
    Every time I need to do something physics related, I need to re implement complex numbers because this stupid msvc
    I remember my reaction the first time I tried to compile something with c99 complex numbers on msvc and it did not compiled. The sensation I felt, the moment I opened the msvc documentation and saw how the complex numbers works on msvc, is burned on my mind. I still have nightmares about it from time to time

  • @ymathh3808
    @ymathh3808 Год назад +1

    unfortunately the video has no subtitles :[ I'm not 100% fluent in English 😢

    • @iamdozerq
      @iamdozerq Год назад +2

      His accent VERY easy to hear understand.

  • @Maximxls
    @Maximxls 9 месяцев назад

    I'm pretty sure your application of log is VERY wrong. You basically set the maximum to the log of itself, while not changing the data in any way. I think the right way to do it would've been to just replace all the numbers with their logs (leave zeros as is).

  • @diegorocha2186
    @diegorocha2186 Год назад

    Who needs sha256 if we have file fingerprints like this!

  • @RandomGeometryDashStuff
    @RandomGeometryDashStuff Год назад +2

    10:25 why do most c hello worlds use printf("Hello, world
    ") instead of puts("Hello, world")?

    • @TsodingDaily
      @TsodingDaily  Год назад +11

      Because if everyone was using puts("Hello, world") you would be asking why not printf("Hello, world
      ")

    • @RandomGeometryDashStuff
      @RandomGeometryDashStuff Год назад

      @@TsodingDailyno, because printf is more complicated because f like %s, %d, %%, %c...

    • @Anubis10110
      @Anubis10110 Год назад +1

      😅

    • @anon_y_mousse
      @anon_y_mousse Год назад +1

      @@RandomGeometryDashStuff It's because printf is like the gateway drug into C.

    • @CD4017BE
      @CD4017BE Год назад +1

      Probably because the function name `printf` is more intuitive than `puts` for a person that just starts programming for the first time.

  • @PeterJepson123
    @PeterJepson123 Год назад +4

    Here is a thought. With a powerful Neural Network it might be possible to reverse this process and produce an executable binary from the image. Lol. And then, we could apply the gaussian diffusion process which is used by midjourney et-al to mix different images based on labels and produce entirely new binary files. Then we could skip the programming and compiling alltogether and simply text-prompt features we want and produce a binary application. I imagine that would be very difficult to program but it certainly seems possible.
    Good video as always. Cheers.

    • @drdca8263
      @drdca8263 Год назад +2

      These images are only a statistical summary of the file. They always are 256 by 256 regardless of the size of the input file.
      It isn’t like a spectrogram of an audio file, which includes enough info about the file to recover a large amount of the original sound (or possibly even all of it if you keep the phase information?).
      At most, you could see each of these visual representations as being like, a simple statistical model (specifically a Markov chain) created from a single file, and you could sample bytes according to it,
      but there would be tons of possible files fitting with the same bigram statistics, and most of them would be complete garbage.

  • @zahash1045
    @zahash1045 Год назад +16

    Yeah right, an elite Russian programmer thats "definitely not a hacker". Nice try.

  • @mrcrafter_y
    @mrcrafter_y Год назад +1

    13:09 Epic scheiße

  • @0ne87
    @0ne87 Год назад

    cheese viz

  • @ShanyGolan
    @ShanyGolan Год назад +1

    Let's ggoooooo

  • @noctavel
    @noctavel Год назад

    I start with a like and then every time i see something cool, i dislike and like it again. avg like rate: 7 likes per video. you're welcome

  • @ilikegeorgiabutiveonlybeen6705
    @ilikegeorgiabutiveonlybeen6705 Год назад +1

    antiplagiat 0_o

  • @kawaikaede2269
    @kawaikaede2269 Год назад

    💀

  • @upbeatsarcastic8217
    @upbeatsarcastic8217 Год назад +4

    This guy nobs

  • @satchelfrost6531
    @satchelfrost6531 Год назад

    lol 1:24:58

  • @Joorin4711
    @Joorin4711 Год назад

    To claim that there are many programming languages because some company wanted to own some market is, at best, naive and, at worst, stupid. Have some companies tried to own some market by trying to introduce their own version of a language? Yes. But to extrapolate from that is just not valid.
    Programming languages are tools that often are designed with a specific problem in mind and that has been true all the way from ALGOL, PROLOG, FORTH and LISP up to Rust and Python and the rest of them all.

    • @anon_y_mousse
      @anon_y_mousse Год назад +9

      At worst it's an exaggeration. It can't be denied that nearly every product that Microsoft has put out was an attempt to dominate the market. While he may have exaggerated by saying programming language, had he restrained himself to just Microsoft products he'd be 100% spot-on.

    • @TsodingDaily
      @TsodingDaily  Год назад +6

      > Programming languages are tools that often are designed with a specific problem in mind
      Speaking of stupid, how does that contradict what I've said?
      And thinking that Rust, for instance, is not an attempt at expanding someone influence is speaking of naive. 🤡

  • @iivarimokelainen
    @iivarimokelainen Год назад

    as someone using modern tools like an IDE... this was so painful to watch. it's like coding on a C64 and calling it productive