Zip works Better than AI

Поделиться
HTML-код
  • Опубликовано: 15 сен 2024
  • НаукаНаука

Комментарии • 160

  • @meatcow417
    @meatcow417 8 месяцев назад +276

    This is hilarious because my teachers back in college used this exact formula with Gzip to calculate plagiarism between programming assignments.
    Guess some people knew that Gzip could do this 10 years ago, and we just forgot it worked.

    • @rigen97
      @rigen97 8 месяцев назад +34

      I hope your teachers read this paper and collectively facepalms because they actually just didn't realize it's something they could've published.

    • @ludoviclagouardette7020
      @ludoviclagouardette7020 8 месяцев назад +21

      @@rigen97 As a teacher I am internally facepalming, I am terribly and deeply saddened by not having published that first

    • @AlgoFodder
      @AlgoFodder 8 месяцев назад

      The underlying idea goes back at least a couple of decades. You can google the paper: "Language Trees and Zipping" by Dario Benedetto, Emanuele Caglioti, Vittorio Loreto, which appeared in Phys. Rev. Letters in 2001

    • @AvanaVana
      @AvanaVana 7 месяцев назад +11

      The point of the paper is not so much that compression algorithms _can_ be used to calculate the degree of identicalness of data, but rather that they can use that ability at scale to classify large sets of data-moreover, in this task _they outperform complicated neural networks_ with millions of parameters, purpose-built to do this. Part of the reason it never occurred to your teachers to publish on this 10 years ago is that 10 years ago, “AI” wasn’t yet the buzzword/hot-topic that it is today, garnering so much hype, focus, and $$$. In the current climate, taking any simple idea can become a novel research paper or alluring pitch in Silicon Valley if it is recontextualized and repackaged (given that it is actually useful) as part of the burgeoning “AI” dialogue/industry.

  • @Geosquare8128
    @Geosquare8128 8 месяцев назад +117

    this is my favorite genre of cs research right now, the fact that compression is a useful clustering metric is so funny but also makes so much sense.
    stuff like this makes me hopeful that one day we'll escape the "scale everything" model of ML research and lean into the information/learning theory side of AI research :))

    • @billowen3285
      @billowen3285 8 месяцев назад +4

      Didn't expect to see you here but I guess it checks out

    • @Geosquare8128
      @Geosquare8128 8 месяцев назад +8

      lol yeah this is what I do most now

    • @shivashankar28
      @shivashankar28 8 месяцев назад +3

      hey, I am currently doing some readings on compression algorithms for embedded systems like STM32, nRF52 microcontrollers. Do you have any suggestions please? my current contenders are miniz, Heatshrink. But I want to have an expert weigh in or at least point some direction :))

    • @andrewferguson6901
      @andrewferguson6901 7 месяцев назад +1

      some would even go as far as saying that compression and intelligence, at the extreme, are nearly indistinguishable

    • @BlackDoggo-dy6lp
      @BlackDoggo-dy6lp 7 месяцев назад

      are you tired of people saying "oh look its the dream guy"

  • @Alexander_Sannikov
    @Alexander_Sannikov 8 месяцев назад +94

    Computing Kolmogorov's compelxity is not NP-hard. No, it's literally uncomputable. Not in polynomial, not in exponential, it's just not computable in any finite time. Wikipedia has more details as to why, but TL;DR you can't even bruteforce iterate through all possible programs, because some of them will never terminate and just figuring out whether a given program will terminate is already an uncomputable problem.

    • @JonasDonald
      @JonasDonald 7 месяцев назад +16

      True, good point! Just a minor nitpick: It still is technically NP-hard, since NP-hard just means, roughly speaking, "harder or as hard as" the "hardest" problem in NP (with regard to computational complexity)

    • @NXTangl
      @NXTangl 7 месяцев назад +1

      Also, there's a pretty easy argument even without getting into the halting problem, which is that if you can compute Kolgomorov complexity with a finite program, there exists some program with length n which finds the smallest number with Kolgomorov complexity greater than n and outputs it.

    • @AmaroqStarwind
      @AmaroqStarwind 7 месяцев назад

      What if you used a quantum computer?

    • @NXTangl
      @NXTangl 7 месяцев назад +3

      @@AmaroqStarwind a quantum computer can be simulated by a classical computer with exponential overhead. So no. However, the decision version of the problem, asking whether Complexity(str)

    • @andrejjjj2008
      @andrejjjj2008 7 месяцев назад +1

      Can you explain to an idiot:
      Why you can’t get the length of the program, that produces a peace of text?
      Why it’s uncomputable?
      I broke my brain reading the definition like a dozen times at this point.

  • @viacheslavprokopev8192
    @viacheslavprokopev8192 8 месяцев назад +50

    Yes, it works because if X and Y are similar (have a bunch of words in common), their normalised compressed length will be small and if two texts are different, it will be large. Think about it as "more common words = smaller compressed representation".

    • @OCPyrit
      @OCPyrit 8 месяцев назад +3

      Yes and I think not only common words, but also things like similar overall structure etc. This metainformation is derived from the raw data. And category is also such a metainformation as well.

  • @mindasb
    @mindasb 8 месяцев назад +14

    Knowledge is compression. Your mind creates a model in your head of the world that is a compressed version of the world.

    • @anon_y_mousse
      @anon_y_mousse 8 месяцев назад +4

      @@theboomerperspective2478 Sure it does, otherwise you'd have perfect recall.

    • @EvilTim1911
      @EvilTim1911 7 месяцев назад +3

      And it happens to be a very lossy compression

    • @anon_y_mousse
      @anon_y_mousse 7 месяцев назад

      @@belgianwaffles197 I'm not sure why you brought that up, or how you think any scientist knows that about chimps, but okay.

    • @anon_y_mousse
      @anon_y_mousse 7 месяцев назад

      @@belgianwaffles197 Again, I'm asking you, but in a different way, relevance?

    • @anon_y_mousse
      @anon_y_mousse 7 месяцев назад

      @@belgianwaffles197 Ah nice, so you take offense at everything. Got it. What you said was neither correct nor relevant. And I still don't know why so many of you have this weird brain disease. But thanks for helping me feel less inclined to be sad when humanity goes extinct in the next 100 years. I won't even miss half of you.

  • @stephaneduhamel7706
    @stephaneduhamel7706 8 месяцев назад +21

    Having a value of 2 for k is strange. It's going to cause lots of cases where two classes have the same score... Ideally, it shouldn't be a multiple of either 2 (or 4) or 3 to avoid these kinds of conflicts.
    Personally, I would pick some prime number bigger than the number of classes, for example 5

    • @azai.mp4
      @azai.mp4 8 месяцев назад +2

      And then if you get a sample like (A, A, B, B, C), where A and B are tied, you can rerun the k-nearest-neighbors but only considering neighbors in classes A or B. And you'll get a new sample like (A, A, A, B, B) for example, where you can now pick A as the "winner".
      As long as you use a prime number larger than the number of classes, this always reduces the number of classes under consideration, so if you repeat the process enough times, you'll always reach a winner.

  • @joshuascholar3220
    @joshuascholar3220 7 месяцев назад +4

    One point is that one of the reasons this works is THE KIND of compression algorithm that gzip is.
    There are programs that compress text BETTER than gzip does WHICH WOULD BE WORSE for measuring similarity between texts, because they're better at taking advantage of global properties of the text (for instance, recognizing the languages) and less focused on taking advantage of context within the text.

  • @marvos420
    @marvos420 8 месяцев назад +15

    Tsoding, I really love your content dude, so genuine and interesting, you're really inspiring!
    I've wanted to ask you if you're into signal processing and if yes, would be cool to see you do a nice tsoding session about Fourier transformation and even wavelets and how they're similar yet very different they're in the sense of how they give information about both time and frequency :D You could even demonstrate how wavelets can be used to analyze features in pictures like blood vessels in medical imaging and so on!
    I'd really find that interesting coming from you and the way you explain and talk about complex things in an understandable way.
    Much appreciated and continue doing what you're doing!

    • @adriaanj
      @adriaanj 8 месяцев назад

      He already did some video's on this for musializer iirc

  • @hr3nk
    @hr3nk 8 месяцев назад +25

    WARNING, I have to warn anyone who believes in the papers description of this approach being better then AI - it's not. This papers code has been debunked and authors made critical mistake with metric calculations when comparing their method against BERT. Don't know if this is mentioned on stream, but felt like this is important.

    • @chibi_bb9642
      @chibi_bb9642 7 месяцев назад +11

      source?

    • @droidBasher
      @droidBasher 7 месяцев назад

      @@chibi_bb9642
      Ken Schutte has a summary on the blog, search for "gzip-knn-paper2 kenshutte" for a link to it.
      In summary, K=2 is weird, it causes lots of ties and a flaw in the paper's code causes ties to resolve toward true labels. Also there may be train/test data contamination.

    • @cat47
      @cat47 7 месяцев назад +3

      with updated metric, it is slightly outperformed by BERT

    • @manawa3832
      @manawa3832 6 месяцев назад

      hello source?

    • @droidBasher
      @droidBasher 6 месяцев назад +1

      I've provided source a few times, but my message gets deleted. A google search will easily find flaws in the gzip classifier paper.

  • @lizardy2867
    @lizardy2867 7 месяцев назад

    I am glad I came across your channel today. I am currently in research and development on a project which attempts to model Kolcom. I hope to see more of your journey as I go along with mine.
    One such approach I have decided to utilize, is compression by reference. Pointers are one implementation of this method. A word for example is comprised of letters, which are hex values, or a single byte value. A reference frame determines the level of compression possible. This is why GZip and K-NN are very successful. By caching the reference frame, you cache the reverse program to decompress the data.

  • @shrddr
    @shrddr 8 месяцев назад +9

    to optimize `deflate(a+b)` maybe you can save the state of deflator right after `a`, and continue from that checkpoint for every `b`?

  • @OREYG
    @OREYG 7 месяцев назад +1

    43:20 - Pascal WAS a competitor to C, and it almost won, it didn't win because:
    a) arrays start from 1
    b) lack of macros (it was considered as uber-feature at the time)
    c) C was available out-of-the-box in Unix
    Zig is a good candidate to fill C niche because it DOES focus on simplicity, but it wouldn't take off because its too focused on it's own ideal (not until there would be c-style for loop, and some of 'errors' wouldn't be categorized as warnings instead (for example unused variable is an error))
    44:00 it doesn't matter because fread (and OS) does caching for you, and realloc takes some extra capacity

  • @anon_y_mousse
    @anon_y_mousse 8 месяцев назад +3

    I agree that Pascal is the only real potential substitute for C, and you mentioning that reminded me that Niklaus Wirth died on the first.

  • @wernersmidt3298
    @wernersmidt3298 8 месяцев назад +19

    This paper is a beautiful troll. I love how you think about how something works and turning the paper into a sort of puzzle. I'm stealing this technique without shame 😂
    "Why the fuck would this even work?" is a question that really needs to be asked more by deep learning cool aid kids. (let's be honest - we use these models when we lack the capacity to explain whatever system we're..."modelling")

  • @JonnyDeRico
    @JonnyDeRico 8 месяцев назад +5

    wow this is cool shit, i heard geohot talking about that ai is basically compression :D And here it is cleary demonstrated.

  • @Mozartenhimer
    @Mozartenhimer 8 месяцев назад +1

    Regarding annoyance with custom printers, you can give them your compier's printf attribute so you get nice warnings.

  • @MatejDujava
    @MatejDujava 8 месяцев назад +4

    When k=2 and neighbors are different, selecting predicted klass will be pick based on order of classes (which I think is the problem). so I would say k=nklass+1 should be better

  • @顔boom
    @顔boom 8 месяцев назад +1

    Just a stream of thoughts...
    Given the formula: NCD(x,y) = (C(xy) - min(C(x), C(y)) / max(C(x) , C(y))
    I think it's quite educational to look at the minimal test case of samples: "aabb", "aa", "bbcc"
    Assuming a naive length based on dictionary size 2, entries D and dictionary lookups L
    "aa", "aabb" => (2D3L - min(1D1L, 2D2L)) / max(1D1L, 2D2L) = 1D2L / 2D2L
    "aa", "bbcc" => (3D3L - min(1D1L, 2D2L)) / max(1D1L, 2D2L) = 2D2L / 2D2L
    "aabb", "bbcc" => (3D4L - min(2D2L, 2D2L)) / max(2D2L, 2D2L) = 1D2L / 2D2L
    So, as expected, "aa" has as much in common with "aabb" as "aabb" does "bbcc". Of course, it's far from the complete picture but decent enough to illustrate a point.
    Not sure how useful it would be, but there's a parallel gzip called pigz. If nothing else, the concatenation process might be useful. You could store the compressed form of every entry and then, instead of doing C(xy) it might be enough to approximate by compressing y initialized with the dictionary from x. So halving the workload for the calculation for the given samples at least.
    k = 2 is probably used due to the short samples causing too much variability for larger neighbourhoods I would think?
    The moon landing being classified as sports isn't all that strange. At least I'm pretty sure I've heard such phrases as "landing (a goal)", "landing (a hit)", etc.
    Would be interesting if this also works for image classification, or maybe even face recognition, using jpeg.

  • @pastenml
    @pastenml 8 месяцев назад +3

    CPU cycles have never been utilized more efficiently. Take that Sam Altman!

  • @cyanmargh
    @cyanmargh 8 месяцев назад +2

    Calculation of kolmogorov complexity is not only NP-hard, it's actually uncomputable (and even wikipedia contains proof for it)

  • @Raspredval1337
    @Raspredval1337 8 месяцев назад +4

    44:30 stdio already reads in chunks. So it kinda reads a chunk then copies it into users read buffer. And yes it is inefficient, you could probably do it better with raw 'read' syscalls

  • @byt3blitz
    @byt3blitz 8 месяцев назад +2

    24:21 One of your machine learning videos a while back inspired me to recreate a Kaggle contest. Not sure if it’s considered a contest but more of a preliminary challenge before moving to some of the more difficult ones. It’s the titanic prediction model. I solved it using python and immediately thought, wow that was stupid. All the functions I used were completely abstracted and I had no clue what they were doing under the hood. As a means to learning, and an excuse to not use python, I decided to cause immense pain to myself and reimplement the model using c! I spent a decent amount of time writing a csv parser for the training data. The way the data was structured really pissed me off. nested double quotes and all sorts of crazy shit they packed in a given string lmfao. I ended up using a stack based approach, and it worked perfectly. I’m still working on building the rest of the model, it’s been challenging but learning a lot. It’s videos like these that inspire me to dig a bit deeper.

  • @creatorofimages7925
    @creatorofimages7925 8 месяцев назад +1

    I love our little rant about mathematicians giving stuff overly technical, complex names to make things seems complex. Mathematicians are really good at abstracting things to easier problems for themselves and really good at making it as complex as possible for others.

  • @ecosta
    @ecosta 7 месяцев назад

    I love how I get exposed to different papers by watching Tsoding. It feels like I'm back at college (but without the stress of tests, etc).

  • @thegreywun
    @thegreywun 8 месяцев назад +1

    you could get a large speed-up by caching the compressed samples - as you will only need to re-compress the combination of test+sample inside the loop, instead of all three each time

  • @gianlucalocri
    @gianlucalocri 7 месяцев назад

    Hi! New subscriber here!
    You have one of the freaking best channel on YT. Your contents are beyond priceless!
    I'm speechless.
    BTW You really need a simple piece of software to display onscreen the number of "right" you say 😂. (no offense intended!)
    Greeting from Italy!

  • @dromedda6810
    @dromedda6810 8 месяцев назад +2

    loved watching this live

  • @cobbcoding
    @cobbcoding 8 месяцев назад +9

    1:11:45 read all the data into memory and compressed it faster than github can load a single page

  • @tomasruzicka9835
    @tomasruzicka9835 7 месяцев назад

    It's interesting. I started getting really good at learning fast recently. I think it has to do with my ability to find a good enough "compression algorithm" and then compress the topic I want to learn. And yes, I first have to understand the concept, but I didn't really had a problem with that at school. What I did have a problem with was to remember a semester of detailed literature verbatim (times the ammount of courses I attended that semester), so I could do well on exams. (Stupid really) But in the end the process became: Parse and index the issue for common concepts, generalize the differences as much as possible, remember the general concept once, remember the few variables to bend the concept to my will. Which, if you think about it, is like a compression algorithm. Find common subsets, encode them by "keywords", remember the common keywords once (like a dictionary), remember the variables (the keys into the dictionary)

  • @Blubb3rbub
    @Blubb3rbub 8 месяцев назад +2

    Could maybe also speed up the calculation of C(ab) by feeding the test sample into a deflate stream and then making a full copy of the state of that deflate stream before feeding the other sample into it. And then copy that backup again to get back to the state where the test sample was already deflated.

  • @punitkaushik7862
    @punitkaushik7862 7 месяцев назад +1

    I am working on a DL projects and it's 4 am in the morning and I am fucking dying because I don't have enough computation on kaggle to run the models properly ..... and RUclips recommends me this. These recommendation systems are getting out of hand man .

  • @varshneydevansh
    @varshneydevansh 8 месяцев назад +6

    I started smiling already by reading the title 💪🏻😁

  • @SuperFinGuy
    @SuperFinGuy 7 месяцев назад

    It seems that basically data compression mimics cognitive abstraction or how we form concepts. We find find the similarities among objects in order to classify them. For example, we call something a chair because it meets a certain criteria.

  • @CalebWetherell
    @CalebWetherell 8 месяцев назад

    To get an intuition for setting k, I would look at the sorted NCD table for the first n records. If the klassifier is good, the klasses for the examples at the top of the sorted NCD table should be often aligned with the test example.

  • @nephew_tom
    @nephew_tom 8 месяцев назад +3

    Interestingly, when you say "state of the art AI language models and shit like that", youtube auto-generated subtitles do not show the "shit"...
    RUclips guys implemented a shit-collector... 😅

  • @blackasthesky
    @blackasthesky 7 месяцев назад

    What a hilarious paper, love it!

  • @nil0bject
    @nil0bject 7 месяцев назад

    PASCAL was a good competitor to C. I'm suprised people like us still exist. GG

  • @buny0n
    @buny0n 7 месяцев назад

    @00:12:31 they're code also contains syntax errors. (missing ')' after every use of ...`encoded()` ).

  • @MystycCheez
    @MystycCheez 17 дней назад

    Are names language's version of implementing (approximation of) Kolmogorov complexity?

  • @SegevKlein
    @SegevKlein 7 месяцев назад

    this video was awesome

  • @ossianarn9175
    @ossianarn9175 8 месяцев назад

    You could try implementing the MDS algorithm using the NCD in order to plot the points to a lower dimnsion. Then you could use k-means and see if the data could be clustered correctly.

  • @ivanjermakov
    @ivanjermakov 8 месяцев назад

    By the same analogy, the brain is an interpreter that uses body's sensory systems as IO. Fascinating!

  • @thenwhoami
    @thenwhoami 5 месяцев назад

    "The internet feels increasingly smaller and smaller and smaller..." yes :(

  • @titaniumtomato7247
    @titaniumtomato7247 8 месяцев назад

    I guess with k = 2 its hard to show the "confidence" of the system but that would have been a cool addition so that you can evaluate the output better

  • @RandomGeometryDashStuff
    @RandomGeometryDashStuff 8 месяцев назад +1

    22:46 but github also needs to display stuff, not just load into memory

  • @Pyronimous
    @Pyronimous 8 месяцев назад +1

    That's freaking brilliant! I wonder if this would work for handwritten character recognition, like with the MNIST dataset or similar?

    • @hendrikd2113
      @hendrikd2113 8 месяцев назад

      Bro, just google "mnist by gzip".
      The first result will tell you, that this is possible, but 78% accuracy is kinda mid."

    • @stephaneduhamel7706
      @stephaneduhamel7706 8 месяцев назад +1

      Maybe using PNG instead of gzip?

  • @thatstupiddoll
    @thatstupiddoll 8 месяцев назад +1

    How does he beatbox so good

  • @xd_metrix
    @xd_metrix 8 месяцев назад +2

    Just a little reminder, if you divide the video into 120 parts, the division is useless because it doesn't tell a person anything. Perhaps it would be better to combine some parts and generalize

  • @mythacker4365
    @mythacker4365 8 месяцев назад +4

    What da thumbnail it is 😂😂😂

  • @apppples
    @apppples 8 месяцев назад

    another idea... could you preprocess the train data and find the centroid of each class, maybe 10 or so samples for class, that maximizes the distance from other classes while minimizing distance to other elements of their class, then classify on this reduced set? though at that point maybe you just get the most common words per each class... idk

  • @h65119
    @h65119 8 месяцев назад

    Maybe it wont matter but at 45:24 you were measuring the performance of the whole application (reading the file to memory and then counting the lines) so i thought maybe counting the lines takes more time that opening the file hence you cannot decide if reading the whole file into memory is faster than reading it in chunks of 32k bytes, unless maybe the OS is also reading the file into memory using 32k byte chunks ? or maybe the allocator allocates pages of 32k bytes at a time and handles the page faults ? IDK but what i'm saying is maybe the measurment was flawed, maybe i'm wrong i am just curious and please correct me if so

  • @ianknowles
    @ianknowles 7 месяцев назад

    Holy phek... Here's me being an idiot using embeddings. This makes total sense how it works. I'm only 10mins in and I totally want to stop and try and implement myself. I suspect for simple query and passage problems this would be useless but for large bodies of text it would work a charm.
    Gives me an idea to use a thesaurus and/or terminologies to create domain specific querying without AI model

  • @ThankYouESM
    @ThankYouESM 4 месяца назад

    I might have proof of being the first person to find this solution which was just to see if kNN can work in a zip compression. By the way... I am not at all a mathematician and I also might be the first to have replaced the Sigmoid to be the ReLu.

  • @ADRENELINEDUDE
    @ADRENELINEDUDE 7 месяцев назад

    Any time someone says "AI" it's like saying a generic term. "Peaches are better than fiber", "AI" is a generic term that can literally mean anything. I can make a garbage AI that, yes, "Zip will work better", however, a very well crafted ML compression system would then be better.

  • @diegoestrada35
    @diegoestrada35 8 месяцев назад

    11:23 this remind me of a video I saw of george hotz, where he said exactly that, I think it was called "the days that changed my life" or something

  • @apppples
    @apppples 8 месяцев назад

    try implementing hnsw for this to speed up your classification? probably be a lot of fun to implement

  • @matthias916
    @matthias916 8 месяцев назад +1

    Is it possible to show the possibilities of the classes? or does it not calculate that

  • @jgfjfgjfhjf
    @jgfjfgjfhjf 3 месяца назад

    science is basically reverse debugging: find out why something works

  • @nyyakko
    @nyyakko 8 месяцев назад +1

    @ 27:30 and this is why im always using -Werror in my projects 😅😅

  • @sergarrick
    @sergarrick 7 месяцев назад

    I can't help but notice that all these papers that detract from modern llms use models that pre-date gpt-3. Like, no one ever claimed OPT or text-davinci were exciting, revolutionary, or game changing. It wasn't until gpt-3/4, llama, mistral, yi, etc that llms started to feel magic. The ouroboros paper was the same thing. Finetuning OPT with OPT outputs makes for a dumber model. Finetuning llama with llama outputs makes a far smarter model. Pretty sure that's how wizardLM was trained. Likewise, I'm not surprised Gzip is smarter than BERT. Who is out here claiming BERT is the future?

  • @wzqdhr
    @wzqdhr 7 месяцев назад

    Wait for the day someone comes with AI idea called “hash is all you need” and it’s going to be more awesome than Cypto

  • @owlmostdead9492
    @owlmostdead9492 7 месяцев назад

    23:00 thank JS for that

  • @hubstrangers3450
    @hubstrangers3450 6 месяцев назад

    Thank you....

  • @puzzlinggamedev
    @puzzlinggamedev 7 месяцев назад

    Today I used this trick in the puzzle game I'm making :)

  • @ruroruro
    @ruroruro 8 месяцев назад

    7:58 it's not NP-hard, it requires solving the halting problem

  • @dnkreative
    @dnkreative 8 месяцев назад

    It works because NNs a actually lossy compressors and with enough coefficients they approach losless compression state. And zip is just a losless compressor. That seems obvious.

    • @dnkreative
      @dnkreative 8 месяцев назад

      Btw, this has nothing to do with "intelligence" or "understanding"

    • @DelgardAlven
      @DelgardAlven 8 месяцев назад

      @@dnkreative true, and still this gzip raw demo by zoze catches sense kinda preciser than average people in the chat LOL

  • @guptadagger896
    @guptadagger896 8 месяцев назад

    11:04 Is this true, does the ideal compression of a piece of text for example necessarily have the same entropy as the smallest program (in a specific language) to generate that text?

    • @guptadagger896
      @guptadagger896 8 месяцев назад

      I think you'd have to take the minimum over all programming languages, at the very least. Do they all collapse to the same value, is the question

  • @marcelbricman
    @marcelbricman 7 месяцев назад

    it doesnt have anything to do with complexity. as you have also realized its about common parts of A and B. huffman coding will do the heavy lifting here, basically looking for common prefixes and reducing their occurrence.

  • @thatstupiddoll
    @thatstupiddoll 8 месяцев назад

    this program can be further optimized by pre-caching the gzip of the training (and test?) samples

  • @koktszfung
    @koktszfung 8 месяцев назад +1

    Nice thumbnail

  • @rigen97
    @rigen97 8 месяцев назад

    tangent re text files it's so so so stupid how "hard" it is for text editors to open files in the megabytes range. it's a minuscule fractions of a fractions of RAM available in our modern systems, yet notepad++ for example lags so hard when you feed it files in the megabytes. while ed just chads through it.

  • @PedroPabloCalvoMorcillo
    @PedroPabloCalvoMorcillo 7 месяцев назад

    You're fucking awesome.

  • @platin2148
    @platin2148 7 месяцев назад

    Hmm should take a look at Oodle i guess?

  • @DonAlcohol
    @DonAlcohol 8 месяцев назад

    difference between compression and ai is the same difference as ALAC/FLAC/WAVPAK and AAC,MP3.VORBIS, , the first is obligated to reproduce the source material exactly , the second tries to create a new thing that is persieved as alike the original. i recently had a discussion about this with someone about something related, i claimed that nvidia in essense proclamated that they hit the ceiling of their abilities by pushing all in on AI , since if you take a step back and look at the process , they switched from increasiong gpu power for improving image quality and rate , to prerendering the games on ultra quality at their server farm and shipping the frames the cards cant generrate due to lack of power , in a highly compressed form disguised as AI to the customer , if they just gave you hard disks full of complete prerendered frames , and your pc only had to pick the rigth frames to match the actions you make in the game , youd be calling it a disgrase... because the AI only can do what it was trained to do, similar a compressed prerender can only aproximate its original source , making the x-thousand ai accelerators pretty useless on their own for anything other than image restoration , and defenetly for anything requiring accuracy.... and im thinking of doing fourrier transforms en mass parralell (soemthing gfx cards should be good at,and they are but they wont get better than they are now) or proteinfolding , possibly even emulating syncro systems (logic) or simulating old console hardware (similar to what they use fpga's for today (modern gameboy eg)

  • @dinoscheidt
    @dinoscheidt 8 месяцев назад

    12:10 😀 me as AI Researcher: 100% needed to do worse to crush space to fit into max page length. We shall apologize to all engineers ❤️

  • @RandomGeometryDashStuff
    @RandomGeometryDashStuff 8 месяцев назад +3

    05:00 can NCD(x,y)!=NCD(y,x) happen?

    • @igricRBX
      @igricRBX 8 месяцев назад

      By definition, no.

    • @leiko7986
      @leiko7986 8 месяцев назад +2

      Well in practice yes lmao

    • @DylanNelsonSA
      @DylanNelsonSA 8 месяцев назад +3

      @@igricRBX What do you mean? The definition has C(xy) in it, which depends on the order of xy. I would find it surprising if the smallest program that outputs xy always has the same length as the smallest program that outputs yx. And even if we forget about Kolmogorov complexity and just focus on gzip, I did an experiment where I concatenated two latex files from university assignments that happened to be on my laptop, and used gzip to compress the result and looked at the size, and then did the same but with the concatenation in the other order, and the order did matter. (Although the difference was only about 10 bytes)

    • @DylanNelsonSA
      @DylanNelsonSA 8 месяцев назад +2

      We could make the definition symmetric by replacing C(xy) with 1/2 (C(xy) + G(yx)). I don't know if it really matters. I think knn is already sensitive to the order that you add new items.

    • @anon_y_mousse
      @anon_y_mousse 8 месяцев назад +2

      @@DylanNelsonSA That's because of how it processes the data. The sliding window is only so large and if data repeats in certain ways it can compress better or worse. You can even generate data that can bork the processing time.

  • @SteveRowe
    @SteveRowe 8 месяцев назад

    I'd try to reproduce their results with their data.

  • @sukina5066
    @sukina5066 8 месяцев назад

    42:50 true, I used to be a fanboy and it hurts but it's true

  • @ToniMorton
    @ToniMorton 8 месяцев назад

    so theoretically information compression is the emergence of awareness/understanding?

  • @RandomGeometryDashStuff
    @RandomGeometryDashStuff 8 месяцев назад

    43:21 c is still easier for dynamic linking than pascal

  • @marusdod3685
    @marusdod3685 7 месяцев назад

    couldn't you define a macro to automatically create an array type of a struct, something like
    #define vector(T)\
    typedef struct {\
    T* data;\
    uint16_t length,capacity;\
    } T ## s

  • @ignaciovittorinifennema9791
    @ignaciovittorinifennema9791 8 месяцев назад

    I wonder if you could make a basic spam/phishing filter using this

  • @flamendless
    @flamendless 8 месяцев назад

    Parsing complex cvs is easier than signing up in kaggle it seems 😂

  • @user-hc4we4kb4j
    @user-hc4we4kb4j 8 месяцев назад

    Hello
    I'want to learn opencv but l have i5 cpu 8gb ram and intel hd 520 gpu is it Enough

  • @RandomGeometryDashStuff
    @RandomGeometryDashStuff 8 месяцев назад

    40:53 fread and ferror don't set errno, so it would log incorrect error message

  • @jjonojj
    @jjonojj 8 месяцев назад +1

    good video

  • @helloworld9018
    @helloworld9018 8 месяцев назад

    Can someone clarify what are the main fields of math and cs this man is interested in?

  • @elrichjr301
    @elrichjr301 8 месяцев назад

    Here's a crackpot idea: AI is lossy compression for text?

  • @blackhaze3856
    @blackhaze3856 8 месяцев назад +1

    I though gzip used Huffman. Or the most probably thing is I didn't understand a shit xD

  • @paulpruenster5884
    @paulpruenster5884 8 месяцев назад

    3:06 FACTS hahaha bothers me so much sometimes

    • @grivza
      @grivza 8 месяцев назад

      I mean, k nearest neighbors classifier is a pretty descriptive name, how else would you want them to call it?

  • @luandkg
    @luandkg 8 месяцев назад

    Omg, Tsoding are you learning portuguese ? “cinco” 👉🏼👈🏼

  • @muhammedyasinhanyasar539
    @muhammedyasinhanyasar539 8 месяцев назад

    I was watching your videos, and you said in the one of your previous stream that "You know less about a subject, you are better from the others. You make done the job quickly." And than I've just run accross that video: ruclips.net/video/mBcBoGhFndY/видео.html --> "I built an image search engine". You know what? you were right, is he really built an image search engine, is he? Nah he's not. But at the same time he is. without having a purely understanding of the billions and zibillions of shit under the hood. You were right...

  • @RandomGeometryDashStuff
    @RandomGeometryDashStuff 8 месяцев назад

    24:59 Shell 76.5%

  • @Alguem387
    @Alguem387 8 месяцев назад +1

    Try Odin

  • @MemeConnoisseur
    @MemeConnoisseur 8 месяцев назад

    Woah

  • @terakiei
    @terakiei 7 месяцев назад

    Yeah but it spends much more time and processing

  • @v2ike6udik
    @v2ike6udik 8 месяцев назад

    1:21:54 !

  • @thatstupiddoll
    @thatstupiddoll 8 месяцев назад

    AGI

  • @tommyshelby6277
    @tommyshelby6277 8 месяцев назад +2

    IF THERE IS DISCORD SERVER, HOW TF CAN'T I FIND IT ANYWHERE!!!

    • @ade5324
      @ade5324 8 месяцев назад +1

      www.twitch.tv/tsoding/about

    • @H3llsHero
      @H3llsHero 8 месяцев назад +4

      You ok bro? You can find an invite link on his twitch page

    • @dieSpinnt
      @dieSpinnt 8 месяцев назад +3

      @@H3llsHero Do you think this is a good idea with all that shouting?
      Hehehehe:)

  • @avi7278
    @avi7278 8 месяцев назад

    No, no it isn't....dumb question.