Making a Language Model From 0 (You Can Run It Too!)

Поделиться
HTML-код
  • Опубликовано: 16 дек 2024

Комментарии • 142

  • @8AAFFF
    @8AAFFF  5 дней назад +10

    Go to piavpn.com/8AAFFF to get 83% off Private
    Internet Access with 4 months free (and support me :D)!
    thanks for watching!
    also discord server now: discord.gg/MC4wTeb4

    • @thefcraft8763
      @thefcraft8763 3 дня назад

      It's nice but i think your architecture has some flows like suppose a text "This is a ...." And now there are different possible next world predictions here like "dog, cow, mountain" and dog and cow are nearby in vocab dimensions space but mountain are might far apart and if you train your model in such cases it will average out the result and might give some nonsense or hallucinate etc... (basically it might give medium point/vector of cow dog and mountain)

    • @Myexpectationsarerealistic
      @Myexpectationsarerealistic 2 дня назад

      I did similar. Not touching Rust.

    • @AndroidFerret
      @AndroidFerret 2 дня назад

      The production and information value of this video is insane. How long did you edit this?? Fantastic

  • @scoffpickle9655
    @scoffpickle9655 5 дней назад +72

    The reason why the 160k batch REAN was worse with the graphics card prompt is because the network is overfitting itself, I'd recommend using a test set with some prompts to choose the model that performs best on that test set instead of just running it with high batch amounts

    • @8AAFFF
      @8AAFFF  5 дней назад +17

      ur right its most likely overfitted, the weird thing is that most other test prompts i was running were generally getting better with more batches so idk

    • @scoffpickle9655
      @scoffpickle9655 5 дней назад +5

      @8AAFFF It sounds like a data problem, then, too little or not general enough data would lead to worse curve fitting. I suppose that there wasn't much data about graphics cards, so it freaked tf out and kept spamming "graphics"

    • @8AAFFF
      @8AAFFF  5 дней назад +6

      maybe, also possible that the graphics cards knowledge just got overshadowed because it was in the beginning of the dataset. i did some more tests today and basically it just seems to have some knowledge points that it tries sticking to no matter what the prompt is

    • @PaulanerStudios
      @PaulanerStudios 3 дня назад +3

      @8AAFFF Are you using any sort of speculative decoding or temperature scaling? That wasn't mentioned in the video and does make quite a difference.

    • @NoSubsWithContent
      @NoSubsWithContent 3 дня назад

      ​@@8AAFFF what if you used an existing super efficient model like the granite MoE with 400M active parameters to comb through a different dataset like fineweb EDU and produce a list of knowledge it could access during training via RAG or something?
      if you figure out a way to do that I feel like it'd get much better performance because it doesn't have to spend so much of its weights on memorizing stuff, instead it can learn actual patterns, intelligence even?

  • @mrpro7737
    @mrpro7737 3 дня назад +20

    To editing skills in this video harder that that new architecture 😂

  • @gilbertenevoldsen4469
    @gilbertenevoldsen4469 3 дня назад +17

    This video is super well made and informative! But i'm a bit curious on why you chose the achitecture that you did. The reason this way of outputting words isn't typically used in large language models. Is because it's useful for the model to be able to have multiple high propability cadidates for the next word, that aren't necessarily close to each other in vector space.
    For example, let's say a sentence comes up in training like "My favorite hobby is..." There are a lot of possibilities for the next word. So the model would be optimised to output the average vector of those possible answers, which likely isn't a sensible continuation of the sentence.
    I would love to see what you could make by doing it the traditional way, and showing how good of a model you can train as a single person.

    • @simonwillover4175
      @simonwillover4175 3 дня назад +1

      or um maybe reward it for simply choosing any word close to any option rather than the average?

    • @WoolyCow
      @WoolyCow 2 дня назад

      @@simonwillover4175 could just be weighted by distance as well, or even add in some error on purpose to get some more divergent responses

  • @salad_txt
    @salad_txt 5 дней назад +30

    You are so underrated it is actually insane, keep it up dude. Great stuff.

  • @zaj007
    @zaj007 5 дней назад +27

    18:25 Bro there has gyat to be a better way! I'm crying 😭😭 wtf is that timeline 💀💀

    • @8AAFFF
      @8AAFFF  5 дней назад +21

      bro did the tower of babel editing technique ahh

  • @IceMetalPunk
    @IceMetalPunk 2 дня назад +13

    I mean... your network only predicts the most likely next token, whereas GPT models predict the probability of all tokens and sample from there (they don't just choose the highest-probability token); and your tokens are just entire words from the corpus. So it's like a GPT model that (a) always has a temperature of 0, and (b) can't understand anything that's not a word present in the corpus. I think we can see from that why GPT and its similar models didn't go this route 😅

    • @jairjuliocc
      @jairjuliocc 2 дня назад +2

      With respect to temperature , is possible to find the k neighborhood vectors more similar , and adding a probability based in the similarity score. In this way you can mimic temperature

    • @IceMetalPunk
      @IceMetalPunk 2 дня назад +7

      @jairjuliocc Yet that would result in a very different probability space. It'll be "the probability of words that are most related to the most likely next word" instead of "the probability of words that are likely to be the next word".

    • @pacoalsal
      @pacoalsal 22 часа назад +1

      @jairjuliocc worth remembering that these embeddings can’t differentiate between senses of the same word. So “fly” the insect and “fly” the verb share the same point in the embedding space, e.g. somewhere in between an “animals” cluster and a “forms of locomotion” cluster. Sampling as you say, you’d get a mix of words closely related in one sense or another but you can’t distinguish which sense is relevant to the current context.

    • @IceMetalPunk
      @IceMetalPunk 17 часов назад

      @@pacoalsal Well, no, not quite. That's what the attention heads are for: they push and pull the embedding away from its initial position -- for instance, that midpoint between "locomotion" and "animals" -- based on the context, therefore separating the resulting vectors for each meaning. So the resulting vector would definitely encode the context-specific meaning. The problem here isn't that it fails to disambiguate context; it's just that the probability space would be based on the one most likely output rather than all the different likely outputs.

  • @jondoe6608
    @jondoe6608 3 дня назад +11

    Out of curiosity are you aware of the RWKV architecture? Its a LLM thats based on a type of RNN, its main advantage is removing the hard context limit, making it possible to have longer contexts on weaker devices, due to using a constent amount of memory. Your idea of using embeddings as the input and output is really cool, especially due it further reducing vram requirements.

  • @slowpoke101
    @slowpoke101 5 дней назад +6

    GReat video, these longer videos are always nice to see. Thank you for opensourcing the code.

  • @toofardoug2188
    @toofardoug2188 3 дня назад +1

    This is so high quality it's nuts! The editing is excellent. The explanations are crisp. The relative context ti the SOTA for each variable choice js excellent. Such as. The origin and then evolution of concepts is extremely valuable. Such as the beginning/origin of tokenization that becomes embeddings.

  • @hasirciogli
    @hasirciogli День назад

    MAN YOUR ANIMATIONS SO PURFECT

  • @jaythecoderx4623
    @jaythecoderx4623 3 дня назад +5

    This should have millions of views what the hell this is epic, very well edited too

  • @brams06
    @brams06 3 дня назад +3

    I was shocked to see that this video has so little views. I feel so lucky to come across this gem.

  • @lionlight9514
    @lionlight9514 5 дней назад +3

    This is so cool man! Please, keep going.

  • @v0idbyt3
    @v0idbyt3 3 дня назад +1

    damn you made davinci resolve go kaboom at the end
    btw cool video! i hope this architecture eventually gets a remake or succeeds, because this could be a way better alternative to GPT architecture.

  • @OscarTheStrategist
    @OscarTheStrategist День назад

    Excellent video. Thank you for explaining everything in such detail, despite the setbacks. WOuld love to see more of this architecture being refined if you deem it worthy of continuing development.

  • @ClayShoaf
    @ClayShoaf 2 дня назад

    Switching the end from a token list to a vector output is pretty genius. This will probably make some non-natural language things (like coding and markdown formatting) more spotty, but for keeping the final layer small, it seems like it's worth a shot.

  • @aamindehkordi
    @aamindehkordi 3 дня назад +1

    Insane time spent and crazy W video. don't worry about compression or pacing this is gas and should blow up soon

  • @WoolyCow
    @WoolyCow 2 дня назад +1

    was 784 a reference to mnist? loved the vid, well explained and beautifully edited :D dropped a sub

    • @8AAFFF
      @8AAFFF  2 дня назад

      Nice XD someone saw it
      Thx for the sub :)

  • @sandded7962
    @sandded7962 3 дня назад +2

    Hi , Can you elaborate on the 12:43 part where the circular text says the following:
    “I’m edging, I’m edging , I’m edging , I’m edging”

  • @4.0.4
    @4.0.4 3 часа назад

    The problem with efficiency optimizations like RWKV, Mamba, BitNet etc is that the ones making huge models are reluctant (understandably) to train a 7-70B on them.

  • @Pratikology
    @Pratikology 3 дня назад +1

    Wtf, why isn’t This at a million views? keep it up bro what a fantastic blend of knowledge and creativity 🔥

  • @PseudoProphet
    @PseudoProphet 3 дня назад +2

    Wow, it could work, 😮😮
    You just need a better and more complete dataset.
    You should have also tried to ask it questions that you knew were in it's training data, to see it's performance.

  • @Aragamin
    @Aragamin 3 дня назад

    Чел, это замечательная работа.
    Рад видеть, что энтузиазм порой превращается не только в увлечение, но и в серьёзные разработки)
    Продолжай свой путь!
    P.S: реально крутой дизайн видосов - зачёт.

  • @kotway2
    @kotway2 5 дней назад +1

    Very cool video and project man!

  • @Quozul
    @Quozul 2 дня назад

    This is an amazing project! And I love the graphics and visuals of the video too!

  • @TeamDman
    @TeamDman 3 дня назад +1

    Your animations are awesome :o

  • @TheLucaz37
    @TheLucaz37 День назад

    This is amazing... ive always been fascinated by how AIs work

  • @mrcavas
    @mrcavas 2 дня назад

    Such a fire video! I love the style

  • @A_Me_Amy
    @A_Me_Amy 2 дня назад

    i like ur ad or rather, ur general artistic style. Also for the model, i think that the idea of the vocabulary space makes sense. there is a research that came out today from meta that could pair with this fairly well about LCM as opposed to LLM, and it takes small sentence with limited tokens, and I could imagine if you were to in essence translate into the 768 vocab any sentence, or something like this... not technically aware enough to contribute more than to say this. perhaps word2word2vec2word2word process, so that it can still speak the full vocab list and understand it, but it processes the core essence in the smaller architecture. I do think that figure this out is the sort of future, or that there is a lot possible...Oh and the same dude who talked about this paper today also talked about another research form princeton about slow and shorter training leading to more in context learning ICL and that at some point when training weights it loses t he ability to do this.... but yeah the most fully reasoning model at the lowest possible is the new effect extension to the computational power halving in physical size and doubling in power process, i forget what it is called. meh. moores law. more law. even more law. 93/93 ok... but the new moores law. ai gets twice as smart every 2 years and half as large. I am quite sure this will be the trend to be honest.

    • @w花b
      @w花b 2 дня назад

      That's nice to have something that's not manim for once

  • @devbuffer0112
    @devbuffer0112 3 дня назад

    Creel style visuals, cool bgm and hot topics related to CS. You're gonna become huge

  • @rkpstam
    @rkpstam 2 дня назад +2

    Хорошая работа, Олег

  • @that_guy1211
    @that_guy1211 День назад

    i remember one time watching a channel explain how chatGPT became whicked cause somebody multiplied a variable by -1. And in that video they were like, there's the syntax teacher, and the reason teacher, the AI gets points for using "good" words, aka being censored and not using "bad" words, and the syntax teacher would give a punishment if the grammar was off, or if it repeated too much the words and stuff, that's how ChatGPT got so good, maybe you can implement something similar? But instead for the reason teacher being a censor filter, being something else? IDK

  • @vassa-vn7xz
    @vassa-vn7xz 3 дня назад +3

    How is this even working? Why will it not collapse to single embedding for all words?

  • @bedahtpro
    @bedahtpro 3 дня назад

    Great quality video man!

  • @ainet8415
    @ainet8415 6 часов назад

    Instead of training a rean model, try to take a model like llama 3.2 1b and add your idea (rean) as a layer and train this layer. Basically, fine tune it and use a high quality dataset

  • @AllExistence
    @AllExistence 4 дня назад +7

    You seem to have went a weird route with training. Normally, networks are just trained in plain text first, to learn normal language. Then, they are finetuned with "human/assistant" data to actually answer questions instead of talking to themselves.

    • @8AAFFF
      @8AAFFF  3 дня назад +2

      yeah thats true
      its just that the higher quality human/assistant dataset was so big that i didnt need to first train on raw text

  • @driss1227
    @driss1227 2 дня назад

    The graphics or so great, curious what you used to produce this video? Looks like manim expertly used

  • @lewisbeith
    @lewisbeith 2 дня назад

    Very good editing!

  • @juliansantos1900
    @juliansantos1900 2 дня назад

    Crazy work not to mention crazier animation, i know the concept of the ais but dont have this extensive knowledge to write it on m y own without lm libs 😆

  • @toofardoug2188
    @toofardoug2188 3 дня назад

    I wonder if there's a better sampling mechanism when you're using the word2VEC model? If you watch the got from scratch video from youll see that andrej karpathy doesnt just take the highest predicted token. They sometime take from the top 3rd value.

  • @DallasMcMillan
    @DallasMcMillan 3 дня назад

    Incredible project and so many insights into ai in a fun and digestible way!
    Thanks !!!! ❤❤❤

  • @foreignconta
    @foreignconta 2 дня назад

    It's just a transformer which uses pre learnt embeddings from word2vec. The attention is still quadratic.

  • @alisyoung2741
    @alisyoung2741 3 дня назад

    I have been working on one as well but ran across issues currently! So exciting!

    • @8AAFFF
      @8AAFFF  3 дня назад +1

      yooo gl man
      are you doing like a custom architecture?

    • @alisyoung2741
      @alisyoung2741 2 дня назад

      Yes! I customized the standard U-net architecture by rebuilding the bridge to process input using a Bi-lstm a memory system and attention mech before re-upsampling.

    • @alisyoung2741
      @alisyoung2741 2 дня назад

      Your video actually inspired me to try and work on a kind of single token tokenizer that will produce a single unique token for any given input of a certain size, hopefully really large.

  • @xorcise
    @xorcise 2 дня назад

    8:20 ah yes
    good birchrunville Cassiano

  • @MrNootka
    @MrNootka 3 дня назад

    Hello! Nice video,
    In the section "Final word2vec Results" i.e. at point 11:14 and 11:28, you had a space inside the variable value of similar_by_world in one and the other you didnt... I wonder if the space changes the results

    • @v0idbyt3
      @v0idbyt3 3 дня назад

      a space in the code would make the compiler or interpreter think that its something else, so it would make an error (which is a difference)

    • @8AAFFF
      @8AAFFF  3 дня назад

      thanks :)
      also well done for noticing, the space does change the results because its a slightly different token in the word2vec (but they are really close to each other). i dont know why its there its probably by accident but if ur curious this is the output for "2021" with no space:
      [('2020', 0.7180283069610596),
      ('2022', 0.6416824460029602),
      ('2021 ', 0.6249533295631409),
      ('2019', 0.6035624742507935),
      ('October ', 0.5840676426887512),
      ('october ', 0.5773099660873413),
      ('January ', 0.5399696230888367),
      ('2020 ', 0.5389090776443481),
      ('2018', 0.5194795727729797),
      ('July ', 0.5182425379753113)]

    • @MrNootka
      @MrNootka 3 дня назад

      @@8AAFFF Thanks for the reply, I mainly asked because of your "tokenization" approach; Anyway I believe what you have cooked here has some serious potential! When I found your channel yesterday I binge watched most of your videos and this one and the dimensions simulator are my top favorite ones 😆, I am working on somerhing similar, keep up the good work!

  • @Kwenen
    @Kwenen 2 дня назад

    4:00 It's intuitive to do so, but I'm surprised that big companies still choose to output Token regressions

    • @ЕгорКолов-ч5с
      @ЕгорКолов-ч5с 2 дня назад

      Because you need the model to be able to produce different outputs. For example if you have "Cat sits on a tree" and "Cat sits on a sofa" in the training data, this trained model will always predict (tree + sofa) / 2 when given "Cat sits on a" as a prompt, and there is no remedy for this issue

    • @Kwenen
      @Kwenen 2 дня назад

      @@ЕгорКолов-ч5с
      I don’t think it matters, because what we usually want is an output, and we don’t care what the value of a specific Token is (for example, 0~1 for emotion recognition), and the current model will also have situations where both Tokens are 0.5, which is also passed Throw a weighted die when an output is needed.
      The vector used in the video, (tree + sofa) / 2 also shows that this sentence can be followed by two words.
      Then I think the model can learn the usage of language very well. When calculating the similarity with the output, both It's 0.5, just throw a dice and everything is fine.
      I guess, in the video, the maximum value is always chosen, and there should be a chance of outputting other words when the probability is half and half. This is like using a Markhov chain, but letting the maximum value determine the transfer.
      :)

    • @ЕгорКолов-ч5с
      @ЕгорКолов-ч5с 2 дня назад

      @@Kwenen I don't really understand what you are referring to. I'm just relaying a simple fact: output of this model is a 784 value embedding, that corresponds to a vector in word2vec space which is not as powerfull as a probability distribution over tokens. Generating next word is just taking the closest word in word2vec space to generated embedding. Because of the way word2vec works, 1) embeddings of contextually close words will be closer in word2vec space, so even if you randomly choose 1 out of 10 closest word to the embedding you will get sinonyms at best and gibberish at worst, 2) because word2vec doesn't care about word order, model trying to predict next token will always produce chains of same words over and over. The main reason that nobody uses this method is that it fundamentally doesn't work.

    • @Kwenen
      @Kwenen 2 дня назад

      @@ЕгорКолов-ч5с
      If the correction w2v brings to the model is not strong enough and the training is too slow, it will really make me give up this path.
      Oh, maybe what I said was a bit out of focus.
      I mean, we hope that the language model's output of the meaning is enough, even if it is a synonym. Therefore, I thought that if I output a vector (fuzzy meaning) and then select from similar words, it should be enough to support communication. That model may have a lighter load on the graphics card.
      Of course, if the model gives meaningless vectors, then choosing from them will really only result in a bunch of gibberish, then I can only say that it is very sad.
      And I naively thought that the position encoding of the input and Self-Attention were enough to make the output position-sensitive.
      So the idea of ​​playing with the prediction of the next token on the vector doesn’t really work?
      It’s just that this method really makes me feel intuitive.
      It is easy to imagine a sequence of arrows in hyperspace, pointing to multiple different words in sequence.
      As you reminded, this seems to be inefficient at the learning level. After all, the words are too discrete. Even if it is easier at the output layer, it does not mean that things have become easier, right?

    • @ЕгорКолов-ч5с
      @ЕгорКолов-ч5с 2 дня назад

      @@Kwenen Maybe this could work, but there is no such this as a free meal in ML, in order for this approach to be viable, you would probably need to operate in embedding space that is as large as a common vocabulary (50000+ tokens for gpt3) instead of 784 dimensions, then there will probably be enough redundancy to make it competitive with NTP, but then the approach loses memory efficiency (and training such a big w2v model also becomes too hard)

  • @yeyito7040
    @yeyito7040 День назад

    2:51 MNIST MENTIONED!!!!!!!!

  • @VioletGiraffe
    @VioletGiraffe 5 дней назад

    Even your animations are cool, how did you make them? Or do you have another neural net to do that for you? :)

    • @8AAFFF
      @8AAFFF  5 дней назад +1

      thanks :), basically with just images / clips in davinci resolve.
      I put the almost final timeline at the end 18:26

  • @takeraparterer
    @takeraparterer 5 дней назад +4

    ruclips.net/video/_B2RImihdUI/видео.html that's not correct. gpt models predict every "next word" from a sequence at the same time

    • @8AAFFF
      @8AAFFF  5 дней назад +1

      yeah 100% correct
      i just lied about it in the beginning for the explanation to be easier, but i do later correct myself
      well done for noticing :)

  • @kamertonaudiophileplayer847
    @kamertonaudiophileplayer847 2 дня назад

    You need to patent your approach. It's a very interesting, although I use a slightly modified one.

  • @KristoferPettersson
    @KristoferPettersson 3 дня назад

    Arn't you using any sampling when picking the next token?

    • @ЕгорКолов-ч5с
      @ЕгорКолов-ч5с 2 дня назад

      His architecture doesn't output a probability distribution, where is he supposed to sample from?

  • @yyhhttcccyyhhttccc6694
    @yyhhttcccyyhhttccc6694 21 час назад

    what if you just made it output letters and looped it a bunch of times to spell words?

  • @Moshi74533
    @Moshi74533 3 дня назад

    sick bro, absolutely sick

  • @hypercoder-gaming
    @hypercoder-gaming 3 дня назад

    With more training, this could definitely be very powerful.

  • @60pluscrazy
    @60pluscrazy 3 дня назад

    Amazing..how did you animate 👌🎉🎉🎉

    • @8AAFFF
      @8AAFFF  3 дня назад

      thanks :) all the animations are pretty much fully made up of davinci resolve images and clips and stuff
      i put the timeline at 18:26 if you want to see

  • @AverusMuto
    @AverusMuto 2 дня назад

    This is very useful. Thank you.

  • @FoxGoesBrrr
    @FoxGoesBrrr 16 часов назад

    content quality so high i think i need to pay for ts 😭🙏

  • @Leb369
    @Leb369 5 дней назад

    very good video, the only default is the sound quality.

  • @rikhendrix261
    @rikhendrix261 3 дня назад

    3:11 i thought chat gpt 3 had a 12288 embedding size. You are saying as high as 4096.

    • @8AAFFF
      @8AAFFF  3 дня назад +1

      tbh i asked chatgpt whats its embedding dim XD so idk if its correct
      i looked it up again and ur right the biggest version of gpt3 is 12k embedding dim, and openai is really secretive about gpt4 so im probably completly wrong on that. thanks for pointing out :)

    • @rikhendrix261
      @rikhendrix261 3 дня назад

      ​@@8AAFFF Its okay I tought I might have been wrong. At 4:57 you are saying that you are going to compare the vector of the word it wants to predict to all the words in your database with its vector representation (RAG with cosine similarity). But words like Mole for example can be an animal, soemthing on your cheeck or arm, or it can be something having to do with the total number of molecules 6.02 x 10 ^23. Does this mean that your word database has these words written down multiple times?
      And at some point you said you had 340.000 words in your database?? instead of the 40.000 from openai?
      Im also interested to know what the most important thing you learned during this time was? I have only been learning about AI recently so im all ears.

    • @8AAFFF
      @8AAFFF  3 дня назад +1

      ah i get what ur saying. basically the word2vec model has a bunch of tokens in its vocabulary. every token can appear only once, but usually there are some accidental half duplicates like "mole", "mole ", " mole" etc...
      usually the duplicates have pretty much the same meaning as the "true" word just because they appear in exactly the same contexts.
      because the words are not written down multiple times there are some misplaced words that have meanings in two different areas so they are just awkwardly put in the middle of both "meaning areas".
      this doesnt really hurt the neural net because im relying on it understanding that even if a word's vector doesnt 100% match the situation, its still good because of the context of whatever its typing.
      as for the vocab size being 340k instead of something smaller like 40k its due to me using a tokenizer that splits the text into bigger tokens, usually the size of full words, instead of smaller half words like what openai uses.
      so for me "hello, world!" would be split into something like: "hello" "," " world" "!"
      and for openai same thing would be split into something like: "hel" "lo" ", " "wo" "rld" "!"
      so openai needs less of these tokens to fully encode english
      and probably the bigggest thing i learned with this was how to properly use tensorboard. its a library that lets you track stuff like the loss graphs in real time, compare how two different models trained and more stuff like that.
      the best way to learn about ai stuff is to do big projects like this because you actually encounter problems in the wild, then solve them, instead of just learning about solutions to problems you never had

    • @rikhendrix261
      @rikhendrix261 3 дня назад

      ​@@8AAFFF Wow, very interesting! Yes i now understand why your token count was higher. This would also mean that for you a simple "normal" english sentence would consist of less total tokens than openai which would save on the compute.
      Do you by chance follow "Discover AI" he has some very interesting videos on new Test-Time compute and Test-Time training which according to the literature saves a lot of time and has great results, but my level of AI understanding isn't at that point yet. Maybe that you would be able to combine that tactic with your own?
      I'll follow you and see what more you will post.

  • @mtlr3803
    @mtlr3803 3 дня назад

    crazy video!!

  • @TimeLordRaps
    @TimeLordRaps 3 дня назад

    bros cracked. thank you fellow.

  • @user-qw1rx1dq6n
    @user-qw1rx1dq6n 3 дня назад

    you should probably use a cosine similarity loss

  • @tevincoburn1469
    @tevincoburn1469 2 дня назад

    Dude. Great video but like... Bump up your audio by like 4db. You're so quiet I have to max out my speakers.

    • @8AAFFF
      @8AAFFF  2 дня назад

      Thanks! Alot of people said that, was the general audio too quiet or just the voiceover?

  • @CC1.unposted
    @CC1.unposted День назад

    The reason GPT and other transformers have every token output instead is because these solve a problem so there's no weird outputs (Blurring effect if you try to do Image Gen because model is trained on multiple similar outputs per same input etc)
    You could just use renderable ASCII Why use Word2Vec because your using word2Vec to solve this problem and yet your still making it worse Transformer
    model will still need to memorize every word vector because now it needs to return new vector! It gives it extra capacity by just letting it not learn Vector embeddings but only understand it as much as it needs
    Your just telling a worst Transformer Model
    Current challenges of AI is It's generalisation for which you need something called Time dependent Archetexture not this universal vector to vector Aproximator
    Humman Brains do this infact Meta learning is also this but it's just too computationally intensive like you don't want to store so much specific Data per user it will be infecible
    You should just abandon this project! You could try making some code which Mutate Or trains a JS string so like users could write test cases and get a function which is far better infact I also tried it using basic Mutation in node js but was painfully slow because I didn't made it Regressive mutate (instead of random chars changing I can change keywords or Santax Friendly changeling chars)

  • @ВладЧорний-ч4и
    @ВладЧорний-ч4и 3 дня назад

    This is fire

  • @_XoR_
    @_XoR_ День назад

    Add Byte-Pair Encoding to it :P

  • @yyhhttcccyyhhttccc6694
    @yyhhttcccyyhhttccc6694 21 час назад

    tutorial?

  • @fortaber
    @fortaber 5 дней назад +2

    The editing of the video is just amazing!!

  • @lobiqpidol818
    @lobiqpidol818 3 дня назад

    🤓 well AksUaLly each embedding vector takes up space on the device. So while you save space by vector quantizing the output embeddings the vocabulary is still limited by GPU space. Also you lose the ability to do some calculations on the output like temperature. Good video

    • @yoavco99
      @yoavco99 3 дня назад

      You can probably just have it not be on the gpu, and just check the closest token on like the CPU or whatever.
      Also can't you just easily recreate temperature with this?

    • @8AAFFF
      @8AAFFF  3 дня назад +1

      yeah thats exactly what im doing
      the word2vec weight is stored on regular RAM and is only used to translate tokens back and fourth.
      so the only stuff on the GPU VRAM is the network and the already translated vectors.
      its true that i dont really have regular temperature like other GPT models but i can sort of recreate it by either adding noise to the input or selecting the 2nd / 3rd closest word to the network output instead of the 1st :)

    • @user-qw1rx1dq6n
      @user-qw1rx1dq6n 3 дня назад

      You can absolutely recreate temperature if you just train the embedding model differently

    • @lobiqpidol818
      @lobiqpidol818 3 дня назад

      @@8AAFFF what I've been thinking about what if you use very small embedding vectors only 2 dims for example to represent words then immediately expand it to more dimensions with linear layers when inside the model. Does the model see this as the same thing or completely different?

  • @Vine_Zx
    @Vine_Zx 3 дня назад

    Remember me when you make it big!

  • @callowaysutton
    @callowaysutton 2 дня назад

    I think you're running into issues at the front of the pipeline, when "translating" from the vocabulary to the tokens try just blacklisting tokens already mentioned in the past 3 tokens up to the point you're translating at.

    • @8AAFFF
      @8AAFFF  2 дня назад +1

      To be honest i didnt think of that
      This could also work as some sort if temperature like GPTs have

    • @callowaysutton
      @callowaysutton 2 дня назад

      @@8AAFFF Temperature would be more equivalent to doing a left tailed random distribution over the list of tokens for the given category, this would just be a repeat penalty

  • @TheTruthOfAI
    @TheTruthOfAI 3 дня назад

    Hahaha funny guy.. it's like reading a long gpt4 hallucination

  • @averesenso
    @averesenso 5 дней назад +3

    Your voice is quiet on my speakers

  • @sandded7962
    @sandded7962 3 дня назад

    That’s crazyyyyy

    • @8AAFFF
      @8AAFFF  3 дня назад

      cdn.discordapp.com/attachments/888851399490797620/1242133470235594853/attachment.gif?ex=675e49b1&is=675cf831&hm=dc928ebc5d6bb49010b1d0ce10dd3a420fbc86c69d8aeed38906f4dc3a526d0a&

  • @MommysGoodPuppy
    @MommysGoodPuppy 3 дня назад

    holy GOATED

  • @epicman9105
    @epicman9105 День назад

    ur sick dude

  • @idf3da
    @idf3da 5 дней назад

    top!

  • @absentmindhere
    @absentmindhere 2 дня назад

    nice

  • @RasmusSchultz
    @RasmusSchultz 3 дня назад

    interesting idea, except... it doesn't seem to work? 😅

  • @Myexpectationsarerealistic
    @Myexpectationsarerealistic 2 дня назад

    I did the same thing.

  • @raihanhossain3423
    @raihanhossain3423 3 дня назад

    Is that your real credit card number? 🙃

    • @8AAFFF
      @8AAFFF  3 дня назад

      one way to find out

  • @piglava
    @piglava День назад

    I am writing this comment to comment a comment comment comment comment comment comment comment, and comment comment, comment...

  • @that_guy1211
    @that_guy1211 3 дня назад

    Bro, not trynna be mean or anything, but you AI looks dumb as hell.... Keep working on it mah dude, would love to see this architecture get better with a actual decent LLM on it bruv!

  • @Tenraiden
    @Tenraiden 3 дня назад +2

    Speak louder!!