How To Create Datasets for Finetuning From Multiple Sources! Improving Finetunes With Embeddings.

Поделиться
HTML-код
  • Опубликовано: 29 ноя 2024

Комментарии • 113

  • @cesarsantos854
    @cesarsantos854 Год назад +37

    This content is top notch among ML and AI in RUclips showing us how it really works!

    • @AemonAlgiz
      @AemonAlgiz  Год назад +1

      Thank you, I’m glad it’s helpful!

  • @AemonAlgiz
    @AemonAlgiz  Год назад +4

    Comedy dataset update! I have found an approach I think I like for it, though I didn't have time to complete it for this video. So, I will also cover that in today's live stream!

  • @timothymaggenti717
    @timothymaggenti717 Год назад +9

    Okay so after a cup of coffee and watching a couple of times, WOW. You helped me so much thank you. This has been driving me nuts and you make it look so easy to fix. I wish I was as smart as you. Thank you again. 🎉

    • @AemonAlgiz
      @AemonAlgiz  Год назад

      You always ask the best questions, so keep them coming :)

  • @pelaus01
    @pelaus01 Год назад +10

    Amazing work... this channel is pure gold, the exact amount of concepts, everything is spot on. Nothing beats teaching by experience like you do.

    • @AemonAlgiz
      @AemonAlgiz  Год назад

      I’m glad it was helpful and thank you for the comment :)!

  • @leont.17
    @leont.17 Год назад +1

    I very much appreciate that you always have this way of listing the most important bullet points at the beginning

    • @AemonAlgiz
      @AemonAlgiz  Год назад

      I’m glad it’s helpful! I figured it would be nice to give a quick overview

  • @fabsync
    @fabsync 6 месяцев назад +1

    Finally some freaking great tutorial! Practical, straight to the point and it works!!

  • @smellslikeupdog80
    @smellslikeupdog80 Год назад +5

    I knew I subscribed here for good reason. this is consistently extremely high quality information -- not the regurgitated stuff. This is super educational and has immensely improved my understanding.
    Please keep going bud, this is great.

    • @AemonAlgiz
      @AemonAlgiz  Год назад

      Thank you! It’s greatly appreciated

  • @HistoryIsAbsurd
    @HistoryIsAbsurd 10 месяцев назад +1

    Dude seriously your content is so clear and easy to follow keep it up!

  • @RAG3Network
    @RAG3Network 6 месяцев назад

    You’re literally a genius! I appreciate you taking the time to share the knowledge with us! Exactly what I was looking for… how to create a dataset and in such a well put together video. Thank you

  • @boogfromopenseason
    @boogfromopenseason 7 месяцев назад +1

    I would pay a lot of money for this information, thank you.

  • @jonmichaelgalindo
    @jonmichaelgalindo Год назад +3

    The appeal has been processed by the approval AI... And it passed! The prescription will now be covered. 😊
    (Thank you for the video! I think datasets and install dependencies are ML's greatest pain points at the moment.)

    • @AemonAlgiz
      @AemonAlgiz  Год назад

      Thank you! I’m glad it was helpful :)

  • @rosenangelow6082
    @rosenangelow6082 Год назад +2

    Great explanation with the right level of details and depth. Good stuff. Thanks!

  • @Hypersniper05
    @Hypersniper05 Год назад +2

    Thats awesome! And you can even save the new appeal to create more data !

    • @AemonAlgiz
      @AemonAlgiz  Год назад

      Indeed! It becomes a very nice self reinforcing model, this is why I really like the fine tuning and embedding approach

  • @timothymaggenti717
    @timothymaggenti717 Год назад +3

    Wow, how do you make everything look easy. Nice thanks. So East coast, man your early bird.

    • @AemonAlgiz
      @AemonAlgiz  Год назад

      I live in MST, haha. I just wake up very early :)

  • @kenfink9997
    @kenfink9997 Год назад +2

    How would building a training set on a codebase look? Is there a good example of automating generation of a Q&A training set based on code? How do you chunk it to fit in context window - break it up by functions and classes? Where would extraneous stuff go, like requirements, imports, etc... Thanks for the great content!

  • @kaymcneely7635
    @kaymcneely7635 Год назад +1

    Superb presentation. As always. 😊

  • @arinco3817
    @arinco3817 Год назад +2

    This video was awesome! I'm finally starting to wrap my head round this stuff. At the same time I'm realising the power that is being unleashed onto the world!
    BTW did you see this new paper:SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression. Looks like it's right up your alley!

    • @AemonAlgiz
      @AemonAlgiz  Год назад

      Thank you! I’m glad it’s helpful :D
      I have not seen this, this is super cool though, thank you for pointing me to it! I would love to see some implementation of pruning in LLM’s. Quantization is incredibly powerful, but we can only compress so much until we hit the limit. With pruning plus weight compression, as could run 30/65B parameter models on a single consumer GPU.

  • @flowers134
    @flowers134 Год назад +1

    Amazing, Thanks a lot for sharing your reflections on your work and experience ! It is much appreciated ! First time I check something like this quickly browsing and stick without having to review / study and come back later. I am able to get a Birds eye view on the topic and options available for work, and the underlying purpose. 🥇Pure Gold. Definitely Subscribed !

  • @onurdatascience
    @onurdatascience Год назад +2

    Awesome video!

  • @danielmz99
    @danielmz99 Год назад +1

    Hey man, thanks for your videos they are instructive. I am new to LLMs and I think there is a significant gap in RUclips content with the new LLMs. I know there are videos on fine tuning GPT3 but I can't find anything like walk through in fine tuning a larger new open source model like Falcon-40b instruct. If there was a playlist going through the process: QA fine tune data definition, synthetic data production, fine tuning and test. I am sure others like myself will be very keen followers

    • @AemonAlgiz
      @AemonAlgiz  Год назад

      I’ll make a playlist today!

  • @PhantasyAI0
    @PhantasyAI0 10 месяцев назад +1

    do you have a video on how to prepare a dataset for creative writing?

  • @Tranquilized_
    @Tranquilized_ Год назад +3

    You are an Angel. 💜

    • @AemonAlgiz
      @AemonAlgiz  Год назад

      Thank you! I’m glad it was helpful :) I do like how you left your name that haha

  • @MohamedElGhazi-ek6vp
    @MohamedElGhazi-ek6vp Год назад +1

    it's so helpful thank you, what if I have a multiple pdf files at the same time and each one of them has his own subject can I do the same for them ?

  • @SamuelJohnKing
    @SamuelJohnKing Год назад +11

    I really love the concept, but whatever I have tried I get ERROR: Token indices sequence length is longer than the specified maximum sequence length for this model (194233 > 2048)
    Could you please update it? it would be of immense value to me :)

  • @babyfox205
    @babyfox205 8 месяцев назад

    great explanations thanks a lot for your efforts making this great content!

  • @redbaron3555
    @redbaron3555 Год назад

    Awesome content!! Thank you very much!!👏🏻👏🏻👍🏻

  • @mohammedanfalvp8691
    @mohammedanfalvp8691 Год назад +3

    I am getting an error like this.
    Token indices sequence length is longer than the specified maximum sequence length for this model (546779 > 2048). Running this sequence through the model will result in indexing errors
    Max retries exceeded. Skipping this chunk.

  • @AadarshRai2
    @AadarshRai2 4 месяца назад

    top notch content

  • @bleo4485
    @bleo4485 Год назад +2

    Hi Aemon, i am new to local llm api setting up. Could you explain a little on how to get around setting it up? thanks

    • @AemonAlgiz
      @AemonAlgiz  Год назад

      Hey there! From the OobaBooga web application you can enable extensions, including the api. It will run on port 5000 by default!

    • @champ8142
      @champ8142 Год назад +1

      Hi Aemon I checked api and public_api on the flags/extensions page, any idea why I can't connect to port 5000?

  • @filipbottcher4338
    @filipbottcher4338 Год назад +1

    Well done but how do you handle the max model length of tokenizer.encode?

  • @ĐôNguyễnThành-r1v
    @ĐôNguyễnThành-r1v Год назад +1

    Hi, I have some confusion about your content about leveraging embeddings. My understanding so far is that, embedding approach simply means "few-shot learning". The pipeline is, say, I have a query, I embed the query into a vector and then search for similar vectors which represent relevant examples in a vector db, now I have my initial query + some examples of (query, answer) from the db. Then I somehow cleverly concat my query with the retrieved examples to form a long instruction/prompt, feed it to the llm and just wait for the output. Did I get my understanding right?

  • @unshadowlabs
    @unshadowlabs Год назад +1

    When you uploaded the additional data in superbooga, did you have to prep it first as a question and answer format like you did on the fine tuning, or were you able to just upload books, files, etc for that part? Also thanks for doing these vidoes! These are by far the most informative on how this stuff works!

    • @AemonAlgiz
      @AemonAlgiz  Год назад +1

      I just naively dumped the entire file, which I wouldn’t do for a more sophisticated application. Though superbooga will just chunk the files for you, so you can just drag and drop massive files.

    • @unshadowlabs
      @unshadowlabs Год назад +1

      @@AemonAlgiz Thanks, How do you deal with more complex formatted material, such as research papers? Are the parsers good enough to handle them without a lot of data cleaning or prep work on the paper first?

    • @AemonAlgiz
      @AemonAlgiz  Год назад +1

      @@unshadowlabs this has been my area of expertise for years! I worked in scientific publishing for over a decade, so what I find is that trying to naively parse them works to some extent, especially with research papers since they tend to be very topically dense. What you may find challenging is keeping all of the context densely packed, so it may be worth trying to split on taxonomic/ontological concepts.

    • @unshadowlabs
      @unshadowlabs Год назад +1

      @@AemonAlgiz Awesome, thanks for the reply! A suggestion for a video, I would love to see how you deal with different types of content and sources and what type of data processing, wrangling, or cleaning, and what type of tools you recommend given your expertise, background, and experience.

    • @AemonAlgiz
      @AemonAlgiz  Год назад

      This is a great idea, I have dealt with some nightmarish formats

  • @cmosguy1
    @cmosguy1 Год назад

    Hey @AemonAlgiz - How did you create the instruction set of data for the CYPHER query examples? Did you do that all manually?

  • @mygamecomputer1691
    @mygamecomputer1691 Год назад

    Hi, I was listening to your description of raw text and then how are you converted it. But can you just upload a very short story that has the style you like and take all the defaults of the training tab and use the plain TXT file and make a lora that will be useful in that it will simulate the style I like in model I want to use?

  • @bleo4485
    @bleo4485 Год назад +2

    Aemon, what time will your live stream be?

  • @adriangabriel3219
    @adriangabriel3219 Год назад +1

    Hi @AemonAlgiz, great video! I am using a similar approach (I use langchain for the handing over the documents to a LLM) and I have tried a wizardLM model which hasn't performed too great. What strategies (fine-tuning, in-context learning or other models?) would you recommend to improve the performance of answering a question given the retrieved documents? Can you recommend specific models (Flan-T5 or other models?)

    • @AemonAlgiz
      @AemonAlgiz  Год назад

      Gorilla is specifically tuned for use with langchain, so that may be an interesting model to test with. What kind of data are you want to use? That may influence my answer here

    • @adriangabriel3219
      @adriangabriel3219 Год назад

      @@AemonAlgiz haven't heard of Gorilla so thank's for pointing that out! I would like to answer questions given paragraphs of a technical manual

    • @adriangabriel3219
      @adriangabriel3219 Год назад

      Hi @@AemonAlgiz I don't quite understand how to use Gorilla with an existing vector database. Could you make a video on that or do you have guidance for that? Am I suppose to use the OpenAI API for that use case?

  • @天蓝蓝的
    @天蓝蓝的 Год назад

    Amazing work! I would like to know if it is possible to use langchain to load pdfs to batch generate instruction datasets?

  • @d_b_
    @d_b_ Год назад

    Could you clarify the performance of the LLMs where you provide it context, but dont do a fine tune? Was that last oogabooga medial appeal demo with a fine tuned model, or was it just using the additional embedded context?

  • @CallisterPark
    @CallisterPark Год назад +1

    Hi @aemonAlgiz - how long did it take to finetune stablelm-base-alpha-7b ? On what hardware?

    • @AemonAlgiz
      @AemonAlgiz  Год назад +2

      Howdy! Not very long for this, since it was a fairly small finetune, about an hour. I use an AMD 7950X3D CPU and a RTX 4090

  • @srisai00123
    @srisai00123 8 месяцев назад

    Token indices sequence length is longer than the specified maximum sequence length for this model (249345 > 2048). Running this sequence through the model will result in indexing errors
    I am facing this issue, please help for resolution.

  • @aditiasetiawan563
    @aditiasetiawan563 8 месяцев назад

    can you explain code to convert pdf to json.. i dont know how you doing that.. it's great and thats what we need.. thanks before

  • @protectorate2823
    @protectorate2823 Год назад

    Hey aemon, how can I structure my dataset so it outputs answers in a specific format every time. Is this possible?

  • @xspydazx
    @xspydazx 7 месяцев назад

    hmm... : I would like to be able to : Update the llm , ie by extrracting the documents in a folder , extracting the text and fine tuning it in ?
    ie : i suppose the best way would be to inject it as a text dump ~ HOW?(Please)
    ie take the whole text and tne a single epoch only !:
    As well as saving my chat history as a input/Response dump : single epoch only .
    Question : each time we fine tune ? it takes the last layer and makes a copy then trains the copy and replaces the last layer ? as the model weights are FROZEN? does this mean that they dont get updated ....? if so then the lora is applied to this last layer esentially replacing the layer ?
    If we keep replacing the last layer do we essentially wipe over the previous training ??
    i have seen that you can target Specific layers ? ... How to determine which layers to target? then create the config to match these layers?
    Question : How dowe create a strategy for regular tuning without destroying the last training ? should we be Targetting different layers each fine tuning ?
    Also Why canwe not tune it Live!! ie while we are talking to it ? or discuss with the model and adust the model whilst talking ? is adjusting the weights done by the AUTOGRAD? NN in pytorch with the optimization ? ie adam optimizer ? as with each turn we can produce the loss from the input by supplying the expected outputs to compare with simuarity so if the output is over a specfic threshhold it would finetune acording to the loss (optimize this(once)) ... ie switching between train and evaluation , (freezing a specific percentage of the model )... ? ie essentially woring with a live brain ???
    how can we update the llm with conversation , ??? by giving it the function (function calling) to execute a single training optimization based on user feedback ? ie positive and negative votes... and the current response chain ... ie if the rag was used then the content should be tuned in ??
    SOrry for the long post but it all connects to the same thingy?

  • @othmankabbaj9960
    @othmankabbaj9960 Год назад

    When training a dataset, it seems the Q&A is too specific to the book. Wouldn't that make the model too specific to the use case you're training ?

  • @wilfredomartel7781
    @wilfredomartel7781 Год назад +2

    Amazing work! I still trying to understand the embeddings approach.😊

    • @AemonAlgiz
      @AemonAlgiz  Год назад +1

      Basically, we would rather teach the model how to use information than try to teach it everything. So, if we can give the model enough examples of what a procedure looks like, it can learn how to better follow it.
      So, take for example a para-legal or a lawyer. They’re well educated on how to write legal briefs, though they’re not aware of every law to exist. They have learned how to research and leverage information, which is what we’re trying to do with this approach.

    • @Hypersniper05
      @Hypersniper05 Год назад +1

      The only way you'll understand it is by trying it yourself

    • @wilfredomartel7781
      @wilfredomartel7781 Год назад

      @@Hypersniper05 you are right.

    • @wilfredomartel7781
      @wilfredomartel7781 Год назад +1

      ​@@AemonAlgiz thanks for the explanation to my doubt. I will try yo reproduce in my colab pro.

    • @AemonAlgiz
      @AemonAlgiz  Год назад +1

      Let me know how the experiment goes!

  • @darklikeashadow6626
    @darklikeashadow6626 8 месяцев назад +1

    Hi @aemonAlgiz , I am new to Python (and LLMs) and wanted to try creating a dataset from a book as well. However when running the provided code, I got a warning:
    "Token indices sequence length is longer than the specified maximum sequence length for this model (181602 > 2048). Running this sequence through the model will result in indexing errors
    Max retries exceeded. Skipping this chunk." (which happened a lot).
    The new .JSON file was empty. I tried changing the "model_max_length": from 2048 to 200000 in the tokenizer_config from my model, but that only made the warning disappear (but the result was the same).
    Would love if anyone has a solution to this :)

  • @tatsamui
    @tatsamui Год назад +2

    What difference between this and chat with documents?

    • @AemonAlgiz
      @AemonAlgiz  Год назад +2

      That’s a great question! You can encourage the model to “behave” in a particular way. Though of course you’re not really imbuing the model with knowledge you’re causing a preference for tokens that satisfy some requirement. For example, if I had enough samples for a solid fine tune on appeals it would write near human like in the process.
      So combining the influence on the models behavior with additional context from documents, you get a more modern version of an expert system. This is a technique we have been using in industry to get models to fulfill very specific use-cases.

    • @Hypersniper05
      @Hypersniper05 Год назад +1

      Think of it as of you were using bing but the search results are very specific. This is good for closed domains and very specific tasks . I use it for work as well in closed domain data

  • @AadeshKulkarni
    @AadeshKulkarni Год назад

    Which model did you use on oobabooga ?

  • @LoneRanger.801
    @LoneRanger.801 Год назад

    Waiting for new content 😊

  • @GamingDaveUK
    @GamingDaveUK Год назад +1

    so with superbooga you could just drop in the file with the Q&A from the book, add an injection point in your prompt and the LLM has access to the data?
    That sounds too easy lol
    So say you want to have oogabooga be a storytelling ai, can you add the injection point in that opening prompt, feed it a Q&A made from stargate scripts and then have it use that data in responses to set tone and characters?

    • @AemonAlgiz
      @AemonAlgiz  Год назад

      Superbooga makes it pretty easy! They have a drag and drop embedding system and it handles the rest for you. It’s not going to be optimal for all use-cases but it works well in general

  • @LikithVibes
    @LikithVibes Год назад

    @AemonAlgiz How to enable Superbooga api .?

  • @PromptoraApps
    @PromptoraApps Год назад

    iam getting this error"Max retries exceeded. Skipping this chunk."

  • @LeonvanBokhorst
    @LeonvanBokhorst Год назад

    🙏 thanks

  • @li-pingho1441
    @li-pingho1441 Год назад

    thank you soooooo much

  • @amortalbeing
    @amortalbeing Год назад

    thanks man

  • @JAIRREVOLUTION7
    @JAIRREVOLUTION7 Год назад

    Thanks for your awesome video, if you some day want to work as a mentor for our startup, write me dude.

  • @РыгорБородулин-ц1е

    I still understood literally nothing. What vector databases have anything to do with embedding vectors in language models? and how they get utilized anyway? This video being like "we mentioned them in adjacent sentences and this shows they can work together".

    • @AemonAlgiz
      @AemonAlgiz  Год назад

      Howdy! I’m happy to try and explain anything that’s not clear. Where are things not making sense?

    • @РыгорБородулин-ц1е
      @РыгорБородулин-ц1е Год назад

      @@AemonAlgiz the whole thing, the entire pipeline, especially for QA purpose. like, if I have a huge document put into a vector database, an embedding for a question about this document can very well be really far away from any relevant vector in the database, thus, making chances of getting relevant vector from the database smaller. if this vector affects further model generation, then we won't get answer on this question. it's also not clear how exactly this vector is getting used within the model anyway. it this concatenation? or used as a bias vector? or is it a soft promt?

    • @AemonAlgiz
      @AemonAlgiz  Год назад +1

      @@РыгорБородулин-ц1е this is a great question! This is why we have the tags around different portions of the input, mainly to control the documents that are queried for. Since we can wrap the input, we have explicit control over what portion of the input text gets embedded for the query. Does that make more sense?
      Also, the way we chunk inputs helps to prevent getting portions of the document that aren’t relevant. The way I embedded in this example was naive, though we can use very intricate chunking methodologies to have a higher assurance of topical density.

    • @РыгорБородулин-ц1е
      @РыгорБородулин-ц1е Год назад

      @@AemonAlgiz in such case, if we need explicit control over which documents/portions of documents are queried, it looks like queries in question look more like queries to old-fashioned databases and less like questions to a language model, with a lot of manual labour and engineering knowledge required to do make fruitful requests

  • @caseygoodrich9717
    @caseygoodrich9717 Год назад +1

    Lipsync issue your audio

  • @pedro336
    @pedro336 Год назад

    did you skip the training process?

  • @stephenphillips8782
    @stephenphillips8782 Год назад

    I am going to get fired if you don't come back

  • @vicentegimeno6806
    @vicentegimeno6806 Год назад +5

    Hi, I'm new to Python and getting an error related to the token sequence length exceeding the maximum limit of the model, could you please help me to solve the problem?
    ERROR: Token indices sequence length is longer than the specified maximum sequence length for this model (194233 > 2048). Running this sequence through the model will result in indexing errors 2023-08-24 10:41:54.890169: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT

    • @SamuelJohnKing
      @SamuelJohnKing Год назад

      would also love a answer to the Token Indicies issue

  • @fndTenorio
    @fndTenorio Год назад

    So in the embedding approach the embeddings are just additional information that are injected in the prompt itself? In other words, the fine tuned model knows how to do something, but i can use an extra help (the embedding info) to generate a better prompt? If so we are optimizing the prompt, right? Thanks for the video!

  • @leemark7739
    @leemark7739 Год назад

    UnboundLocalError: local variable ‘iter’ referenced before assignment

  • @linuxbrad
    @linuxbrad Год назад +10

    Wasted 10 minutes to find out you're using an API "oogabooga?" Instead of actually telling us how.