How To Create Datasets for Finetuning From Multiple Sources! Improving Finetunes With Embeddings.

AemonAlgiz

Просмотров 40 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 29 ноя 2024

Комментарии • 113

@cesarsantos854 Год назад ⁺³⁷
This content is top notch among ML and AI in RUclips showing us how it really works!
@AemonAlgiz Год назад ⁺¹
Thank you, I’m glad it’s helpful!
@AemonAlgiz Год назад ⁺⁴
Comedy dataset update! I have found an approach I think I like for it, though I didn't have time to complete it for this video. So, I will also cover that in today's live stream!
@timothymaggenti717 Год назад ⁺⁹
Okay so after a cup of coffee and watching a couple of times, WOW. You helped me so much thank you. This has been driving me nuts and you make it look so easy to fix. I wish I was as smart as you. Thank you again. 🎉
@AemonAlgiz Год назад
You always ask the best questions, so keep them coming :)
@pelaus01 Год назад ⁺¹⁰
Amazing work... this channel is pure gold, the exact amount of concepts, everything is spot on. Nothing beats teaching by experience like you do.
@AemonAlgiz Год назад
I’m glad it was helpful and thank you for the comment :)!
@leont.17 Год назад ⁺¹
I very much appreciate that you always have this way of listing the most important bullet points at the beginning
@AemonAlgiz Год назад
I’m glad it’s helpful! I figured it would be nice to give a quick overview
@fabsync 6 месяцев назад ⁺¹
Finally some freaking great tutorial! Practical, straight to the point and it works!!
@smellslikeupdog80 Год назад ⁺⁵
I knew I subscribed here for good reason. this is consistently extremely high quality information -- not the regurgitated stuff. This is super educational and has immensely improved my understanding.
Please keep going bud, this is great.
@AemonAlgiz Год назад
Thank you! It’s greatly appreciated
@HistoryIsAbsurd 10 месяцев назад ⁺¹
Dude seriously your content is so clear and easy to follow keep it up!
@RAG3Network 6 месяцев назад
You’re literally a genius! I appreciate you taking the time to share the knowledge with us! Exactly what I was looking for… how to create a dataset and in such a well put together video. Thank you
@boogfromopenseason 7 месяцев назад ⁺¹
I would pay a lot of money for this information, thank you.
@jonmichaelgalindo Год назад ⁺³
The appeal has been processed by the approval AI... And it passed! The prescription will now be covered. 😊
(Thank you for the video! I think datasets and install dependencies are ML's greatest pain points at the moment.)
@AemonAlgiz Год назад
Thank you! I’m glad it was helpful :)
@rosenangelow6082 Год назад ⁺²
Great explanation with the right level of details and depth. Good stuff. Thanks!
@AemonAlgiz Год назад
I’m glad it was helpful!
@Hypersniper05 Год назад ⁺²
Thats awesome! And you can even save the new appeal to create more data !
@AemonAlgiz Год назад
Indeed! It becomes a very nice self reinforcing model, this is why I really like the fine tuning and embedding approach
@timothymaggenti717 Год назад ⁺³
Wow, how do you make everything look easy. Nice thanks. So East coast, man your early bird.
@AemonAlgiz Год назад
I live in MST, haha. I just wake up very early :)
@kenfink9997 Год назад ⁺²
How would building a training set on a codebase look? Is there a good example of automating generation of a Q&A training set based on code? How do you chunk it to fit in context window - break it up by functions and classes? Where would extraneous stuff go, like requirements, imports, etc... Thanks for the great content!
@kaymcneely7635 Год назад ⁺¹
Superb presentation. As always. 😊
@arinco3817 Год назад ⁺²
This video was awesome! I'm finally starting to wrap my head round this stuff. At the same time I'm realising the power that is being unleashed onto the world!
BTW did you see this new paper:SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression. Looks like it's right up your alley!
@AemonAlgiz Год назад
Thank you! I’m glad it’s helpful :D
I have not seen this, this is super cool though, thank you for pointing me to it! I would love to see some implementation of pruning in LLM’s. Quantization is incredibly powerful, but we can only compress so much until we hit the limit. With pruning plus weight compression, as could run 30/65B parameter models on a single consumer GPU.
@flowers134 Год назад ⁺¹
Amazing, Thanks a lot for sharing your reflections on your work and experience ! It is much appreciated ! First time I check something like this quickly browsing and stick without having to review / study and come back later. I am able to get a Birds eye view on the topic and options available for work, and the underlying purpose. 🥇Pure Gold. Definitely Subscribed !
@onurdatascience Год назад ⁺²
Awesome video!
@AemonAlgiz Год назад
Thank you!
@danielmz99 Год назад ⁺¹
Hey man, thanks for your videos they are instructive. I am new to LLMs and I think there is a significant gap in RUclips content with the new LLMs. I know there are videos on fine tuning GPT3 but I can't find anything like walk through in fine tuning a larger new open source model like Falcon-40b instruct. If there was a playlist going through the process: QA fine tune data definition, synthetic data production, fine tuning and test. I am sure others like myself will be very keen followers
@AemonAlgiz Год назад
I’ll make a playlist today!
@PhantasyAI0 10 месяцев назад ⁺¹
do you have a video on how to prepare a dataset for creative writing?
@Tranquilized_ Год назад ⁺³
You are an Angel. 💜
@AemonAlgiz Год назад
Thank you! I’m glad it was helpful :) I do like how you left your name that haha
@MohamedElGhazi-ek6vp Год назад ⁺¹
it's so helpful thank you, what if I have a multiple pdf files at the same time and each one of them has his own subject can I do the same for them ?
@SamuelJohnKing Год назад ⁺¹¹
I really love the concept, but whatever I have tried I get ERROR: Token indices sequence length is longer than the specified maximum sequence length for this model (194233 > 2048)
Could you please update it? it would be of immense value to me :)
@octadion3274 Год назад ⁺¹
Did u find the solution?
@darklikeashadow6626 8 месяцев назад ⁺¹
Would also love to know :)
@maneeharani8135 6 месяцев назад
How you resolved this problem?
@babyfox205 8 месяцев назад
great explanations thanks a lot for your efforts making this great content!
@redbaron3555 Год назад
Awesome content!! Thank you very much!!👏🏻👏🏻👍🏻
@mohammedanfalvp8691 Год назад ⁺³
I am getting an error like this.
Token indices sequence length is longer than the specified maximum sequence length for this model (546779 > 2048). Running this sequence through the model will result in indexing errors
Max retries exceeded. Skipping this chunk.
@darklikeashadow6626 8 месяцев назад ⁺¹
Same here. Does anyone have an answer?
@AadarshRai2 4 месяца назад
top notch content
@bleo4485 Год назад ⁺²
Hi Aemon, i am new to local llm api setting up. Could you explain a little on how to get around setting it up? thanks
@AemonAlgiz Год назад
Hey there! From the OobaBooga web application you can enable extensions, including the api. It will run on port 5000 by default!
@champ8142 Год назад ⁺¹
Hi Aemon I checked api and public_api on the flags/extensions page, any idea why I can't connect to port 5000?
@filipbottcher4338 Год назад ⁺¹
Well done but how do you handle the max model length of tokenizer.encode?
@ĐôNguyễnThành-r1v Год назад ⁺¹
Hi, I have some confusion about your content about leveraging embeddings. My understanding so far is that, embedding approach simply means "few-shot learning". The pipeline is, say, I have a query, I embed the query into a vector and then search for similar vectors which represent relevant examples in a vector db, now I have my initial query + some examples of (query, answer) from the db. Then I somehow cleverly concat my query with the retrieved examples to form a long instruction/prompt, feed it to the llm and just wait for the output. Did I get my understanding right?
@unshadowlabs Год назад ⁺¹
When you uploaded the additional data in superbooga, did you have to prep it first as a question and answer format like you did on the fine tuning, or were you able to just upload books, files, etc for that part? Also thanks for doing these vidoes! These are by far the most informative on how this stuff works!
@AemonAlgiz Год назад ⁺¹
I just naively dumped the entire file, which I wouldn’t do for a more sophisticated application. Though superbooga will just chunk the files for you, so you can just drag and drop massive files.
@unshadowlabs Год назад ⁺¹
@@AemonAlgiz Thanks, How do you deal with more complex formatted material, such as research papers? Are the parsers good enough to handle them without a lot of data cleaning or prep work on the paper first?
@AemonAlgiz Год назад ⁺¹
@@unshadowlabs this has been my area of expertise for years! I worked in scientific publishing for over a decade, so what I find is that trying to naively parse them works to some extent, especially with research papers since they tend to be very topically dense. What you may find challenging is keeping all of the context densely packed, so it may be worth trying to split on taxonomic/ontological concepts.
@unshadowlabs Год назад ⁺¹
@@AemonAlgiz Awesome, thanks for the reply! A suggestion for a video, I would love to see how you deal with different types of content and sources and what type of data processing, wrangling, or cleaning, and what type of tools you recommend given your expertise, background, and experience.
@AemonAlgiz Год назад
This is a great idea, I have dealt with some nightmarish formats
@cmosguy1 Год назад
Hey @AemonAlgiz - How did you create the instruction set of data for the CYPHER query examples? Did you do that all manually?
@mygamecomputer1691 Год назад
Hi, I was listening to your description of raw text and then how are you converted it. But can you just upload a very short story that has the style you like and take all the defaults of the training tab and use the plain TXT file and make a lora that will be useful in that it will simulate the style I like in model I want to use?
@bleo4485 Год назад ⁺²
Aemon, what time will your live stream be?
@AemonAlgiz Год назад ⁺¹
6PM MST :D
@adriangabriel3219 Год назад ⁺¹
Hi @AemonAlgiz, great video! I am using a similar approach (I use langchain for the handing over the documents to a LLM) and I have tried a wizardLM model which hasn't performed too great. What strategies (fine-tuning, in-context learning or other models?) would you recommend to improve the performance of answering a question given the retrieved documents? Can you recommend specific models (Flan-T5 or other models?)
@AemonAlgiz Год назад
Gorilla is specifically tuned for use with langchain, so that may be an interesting model to test with. What kind of data are you want to use? That may influence my answer here
@adriangabriel3219 Год назад
@@AemonAlgiz haven't heard of Gorilla so thank's for pointing that out! I would like to answer questions given paragraphs of a technical manual
@adriangabriel3219 Год назад
Hi @@AemonAlgiz I don't quite understand how to use Gorilla with an existing vector database. Could you make a video on that or do you have guidance for that? Am I suppose to use the OpenAI API for that use case?
@天蓝蓝的 Год назад
Amazing work! I would like to know if it is possible to use langchain to load pdfs to batch generate instruction datasets?
@d_b_ Год назад
Could you clarify the performance of the LLMs where you provide it context, but dont do a fine tune? Was that last oogabooga medial appeal demo with a fine tuned model, or was it just using the additional embedded context?
@CallisterPark Год назад ⁺¹
Hi @aemonAlgiz - how long did it take to finetune stablelm-base-alpha-7b ? On what hardware?
@AemonAlgiz Год назад ⁺²
Howdy! Not very long for this, since it was a fairly small finetune, about an hour. I use an AMD 7950X3D CPU and a RTX 4090
@srisai00123 8 месяцев назад
Token indices sequence length is longer than the specified maximum sequence length for this model (249345 > 2048). Running this sequence through the model will result in indexing errors
I am facing this issue, please help for resolution.
@aditiasetiawan563 8 месяцев назад
can you explain code to convert pdf to json.. i dont know how you doing that.. it's great and thats what we need.. thanks before
@protectorate2823 Год назад
Hey aemon, how can I structure my dataset so it outputs answers in a specific format every time. Is this possible?
@xspydazx 7 месяцев назад
hmm... : I would like to be able to : Update the llm , ie by extrracting the documents in a folder , extracting the text and fine tuning it in ?
ie : i suppose the best way would be to inject it as a text dump ~ HOW?(Please)
ie take the whole text and tne a single epoch only !:
As well as saving my chat history as a input/Response dump : single epoch only .
Question : each time we fine tune ? it takes the last layer and makes a copy then trains the copy and replaces the last layer ? as the model weights are FROZEN? does this mean that they dont get updated ....? if so then the lora is applied to this last layer esentially replacing the layer ?
If we keep replacing the last layer do we essentially wipe over the previous training ??
i have seen that you can target Specific layers ? ... How to determine which layers to target? then create the config to match these layers?
Question : How dowe create a strategy for regular tuning without destroying the last training ? should we be Targetting different layers each fine tuning ?
Also Why canwe not tune it Live!! ie while we are talking to it ? or discuss with the model and adust the model whilst talking ? is adjusting the weights done by the AUTOGRAD? NN in pytorch with the optimization ? ie adam optimizer ? as with each turn we can produce the loss from the input by supplying the expected outputs to compare with simuarity so if the output is over a specfic threshhold it would finetune acording to the loss (optimize this(once)) ... ie switching between train and evaluation , (freezing a specific percentage of the model )... ? ie essentially woring with a live brain ???
how can we update the llm with conversation , ??? by giving it the function (function calling) to execute a single training optimization based on user feedback ? ie positive and negative votes... and the current response chain ... ie if the rag was used then the content should be tuned in ??
SOrry for the long post but it all connects to the same thingy?
@othmankabbaj9960 Год назад
When training a dataset, it seems the Q&A is too specific to the book. Wouldn't that make the model too specific to the use case you're training ?
@wilfredomartel7781 Год назад ⁺²
Amazing work! I still trying to understand the embeddings approach.😊
@AemonAlgiz Год назад ⁺¹
Basically, we would rather teach the model how to use information than try to teach it everything. So, if we can give the model enough examples of what a procedure looks like, it can learn how to better follow it.
So, take for example a para-legal or a lawyer. They’re well educated on how to write legal briefs, though they’re not aware of every law to exist. They have learned how to research and leverage information, which is what we’re trying to do with this approach.
@Hypersniper05 Год назад ⁺¹
The only way you'll understand it is by trying it yourself
@wilfredomartel7781 Год назад
@@Hypersniper05 you are right.
@wilfredomartel7781 Год назад ⁺¹
@@AemonAlgiz thanks for the explanation to my doubt. I will try yo reproduce in my colab pro.
@AemonAlgiz Год назад ⁺¹
Let me know how the experiment goes!
@darklikeashadow6626 8 месяцев назад ⁺¹
Hi @aemonAlgiz , I am new to Python (and LLMs) and wanted to try creating a dataset from a book as well. However when running the provided code, I got a warning:
"Token indices sequence length is longer than the specified maximum sequence length for this model (181602 > 2048). Running this sequence through the model will result in indexing errors
Max retries exceeded. Skipping this chunk." (which happened a lot).
The new .JSON file was empty. I tried changing the "model_max_length": from 2048 to 200000 in the tokenizer_config from my model, but that only made the warning disappear (but the result was the same).
Would love if anyone has a solution to this :)
@abhaypratap7415 5 месяцев назад
did u got the solutiion??
@darklikeashadow6626 5 месяцев назад
@@abhaypratap7415 nope
@tatsamui Год назад ⁺²
What difference between this and chat with documents?
@AemonAlgiz Год назад ⁺²
That’s a great question! You can encourage the model to “behave” in a particular way. Though of course you’re not really imbuing the model with knowledge you’re causing a preference for tokens that satisfy some requirement. For example, if I had enough samples for a solid fine tune on appeals it would write near human like in the process.
So combining the influence on the models behavior with additional context from documents, you get a more modern version of an expert system. This is a technique we have been using in industry to get models to fulfill very specific use-cases.
@Hypersniper05 Год назад ⁺¹
Think of it as of you were using bing but the search results are very specific. This is good for closed domains and very specific tasks . I use it for work as well in closed domain data
@AadeshKulkarni Год назад
Which model did you use on oobabooga ?
@LoneRanger.801 Год назад
Waiting for new content 😊
@GamingDaveUK Год назад ⁺¹
so with superbooga you could just drop in the file with the Q&A from the book, add an injection point in your prompt and the LLM has access to the data?
That sounds too easy lol
So say you want to have oogabooga be a storytelling ai, can you add the injection point in that opening prompt, feed it a Q&A made from stargate scripts and then have it use that data in responses to set tone and characters?
@AemonAlgiz Год назад
Superbooga makes it pretty easy! They have a drag and drop embedding system and it handles the rest for you. It’s not going to be optimal for all use-cases but it works well in general
@LikithVibes Год назад
@AemonAlgiz How to enable Superbooga api .?
@PromptoraApps Год назад
iam getting this error"Max retries exceeded. Skipping this chunk."
@LeonvanBokhorst Год назад
🙏 thanks
@li-pingho1441 Год назад
thank you soooooo much
@amortalbeing Год назад
thanks man
@JAIRREVOLUTION7 Год назад
Thanks for your awesome video, if you some day want to work as a mentor for our startup, write me dude.
@РыгорБородулин-ц1е Год назад ⁺¹
I still understood literally nothing. What vector databases have anything to do with embedding vectors in language models? and how they get utilized anyway? This video being like "we mentioned them in adjacent sentences and this shows they can work together".
@AemonAlgiz Год назад
Howdy! I’m happy to try and explain anything that’s not clear. Where are things not making sense?
@РыгорБородулин-ц1е Год назад
@@AemonAlgiz the whole thing, the entire pipeline, especially for QA purpose. like, if I have a huge document put into a vector database, an embedding for a question about this document can very well be really far away from any relevant vector in the database, thus, making chances of getting relevant vector from the database smaller. if this vector affects further model generation, then we won't get answer on this question. it's also not clear how exactly this vector is getting used within the model anyway. it this concatenation? or used as a bias vector? or is it a soft promt?
@AemonAlgiz Год назад ⁺¹
@@РыгорБородулин-ц1е this is a great question! This is why we have the tags around different portions of the input, mainly to control the documents that are queried for. Since we can wrap the input, we have explicit control over what portion of the input text gets embedded for the query. Does that make more sense?
Also, the way we chunk inputs helps to prevent getting portions of the document that aren’t relevant. The way I embedded in this example was naive, though we can use very intricate chunking methodologies to have a higher assurance of topical density.
@РыгорБородулин-ц1е Год назад
@@AemonAlgiz in such case, if we need explicit control over which documents/portions of documents are queried, it looks like queries in question look more like queries to old-fashioned databases and less like questions to a language model, with a lot of manual labour and engineering knowledge required to do make fruitful requests
@caseygoodrich9717 Год назад ⁺¹
Lipsync issue your audio
@pedro336 Год назад
did you skip the training process?
@stephenphillips8782 Год назад
I am going to get fired if you don't come back
@vicentegimeno6806 Год назад ⁺⁵
Hi, I'm new to Python and getting an error related to the token sequence length exceeding the maximum limit of the model, could you please help me to solve the problem?
ERROR: Token indices sequence length is longer than the specified maximum sequence length for this model (194233 > 2048). Running this sequence through the model will result in indexing errors 2023-08-24 10:41:54.890169: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
@SamuelJohnKing Год назад
would also love a answer to the Token Indicies issue
@fndTenorio Год назад
So in the embedding approach the embeddings are just additional information that are injected in the prompt itself? In other words, the fine tuned model knows how to do something, but i can use an extra help (the embedding info) to generate a better prompt? If so we are optimizing the prompt, right? Thanks for the video!
@leemark7739 Год назад
UnboundLocalError: local variable ‘iter’ referenced before assignment
@leemark7739 Год назад
how can i solve my problem
@linuxbrad Год назад ⁺¹⁰
Wasted 10 minutes to find out you're using an API "oogabooga?" Instead of actually telling us how.
@zihadedu6328 4 месяца назад
😂

Следующие

Автовоспроизведение

Why Do LLM’s Have Context Limits? How Can We Increase the Context? ALiBi and Landmark Attention!