Converting a LangChain App from OpenAI to OpenSource

Sam Witteveen

Просмотров 16 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 1 фев 2025

Комментарии • 69

@julian-fricker Год назад ⁺⁵
This is exactly why I'm learning langchain and creating tools with data I don't care about for now. I know one day I'll flick a switch and have the ability to do all of this locally with open source tools and not worry about the security of my real data.
This is the way!
@vhater2006 Год назад
Good Luck on Privacy ;)
@robxmccarthy Год назад ⁺¹⁰
Thank you so much for doing all of this work!
Would be really interesting to compare the larger models. If GPT-3.5-Turbo is based on a 176b parameter model, it's going to be very difficult for a 13b model to stack up.
13b models seem more appropriate for fine tuning, where the limited parameter count can be focused on specific context and domains - such as these texts and a QA structure for answering questions over the text. The example QA instructions and labels could be generated using OpenAI to ask questions over the text as in your first example.
This is all very expensive and time consuming though... So I think you'd really need a real world business use case to justify the experimentation and development time required.
@Rems766 Год назад ⁺⁸
Mate you are doing all the work I planned, for me. Thanks a lot.
@jarekmor Год назад ⁺²
Unique content and format. Practical examples. Something amazing! Don't stop making new videos.
@georgep.8478 Год назад ⁺¹
This is great. Please follow up on fine tuning a smaller model on the text and epub
@acortis Год назад ⁺¹
This was very helpful! Thanks so much for doing these videos. May I suggest that you do a video on the things that are needed to fine-tune some of the LLMs having a specific goal in mind? not sure that this is something that can be done on a colab, but knowing what are the steps and the required resources might be very helpful. Thanks again!
@samwitteveenai Год назад ⁺¹
I will certainly make some more fine-tuning vids. anything good examples you mean by "having a specific goal in mind"?
@acortis Год назад ⁺¹
@@samwitteveenai I saw your video on fine-tuning with PEFT on the English quotes, and I thought the final result was a bit of a hit-and-miss. I was wondering what specific type of datasets would be needed for, say, reasoning or data extraction (a la squadv2). Overall, I have the sense that LLMs are trying to train on too much data (why in the world we are trying to get exact arithmetic is beyond me!). I think that it would be more efficient if there was a more specific model just dedicated to learning English grammar and then smaller, topic-specific, models. Just my gut feeling.
@samwitteveenai Год назад ⁺¹
@@acortis This is something I am working on a lot. The PEFT task was partially due to me not training it very long, it was just to give people something they could use to learn on. Reasoning is a task that normally requires bigger models etc. for few shot tasks. I am currently training models around 3B for very specific types of tasks around ReACT and PAL. I totally agree about the arithmetic etc. what I am interested in though is models that can do the PAL tasks etc. I have a video on that from about 2 months ago. I will make some more fine tuning content. I want to show QLoRA and some other cool stuff in PEFT as well.
@tejaswi1995 Год назад
The video I was most waiting for on your channel 🔥
@DaTruAndi Год назад ⁺⁴
Can you look into using the quantized models (GPTQ 4 bit or GGML 4.1) for example with langchain?
@tensiondriven Год назад ⁺³
This might be trivial; but I’d love a video on the difference between running a notebook and running a cli vs api. All the demos use notebooks, but to make it useful we need apis and cli!
@theh1ve Год назад ⁺¹
I'd like to see this too. I want my model inferences running on one network machine and a GUI running on another with API calls.
@thewimo8298 Год назад
Thank you Sam! Appreciate the guide with the non-OpenAI LLMs!
@reinerheiner1148 Год назад ⁺⁴
I've really wondered how open source models would perform with langchain vs gpt 3.5 turbo so thanks for making that video. I suspected that the open source models would probably not perform as well but I did not think it would be that bad. Could you maybe provide us with a list of LLM's you tried that didnt work out, so we can cross them off our list of models to try for langchain?
In any case thanks for making this notebook, it'll make it so much easier for me to mess around with open source models and langchain!
@PleaseOpenSourceAI Год назад ⁺³
Great job, but these HF models are really large - even 7B ones take more than 12Gb of memory, so can't really run them on local cuda core. I'm almost at the point of beginning to try to figure out how to use GPTQ models for these purposes). It's been a month already and seems like no one is doing it for some reason. Do you know if there is some big obvious roadblock on this path?
@fv4466 Год назад ⁺³
As a new comer, your discussion on the difference among models and prompt tuning is extremely helpful. Your video pins down the shortcoming of the current Retrieval-Augmented Language Modeling. It is very informative. Is any good way just to digest the html as raw? Is it always better to convert the html pages to text and following your process described in your video? Any tools do you recommend?
@henkhbit5748 Год назад ⁺²
Great video, love the comparison with open source. Would be Nice if u can show how to fine tune an os model, small model, with your own instruct dataset.
BTW: how to add new embeddings in a currrent chroma DB? DB.sdd(....)?
@clray123 Год назад ⁺³
I think it will get interesting when people start tuning these open source models with QLoRa and some carefully designed task-specific datasets. If you browse through the chat-based datasets these models are pretrained with, there's a lot of crap in there, so no wonder the outputs are not amazing. I believe the jury is still out to what extent a smaller finetuned model could outperform a large general one on a highly specialized task. Although based on the benchmarks of the Guanaco model family, it seems that the raw model size also matters a lot.
@pubgkiller2903 Год назад ⁺²
Biggest drawback is QLoRA will take a long time to generate the answer from Context
@yousif_12312 Год назад ⁺¹
Is it optimal to pass the user query to the retriever directly? Wouldn't asking the language model to decide what to search for (like using a tool) be better?
Also, if 3 chunks in 1 doc were found, I wonder if its better to order them sequentially as they show up in the doc..
@ЕгорГуторов-р7я Год назад ⁺¹
Thank you for such content. Is there any possibility to do the same without using only cloud-native platform and GPU? If I wanna launch smth similar on-premises with CPU?
@rudy9546 Год назад ⁺¹
Top tier content
@Danpage04 Год назад ⁺¹
Fine tuning is hard. But RLHF is what takes the model to the next level and on par with the top commercial models. Wanna try to do it?
@samwitteveenai Год назад ⁺²
RLHF isn't the panacea that most people make it out to be it. I have tried it for some things. I will make a video about it at some point.
@Danpage04 Год назад
@@samwitteveenai I guess RLHF is hard to implement and is still in research territory.
@Danpage04 Год назад ⁺¹
Could you please point me to a video you’ve done abouy how the embedding model works? Specifically, I want to know how does it transform a whole chunk of data (paragraph) into 1 embedding vector (instead of multiple vectors per token)?
@darshitmehta3768 Год назад
Hello Sam, Thank you for this amazing video.
I am also facing issue for open source model like the same way in video. The open source models are giving answers them self if the data is not present in the PDF or chromadb. Are you having idea how can we achieve thing like openai for open source and which model we can use for that?
@DaTruAndi Год назад ⁺²
Wouldn’t it make more sense to chunk tokenized sequences instead of the untokenized text? You don’t know the length of the tokenizations of each chunk, but maybe you should.
Also handling of special sequences, like ### Assistant, would they be represented as special tokens? If so, handling them in the token space eg as additional stop tokens for the next answer may make sense.
@samwitteveenai Год назад
Yes, but honestly most the time it doesn't matter than much. The tokens way is a perfectly valid way to do it but here I was trying to keep it simple.
You can use fancier ways for things like interviews. I have one project that has one set of docs that are financial interviews where I took the time to write a custom splitter for question / answer chunks and it certainly helps.
Another challenge with custom open source models too are the different tokenizers. Eg. the original LLaMA models have a 32k vocab tokenizer but the fully open source ones are using 50K+ etc. So we want to make the indexes once but test them on multiple models. So in cases like this using token indexing doesn't always help too much.
Often the key thing is to have a good overlap size and that should be tested
@bobchelios9961 Год назад
i would love some information on the RAG models you mentioned near the end
@ygshkmr123 Год назад ⁺¹
Hey Sam, Do you have any idea how can reduce inference time on open-source LLM model
@samwitteveenai Год назад
Multiple GPUs, Quantization, Flash attention and other hacks. I am thinking about doing a video about this . Any particular model you are using ?
@pranjuls-dt1sp Год назад ⁺¹
Excellent stuff!!🔥🔥 Just curious to know, is there a way to extract unstructured information like invoice data extraction / receipt labels / medical bills info description etc. Using open source LLMs? Like using langchain + wizard/vicuna to perform such nlp tasks?
@samwitteveenai Год назад ⁺¹
you can try the Unstructed package or something like an open source OCR model
@user-wr4yl7tx3w Год назад ⁺¹
Which LLM is instruct embeddings compatible with? Is it a common standard?
@samwitteveenai Год назад ⁺²
It will work with any LLM you use for conversational part. Embedding models are independent of the conversation LLM, they are for retrieval.
@Borbby Год назад
Hello, thank you for the great work !
I have a confusion about tokenizer and LLM, can they use the same model, like at 11:00 in the video, or can I use another model? Is there any difference between them?
@123arskas Год назад
Hey Sam, awesome work. I wanted to ask you something:
1- Suppose we've a lot of call transcripts of multiple agents
2- I want to summarize the transcripts of a month (lets say January)
3- The call transcripts can be from 5 to 600 in a month for a single agent
4- I want to use GPT-3.5 models not the other GPT models.
How would I use LangChain to deal with that much amount of data using Async Programming? I want the number of Tokens and number of Requests to OpenAI API to be below the recommended level so nothing crashes. Any place where I can learn to do this sort of task?
@samwitteveenai Год назад ⁺¹
Take a look at the summarization vids I made and especially the map_reduce stuff so that would do lots of small summaries which you can then do summaries of summaries etc.
@123arskas Год назад
@@samwitteveenai Thank you
@vhater2006 Год назад
Hello Thank your for sharing , so if i want to use langchain and HF "Just Open" a pipelines finally it get it .why not using big models from HF on you example a 40b 65b to get "better" results
?
@samwitteveenai Год назад
mostly because people won't have the GPUs to serve them. Also HF doesn't serve most the big models for free on their API
@dhruvilshah9881 Год назад
Hi, Sam. Thank you for all the videos - I have been with you from the first video. Learned so much from these tutorials. Can you create a video on Fine Tuning LLaMA/Alpaca/ VertexAI(text-bison) or any other feasible LLM for retrieval purposes? Retrieval purposes could be - 1) Asking something about the private data (in GBs/TBs) on local repository. 2) Extracting some specific information from the local data.
@samwitteveenai Год назад
Thanks for being around from the start :D. I want to get back more in to showing Fine-tuning especially now the truly open LLaMA models are out. I try to show something that people can run in Colab so probably won't do TBs of data. Do you have any suggested datasets I could use?
@Danpage04 Год назад ⁺¹
Can you try with the falcon-40B ?
@cdgaeteM Год назад ⁺¹
Thanks, Sam; your channel is great! I have developed a couple of APIs. Gorilla seems to be very interesting. I would love to hear your opinion through a video. Best!
@samwitteveenai Год назад ⁺³
Yes Gorilla does seem interesting I read the abstract a few days ago, need to go back and check it out properly. Thanks for reminding me!
@rakeshpurohit3190 Год назад
Will this be able to give insights into the given doc like writing pattern, tone, language etc?
@samwitteveenai Год назад
It will use the those from the docs and you can set those in the prompts
@HimanshuSingh-ov5gw Год назад
How much time would this e5 embedding model take to embed large files or larger no. of files like 1500 text files?
@samwitteveenai Год назад
1500 isn't that large, on a decent GPU probably looking at 10s of mins max. Probably a lot shorter depending on each file length. Of course once indexed just save them to use in the future etc.
@HimanshuSingh-ov5gw Год назад
@@samwitteveenai Thanks! Btw your videos are very helpful!
@kumargaurav2170 Год назад
The kind of understanding wrt what user is exactly looking out for is currently best performed by OpenAI & PaLM APIs between all the hype.
@samwitteveenai Год назад ⁺²
Totally agree. Lots of people are looking for open source models and it can work for certain uses, but GPT3/4, Palm Bison/Unicorn and Claude are the ones that work the best for this kind of thing.
@adriangabriel3219 Год назад
What dataset would you use for fine-tuning?
@samwitteveenai Год назад
Depends on the task. Mostly I use internal datasets for fine tuning.
@user-wr4yl7tx3w Год назад
Have you tried the Falcon LLM model?
@samwitteveenai Год назад ⁺¹
Yes Falcon7B was the original model I wanted to make the video with but it didnt work well.
@alexdantart Год назад
Please, tell me your Collab environment... even in Collab Pro I get:
OutOfMemoryError: CUDA out of memory. Tried to allocate 288.00 MiB (GPU 0; 15.77 GiB total capacity; 14.08 GiB
already allocated; 100.12 MiB free; 14.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated
memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and
PYTORCH_CUDA_ALLOC_CONF
@samwitteveenai Год назад
I usually use an A100. You will need the Colab Pro+ to run on Colab
@pubgkiller2903 Год назад ⁺²
Thanks Sam …. It’s great. Would you please implement the same concept with Falcon ?
@samwitteveenai Год назад ⁺²
I did try to do the video with Falcon7B but the outputs weren't that good at all.
@pubgkiller2903 Год назад
@@samwitteveenai one question, are these big models like Falcon, Stable Vicuña etc can work on windows laptop on Jupyter Notebook? Or they require Unix system only?
@fv4466 Год назад
@@samwitteveenai Wow! I thought it was highly praised.
@andrijanmoldovan Год назад ⁺¹
Would this work with "TheBloke/guanaco-33B-GPTQ" 4-bit GPTQ model for GPU inference(or other GPTQ model)?
@samwitteveenai Год назад ⁺¹
possibly but would need different loading code etc.

Следующие

Автовоспроизведение

OpenAI's Game Changing Updates. New Features! Bigger Savings!