LangChain + Retrieval Local LLMs for Retrieval QA - No OpenAI!!!

Sam Witteveen

Просмотров 51 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 22 авг 2024

Комментарии • 205

@sharifhamza25 Год назад ⁺⁹⁸
Finally a tutorial without open ai. Open Source rules. Perfect.
@srikarpamidi1946 Год назад ⁺⁴⁷
I’ve been following along since your first video and I can say that this is the best zero-to-hero course I’ve tried. Seriously thank you for making us feel like we’re on the cutting edge with you. There’s so much to learn in this field, and every day 100 new things are happening. Thank you for making it easy to follow and code along.
@samwitteveenai Год назад ⁺³
Thanks Srikar much appreciated
@MichaelDude12345 Год назад ⁺⁶
Couldn't wait for it! This is the tech I have been waiting for! This is how LLM's can be powerful for us as developers even without OpenAI or any other big companies! I am here for it.
@alessandrorossi1294 Год назад ⁺⁴
Amazing, I've literally been spending the past day trying to learn exactly this and you boiled it down nicely!
@KA-kp1me Год назад ⁺⁹
Decrease chunk size when splitting text so that when retriever pulls like 3 chunks out of DB those will be smaller so they will fit into context ;-) Awesome content!
@redfield126 Год назад ⁺⁴
I started from your past tuto and spent hours to switch on local model (I am a noob 😅). Anyway this is perfect timing with this video to check my homework. Eager to dive into it as soon as I leave the office !
@redfield126 Год назад
It was not me. I got similar weird output with other models. Trade off is the right conclusion. It is really hard to get rid of open ai or other giant models so far. Anyway. Let’s keep optimistic as huge progress is being made on a daily basis in the open source galaxy. Thanks again for the perfect content I was searching for ! Let’s keep testing and exploring as you say
@autonomousreviews2521 Год назад
Thank you so much for this video! Been waiting for this :) Thumbs up before I even watch. Hoping for more on this general local llm vein, especially as new models are almost constantly coming out!
@microgamawave Год назад ⁺³
Make another video that you make with other models, I would be happy for bigger models or those of 13B. Love your videos❤
@disarmyouwitha Год назад
Thank you so very much for this! I have been wanting to dip my toes in with LangChain and Embeddings -- this was a very helpful and made it feel a lot more accessible.
@tejaswi1995 Год назад
Thank you so much for this video Sam. Found the video at the right time!
@user-zi1xx6lr1b Месяц назад
Thank you for this best tutorial video
@thejohnhoang 5 месяцев назад
you are a life saver for this ty!
@MrOldz67 Год назад
Just wanna add a comment to congrat you, you're doing a great job for the community and I personnaly learn much more watching your videos than hours to find information and make my own testing on git. Thanks for this. I will give it a try with a bloom model and report here so you guys know
@MrOldz67 Год назад
Seems like that text2text generation doesn't support bloom do you have a list of supported model for this purpose?
@samwitteveenai Год назад ⁺¹
Thanks for the kind words. Bloom is a Decoder only model so you will just use one of the notebooks that has 'text-generation' in the pipeling not 'text2text-generation'
@MrOldz67 Год назад
@@samwitteveenai hey Sam thanks for the answer.
Does that mean that I only have to change this setting to make decoder only model working?
What will I loose by using text generation instead of text2text?
Or Is it better to use encoder/decoder model?
Thanks for the answer!
@samwitteveenai Год назад
better to look for one of my notebooks with the text-generation examples in there as there are other things you need to change in how you load the model too.
@Zale370 Год назад
Thank you for this unique and super high quality content you put out for us!
@MartinStrnad1 Год назад ⁺¹
This is amazing stuff, I am learning a lot from the colabs! For my use case, it would be amazing to see work with larger models too (databricks Dolly comes to mind)
@vijaybudhewar7014 Год назад ⁺¹
Thats something unique...you are best..i think you just missed one important thing that langchain supports only text-generation and text2text-generation models from huggingface
@alx8439 Год назад
Incredible! Thank you so much for putting it together
@mdeutschmann 6 месяцев назад
Would be great to have a video about top 5 open source llm allowing commercial usage. 🤓👏
@unclecode Год назад ⁺¹
It's great to see a focus on open source models. By the way, it would be better to use a stop token (such as "### Human:" and "### Assistant:") for "StableVicunaLM" instead of using regular expressions or other cleaning methods. Great video :)
@TomanswerAi Год назад
Wild that so many models out there that can already be run locally
@DevonAIPublicSecurity Год назад
You are the man I was working on this myself but I was started using Bert or basic NLP
@Nick_With_A_Stick Год назад ⁺²
I’ve been following auto gpt’s repos such as agent llm that allows auto gpt with local models. Its is very experimental, the models would do significantly better if they were trained with retrieval example questions. While better prompting may help, a larger context window would help, I believe a version of mpt 7 story writer that would be trained with instruction examples and examples for auto gpt for training, as well as a dataset of python code q/a already available in alpaca format on hugging face would allow the model to be trained to do many auto tasks. And for a very long time due to its context window.
@taiconan8857 Год назад
Has this presumption panned out as you portrayed it? I had a similar impression at the time but got derailed. Trying to pick up the pieces as it were and curious if this has been at least theoretically encouraging? (Lots to still catch up on.)
@Nick_With_A_Stick Год назад ⁺¹
@@taiconan8857 there have no been recent attempts at making the larger token models more accurate at instruction based questions and answers, provided it would be a expensive dataset to make, unless you did it all with gpt 3.5 turbo api. I suggest you read microsoft paper Orca, its essentially the 13b llama model trained of freaking 5 Million questions where gpt 3.5 and 4 created a 5 million question step by step instructions and explanation on why it is answering the way it is answering, so instead of just mimicing chat gpt, it is actually learning and getting better at reasoning. In their paper it showed to preform similar to chat gpt. And they plan to release the model weights and they are currently working with their lawyers. This was half a week ago while ago so open ai might be a little angry 😂. But if this were to work open sourced models and ai agents are gonna be actually useful.
@taiconan8857 Год назад ⁺¹
@nicolasmejia-petit7179 Very nice! Really appreciate all this info. I've often wondered (since I don't precisely understand the methodology between following token weights and the training data's weights) Is it not possible to have token weights that do not sufficiently resemble inputs or expected outputs it begins to favor responses that include, "I'm under the impression that" or simply "I don't know."
Thanks again for this *great* response! 👍💎⛏️
@Nick_With_A_Stick Год назад ⁺¹
@@taiconan8857 yeah, that is done with open ai’s models, but with the current open source models they are called “imitation models” because they only imitate the way the model works(theres a paper about this), it has no understanding of reasoning so if it preforms something outside its training it is likley to hallucinate. This is something orca likley fixed wont know till they publicize there weights. So it is important that models know their own limits, but sometimes in the cases of uncensored models the performance is often better than their censored counterparts, meaning any phrase along the lines of “as an ai model I can not” is removed.
@taiconan8857 Год назад
@@Nick_With_A_Stick Sounds like a classic case of "working to treat the symptoms rather than the cause" but with data instead of disease. Doesn't seem to me like people don't have sufficient tools to do this, but perhaps it's more work than they're willing to put in. 😉
@patrafter1999 Год назад
Hi Sam. Thanks a lot for you awesome videos. Your channel is probably one of the best (if not the best) sources to stay on top of LLMs in this crazy time. I'd like to make a suggestion on your content.
When we want to use LLMs in production, we do not rely on the 'knowledge' it outputs since it's not reliable. We need accuracy and up-to-date-ness in production. And since people have their custom libraries of their own, the immediate benefit comes out of using those libraries rather than relying on Coder LLM outputs, which is largely troublesome.
So instead of experimenting with 'What's the capital of England?', 'Tell me about Harry Potter', and 'Write a letter to Sam Altman', I think the following would be of more value to the audience.
"""
Write a summary from the data focusing on the commonalities, differences and anomalies.
library:
function get_dataframe_from_csv(csvpath:str) -> pd.DataFrame
function get_commonalities(dataframe:pd.DataFrame) -> list
function get_differences(dataframe:pd.DataFrame) -> list
function get_anoamlies(dataframe:pd.DataFrame) -> list
"""
The output is expected to be a few lines of code using those library functions in the prompt. And running the automatically generated code would be quite similar to what OpenAI Code Interpreter offers. The core idea here is that this is the way I would pursue if I were to develop a production system using LLM rather than relying on the outputs directly genenerated by the LLM.
You could perhaps do some training instead of putting the library function definitions in the prompt.
It would be exciting to see the outputs automatically correlated and organised in the way you wanted on your own data.
I hope this helps.
@ozzykampha2776 Год назад ⁺³
Can you test the mpt-7B Model?
@ChrisadaSookdhis Год назад
This is perfect for what I am trying to do!
@julian-fricker Год назад
How are you doing this? My RUclips watchlist can't cope with all your content. 😂
Thanks though, this stuff is fantastic.
@redneq Год назад
I've binged watched every video and have caught up, but really sir. Did you have to be this good! ;)
@samwitteveenai Год назад
thanks
@rudy.d Год назад ⁺¹
8:14 you might want to build a template=PromptTemplate(template="your template with a {context} and a {question}", input_variables=["context", "question"]) and pass it as a prompt=template param to your QA chain instead of overriding the .prompt.template
@samwitteveenai Год назад
Yes I just wanted people to see that all it is underneath is a string and that string can be directly manipulated.
@ryanbthiesant2307 Год назад
Hi, I am having the same issue as you. That is that there was a difference between information retrieval and information analysis. These models are unable to compare and contrast two documents. So you will get against a brick wall. They cannot critically think and this is probably our problem because they seem like they can think. But they are only running through similar expected chains of words.
Perhaps in one of your models, you are able to have a set up to critically think. I imagine it would be like when autogpt is talking to itself? But you would code the box to chat to each other in order to generate the critically evaluation of two documents, or ideas. I hope you get what I mean. For example, you would write some code to look for identifiers of keywords, like critically, evaluate, or compare, or any of those essay type terms. Then, when hearing that identifier, you engage another chat, personality. Those two personalities will then work out the opposing views, and placed this in a in the chat. There are a lot of examples of rhetorical analysis, or that you can use as a template for a prompt.
@picklenickil Год назад
Congratulations on a brilliant video
@samwitteveenai Год назад
thanks
@Atlas3D Год назад ⁺²
it would be interesting to give a small model a "tool" of using a larger model for more context when needed - so that we could almost like ratchet up the intelligence needed on a per generation basis. Maybe even doing parts of the generation on smaller models then passing that information up to larger and larger models until some set threshold is reached. I feel like that is going to require some kind of lora fine tuning tho or else it will be fairly brittle.
@utuberay007 Год назад
A good benchmark chart will help
@backbackduck5167 Год назад
Appreciate your sharing. Embedding with hkunlp/instructor-xl and LLM with MBZUAI/LaMini-Flan-T5-248M are the best combination i test. embedding takes some time, but searching basically takes a few seconds with this tiny model.
@remikapler Год назад
Hey, love the content, might make sense if you talk about an earlier video to add a popup on the video where the earlier video is. Otherwise , well done, and thank you.
@samwitteveenai Год назад
good point. Will try to do this more going forward.
@MrPsycic007 Год назад
Much awaited video
@thequantechshow2661 Год назад
The hashes are Markdown format! That’s actually how openai responds as well
@silvacarl 4 месяца назад
Very cool!
@mahroushkagaurav3601 Год назад
fantastic - thank you!
@jakekill8715 Год назад
This looks great! Thank you for the tutorials! Have you thought about trying langchain with different architectures like mpt? The MPT-7b-chat model seems to be quite good at coding and might have autogptq support to be able to be quantized soon so running it with langchain would be great! Thank you again for all the help
@snippars Год назад
Yes also interested in the MPT 7b chat. Gave it a try but failed to get it to work due to what seems to be a corrupt config file in the repo. Can't spot where the error is.
@mimori.com_ Год назад
Very helpful walkthru! Your vid makes me think I can do it myself if I follow you! 🎉 For your video making one thing I noticed is the voice sound is distorted and weaken several times. I am not native English so this makes me loose context. Your microphone may be too near or recording FX / mic attenator / auto leveler may need adjust
@AdrienSales Год назад
It's gonna be a busy week-end !😂
@DevonAIPublicSecurity Год назад
You are the man ...!!!
Год назад
Thank you for yet another excellent video! It's great to learn about other models and how to get started with them. Can you please provide more information about why you choose Flan T5, and the associated modeling utilities? I apologize for my lack of technical expertise, but I would appreciate a better understanding of how they operate in tandem.
@samwitteveenai Год назад ⁺¹
I am making a part 2 to this and will explain more in there.
@mukkeshmckenzie7386 Год назад
@@samwitteveenai any ETA on the part 2?
@whiteshadow5881 11 месяцев назад
I hope these is a future edit where this is done using LLAMA 2 model
@samwitteveenai 11 месяцев назад
I actually have done some of these with LLaMA 2. Look for the RAG & LLaMA-2 vids
@abhijitkadalli6435 Год назад ⁺¹
Gotta try using the MPT-7B-storywritter model right ? the 65k token length will surely help?
@samwitteveenai Год назад ⁺¹
I showed StoryWriter briefly in the MPT7 video. It is super slow not really great for this kind of use.
@onroc Год назад
FINALLY!! Thanks!
@waleed5849 Год назад
Thanks. U r the best
@samwitteveenai Год назад
Thanks
@joser100 Год назад
Have you considered, instead of using embeddings (regardless of the model) trying to fine-tune a model, more specifically I heard the idea of using S-BERT (a simple encoder only model) to get the embeddings via fine-tuning instead, this would be efficient specially when the corpus of data is not very large (say below a million words), the idea is that fine-tuning a small model would be very fast and would eventually deliver even better results, easy to update with new data, etc. Have you heard and consider this alternative?
@snippars Год назад
Brilliant thanks!
@dadimanoj9051 Год назад
Great work thanks, can you also try red pajama or MPT models
@samwitteveenai Год назад
Yes I will look at those soon for this kind of task
@ynboxlive Год назад
Awesome content!
@ugurkaraaslan9285 5 месяцев назад
Hello, thank you so much for your video.
I am getting error on local_llm(“text”) line. I think there is missing in langchain huggingfacepipeline.
Value Error: The following ‘model_kwargs` are not used by the model: ['max_lenght', 'return_full_text'] (note: typos in the generate arguments will also show up in this list)
@MrAmack2u Год назад
unlimiformer might be interesting to look at next...unlimited context works with encoder/decoder based models.
@samwitteveenai Год назад ⁺²
Agree its an interesting paper.
@reezlaw Год назад ⁺¹
Sorry if this is an ignorant question, but is there a reason for not using 4-bit quantised models?
@samwitteveenai Год назад ⁺¹
Not and ignorant question at all. In my tests for other things the current way of doing the 4bit tends to change the output. Usually just making it shorter. I will start to make some vids with the 4bit models soon as I can see it allows a lot more people to use the models with free Colab etc. Also there is a new 4bit library coming in the next few weeks.
@reezlaw Год назад
@@samwitteveenai 4bit models have been a godsend for me as I like to run them in TGUI locally
@fernandosanchezvillanueva4762 Год назад
The best video!!
@ComicBookPage Год назад
Great video. I'm curious if there is certain types of input format of the knowledge base documents that work better than other types. Things like longer sentences versus short sentence or other things like that.
@samwitteveenai Год назад
generally you want things with clear meaning in each section of text, but don't let that stop you, just try it out with a variety of docs etc.
@somnathdey107 Год назад
Thank you so much for this nice video. Much appreciated and needed as a lot of things are happening in this area nowadays. However, I am trying the same code of WizardLM in Azure ML, but I am not getting the kind of result as you have shown in your video. A little help is much appreciated.
@samwitteveenai Год назад
Try running the Colab. What outputs are you getting different?
@alizhadigerov9599 Год назад
Great video, Sam! One question: is it worth using another embedding model for text vectorization and use gpt-3.5-turbo model for the rest of the tasks? (agents, qa etc.) The reason behind that is - text vectorization takes too long when using openai's embedding model.
@samwitteveenai Год назад
Yes totally . I think the Instructor embeddings are actually doing better than OpanAI in many cases as well.
@alizhadigerov9599 Год назад
@@samwitteveenai Including vectorization speed? I understand, it depends on specs, but what specs would you recommend to achieve much faster model than openai's one?
@mjgolab Год назад
Sam you are the best
@clray123 Год назад
One disadvantage that needs to be mentioned vs. using OpenAI's API (besides of having to wrangle dumber/buggy models) is the model startup time. If you want to have an always-ready model loaded in the GPU waiting for your input, it gets expensive pretty quick. In fact the most baffling thing about OpenAI is how they can afford such availability at scale. But perhaps they are just bleeding cash like crazy, who knows...
@samwitteveenai Год назад
Agree, lots of disadvantages of using self hosted models.
@rikardotoro 10 месяцев назад
Thanks for the videos! Do you know how can the models be "unloaded"? Should I just delete the folders? They are taking a lot of space in my drive and I haven't been able to figure out the best way of deleting them.
@ikjb8561 Год назад ⁺¹
Is there a way we can store the models locally instead of downloading when script runs? I am running on command prompt in Windows. Also how do we get bits and bytes optimized for cuda? Thanks
@user-jn1gq3wc2q 8 месяцев назад
Hello There, How do we validate on the source document that we receive in response?
Because, I'm getting non matching document as sources in RetrievalQA chain
@marloparlo8594 Год назад
Amazing. I was already using WizardLM following your guide to use Huggingface Embeddings. Good to know I made the right call.
Could you show us how to implement it with Chat memory using this method? I already have the Chat Memory tutorial from you but I don't know how to create the correct chain to combine both.
@samwitteveenai Год назад
Good idea I will add that to a future video
@prakaashsukhwal1984 Год назад
Great video Sam thanks for the relentless effort in helping us all.. +1 for the chat memory request and conversational QA.. hoping to see a video soon!
@piotrzakrzewski2913 Год назад
great content! Have you already tries fine tuning / training your own small specialised model?
@samwitteveenai Год назад
Yes that is how I do it at work, but I can't release those models currently. I am working on a solution like that to show soon
@MariuszWoloszyn Год назад
At least in principle one could use the same model to extract embedings and answer question.
@samwitteveenai Год назад
This doesn't usually work well and the training for the 2 tasks is quite different.
@henkhbit5748 Год назад
Great comparisons of a complete open source solution with different LLM’s. What about mpt-7b with 65 k tokens. If a LLM is big i thought u can stream it from huggingface or am i wrong? Thanks for sharing and show us about the New innovations in fast moving developments in LLM’s 👏
@samwitteveenai Год назад ⁺¹
65k is too slow. You can stream some from HF, but you are limited as too how much you can stream and the context size etc. Also then your data is still going to the cloud. I will make a part 2 to this video with some other new models this week.
@henkhbit5748 Год назад
@@samwitteveenai Thanks, looking forward for the new LLM's..
@cinematheque20 Год назад
What is the difference between HuggingFaceEmbedding vs HuggingFaceInstructEmbeddings? Are there any pros of using one over another?
@DevStephenW Год назад
Would love to see you try different models with the question answer system. This was great. Glad to see you found one that works reasonably well. I tried T5 and mrm8488/t5-base-finetuned-question-generation-ap which worked decently well, slightly better than the plain T5. Will try the wizard but may not be able to on my GPU. Looking forward to your other tests.
@samwitteveenai Год назад ⁺³
This is great that you shared what you tried as well. I think if we all mention what we have tried it would be great and help everyone to find the best models.
@jawadmansoor6064 Год назад ⁺¹
Any idea how to use ggml such as TheBloke/Wizard-Vicuna-13B-Uncensored-GGML (or even smaller for llama.cpp models like 7b Wizard)?
@jawadmansoor6064 Год назад ⁺¹
or this TheBloke/wizardLM-7B-GGML. please make a tutorial for this.
@nithints302 11 месяцев назад
Can you do something on FLAN-T5-XXL, my Flan-T5-XL performs better than xxl, I am missing something so want to understand
@cruzo333 Год назад
Great video man, tx !! however, I couldn't run it on my Macbook Pro (not M1). when trying to pip install transformers and xformers im getting "ERROR: Could not build wheels for xformers, which is required to install pyproject.toml-based projects" 😞
@kevinehsani3358 Год назад
Thanks for the great video. I am a bit confused about 3 tokens used for retriever! Is that the number of tokens fed into the model? I believe chatgpt does not explicitly include a retireiver component, I maybe wrong. If you could explain what is the function of these 3 tokens it would be highly appreciated, seems extremely low for any tasks.
@Ripshot14 Год назад ⁺¹
You must be referring to the line `retriever = vectordb.as_retriever(search_args={"k": 3)`. This argument is not passed ChatGPT or any LLM for that matter. The 3 is specifying the number of documents that the "retriever" should retrieve from the Chroma vector database prepared earlier in the video. Effectively, the 3 documents found to be most relevant a given query will be pulled up and included as additional information ("context") when querying the LLM.
@atultiwari88 Год назад
Hi, Thank you for your tutorials. I am following your tutorials for quite some time now. I have watched your whole playlist on this. However I am unable to figure out best economic approach for my use case.
I want to create a Q & A chatbot on streamlit which answers only my custom single document of about 500 pages. The document is final and won't change. From my understanding so far, I should either choose Langchain or LlamaIndex. But, I will have to use OpenAI api to get best answers, but that API is quite costly for me. So far I have thought of using Chroma for embedding and somhow storing the vectors as pkl or json on streamlit itself for re-use, so I don't have to spend again for vectors/indexing. I don't have enough credits to test different methods myself.
Kindly guide me. Thank you.
@chineseoutlet Год назад
hi Sam, I tried to load WizardLM in my free Colab account. But, I tried couple times and it all failed to load. I wonder if I need to have pro or pro+ account to load your examples?
@samwitteveenai Год назад
yes probably. I was using Pro+ on this .
@user-ej9wq6tt3j Год назад
Can you pls do vidoe on ConverstaionQA with memory, i have tried, but it is not working as per documents.
@geekyinnovator3437 Год назад
Is token size an issue if we want to train over large examples
@Ruzzeem Год назад
great !
@john.knappster Год назад
Are there any examples of LangChain working well to make code changes (or suggesting to make code changes) to medium to large codebases based on a prompt? For example, modify the code to make the given test pass.
@samwitteveenai Год назад ⁺¹
This is an interesting idea, I haven't tried it. It might work better with a code model like Starcoder or GPT-4. The bigger models can certainly do unit tests etc.
@ahmedkamal9695 10 месяцев назад
I have question .does Reterival bring data from vector db then pass it to model generate text
@samwitteveenai 10 месяцев назад
yes it puts it in the context of the prompt
@TheAnna1101 Год назад
thanks for such a great tutorial. could you show how to do the same with ConversationalRetrievalChain with return_source_documents=True? thank you
@____r72 9 месяцев назад
absolute beginner q, how does one go about pouring this concoction into a flask?
@____r72 9 месяцев назад
or rather can a notebook be retrofitted to have REST functionality? I've googled but trying to get the lay of the land before embarking on making the first thing that comes to mind
@user-yw6fq5nd7m Год назад
I tried with this code and it's working fine for the docs that I have given but it's also giving answers for the other questions that details are not present in the docs. so How can I put all things for only my docs and for outer questions it gives something I don't know something like this. Can you please guide me? I have used the WizardLM model.
@darshitmehta3768 Год назад
@Sam Witteveen Do you have any idea about this and can you please guide me?
@alipuccio3603 Год назад
Is WizardLM an encoder-decoder model? Or just a decoder model? Does this distinction matter? If anybody could get back to me, I would be so appreciative.
@samwitteveenai Год назад
If it is the one based on LLaMA then it is a decoder model
@almirbolduan Год назад
Hi Sam! On your opinion, what of this models would perform better on portuguese language? Nice video! Thanks! !
@samwitteveenai Год назад ⁺¹
The vector stores will be the same it really comes down to what embeddings you use. Checkout the leaderboard here and you probably want to try some multi lingual models huggingface.co/spaces/mteb/leaderboard
@theh1ve Год назад
How would you go about updating or removing a document from chroma. Say if a document is updated or is now out of date?
@samwitteveenai Год назад
Yes take a look at the API they have functions to delete a doc etc.
@theh1ve Год назад
@@samwitteveenai thanks 👍 had a ganger and can see how you can delete collections and return the IDs based on a where clause for metadata search, guess you could then take the returned IDs and delete them or update them. Might be worth a video covering this as I can't find a single video on RUclips for this, just a thought!
@user-sq8qi7cr8x 4 месяца назад
Can we use these codes in a kivy Python application??
@MichaelDude12345 Год назад
I have been struggling to figure out if there is a performant way to tune and run a 13B model on a modern graphics card with 12gb of vram (I think I have seen the 4-bit mode suggested for this, but thought that the results might be more poor than just using a 7b model). I love the speeds on the device but I know for some stuff I will be bumping right up against the limits of what I can do with a 7b model. I could get and run the model on a dual-GPU setup as I have seen shown is possible by @Jeff Heaton . I just am under the impression. There would be a significant performance hit. I also really like having the model run on the one GPU I have since it is very conservative of power and I could conceivably afford to run it with the model and do some self-hosted services with it. Does anyone have any suggestions? Thank you!
@samwitteveenai Год назад ⁺¹
Some new 4bit stuff is coming in a few weeks. That might help you a lot.
@loicbaconnier9150 Год назад
Thanks. How about using a local llm ai to find chunks using embeddings and open ai for final step: the chunks tokens and question in input ? Will it be good enought ?
@samwitteveenai Год назад
Yes I did that in the previous video before this, it works well.
@vijaybudhewar7014 Год назад
Why are we using instruction embedding and T5 models which are different?
@samwitteveenai Год назад ⁺¹
There is no reason to use the same model for the embeddings and the main LLM. Embedding LLMs are trained for different tasks. Actually Instructor uses a T5 Architecture underneath though (but thats not related to the other LLM)
@vijaybudhewar7014 Год назад
@@samwitteveenai Thank you so much!!
@jasonl2860 Год назад ⁺¹
can I load the models in cpu?
@samwitteveenai Год назад
probably but you will need a lot of ram and it will be very slow.
@FrancChen Год назад
What would you use with any of these to create a UI for the chat? Gradio?
@samwitteveenai Год назад
Yeah you could use Gradio or Streamlit or build a something like a NextJS app for a proper frontend
@BOMBOMBASE Год назад
Awesome, thanks for sharing, can run Wizard on CPU?
@samwitteveenai Год назад
I haven't done it, but my guess is yes
@NavyaVedachala Год назад
Which videos are a prerequisite for this one?
@samwitteveenai Год назад
Take a look at the LangChain playlist and the few ones before this on that playlist.
@giraymordor Год назад
thank you :d i wonder how can i increase the answer size? is it possible to get longer answer?
@samwitteveenai Год назад ⁺¹
This depends a lot on the model and the prompt we use on it. I plan to make a part 2 to this video this week.
@giraymordor Год назад
@@samwitteveenai thank you, we are looking forward you :d
@harshitgoyal9341 8 месяцев назад
why this collab file does not work , when i am trying to run it , it is showing multiple errors
@samwitteveenai 8 месяцев назад
A lot of of the code has been updated, so that could be the reason. I will try to make some newer vids on this topic soon.
@charsiu8444 Год назад
Attempting to run the StableVicuna and WizardLM Models and I get the error: Make sure you have enough GPU RAM to fit the quantized model. What GPU/VRAM are you running, and how do I set it as a parameter like 'gpu_memory_0'? Also, is there site which documents all parameter inputs in the transformers? (Sorry.. N00b here in Python).
@samwitteveenai Год назад ⁺¹
it sounds like your GPU isn't big enough to run that one.
@charsiu8444 Год назад
@@samwitteveenai Thanks. How much VRAM do I need to run those examples? Looking to get another graphics card soon.
@hiranga Год назад
Get Sam, what is the licensing like to use these other models for commercial usage?
@samwitteveenai Год назад
Some really good like the T5 models, ones like Wizard etc are not for commercial. I a, planning a part 2 to this video which will address this .
@imanonymus5745 Год назад
How can we finetune a local LLM and can you pls make other video using search engine or other
@aziz-xd4de Год назад
what can i do if i want to do it with french documents and questions ?
@samwitteveenai Год назад
I would use a fine tuned version of the new LLaMA2 model

Следующие

Автовоспроизведение

RetrievalQA with LLaMA 2 70b & Chroma DB