Fully local RAG agents with Llama 3.1
HTML-код
- Опубликовано: 12 сен 2024
- With the release of Llama3.1, it's increasingly possible to build agents that run reliably and locally (e.g., on your laptop). Here, we show to how build reliable local agents using LangGraph and Llama3.1-8b from scratch. We build a simple corrective RAG agent w/ Llama3.1-8b, and compare its performance to larger models llama3-70b, gpt4-o. We test our Llama3.1-8b agent on a corrective RAG challenge, and show performance and latency versus a few competing models. On our small / toy challenge, Llama3.1-8b performs on par w/ much larger models w/ only slightly increased latency. Overall, Llama3.1-8b model is a strong option for local execution and pairs well with LangGraph to implement agentic workflows .
Blog post:
ai.meta.com/bl...
Ollama:
ollama.com/lib...
Code:
github.com/lan...
I really like the langmith test section in the package. Great job!
OOhh nice! I had some issues with llama3-groq-tool-use even after pulling with ollama and trying, it kept returning an empty list instead of the actual tool calls. Just tested this code though and it works great! Love it! Thanks!!! Love the videos from the channel!
Thanks so much for this open information ❤
I really like the way you explain it, makes it easy to learn the concepts.
Nice. I need to test this as well on more complicated agents setup, I had a case were some models would not complete, run into loops, having too many errors trying to call tools ... have to give 3.1 a go at it.
True it's going in loops I tested
Llama likely will make its way into online Ai products (it already does). But until someone builds a one click Llama download and install, the general public will likely never run a local Ai. And they will certainly never jump deep into coding just to build simple agents. It is just way over the heads of most general computer users. And if one click instal is not done soon, people will gravitate towards the online subscription proprietary Ai offerings (OpenAi, Claude, Gemini, etc) and never look back.
I think that is what really killed Linux in competing in the OS space. Mac and Windows, even in the 90's was basically a one click installation process whereas Linux used command line, bin this bash that and installation was cumbersome and not easy....I know because I did it in the mid 90's. People will go for what is easy (Mac, Windows, an online Ai model) and once they are hooked into a particular Ai model, it will be darn hard to get them to change.
It's a real shame because Llama is a pretty terrific LLM but local installation is just a nightmare for the majority of the general computing public.
My guess is that Llama will be running on so many different devices without the consumer noticing, same way Linux is powering consumers fridge & dishwasher without the consumer knowing about it
I think there is a missing step in the rag flow. If the user knows he is "talking" to some documents, he might prompt "What is the summary?". In this case, the grader will always answer "No", and the web search will be useless. There has to be an additional step to evaluate the question - if it is similar as the one I just mentioned, then you would simply fetch a block of text from those documents and send to the LLM to summarize.
I have a bert classifier built specially for this on HF: cnmoro/bert-tiny-question-classifier
Great video including the evaluation and closed source comparisons
few thing have to change:
1. use :
from langchain_ollama import OllamaEmbeddings # from langchain_nomic.embeddings import NomicEmbeddings
....
embedding=OllamaEmbeddings(model='nomic-embed-text'),
I would like to see a sample of how to use this to elaborate large size text that follow a structure or script... without losing coherence by re-evaluating the progress
Thanks for your great video. Could you recommend llama3.1 also for RAG based on documents in german language? All the time when we tried this, the results were much worse compared to using LLMs like gpt-4o or gpt-4o-mini. And could you explain why you are using the OpenAI embeddings? If I want to use this demo as a RAG app for asking questions to local document, do I only have to replace the WebBaseLoader by a DocumentLoader?
IMAGE EMBEDINGS NOT WORKING - TEXT FINE - BRO LLAVA RAG MULTIMODAL PLS
Any chance you might compare hosted api Llama 3.1 405b and Mistral Large 2 123b on same evals?
super cool
I find the fascination with parameter numbers boring. What I would like is a way to measure how much data a model can *hold*. Is there anything like that out there?
The parameter size is kind of a proxy for how much data the model is holding. More the number of parameters, more the model "remembers". Andrej Karpathy plainly states that it is kind of a compression of all the data it has seen in training.
You should check out the last video of 3b1b about llm.
He kinda explain how llm store informations and with his explanation, it makes sense why the number of parameters is such a big deal in llm performance and ability to memorize.
Why always openai embedding? Why not used faiss and open source one instead 😊
I think he said that he was just trying to keep it consistent for evaluation against the other llama and gpt model runs that he was comparing benchmarks with towards the end of the video.
But yes, I agree on using the locally installed embeddings. I have utilized FAISS however the most recent embeddings, that I just figured out how to install locally, are the BGE embeddings on HuggingFace. Good luck!
Thank you for sharing this informative video and its hands-on code. I've faced this connection error="Error running target function: [WinError 10061] No connection could be made because the target machine actively refused it" can anyone please guide me how to fix it?
The link to code on GitHub is broken. Can you please fix it?
Is it just me or is the code not accessible?
Can I run 405B LLMs with 8gb of ram? 🤣
No😂
Yes, but forget abt ur system 😂
Unfortunately, no. I tried this months ago with the Llama-2 70b model on workstation that has a NVIDIA RTX3080 and 16GB of system memory. Tried to use the LLM locally installed with a conversational RAG chain. First few questions ran relatively quick, but it would start to hallucinate after that. It was also probably the fact that I was attempting to load really big PDF files, which I am sure ate up all the resources. And we are working with a narrow but deep sub-domain of medical information.
Go to Groq and get an API for the Llama 3 models if you are really interested. I am using a Llama 3 API from Groq for single shot question evaluation on a CSV file (reasonable size, 250 rows with 4 columns of metadata, single word labels, and one "description" column that contains longer explanations with semantic meaning) and it works well; meaning that the chain returns results that I would expect, or reasonably so, at this point. Constant work in progress. Good luck! Lance does a great job here but I also like Sam's channel also at @samwitteveenai
Usa groq
lol… try with 200gb
You can get away with running the 7B model
I got an error info as " ValueError: Node `retrieve` is not reachable" while running the example code.
Who can help me to figure out what happened?
never mind, the code I pulled from github has one typo
# Build graph
workflow.add_edge(START, retrieve) --> must change to workflow.add_edge(START, "retrieve")
thanks for this! Now get off the toilet and go put some clothes on