I have found Filters that give yes and not to be of not much of a help. for example, i have embedding of tech docs and then embeddings of order processing system. When filter is set and prompt is submitted with a random query like "can i order pizza with it?" model thinks the context is related to order processing and returns YES which is totally wrong.
So in short, in order to make the new revolutionary AI actually useful, you must meticulously hardcode the thinking it is supposed to be doing for you. Feels almost like crafting the expert systems in the 80's! Imagine the expected explosion in productivity from applying that same process! Or let the AI imagine for you (imagination is what it's really good for).
RAG is built on retrieval. And retrieval is another word for search. Search is a very hard problem. The difficulty of searching, ranking, filtering to get a good quality set of candidate documents to reason over is underestimated. That's where the complexity. Vector search doesn't directly solve these issues. Search engines like Google has hundreds of ranking factors, including vector searches, re-ranking cross-encoding models, and quality factors. TL; DR - vector search makes for a good demo and proof of concept. For true production systems, there is a lot of complexity and engineering that's required to make these systems work in practice.
Thank you for the amazing tutorial! I was wondering, instead of using ChatOpenAi, how can I utilize a llama 2 model locally? Specifically, I couldn't find any implementation, for example, for contextual compression, where you pass compressor = LLMChainExtractor.from_llm(llm) with the ChatOpenAi (llm). How can I achieve this locally with llama 2? My use case involves private documents, so I'm looking for solutions using open-source LLMS.
So instead of using an 'Extractive QA model' you prompt an LLM into doing the same thing... amazing how flexible these LLMs are... in this case you are basing your hopes on the models 'reasoning'....
hm when you were going over those instructions that are like, don't change the text, don't do it, repeat it the same, & it's hard to convince it to write the same text out ,, i thought, like, why make it then? if we just like numbered the sentences then it could just respond w/ the numbers of which sentences to include, or smth, maybe that'd save output tokens as well as not give it any chance to imagine things
This is really interesting. My only worry is that this makes it prohibitively slow. The longest part of RAG is often the call to the LLM. I'd be interesting if you could review some companies which have faster models than OpenAI but still have decent performance.
if i was making a chatbot & needed it to not lag before responding, i'd just fake it,,, like how windows has twelve different bars go across & various things slowly fade in so it doesn't seem like it's taking forever to boot XD ,, like i'd send the request simultaneously to both the thoughtful process & also a model that just has instructions to respond immediately echoing the user "ok so what you're saying you want is...." personally i'd even want it to be transparent about what's happening, like, say that it's looking stuff up right now, i'd think of feeding the agent that's looking busy for the user some data about how much we've retrieved and how we've processed it so far so it can say computery things like "i have discovered 8475 documents relevant to your query, and i am currently filtering and compressing them to find the most relevant information"... but you could also just fake it by pretending you have the answer and you're just a little slow at getting to the point,,, like stall for a few seconds by giving a cookiecutter disclaimer about how you're just a hapless ai :D
@@wiltedblackrose if it's for your own use & there's no customers to offend then you could make it quick & dirty in other ways--- then i'd think of like giving random raw retrieved documents to a little cheap hallucinatey model to see if it gets lucky and can answer right away, then next get answers from progressively slower chains of reasoning,,,,, if it was for my own use i'd definitely make it so there's visual feedback about what stuff it found & what it's doing, since if i made it myself then otherwise obscure visual feedback where documents are flashing by too quickly to read or w/e would make sense to me b/c i knew exactly what it's doing
First of all, thanks for the great video! As some of the comments have rightfully, while I see some merits for offline use cases, this will be very challenging for real-time use cases. Also, I'm curious how much of a dependency this requires of the chosen LLM to understand and follow the default prompts. It seems the LLM choice and make it or break it, which is quite brittle.
It seems like there's this huge disconnect in understanding of how state of the art 'RAG' works, eg. using document upload in the chatGPT 4 UI, vs all the langchain tutorials etc on RAG, I feel like the community doesn't understand that OpenAI is getting far better results, and seems to be processing embeddings in a way that's much more advanced than langchain based systems do, but that the community isn't even aware that 'langchain RAG' and 'OpenAI internal RAG' are completely different animals. eg. it seems uploaded docs are added as embeddings into a chatGPT 4 query completely orthogonally to the context window, yet all langchain examples I see end up returning text from a 'retriever and shoving this output into the llm context, I don't think good RAG even works that way...
I have found Filters that give yes and not to be of not much of a help. for example, i have embedding of tech docs and then embeddings of order processing system. When filter is set and prompt is submitted with a random query like "can i order pizza with it?" model thinks the context is related to order processing and returns YES which is totally wrong.
Astonishing content Man Sam 💯💯 Thanks to share your knowledge with us (thanks for the subtitles too 😄) Thumbs Up from Brazil 👍👍👍
Great content. Thanks for taking the time to make such videos. I've been learning a lot from them.
this is actually an interesting idea...
So in short, in order to make the new revolutionary AI actually useful, you must meticulously hardcode the thinking it is supposed to be doing for you. Feels almost like crafting the expert systems in the 80's! Imagine the expected explosion in productivity from applying that same process! Or let the AI imagine for you (imagination is what it's really good for).
Yeah. But in some cases I’ve seen, we don’t need that much sophistication and bare bones approach works well 😊 peace
RAG is built on retrieval. And retrieval is another word for search. Search is a very hard problem. The difficulty of searching, ranking, filtering to get a good quality set of candidate documents to reason over is underestimated. That's where the complexity. Vector search doesn't directly solve these issues. Search engines like Google has hundreds of ranking factors, including vector searches, re-ranking cross-encoding models, and quality factors. TL; DR - vector search makes for a good demo and proof of concept. For true production systems, there is a lot of complexity and engineering that's required to make these systems work in practice.
LLMs are not the solution to any problem, as always it's the engineering part that brings the actual results
13:09 Sounds like you should be using an llm to narrow down that prompt for each case
Thoughts on the "Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection" paper?
Interesting paper. I am currently traveling, but will try to make a video about the paper or show some of the ideas in a project when I get a chance
Thank you for the amazing tutorial! I was wondering, instead of using ChatOpenAi, how can I utilize a llama 2 model locally? Specifically, I couldn't find any implementation, for example, for contextual compression, where you pass compressor = LLMChainExtractor.from_llm(llm) with the ChatOpenAi (llm). How can I achieve this locally with llama 2? My use case involves private documents, so I'm looking for solutions using open-source LLMS.
i'm facing same problem , i'm wondering if u've found any solutions?
Thank you for another great video:)
Thanks for the video about finetuning RAG. Personally I think the solution of Self-RAG is more generic because its embedded in the LLM...
So instead of using an 'Extractive QA model' you prompt an LLM into doing the same thing... amazing how flexible these LLMs are... in this case you are basing your hopes on the models 'reasoning'....
As long as someone else pays for it...
hm when you were going over those instructions that are like, don't change the text, don't do it, repeat it the same, & it's hard to convince it to write the same text out ,, i thought, like, why make it then? if we just like numbered the sentences then it could just respond w/ the numbers of which sentences to include, or smth, maybe that'd save output tokens as well as not give it any chance to imagine things
This is really interesting. My only worry is that this makes it prohibitively slow. The longest part of RAG is often the call to the LLM. I'd be interesting if you could review some companies which have faster models than OpenAI but still have decent performance.
if i was making a chatbot & needed it to not lag before responding, i'd just fake it,,, like how windows has twelve different bars go across & various things slowly fade in so it doesn't seem like it's taking forever to boot XD ,, like i'd send the request simultaneously to both the thoughtful process & also a model that just has instructions to respond immediately echoing the user "ok so what you're saying you want is...." personally i'd even want it to be transparent about what's happening, like, say that it's looking stuff up right now, i'd think of feeding the agent that's looking busy for the user some data about how much we've retrieved and how we've processed it so far so it can say computery things like "i have discovered 8475 documents relevant to your query, and i am currently filtering and compressing them to find the most relevant information"... but you could also just fake it by pretending you have the answer and you're just a little slow at getting to the point,,, like stall for a few seconds by giving a cookiecutter disclaimer about how you're just a hapless ai :D
@@mungojelly aha, cool. But this doesn't make a difference to when I use it, e.g., for studying at Uni.
@@wiltedblackrose if it's for your own use & there's no customers to offend then you could make it quick & dirty in other ways--- then i'd think of like giving random raw retrieved documents to a little cheap hallucinatey model to see if it gets lucky and can answer right away, then next get answers from progressively slower chains of reasoning,,,,, if it was for my own use i'd definitely make it so there's visual feedback about what stuff it found & what it's doing, since if i made it myself then otherwise obscure visual feedback where documents are flashing by too quickly to read or w/e would make sense to me b/c i knew exactly what it's doing
First of all, thanks for the great video! As some of the comments have rightfully, while I see some merits for offline use cases, this will be very challenging for real-time use cases. Also, I'm curious how much of a dependency this requires of the chosen LLM to understand and follow the default prompts. It seems the LLM choice and make it or break it, which is quite brittle.
Good ideas Sam 👌
Wouldn't it be simpler if you just use a small chunk_size for the initial splitter function when you embed the documents into the vector database?
Great . how about cross-encoders and re-reranking
i use it and my experience is that it improves retrieval a lot! The out of fashion SentenceTransformers perform amazing there!
I am doing some benchmark testing on arabic datasets and the top I am getting super results with ME5 embeddings with cohere reranker
Yes I still have a number more coming in this series.
Thabks for the video.
thanks a lot, keep it up!
It seems like there's this huge disconnect in understanding of how state of the art 'RAG' works, eg. using document upload in the chatGPT 4 UI, vs all the langchain tutorials etc on RAG, I feel like the community doesn't understand that OpenAI is getting far better results, and seems to be processing embeddings in a way that's much more advanced than langchain based systems do, but that the community isn't even aware that 'langchain RAG' and 'OpenAI internal RAG' are completely different animals. eg. it seems uploaded docs are added as embeddings into a chatGPT 4 query completely orthogonally to the context window, yet all langchain examples I see end up returning text from a 'retriever and shoving this output into the llm context, I don't think good RAG even works that way...
Typo in the thumbnail. It's 4 not 5