Hey Santiago! Just wanted to drop a comment to say that you're absolutely killing it as an instructor. Your way of breaking down the code and the whole process into simple, understandable language is pure gold, making it accessible for new comers like me. Wishing you all the success and hoping you keep blessing the community with your valuable content! Aside of the teaching side have you tried to create micro-saas based on this technologies ? For me seems that you are half there and could be great opportunity to expand your business.
Thanks ! Exactly what I was looking for. I’ve been cracking my head on how the hell to test a RAG system. How the hell is business going to give me 1000+ questions to test and how can a human verify the response. Top content.
Love the video. Great breakdown. Would like to see more detail in evaluation results (e.g. it is now .73 good. WTH...!?), how tweaking the pipeline gives different eval results, and e.g. Ragas versus Giskard.
You are using an LLM to create a question and a LLM to get another answers and than let an LLM eval both answers, but how do you evaluate the output of the Initial Tests? At this point you are trusting the facts of an LLM by tusting the answers of the llm
Hello Santiago, Your explanation was thorough and I understood it really well,Now I have a question as is there any other tool than giskard to evaluate (which is open source and does not require openai api key) for my llm or rag model. Thank you in advance😊
Great stuff Santiago! You've used giskard to create the test cases. These test cases themselves are created using an LLM. In a real application, would we have to manually vet the test cases to ensure they themselves are 100% accurate?
Just awesome instruction Santiago. I am a beginner but you make learning digestible and clear! Sorry if ignorant question. but is it possible to use FAISS, Postgres, MongoDB, or Chroma DB, or another free open source model that can be substituted for pinecone to save money, and if so which would you recommend for ease of implementation with Langchain?
Hey Santiago, thank you for this course in which you explained all the concepts of rag evaluation in a very clear way. However, I have a question about the the reference answers. How they have been generated ? based on what (is it an LLM) ? If it is the case, let's say we have a question that needs a specific information that exists only on the knowledge base, how can other llm generate such an answer ? & how the we know that reference questions are correct and it is what we are looking for ? Thank you in advance
So the one thing you learn training ML models is that you don’t evaluate your model on training data, and be careful of data leaking. Here, you’re providing giskard your embedded documentation, meaning giskard is likely using its own RAG system to generate tests cases, which you then use to evaluate your own RAG system. Can you please explain how this isn’t nonsense? Do you evaluate the accuracy of the giskard test cases beyond the superficial “looks good to me” method that you claim to be replacing? What metrics do you evaluate giskard’s test cases against since its answers are also subjective, you’re just now entrusting that subjective evaluation to another LLM?
Perhaps the purpose of testing in software development is different to ML testing, in soft eng you’re ensuring that changes made to a system don’t break existing functionality, in ML you test on data your model hasn’t trained on to prove it generalises to unseen novel samples as that’s how it’ll have to perform in deployment. Maybe the tests you’re doing here fit into the software eng bucket and therefore LLMs may be perfectly capable of auto generating test cases, and since we aren’t trying to test how well the generated material “generalises” since that doesn’t make sense in this context, that’s okay… I’m a little confused.
@@maxisqt i think these are good questions actually. Maybe the way to think about RAG at least in the present scenario is that it is really a type of information retrieval and there is no need to generalise, as you say - we just want to be able to find relevant information in a predefined set of documents.
The way that frameworks like Giscard try to solve the problem of how we can evaluate llms/rag using llms that are not necessarily better than the ones being eveluated is through the way that test sets are generated. Just to give one example the framework might ask an llm to generate a question and answer pair, then ask it to rephrase the question to make it harder to understand without changing its meaning/what the answer will be. It will then ask llm the harder version of the question and compare it to the original answer. This can work despite the fact their llm is not necessarily more powerful than yours, because rephrasing an easy question into a hard one is an easier problem, than interpreting the hard question. (a good analogy migght be that a person can create puzzles that are hard to solve for much smarter people than themselves by starting from the solution and then creating the question) Note that the test data does not need to be perfect, it just needs to be generally better than the outputs we will get from our models/pipelines. The point of these tools is not to evaluate whether the outputs we are getting are actually true, but simply whether they are improved when we make changes to the pipline.
I think that all the "AI experts" in the wild just "explain" common concepts of AI/LLM systems. It would be nice to understand a bit more other aspects, like (good choice) evaluation. It would be interesting to have some relevant courses on that. I know it is the secret juice but, could be useful. BTW, Are you teaching causal ml in your course?
hi excellent tutorial, wouldn't anticipate any less. Ran your notebook with an open-source LLM, however generate test set with giskard.rag is calling OpenAI api "timestamp 19:11", any work-around?
how does the gpt instance that generates the questions and the answers know the validity of those answers? if they are actually accurate, why would you build the rag in the first place if you can create a gpt instance that is accurate enough (using one simple prompt: 18:33, agent description)? i dont understand, can someone explain please? do you see the paradox here?
because gpt 4 is quite expensive, you wouldnt want to use it in production if 3.5 or any other open source model does the job correctly. this library uses gpt4 as the best llm to have RAG answers. that is why they use it as the test case to see whether for your specific application a cheaper or a free open source model is more or the less okay.
The primary purpose of Retrieval-Augmented Generation (RAG) is to enable the development of applications tailored to your enterprise. Foundational models like GPT-3.5 or GPT-4 are not specifically trained on your enterprise data, so to adapt an LLM effectively for your organization, RAG or fine-tuning may be necessary. This allows the model to interact with and utilize your enterprise data seamlessly.
Hey Santiago! Just wanted to drop a comment to say that you're absolutely killing it as an instructor. Your way of breaking down the code and the whole process into simple, understandable language is pure gold, making it accessible for new comers like me. Wishing you all the success and hoping you keep blessing the community with your valuable content!
Aside of the teaching side have you tried to create micro-saas based on this technologies ? For me seems that you are half there and could be great opportunity to expand your business.
Thanks for taking the time and letting me know! I have not created any micro saas applications, but you are right; that could be a great idea
THANK YOU
I greatly appreciate the release of the new videos. The clarity of the explanations and the logical sequence of the content are exceptional.
Glad to see you involved pytest in the end, it is like a surprise dessert🍰 after great meal.
Thanks ! Exactly what I was looking for. I’ve been cracking my head on how the hell to test a RAG system. How the hell is business going to give me 1000+ questions to test and how can a human verify the response. Top content.
This is my first time watching your videos. It is great. Thank you.
We appreciate your work a lot, my man.
FYI, keep an eye on the mic volume levels! Sounds like it was clipping
Thanks. You are right. Will adjust.
Oh man, the way you explained these complex topics is mind blowing. I just wanted to say thank you for making such types of videos.
Love the video. Great breakdown. Would like to see more detail in evaluation results (e.g. it is now .73 good. WTH...!?), how tweaking the pipeline gives different eval results, and e.g. Ragas versus Giskard.
Superb video. Great content from start to finish. Thank you.
Great stuff!
What are your preferred open source alternatives to all tools used in this tutorial?
You are using an LLM to create a question and a LLM to get another answers and than let an LLM eval both answers, but how do you evaluate the output of the Initial Tests? At this point you are trusting the facts of an LLM by tusting the answers of the llm
Damm, you explained each step really well! Love it!
Hello Santiago, Your explanation was thorough and I understood it really well,Now I have a question as is there any other tool than giskard to evaluate (which is open source and does not require openai api key) for my llm or rag model.
Thank you in advance😊
Great stuff Santiago! You've used giskard to create the test cases. These test cases themselves are created using an LLM. In a real application, would we have to manually vet the test cases to ensure they themselves are 100% accurate?
Super important topic you covered here man!
Just awesome instruction Santiago. I am a beginner but you make learning digestible and clear! Sorry if ignorant question. but is it possible to use FAISS, Postgres, MongoDB, or Chroma DB, or another free open source model that can be substituted for pinecone to save money, and if so which would you recommend for ease of implementation with Langchain?
Yes, you can! Any of them will work fine. FAISS is very popular.
great video on llm and rag
Hey Santiago, thank you for this course in which you explained all the concepts of rag evaluation in a very clear way. However, I have a question about the the reference answers. How they have been generated ? based on what (is it an LLM) ? If it is the case, let's say we have a question that needs a specific information that exists only on the knowledge base, how can other llm generate such an answer ? & how the we know that reference questions are correct and it is what we are looking for ? Thank you in advance
So the one thing you learn training ML models is that you don’t evaluate your model on training data, and be careful of data leaking. Here, you’re providing giskard your embedded documentation, meaning giskard is likely using its own RAG system to generate tests cases, which you then use to evaluate your own RAG system. Can you please explain how this isn’t nonsense? Do you evaluate the accuracy of the giskard test cases beyond the superficial “looks good to me” method that you claim to be replacing? What metrics do you evaluate giskard’s test cases against since its answers are also subjective, you’re just now entrusting that subjective evaluation to another LLM?
Perhaps the purpose of testing in software development is different to ML testing, in soft eng you’re ensuring that changes made to a system don’t break existing functionality, in ML you test on data your model hasn’t trained on to prove it generalises to unseen novel samples as that’s how it’ll have to perform in deployment. Maybe the tests you’re doing here fit into the software eng bucket and therefore LLMs may be perfectly capable of auto generating test cases, and since we aren’t trying to test how well the generated material “generalises” since that doesn’t make sense in this context, that’s okay… I’m a little confused.
I’m new to gen ai, background in ML some years back, apologies if I come off hostile or jaded.
@@maxisqt i think these are good questions actually. Maybe the way to think about RAG at least in the present scenario is that it is really a type of information retrieval and there is no need to generalise, as you say - we just want to be able to find relevant information in a predefined set of documents.
The way that frameworks like Giscard try to solve the problem of how we can evaluate llms/rag using llms that are not necessarily better than the ones being eveluated is through the way that test sets are generated.
Just to give one example the framework might ask an llm to generate a question and answer pair, then ask it to rephrase the question to make it harder to understand without changing its meaning/what the answer will be. It will then ask llm the harder version of the question and compare it to the original answer. This can work despite the fact their llm is not necessarily more powerful than yours, because rephrasing an easy question into a hard one is an easier problem, than interpreting the hard question.
(a good analogy migght be that a person can create puzzles that are hard to solve for much smarter people than themselves by starting from the solution and then creating the question)
Note that the test data does not need to be perfect, it just needs to be generally better than the outputs we will get from our models/pipelines. The point of these tools is not to evaluate whether the outputs we are getting are actually true, but simply whether they are improved when we make changes to the pipline.
Ouroboros
Amazing! Can you also explain how to do the same type of evaluation on Vision Language Models that use images?
explained amazingly
do we need to have a paid subscription to openai apis to be able to use giskard?
Thank you for your video it is very helpful. How can we use giskards if we want to use local LLM in our RAG system, like llama3?
This is so well done
That's great. You rock!!!
This is gold ❤
I think that all the "AI experts" in the wild just "explain" common concepts of AI/LLM systems. It would be nice to understand a bit more other aspects, like (good choice) evaluation. It would be interesting to have some relevant courses on that. I know it is the secret juice but, could be useful.
BTW, Are you teaching causal ml in your course?
I’m not teaching causal ml, no. The program focuses on ML Engineering
@@underfitted I want to do it, but I don't have time. I hope there will be more cohorts in near future
How can i use huggingface llms to generate the testset?
Is it ok to use generative ai to test generative ai ? What about the accuracy of giskard ? I’m not sure about this
The accuracy is as good as the model they use is (which is GPT-4). Yes, this is how you can test the result of a model.
I loved it!
Thanks for an amzing content can we use giskard without openai key
Its awesome need more os models workings
Amazing!
hi excellent tutorial, wouldn't anticipate any less. Ran your notebook with an open-source LLM, however generate test set with giskard.rag is calling OpenAI api "timestamp 19:11", any work-around?
Giskard will always use gpt4 regardless of the model you use in your RAG app
how does the gpt instance that generates the questions and the answers know the validity of those answers? if they are actually accurate, why would you build the rag in the first place if you can create a gpt instance that is accurate enough (using one simple prompt: 18:33, agent description)? i dont understand, can someone explain please? do you see the paradox here?
because gpt 4 is quite expensive, you wouldnt want to use it in production if 3.5 or any other open source model does the job correctly. this library uses gpt4 as the best llm to have RAG answers. that is why they use it as the test case to see whether for your specific application a cheaper or a free open source model is more or the less okay.
The primary purpose of Retrieval-Augmented Generation (RAG) is to enable the development of applications tailored to your enterprise. Foundational models like GPT-3.5 or GPT-4 are not specifically trained on your enterprise data, so to adapt an LLM effectively for your organization, RAG or fine-tuning may be necessary. This allows the model to interact with and utilize your enterprise data seamlessly.
Very useful video. thank you
What happens when you take away OpenAI and a module?
Can you build this with a local model and your own code?
Hey can you make a video that uses open source llm and make a q/a chat bot for website page?
Thanks!
nice explation
How new is this giskard
Thanks ❤
Can you do using gemini pro
Langchain sucks