Been following your tutorials since last year, every single video has been super helpful and provides complete knowledge. THANK YOU and I hope you never stop!!
Hey James, I tried running Ragas with a RAG system using your AI chunked database, and It obliterated my Openai API funds (20$ within just a 8% of the test runned). Am I doing something wrong? Do you think there is a way of calculating the cost beforehand? Thank you so much for your videos
You can check out what is going on by using a web proxy, like webm. There you can see what exact api calls ragas is doing under the hood and whether there is something going wrong. (for example extracting json and as a result some calls are made multiple times.... )
Hey there James. Thank you for the content. I am confused about the retrieval mesaures. Specifically, it seems like we don't feed the ground truth contexts to the to the ragas evaluator (we only feed the ground truth answers), then how come it can decide if a retrieved chunk by the RAG is actually positive or negative? Even if we fed it, I would still be confused about how to compare a retrieved context/chunk with a ground truth context? Because in your example it seems like we have a single and long ground truth context, on the other hand the RAG retrieves 5 smaller chunks, so how do we decide if a single retrieved chunk is positive or not? And lastly, how do we even obtain the ground truth context at all? To answer a question there might be many useful chunks inside our source documents right? How do we decide which one or ones are the best and how do we decide their length and so on? I would appreciate any kind of answer. Thanks in advance. :)
Where does ground truth come from? Is this a human annotated property? I understand the ground truth in RAGAS refers to the correct answer to the question. It's typically used for the context_recall metric. But how to we get this? Human in the loop? LLM generated? More documents from the retrieval? Thank you?
Awesome video! RAGAS looks very nice as we are stumbling upon in building an automated framework for evaluation. Understood we need to have manual test cases during the development but also it is not realistic to scale the manual evaluation process once the RAG apps go to production. I am aware you mentioned RAGAS can generate question and ground truth pair based on the provided data. In my use case we have thousand's of PDFs, HTMLs in RAG application. Does that mean we need to supply every single doc to RAGAS to generate this pair? Just wondering how feasible it is for a chatbot where user could ask any query post go live and how to generate these metrics effectively. Would love to hear your thoughts!
Hi James. I would like to ask, how do you cut the videos such that the end result looks almost as if you talked through it without any mistakes. Clearly, there are many cuts. But is it fully automated? And which tool provides this?
We have been experimenting with RAGAS for evaluating our RAG pipelines but for us the metrics (especially the ones for context retrieval) seem to be very unreliable. Using a test set of 50 questions the recall would go from 0.45 to 0.85 between runs without us changing any of the parameters. For the time being we have stopped using RAGAS because of this. What have your experiences been? Would be interested to know if it's maybe something we have been doing wrong.
I don't think the retrieval metrics (context recall and precision) should vary between runs (assuming you are getting the same output from the retrieval pipeline). The generative metrics rely on generative elements from the LLMs and so they will tend to change between runs
Did you specify the llm? because ragas is using per default only GPT-3.5 Turbo... you have to specify it to a GPT4 version. but if you have a more complex usecase it's probably even with gpt4 not working... for all more complex projects i've been involved i've ended up building custom evaluation systems specified for this use case and they outperformed ragas by far in measuring the real quality...
@@julianrosenberger1793 yeah I agree, RAGAS is a nice first step, but for good and accurate evaluation you need to be creating your own test cases (which you can use RAGAS to initially create, but you should be prepared to modify them a lot)
Hi james, thank you for your job. Ragas seems to be similar to the concepts behind DSPy framework. Have you been testing it yet and if yes what’s your opinion about it ? Thank you again 👍
those who are getting this error and other errors run the evaluate one by one e.g for index in range(len(eval_df)-1): eval_dataset = Dataset.from_dict(eval_df[index:index+1]) result = evaluate( dataset=eval_dataset,
Been following your tutorials since last year, every single video has been super helpful and provides complete knowledge. THANK YOU and I hope you never stop!!
Agreed!! This man is a savior!!
haha thanks man I appreciate this a lot!
Hey James, I tried running Ragas with a RAG system using your AI chunked database, and It obliterated my Openai API funds (20$ within just a 8% of the test runned). Am I doing something wrong? Do you think there is a way of calculating the cost beforehand?
Thank you so much for your videos
You can check out what is going on by using a web proxy, like webm. There you can see what exact api calls ragas is doing under the hood and whether there is something going wrong. (for example extracting json and as a result some calls are made multiple times.... )
Hey there James. Thank you for the content.
I am confused about the retrieval mesaures. Specifically, it seems like we don't feed the ground truth contexts to the to the ragas evaluator (we only feed the ground truth answers), then how come it can decide if a retrieved chunk by the RAG is actually positive or negative?
Even if we fed it, I would still be confused about how to compare a retrieved context/chunk with a ground truth context? Because in your example it seems like we have a single and long ground truth context, on the other hand the RAG retrieves 5 smaller chunks, so how do we decide if a single retrieved chunk is positive or not?
And lastly, how do we even obtain the ground truth context at all? To answer a question there might be many useful chunks inside our source documents right? How do we decide which one or ones are the best and how do we decide their length and so on?
I would appreciate any kind of answer. Thanks in advance. :)
I like your video even before i view it cause it knows it will be awesome!
haha thanks!
great description
Where does ground truth come from? Is this a human annotated property? I understand the ground truth in RAGAS refers to the correct answer to the question. It's typically used for the context_recall metric. But how to we get this? Human in the loop? LLM generated? More documents from the retrieval? Thank you?
Awesome video! RAGAS looks very nice as we are stumbling upon in building an automated framework for evaluation. Understood we need to have manual test cases during the development but also it is not realistic to scale the manual evaluation process once the RAG apps go to production.
I am aware you mentioned RAGAS can generate question and ground truth pair based on the provided data. In my use case we have thousand's of PDFs, HTMLs in RAG application. Does that mean we need to supply every single doc to RAGAS to generate this pair? Just wondering how feasible it is for a chatbot where user could ask any query post go live and how to generate these metrics effectively. Would love to hear your thoughts!
I have a dataset from stack overflow containing questions and answers, how can I prepare the data to be used for this ragas evaluation?
Hi James, Can you make a tutorial on how to integrate RAGAS for local RAG pipelines
For evaluation, the contexts that we provide to ragas. Is it the predicted context or the ground truth contexts?
ground truth are the actual positives (p) and predicted contexts are the (p_hat), we use both in different places
Hi James. I would like to ask, how do you cut the videos such that the end result looks almost as if you talked through it without any mistakes. Clearly, there are many cuts. But is it fully automated? And which tool provides this?
We have been experimenting with RAGAS for evaluating our RAG pipelines but for us the metrics (especially the ones for context retrieval) seem to be very unreliable. Using a test set of 50 questions the recall would go from 0.45 to 0.85 between runs without us changing any of the parameters. For the time being we have stopped using RAGAS because of this. What have your experiences been? Would be interested to know if it's maybe something we have been doing wrong.
I don't think the retrieval metrics (context recall and precision) should vary between runs (assuming you are getting the same output from the retrieval pipeline). The generative metrics rely on generative elements from the LLMs and so they will tend to change between runs
Did you specify the llm? because ragas is using per default only GPT-3.5 Turbo... you have to specify it to a GPT4 version.
but if you have a more complex usecase it's probably even with gpt4 not working... for all more complex projects i've been involved i've ended up building custom evaluation systems specified for this use case and they outperformed ragas by far in measuring the real quality...
@@julianrosenberger1793 yeah I agree, RAGAS is a nice first step, but for good and accurate evaluation you need to be creating your own test cases (which you can use RAGAS to initially create, but you should be prepared to modify them a lot)
Hi james, thank you for your job. Ragas seems to be similar to the concepts behind DSPy framework. Have you been testing it yet and if yes what’s your opinion about it ? Thank you again 👍
Thank you!
i am getting this error while calling evaluate function
AttributeError: 'DataFrame' object has no attribute 'rename_columns'
those who are getting this error and other errors run the evaluate one by one e.g
for index in range(len(eval_df)-1):
eval_dataset = Dataset.from_dict(eval_df[index:index+1])
result = evaluate(
dataset=eval_dataset,
thanks
you're welcome!
RAGGA TWINS STEP OUT! BO! BO! BO!
RAGAS must go. I dont really understand why such weak method promoted so much.