AI Agent Evaluation with RAGAS

Поделиться
HTML-код
  • Опубликовано: 18 янв 2025

Комментарии • 28

  • @utkarshkapil
    @utkarshkapil 9 месяцев назад +5

    Been following your tutorials since last year, every single video has been super helpful and provides complete knowledge. THANK YOU and I hope you never stop!!

    • @dil6953
      @dil6953 9 месяцев назад

      Agreed!! This man is a savior!!

    • @jamesbriggs
      @jamesbriggs  9 месяцев назад

      haha thanks man I appreciate this a lot!

  • @javifernandez8736
    @javifernandez8736 9 месяцев назад +2

    Hey James, I tried running Ragas with a RAG system using your AI chunked database, and It obliterated my Openai API funds (20$ within just a 8% of the test runned). Am I doing something wrong? Do you think there is a way of calculating the cost beforehand?
    Thank you so much for your videos

    • @julianrosenberger1793
      @julianrosenberger1793 8 месяцев назад

      You can check out what is going on by using a web proxy, like webm. There you can see what exact api calls ragas is doing under the hood and whether there is something going wrong. (for example extracting json and as a result some calls are made multiple times.... )

  • @erdoganyildiz617
    @erdoganyildiz617 9 месяцев назад +4

    Hey there James. Thank you for the content.
    I am confused about the retrieval mesaures. Specifically, it seems like we don't feed the ground truth contexts to the to the ragas evaluator (we only feed the ground truth answers), then how come it can decide if a retrieved chunk by the RAG is actually positive or negative?
    Even if we fed it, I would still be confused about how to compare a retrieved context/chunk with a ground truth context? Because in your example it seems like we have a single and long ground truth context, on the other hand the RAG retrieves 5 smaller chunks, so how do we decide if a single retrieved chunk is positive or not?
    And lastly, how do we even obtain the ground truth context at all? To answer a question there might be many useful chunks inside our source documents right? How do we decide which one or ones are the best and how do we decide their length and so on?
    I would appreciate any kind of answer. Thanks in advance. :)

  • @waterangel273
    @waterangel273 9 месяцев назад

    I like your video even before i view it cause it knows it will be awesome!

  • @bradleywoolf3351
    @bradleywoolf3351 12 дней назад

    great description

  • @yerson557
    @yerson557 3 месяца назад +1

    Where does ground truth come from? Is this a human annotated property? I understand the ground truth in RAGAS refers to the correct answer to the question. It's typically used for the context_recall metric. But how to we get this? Human in the loop? LLM generated? More documents from the retrieval? Thank you?

  • @sivi3883
    @sivi3883 7 месяцев назад

    Awesome video! RAGAS looks very nice as we are stumbling upon in building an automated framework for evaluation. Understood we need to have manual test cases during the development but also it is not realistic to scale the manual evaluation process once the RAG apps go to production.
    I am aware you mentioned RAGAS can generate question and ground truth pair based on the provided data. In my use case we have thousand's of PDFs, HTMLs in RAG application. Does that mean we need to supply every single doc to RAGAS to generate this pair? Just wondering how feasible it is for a chatbot where user could ask any query post go live and how to generate these metrics effectively. Would love to hear your thoughts!

  • @bald_ai_dev
    @bald_ai_dev 9 месяцев назад

    I have a dataset from stack overflow containing questions and answers, how can I prepare the data to be used for this ragas evaluation?

  • @shreerajkulkarni
    @shreerajkulkarni 9 месяцев назад

    Hi James, Can you make a tutorial on how to integrate RAGAS for local RAG pipelines

  • @DiljyotSingh-vn6wo
    @DiljyotSingh-vn6wo 9 месяцев назад

    For evaluation, the contexts that we provide to ragas. Is it the predicted context or the ground truth contexts?

    • @jamesbriggs
      @jamesbriggs  9 месяцев назад +1

      ground truth are the actual positives (p) and predicted contexts are the (p_hat), we use both in different places

  • @alivecoding4995
    @alivecoding4995 9 месяцев назад

    Hi James. I would like to ask, how do you cut the videos such that the end result looks almost as if you talked through it without any mistakes. Clearly, there are many cuts. But is it fully automated? And which tool provides this?

  • @DoktorUde
    @DoktorUde 9 месяцев назад

    We have been experimenting with RAGAS for evaluating our RAG pipelines but for us the metrics (especially the ones for context retrieval) seem to be very unreliable. Using a test set of 50 questions the recall would go from 0.45 to 0.85 between runs without us changing any of the parameters. For the time being we have stopped using RAGAS because of this. What have your experiences been? Would be interested to know if it's maybe something we have been doing wrong.

    • @jamesbriggs
      @jamesbriggs  9 месяцев назад

      I don't think the retrieval metrics (context recall and precision) should vary between runs (assuming you are getting the same output from the retrieval pipeline). The generative metrics rely on generative elements from the LLMs and so they will tend to change between runs

    • @julianrosenberger1793
      @julianrosenberger1793 8 месяцев назад

      Did you specify the llm? because ragas is using per default only GPT-3.5 Turbo... you have to specify it to a GPT4 version.
      but if you have a more complex usecase it's probably even with gpt4 not working... for all more complex projects i've been involved i've ended up building custom evaluation systems specified for this use case and they outperformed ragas by far in measuring the real quality...

    • @jamesbriggs
      @jamesbriggs  8 месяцев назад +2

      @@julianrosenberger1793 yeah I agree, RAGAS is a nice first step, but for good and accurate evaluation you need to be creating your own test cases (which you can use RAGAS to initially create, but you should be prepared to modify them a lot)

    • @BriceGrelet
      @BriceGrelet 7 месяцев назад

      Hi james, thank you for your job. Ragas seems to be similar to the concepts behind DSPy framework. Have you been testing it yet and if yes what’s your opinion about it ? Thank you again 👍

  • @realCleanK
    @realCleanK 9 месяцев назад

    Thank you!

  • @Anonymous-bu9ch
    @Anonymous-bu9ch 5 месяцев назад

    i am getting this error while calling evaluate function
    AttributeError: 'DataFrame' object has no attribute 'rename_columns'

    • @Anonymous-bu9ch
      @Anonymous-bu9ch 5 месяцев назад

      those who are getting this error and other errors run the evaluate one by one e.g
      for index in range(len(eval_df)-1):
      eval_dataset = Dataset.from_dict(eval_df[index:index+1])
      result = evaluate(
      dataset=eval_dataset,

  • @duongkstn
    @duongkstn Месяц назад

    thanks

  • @scharlesworth93
    @scharlesworth93 9 месяцев назад

    RAGGA TWINS STEP OUT! BO! BO! BO!

  • @АнтонБ-х9у
    @АнтонБ-х9у Месяц назад

    RAGAS must go. I dont really understand why such weak method promoted so much.