How to evaluate an LLM-powered RAG application automatically.

Поделиться
HTML-код
  • Опубликовано: 27 ноя 2024

Комментарии • 61

  • @aleksandarboshevski
    @aleksandarboshevski 8 месяцев назад +14

    Hey Santiago! Just wanted to drop a comment to say that you're absolutely killing it as an instructor. Your way of breaking down the code and the whole process into simple, understandable language is pure gold, making it accessible for new comers like me. Wishing you all the success and hoping you keep blessing the community with your valuable content!
    Aside of the teaching side have you tried to create micro-saas based on this technologies ? For me seems that you are half there and could be great opportunity to expand your business.

    • @underfitted
      @underfitted  8 месяцев назад +1

      Thanks for taking the time and letting me know! I have not created any micro saas applications, but you are right; that could be a great idea

  • @TooyAshy-100
    @TooyAshy-100 8 месяцев назад +5

    THANK YOU
    I greatly appreciate the release of the new videos. The clarity of the explanations and the logical sequence of the content are exceptional.

  • @liuyan8066
    @liuyan8066 8 месяцев назад +1

    Glad to see you involved pytest in the end, it is like a surprise dessert🍰 after great meal.

  • @AmbrishYadav
    @AmbrishYadav 4 месяца назад

    Thanks ! Exactly what I was looking for. I’ve been cracking my head on how the hell to test a RAG system. How the hell is business going to give me 1000+ questions to test and how can a human verify the response. Top content.

  • @mohammed333suliman
    @mohammed333suliman 8 месяцев назад +3

    This is my first time watching your videos. It is great. Thank you.

  • @TPH310
    @TPH310 8 месяцев назад +1

    We appreciate your work a lot, my man.

  • @TheScott10012
    @TheScott10012 8 месяцев назад +6

    FYI, keep an eye on the mic volume levels! Sounds like it was clipping

    • @underfitted
      @underfitted  8 месяцев назад

      Thanks. You are right. Will adjust.

  • @dikshantgupta5539
    @dikshantgupta5539 6 месяцев назад +1

    Oh man, the way you explained these complex topics is mind blowing. I just wanted to say thank you for making such types of videos.

  • @peterhjvaneijk1670
    @peterhjvaneijk1670 4 месяца назад

    Love the video. Great breakdown. Would like to see more detail in evaluation results (e.g. it is now .73 good. WTH...!?), how tweaking the pipeline gives different eval results, and e.g. Ragas versus Giskard.

  • @tee_iam78
    @tee_iam78 4 месяца назад

    Superb video. Great content from start to finish. Thank you.

  • @bald_ai_dev
    @bald_ai_dev 7 месяцев назад +3

    Great stuff!
    What are your preferred open source alternatives to all tools used in this tutorial?

  • @kloklojul
    @kloklojul 4 месяца назад +3

    You are using an LLM to create a question and a LLM to get another answers and than let an LLM eval both answers, but how do you evaluate the output of the Initial Tests? At this point you are trusting the facts of an LLM by tusting the answers of the llm

  • @maxnietzsche4843
    @maxnietzsche4843 5 месяцев назад

    Damm, you explained each step really well! Love it!

  • @VikasChaudhary-x1y
    @VikasChaudhary-x1y 5 месяцев назад +3

    Hello Santiago, Your explanation was thorough and I understood it really well,Now I have a question as is there any other tool than giskard to evaluate (which is open source and does not require openai api key) for my llm or rag model.
    Thank you in advance😊

  • @CliveFernandesNZ
    @CliveFernandesNZ 7 месяцев назад +3

    Great stuff Santiago! You've used giskard to create the test cases. These test cases themselves are created using an LLM. In a real application, would we have to manually vet the test cases to ensure they themselves are 100% accurate?

  • @alextiger548
    @alextiger548 6 месяцев назад

    Super important topic you covered here man!

  • @theacesystem
    @theacesystem 8 месяцев назад +1

    Just awesome instruction Santiago. I am a beginner but you make learning digestible and clear! Sorry if ignorant question. but is it possible to use FAISS, Postgres, MongoDB, or Chroma DB, or another free open source model that can be substituted for pinecone to save money, and if so which would you recommend for ease of implementation with Langchain?

    • @underfitted
      @underfitted  8 месяцев назад

      Yes, you can! Any of them will work fine. FAISS is very popular.

  • @litan5006
    @litan5006 3 месяца назад

    great video on llm and rag

  • @aliassim8774
    @aliassim8774 7 месяцев назад +1

    Hey Santiago, thank you for this course in which you explained all the concepts of rag evaluation in a very clear way. However, I have a question about the the reference answers. How they have been generated ? based on what (is it an LLM) ? If it is the case, let's say we have a question that needs a specific information that exists only on the knowledge base, how can other llm generate such an answer ? & how the we know that reference questions are correct and it is what we are looking for ? Thank you in advance

  • @maxisqt
    @maxisqt 8 месяцев назад +4

    So the one thing you learn training ML models is that you don’t evaluate your model on training data, and be careful of data leaking. Here, you’re providing giskard your embedded documentation, meaning giskard is likely using its own RAG system to generate tests cases, which you then use to evaluate your own RAG system. Can you please explain how this isn’t nonsense? Do you evaluate the accuracy of the giskard test cases beyond the superficial “looks good to me” method that you claim to be replacing? What metrics do you evaluate giskard’s test cases against since its answers are also subjective, you’re just now entrusting that subjective evaluation to another LLM?

    • @maxisqt
      @maxisqt 8 месяцев назад

      Perhaps the purpose of testing in software development is different to ML testing, in soft eng you’re ensuring that changes made to a system don’t break existing functionality, in ML you test on data your model hasn’t trained on to prove it generalises to unseen novel samples as that’s how it’ll have to perform in deployment. Maybe the tests you’re doing here fit into the software eng bucket and therefore LLMs may be perfectly capable of auto generating test cases, and since we aren’t trying to test how well the generated material “generalises” since that doesn’t make sense in this context, that’s okay… I’m a little confused.

    • @maxisqt
      @maxisqt 8 месяцев назад +1

      I’m new to gen ai, background in ML some years back, apologies if I come off hostile or jaded.

    • @mikaelhuss5080
      @mikaelhuss5080 8 месяцев назад

      @@maxisqt i think these are good questions actually. Maybe the way to think about RAG at least in the present scenario is that it is really a type of information retrieval and there is no need to generalise, as you say - we just want to be able to find relevant information in a predefined set of documents.

    • @u4tiwasdead
      @u4tiwasdead 8 месяцев назад

      The way that frameworks like Giscard try to solve the problem of how we can evaluate llms/rag using llms that are not necessarily better than the ones being eveluated is through the way that test sets are generated.
      Just to give one example the framework might ask an llm to generate a question and answer pair, then ask it to rephrase the question to make it harder to understand without changing its meaning/what the answer will be. It will then ask llm the harder version of the question and compare it to the original answer. This can work despite the fact their llm is not necessarily more powerful than yours, because rephrasing an easy question into a hard one is an easier problem, than interpreting the hard question.
      (a good analogy migght be that a person can create puzzles that are hard to solve for much smarter people than themselves by starting from the solution and then creating the question)
      Note that the test data does not need to be perfect, it just needs to be generally better than the outputs we will get from our models/pipelines. The point of these tools is not to evaluate whether the outputs we are getting are actually true, but simply whether they are improved when we make changes to the pipline.

    • @trejohnson7677
      @trejohnson7677 7 месяцев назад +1

      Ouroboros

  • @MohammadEskandari-do6xy
    @MohammadEskandari-do6xy 8 месяцев назад +1

    Amazing! Can you also explain how to do the same type of evaluation on Vision Language Models that use images?

  • @arifkarim768
    @arifkarim768 5 месяцев назад

    explained amazingly

  • @dhrroovv
    @dhrroovv 5 месяцев назад +1

    do we need to have a paid subscription to openai apis to be able to use giskard?

  • @AliMohammadjafari97
    @AliMohammadjafari97 22 дня назад

    Thank you for your video it is very helpful. How can we use giskards if we want to use local LLM in our RAG system, like llama3?

  • @proterotype
    @proterotype 7 месяцев назад

    This is so well done

  • @theacesystem
    @theacesystem 8 месяцев назад

    That's great. You rock!!!

  • @ergun_kocak
    @ergun_kocak 8 месяцев назад

    This is gold ❤

  • @JonathanLoscalzo
    @JonathanLoscalzo 5 месяцев назад

    I think that all the "AI experts" in the wild just "explain" common concepts of AI/LLM systems. It would be nice to understand a bit more other aspects, like (good choice) evaluation. It would be interesting to have some relevant courses on that. I know it is the secret juice but, could be useful.
    BTW, Are you teaching causal ml in your course?

    • @underfitted
      @underfitted  5 месяцев назад +1

      I’m not teaching causal ml, no. The program focuses on ML Engineering

    • @JonathanLoscalzo
      @JonathanLoscalzo 5 месяцев назад

      @@underfitted I want to do it, but I don't have time. I hope there will be more cohorts in near future

  • @francescofisica4691
    @francescofisica4691 5 месяцев назад +1

    How can i use huggingface llms to generate the testset?

  • @utkarshgaikwad2476
    @utkarshgaikwad2476 8 месяцев назад +1

    Is it ok to use generative ai to test generative ai ? What about the accuracy of giskard ? I’m not sure about this

    • @underfitted
      @underfitted  8 месяцев назад

      The accuracy is as good as the model they use is (which is GPT-4). Yes, this is how you can test the result of a model.

  • @aryamasingh3413
    @aryamasingh3413 Месяц назад

    I loved it!

  • @PratheekBabu
    @PratheekBabu 4 месяца назад

    Thanks for an amzing content can we use giskard without openai key

  • @sabujghosh8474
    @sabujghosh8474 7 месяцев назад

    Its awesome need more os models workings

  • @AjayKumar-hs2li
    @AjayKumar-hs2li 21 день назад

    Amazing!

  • @caesarHQ
    @caesarHQ 8 месяцев назад

    hi excellent tutorial, wouldn't anticipate any less. Ran your notebook with an open-source LLM, however generate test set with giskard.rag is calling OpenAI api "timestamp 19:11", any work-around?

    • @underfitted
      @underfitted  8 месяцев назад +1

      Giskard will always use gpt4 regardless of the model you use in your RAG app

  • @StoryWorld_Quiz
    @StoryWorld_Quiz 7 месяцев назад +1

    how does the gpt instance that generates the questions and the answers know the validity of those answers? if they are actually accurate, why would you build the rag in the first place if you can create a gpt instance that is accurate enough (using one simple prompt: 18:33, agent description)? i dont understand, can someone explain please? do you see the paradox here?

    • @mehmetbakideniz
      @mehmetbakideniz 7 месяцев назад

      because gpt 4 is quite expensive, you wouldnt want to use it in production if 3.5 or any other open source model does the job correctly. this library uses gpt4 as the best llm to have RAG answers. that is why they use it as the test case to see whether for your specific application a cheaper or a free open source model is more or the less okay.

    • @mahimahesh1945
      @mahimahesh1945 Месяц назад

      The primary purpose of Retrieval-Augmented Generation (RAG) is to enable the development of applications tailored to your enterprise. Foundational models like GPT-3.5 or GPT-4 are not specifically trained on your enterprise data, so to adapt an LLM effectively for your organization, RAG or fine-tuning may be necessary. This allows the model to interact with and utilize your enterprise data seamlessly.

  • @sridharm4254
    @sridharm4254 6 месяцев назад

    Very useful video. thank you

  • @sbacon92
    @sbacon92 3 месяца назад

    What happens when you take away OpenAI and a module?
    Can you build this with a local model and your own code?

  • @gauravpratapsingh8840
    @gauravpratapsingh8840 5 месяцев назад

    Hey can you make a video that uses open source llm and make a q/a chat bot for website page?

  • @tee_iam78
    @tee_iam78 4 месяца назад

    Thanks!

  • @datascienceandaiconcepts5435
    @datascienceandaiconcepts5435 3 месяца назад

    nice explation

  • @fintech1378
    @fintech1378 8 месяцев назад +1

    How new is this giskard

  • @not_amanullah
    @not_amanullah 8 месяцев назад

    Thanks ❤

  • @pratheekbabu272
    @pratheekbabu272 4 месяца назад

    Can you do using gemini pro

  • @JTMoustache
    @JTMoustache 8 месяцев назад

    Langchain sucks