Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools (Paper Explained)

Поделиться
HTML-код
  • Опубликовано: 28 июн 2024
  • #rag #hallucinations #legaltech
    An in-depth look at a recent Stanford paper examining the degree of hallucinations in various LegalTech tools that incorporate LLMs.
    OUTLINE:
    0:00 - Intro
    1:58 - What are legal research tools and how are large language models used by them?
    5:30 - Overview and abstract of the paper
    9:29 - What is a hallucination and why do they occur?
    15:45 - What is retrieval augmented generation (RAG)?
    25:00 - Why LLMs are a bad choice when reasoning is involved
    29:16 - The products that were tested
    32:00 - Some shady practices by the researchers in the back and forth with the legal research companies
    37:00 - Legal technology companies’ marketing claims to eliminate or solve hallucination risk
    45:27 - Researchers evaluation of RAG for legal and requirement to have specialized education to use the research tools
    55:27 - How the researchers propose to measure accuracy and the problems of measuring accuracy
    1:09:20 - Researchers conclusion
    Paper: arxiv.org/abs/2405.20362
    Abstract:
    Legal practice has witnessed a sharp rise in products incorporating artificial intelligence (AI). Such tools are designed to assist with a wide range of core legal tasks, from search and summarization of caselaw to document drafting. But the large language models used in these tools are prone to "hallucinate," or make up false information, making their use risky in high-stakes domains. Recently, certain legal research providers have touted methods such as retrieval-augmented generation (RAG) as "eliminating" (Casetext, 2023) or "avoid[ing]" hallucinations (Thomson Reuters, 2023), or guaranteeing "hallucination-free" legal citations (LexisNexis, 2023). Because of the closed nature of these systems, systematically assessing these claims is challenging. In this article, we design and report on the first preregistered empirical evaluation of AI-driven legal research tools. We demonstrate that the providers' claims are overstated. While hallucinations are reduced relative to general-purpose chatbots (GPT-4), we find that the AI research tools made by LexisNexis (Lexis+ AI) and Thomson Reuters (Westlaw AI-Assisted Research and Ask Practical Law AI) each hallucinate between 17% and 33% of the time. We also document substantial differences between systems in responsiveness and accuracy. Our article makes four key contributions. It is the first to assess and report the performance of RAG-based proprietary legal AI tools. Second, it introduces a comprehensive, preregistered dataset for identifying and understanding vulnerabilities in these systems. Third, it proposes a clear typology for differentiating between hallucinations and accurate legal responses. Last, it provides evidence to inform the responsibilities of legal professionals in supervising and verifying AI outputs, which remains a central open question for the responsible integration of AI into law.
    Authors: Varun Magesh, Faiz Surani, Matthew Dahl, Mirac Suzgun, Christopher D. Manning, Daniel E. Ho
    Links:
    Homepage: ykilcher.com
    Merch: ykilcher.com/merch
    RUclips: / yannickilcher
    Twitter: / ykilcher
    Discord: ykilcher.com/discord
    LinkedIn: / ykilcher
    If you want to support me, the best thing to do is to share out the content :)
    If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
    SubscribeStar: www.subscribestar.com/yannick...
    Patreon: / yannickilcher
    Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
    Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
    Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
    Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
  • НаукаНаука

Комментарии • 86

  • @AaronKaufmann-v3x
    @AaronKaufmann-v3x 23 часа назад +9

    Virtually hallucination-free AI is totally doable. In digital marketing you have Lemon AI with AI Reports, which almost never hallucinates

  • @Dogo.R
    @Dogo.R 2 дня назад +19

    Technically correct but made to deceive, apple marketers are bleeding edge at that technique.

  • @serta5727
    @serta5727 2 дня назад +9

    It would need something like git versioning with commits in order to have the llm see the history of the relevant laws and changes that were made to it.
    It would be like a retrieval but with the past included

    • @ACLozMusik
      @ACLozMusik 2 дня назад +1

      Documentation but with time-dimension

    • @clray123
      @clray123 День назад +1

      It has always shocked me that when important legal texts are altered, there is no publicly available record for where each line and word of legal text originated from. Yes, you can see the new version of a law, but how interesting would it be to see that Mr. X edited that line or Lobby Organisation Y contributed that very paragraph. As we know from source code version control, it is technically very possible, even trivial to track such information, but I'm afraid that our dear lawmakers have zero interest in such levels of accountability and transparency.

  • @Mordenor
    @Mordenor 2 дня назад +6

    Thank you Yannic for sharing your insights in legal language models.

  • @gregsLyrics
    @gregsLyrics 2 дня назад +3

    perfect timing! Full of wisdom.

  • @gokhanh.himmetoglu1740
    @gokhanh.himmetoglu1740 2 дня назад +6

    Thanks for explaining the RAG in a such simple form. Can you make a video that explains various other approaches that attempt to solve hallucination ?

    • @RoulDukeGonzo
      @RoulDukeGonzo 2 дня назад

      I don't know any others... There are 1000 flavours of rag, my favourite being raptor.
      Perhaps multi agent systems? Or icl+?

  • @Alistair
    @Alistair 2 дня назад +22

    congrats on 256K subs!

  • @Anonymous-lw1zy
    @Anonymous-lw1zy 2 дня назад +12

    At 39:00 - lawyers scamming lawyers with technically correct wording that misleadingly advertises, seeks to contractually entrap, and thereby defraud. Hilarious!!!
    And thanks for pulling the rug out from RAG's frequently outrageous claims.
    Well deserved 256k!

    • @hieroben
      @hieroben День назад

      I would argue that the claim of LexisNexis is not even "technically correct." They assert that their responses are "grounded" in trustworthy documents, which can obviously not be the case if the system hallucinates.

    • @MacGuffin1
      @MacGuffin1 День назад

      RAG-pull?

    • @clray123
      @clray123 День назад

      @@hieroben The "those responses" phrase has no valid semantic connection to the previous things mentioned in the same sentence ("legal citations" or "source documents"). It is just slimy marketing mumbo jumbo not worth paying any attention to. But you can bet that such subtleties will fly over the head of many lawyers with all their love for logic and precise language.

  • @marinepower
    @marinepower 2 дня назад +6

    LLMs hallucinate because they have zero grounding, but also because they are trained with next-token prediction. In essence, this means that whatever is in the context is maximally correct -- including whatever crap the llm itself generated. This is why, if your context is, 'does 3+5=9? Yes, 3+5 equals 9 because', then it's trivial to see that the model will hallucinate because it must match the existing context. I don't know what you were talking about with llms not being able to build a world model, being 'just statistical' models, etc. Humans are also statistical models with a lifetime of training data, that has nothing to do with hallucinations.

    • @doppelrutsch9540
      @doppelrutsch9540 2 дня назад +1

      That is both kind of true and not true in practice. A lot of gen AI tools these days are not LLMs in the strict sense of the word: After receiving additional training like RLFH or related techniques the distribution of the output is significantly shifted from the simple "most likely" continuation. You can observer this very strongly with Claude for example which will correct itself sometimes when it makes a mistake in its output even without human prompting - and of course the classic "answering this question would be unethical or dangerous" refuse responses.

    • @marinepower
      @marinepower 2 дня назад

      @@doppelrutsch9540 I suppose RLHF with the model making mistakes and then backpedaling could very well solve what I'm describing. Historically, no one ever trained on models making mistakes and then correcting them, we instead would simply train on 'clean' data. But, I suppose, if a mask was applied such that the training loss was only applied to the correction tokens, with the actual bad data tokens masked (so that the model wouldn't learn to generate bad data, only the corrections to the bad data), then it could work.

    • @doppelrutsch9540
      @doppelrutsch9540 День назад

      @@marinepower I wasn't describing a hypothetical, models *do* work like that and have for years. And they are also trained on data that is purposefully incorrect with later added corrections, I am quiet sure.

    • @clray123
      @clray123 День назад +1

      There is no law that the model "must match the existing context". But when training on the next token prediction task, models "discover" by themselves that copying from context more often than not minimizes the average loss across batch and they develop structures suitable to perform such copying ("induction heads" - in small models it can be even empirically observed when they develop during training and they can be ablated away).
      You could construct your training data in such a way that copying would not be an effective loss minimizing strategy. I think the bigger problem is the "average" part in minimizing average loss. A training outcome where you are slightly wrong on each token is arithmetically the same as an outcome where you are very right on many tokens and very wrong on others. Also, add to that that with cross-entropy loss only the probability of the correct token matters in considered for loss calculation - and maximizing this probability can also rise probability of similiar (but incorrect) tokens as a side effect (as mentioned in the ORPO paper).
      So overall, the impression is that our easy-to-compute loss function is crap, but it is hard to devise better ones (especially differentiable and efficient ones). That is where the PPO/RLHF algorithms try to compensate, but the later devised DPO/ORPO/etc. show that they do not really do such a great job either (because you can obtain similar "alignment" quality using these simpler non-PPO algorithms - what seems like an achievement really just highlights how the original approach "also sucks").
      It could be that the core problem of "rare truths" is just not possible to represent well using statistical distributions. Imagine a training dataset in which you have a million sequences which end in one way (and that's correct for them), but just one sequence which ends in a different way, based on some slight difference in the constituent tokens. How do you exactly teach a statistical model to pay attention to that one example and please disregard the mountain of what looks like "contrary" evidence opposing it? Possibly by assigning a million times weight to that one example, but then, if you are forced to apply such tricks and think about exceptions by yourself, then what good is machine learning, why not just write an if-else rule or a database of cases if you already know what to look for?
      I think the public is slowly coming to the realization that there is no free lunch in AI (probably something which seasoned ML researchers have known all along, but they certainly have been quite silent about it recently because of a huge conflict of interest).

    • @marinepower
      @marinepower День назад

      ​@@clray123 I do think the next-token prediction training regime is sort of ass. It is nice in the sense that (with temporal masking), it essentially allows you to train each token as an independent sample (sort of), so it's parallelizable, but there can definitely be improvements made.
      One thing that could maybe work is diffusion-based transformer models. In essence, instead of autoregressive token generation, you generate an entire sentence at a time (via iterative diffusion). Basically, you first train a CLIP-like model where you have a corpus of sentences, you use an LLM to reword said sentences while keeping the same semantic meaning, then train a model such that the original and reworded sentences maximize their cosine similarity between each other and minimize the cosine similarity between other sentences. Then, during training, you noise your input embedding (instead of using a one-hot encoding representing just that token), and pass the thing through your transformer decoder. The final loss is signal_strength * logit_loss + (1 - signal_strength) * similarity_loss. In essence, in low signal regimes, we want our model to predict a sentence that is semantically similar (even if it's not using the same tokens), whereas, when the noise is low we want it to predict the actual tokens as to what is in our training set.
      I haven't thought too deeply about this so maybe there's some sort of crucial issue with this methodology but I think it makes sense.

  • @ryanengel7736
    @ryanengel7736 12 часов назад

    Yannic you are a great academic and youtuber. I appreciate your videos, and as an NLP graduate student myself, I find your content intellectual and entertaining. Keep it up man

  • @luisluiscunha
    @luisluiscunha День назад

    I remember your 2017 revision of the Attention is All you Need, first on RUclips, then on my run, after making an mp3 of it. Good memories.

  • @adfaklsdjf
    @adfaklsdjf 2 дня назад +1

    22:43 to be clear, when we say "paste references", chatgpt and claude can't retrieve the contents of URLs, so a list URLs as references doesn't work with those afaik. i paste the full text of the material i want it to use

  • @alan2here
    @alan2here День назад +1

    Useful practical comparisons/tests between models.
    typical questions
    Well crafted questions with lots of prompt crafting, clearing the context window where needed, looking up and providing some stuff yourself etc…
    sloppy questions
    Questions about cases that involve domain specific knowledge outside of law, such as number theory, cryptography, structural engineering.
    Questions about gruesome cases to test for refusals.
    Large and small context window utilisation.
    Easy vs hard questions.
    Questions that are controversial or where no case law exists yet.
    and the such…

  • @JoanFiguerolaHurtado
    @JoanFiguerolaHurtado 2 дня назад

    From personal experience building AskPandi, the issue with hallucinations is more about lack of data rather than the model itself (assuming great QA capabilities). It's equally true that most QA models are not trained to say "I don't know", which complicates things too...

  • @luke2642
    @luke2642 2 дня назад +1

    Interesting video. How many years are we away from building a complete, semantic logical knowledge graph of all legal precendents based on a large set of documents?

  • @alan2here
    @alan2here День назад +1

    "Lex" is excellent at including the human in the loop.

  • @josephmacdonald1255
    @josephmacdonald1255 День назад

    Yannic has provided a fairly good summary of the risk, I may have missed him saying it, but in my opinion having LLMs generating different answers to identical prompts at different times when none of the facts or rules have change is expoentialy increasing the risk of a negligence claim. The issue of rulings and legislation being superseded by later legislation and later rulings. I also believe pertinent is more accurate than relevant.

  • @florianhoenicke4858
    @florianhoenicke4858 День назад

    You said something like "don't use LLMs for reasoning" and I agree that you need human in the loop. But I also know from experience that GPT-4 can be used for reasoning if I do a lot of handholding and split the task into smaller reasoning tasks.

  • @InstaKane
    @InstaKane 2 дня назад

    Nice, feel like I’m up to speed on this topic at a high level, cheers

  • @arowindahouse
    @arowindahouse 2 дня назад +1

    In non-english speaking countries, we tend to say RAG to avoid the mouthful

  • @alan2here
    @alan2here День назад +1

    Requests like "help me with this specific neurology research" or "here's a vauge half-remembered thing, what's the correct termoniligy for it" ⭐️⭐️⭐️⭐️💫. Requests like "[Baba is You level 7 screenshot] lets work step by step to solve it" ⭐️

  • @xaxfixho
    @xaxfixho 2 дня назад +3

    5:00 everyone is shady, well if you take off those sunglasses INDOORS it might help

  • @BenoitStPierre
    @BenoitStPierre 2 дня назад +3

    Have we tried putting all the legal textbooks necessary to read to become a lawyer in-context yet? Maybe they can be post-trained to learn the definitions and the approach?

    • @RoulDukeGonzo
      @RoulDukeGonzo 2 дня назад +4

      That would be worth doing, and it would give you simple facts, but as yanny says, you need to think (reason) over the facts. For sure, in some form the whole background should be used, either icl+, or foundation model.

    • @andytroo
      @andytroo 2 дня назад +3

      the problem is the complete body of case law (even the potentially relevant case law) can be bigger than the context window of the llm, and anything past that is only statistically stored in the dot product between idea-association-vectors in the internal weights. Even within large context windows models statistically answer, and are only "often" right.

    • @clray123
      @clray123 День назад

      @@andytroo ehh your context window is also "just" a bunch of vectors

  • @tantzer6113
    @tantzer6113 9 часов назад

    Eliminating passages that are backed up by hallucinated citations/links seems like a useful improvement even if the things being cited may still be hallucinated. Personally, I think that Lexis's marketing was accurate and fair. A customer who cares about the issue and who is intelligent will pay attention to the words used to describe the product.

  • @alan2here
    @alan2here День назад +1

    Chat GPT already RAGs when it thinks it needs to, with web searches, analysis steps, and the such.

  • @RobertSobieski
    @RobertSobieski День назад

    Conclusion is the best, so surprising 😎

  • @scottmiller2591
    @scottmiller2591 2 дня назад

    I remember when Lexis/Nexis was just a keyword database for patents, with the "(a OR NOT b) c"-type query language. It looks like it's aspirationally come a long way, but it does seem they've forgotten (or lost) the knowledge of where they came from (database management), and are dazzled by the Eye of AI.

  • @zeeshanahmed5624
    @zeeshanahmed5624 2 дня назад

    Hi, love your videoes on LLMs and hallucinations - I'm new to machine learning (doing a masters in CS) so I find this channel very useful.
    I understand that LLMs arent designed specifically for reasoning hence why we shouldn't expect them to perform well in QA-like tasks.
    So what are LLMs fundamentally designed for then?

  • @andytroo
    @andytroo 2 дня назад

    Lexis: delivering 99.6% hallucination free advice connected to citations to 100% hallucination free citations.

  • @covertassassin1885
    @covertassassin1885 2 дня назад +1

    Are you at AI Engineer World's Fair in SF right now by any chance?

  • @gody7334
    @gody7334 14 часов назад

    fine-tune on lots of domain specific documents might help improve performance ??

  • @theprofessionalfence-sitter
    @theprofessionalfence-sitter 2 дня назад

    I wonder whether some sort of transformer model built to move an embedding of the retrieved results towards what is hopefully an embedding of the answer guided by the question would work better for retrieval augmented generation.
    I feel like one major problem with using LLMs in this context is that they are also trained to answer questions by themselves, rather than having to stick to the retrieved results, which could make them more likely to produce hallucinations.

    • @RoulDukeGonzo
      @RoulDukeGonzo 2 дня назад

      There are some papers on rag vs. fine tuning. The balance of relevant info in the pre training / fine tuning / rag database is critical

  • @henrythegreatamerican8136
    @henrythegreatamerican8136 День назад

    Yes, they hallucinate often. I type the same response into Claude, ChatGPT, and PerplexityAI. More often than not I'll get different responses. But then I play all of them against each other by repeating what the other said in a follow up response. Eventually, they'll all agree, but sometimes they agree on the WRONG ANSWER!!!!
    And the problem with Claude is it doesn't have a memory. So if you close the website and return to ask the same question, it will repeat the same wrong response it gave you the first time without any reference to your follow up questions.

  • @errorbool531
    @errorbool531 6 часов назад

    Like your spice comments 🔥 It hurts but it is true.

  • @jaakko3083
    @jaakko3083 2 дня назад

    What about finetuning instead of using RAG?

  • @wolpumba4099
    @wolpumba4099 2 дня назад +22

    *Summary*
    This video essay critiques a Stanford paper that examined the reliability of AI-powered legal research tools.
    *Here's a breakdown:*
    * *0:00* - *The Problem:* AI legal research tools, which use LLMs (like ChatGPT) to answer legal questions based on public data, are prone to "hallucinations"-generating inaccurate or misleading information.
    * *15:45* - *The Claim:* Some legal tech companies advertise their products as "hallucination-free" or "eliminating" hallucinations due to using RAG (Retrieval Augmented Generation).
    * *15:45* - *RAG Explained:* RAG enhances LLMs by incorporating a search engine to fetch relevant documents alongside user queries. It essentially allows the LLM to "refer to notes" while answering.
    * *29:16* - *The Study:* The paper's authors tested popular legal research tools (Lexis+ AI, Westlaw AI, Casetext), comparing them to GPT-4 with and without RAG.
    * *~**30:00* - *The Findings:* While RAG did reduce hallucinations compared to a standalone LLM, these tools still exhibited hallucinations between 17% and 33% of the time.
    * *32:00* - *Critique of the Study:*
    * *32:00* - *Shady Practices:* Kilcher argues both the researchers and companies engaged in shady behavior. The researchers allegedly evaluated the wrong product (Practical Law instead of Westlaw) and only gained access to the correct tool after publishing their findings.
    * *55:27* - *Misleading Metrics:* The paper's definition of "hallucination" is criticized, as is their choice of metrics, which might misrepresent the tools' capabilities.
    * *~**49:00* - *Kilcher's Main Argument:*
    * *~**49:00* - *LLMs are not designed for complex reasoning tasks.* Expecting them to solve intricate legal questions end-to-end is unrealistic.
    * *~**49:00* - *The focus should be on human-AI collaboration.* Combining the strengths of search engines, LLMs, and human legal expertise is a more effective approach.
    * *1:09:20* - *Conclusion:*
    * AI legal research tools are not a replacement for human lawyers.
    * Users should always verify the information generated by these tools.
    *Key Takeaway:* While AI can be a valuable tool in legal research, it's crucial to be aware of its limitations and use it responsibly. Overhyping AI capabilities and neglecting human oversight can lead to flawed conclusions.
    i used gemini 1.5 pro to summarize the transcript.

    • @marcoramponi8462
      @marcoramponi8462 2 дня назад +3

      This summary is NOT hallucination-free

    • @wolpumba4099
      @wolpumba4099 2 дня назад +2

      @@marcoramponi8462 Where? I will make a note

  • @gr8ape111
    @gr8ape111 2 дня назад

    congrats on 2^8 * 1000 subs

  • @Alex-gc2vo
    @Alex-gc2vo 2 дня назад

    so what you're telling me is LLMs would benefit from Mixture Density layers

  • @adamholter1884
    @adamholter1884 2 дня назад

    What about Lamini Memory Tuning

    • @patrickl5290
      @patrickl5290 2 дня назад

      Is this publicly accessible yet? Seems so cool

    • @adamholter1884
      @adamholter1884 2 дня назад

      Not open source, but you can get it if you pay them money I believe. There's a lamini x meta deal for llama

  • @andytroo
    @andytroo 2 дня назад

    how do you find good documents :D
    sounds like you need a LLM to evaluate the relevancy of all the documents...

  • @bjarke7886
    @bjarke7886 День назад

    ESM3!

  • @agentds1624
    @agentds1624 2 дня назад +1

    36:36 shots fired

  • @BuFu1O1
    @BuFu1O1 2 дня назад +1

    my neurons are firing

  • @hasko_not_the_pirate
    @hasko_not_the_pirate День назад

    As the old saying goes: If you only have a hammer, every problem looks like a nail.

  • @JumpDiffusion
    @JumpDiffusion День назад

    Lots of strong claims/language ("garbage", "that's not what these models for") without any arguments to back it up...

  • @user-iv8fq5zl9o
    @user-iv8fq5zl9o 2 дня назад

    As a recent law graduate I’m siding with you over the academics and lawyers lmao

  • @hitechconnect
    @hitechconnect День назад

    thanks for explaining this paper to all. So you did not actually evaluate this by yourself? The fact that these tools are better than ChatGPT shows that they improved which they did using e.g. RAG but also other technologies. My strong guess is that over time these tools will get better and better and who knows how close to the 100% correct answers. I strongly suppose also DeepJudge cannot claim 100% accuracy. Can you share where you are there? I agree though that claiming so is a bad business practice.

  • @gileneusz
    @gileneusz 2 дня назад

    some claims that grokking is better

    • @RoulDukeGonzo
      @RoulDukeGonzo 2 дня назад +1

      Never been done on a big corpus... Would be great if it can be done.

  • @JoeTaber
    @JoeTaber 2 дня назад

    Not that Lexis is deserving, but a charitable reading of that marketing spiel could see it as an answer to that recent case where a lawyer who didn't know what he was doing tried to use GPT-4 to find case law and it completely hallucinated references multiple times. I think they ended up getting fined and reprimanded, besides being rather embarrassing.
    I'm neither a lawyer nor ML researcher, just a regular old programmer, but this spiel looks to be targeted at those lawyers who are largely put off of AI tools altogether by this case and don't really understand/care if it's basically the same as current search techniques as long as it avoids this one big (in their eyes) flaw.

  • @samdirichlet7500
    @samdirichlet7500 День назад

    The whining about how private companies won't share access for free reminds me of dealing with a junior engineer who expects to be handed an X and y matrix before starting work on any project.
    On another note, I'm writing AI to predict the ideal structure for better aerospace materials. Why won't Alcoa give me access to their proprietary database of alloy properties?

  • @darshanpandit
    @darshanpandit 13 часов назад

    Christopher Manning is a co-author. I am pretty sure you are missing onto something fundamental. 😂

  • @wwkk4964
    @wwkk4964 2 дня назад

    First!

  • @nebson
    @nebson 11 часов назад

    "you have no right" to pressure these powerful companies into submitting their platforms for scientific evaluation? Worst take.

  • @zhandanning8503
    @zhandanning8503 2 дня назад +4

    Here is me trying to be smart now, but does anyone else think GenAI should be shortened to GAI? and then the pronunciation should be close to similar words like "hail" or "snail" but "GAI" and capped for shouting it as well

  • @billxu9799
    @billxu9799 16 часов назад

    cringe cringe

  • @RoulDukeGonzo
    @RoulDukeGonzo 2 дня назад

    First!

  • @AirSandFire
    @AirSandFire 2 дня назад +1

    First!