Yannic you are a great academic and youtuber. I appreciate your videos, and as an NLP graduate student myself, I find your content intellectual and entertaining. Keep it up man
Thank you for this insightful video showing indeed the challenges we face in Legal AI. A while ago, I did my own research on LLMs in Law and came to very similar conclusions, just that at that time I was not aware of "Retrieval-Augmented Generation". LLMs in Law currently do require a human in the loop. I presented my work in ALP 2023: Workshop on AI, Law and Philosophy at the JURIX conference and we had some debate on this, where many lawyers agreed that "Justifiability" is currently a good way to go. The paper's title is: "Justifiable artificial intelligence: Engineering large language models for legal applications". In this, I also look at the so-called "emergent capabilities" of LLMs. It is almost entertaining to see the ongoing debate on whether LLMs can perforn legal reasoning, with papers referencing and debunking each other's claims.
The section you highlighted in the paper as defining "hallucinations' is a general introduction and a reference to another paper . The working definition of "hallucination" for this reserach is much more nuanced and defined in section 4.3 of the paper "We now adopt a precise definition of a hallucination in terms of the above variables. A response is considered hallucinated if it is either incorrect or misgrounded. In other words, if a model makes a false statement or falsely asserts that a source supports a statement, that constitutes a hallucination".
It would need something like git versioning with commits in order to have the llm see the history of the relevant laws and changes that were made to it. It would be like a retrieval but with the past included
It has always shocked me that when important legal texts are altered, there is no publicly available record for where each line and word of legal text originated from. Yes, you can see the new version of a law, but how interesting would it be to see that Mr. X edited that line or Lobby Organisation Y contributed that very paragraph. As we know from source code version control, it is technically very possible, even trivial to track such information, but I'm afraid that our dear lawmakers have zero interest in such levels of accountability and transparency.
LLMs hallucinate because they have zero grounding, but also because they are trained with next-token prediction. In essence, this means that whatever is in the context is maximally correct -- including whatever crap the llm itself generated. This is why, if your context is, 'does 3+5=9? Yes, 3+5 equals 9 because', then it's trivial to see that the model will hallucinate because it must match the existing context. I don't know what you were talking about with llms not being able to build a world model, being 'just statistical' models, etc. Humans are also statistical models with a lifetime of training data, that has nothing to do with hallucinations.
That is both kind of true and not true in practice. A lot of gen AI tools these days are not LLMs in the strict sense of the word: After receiving additional training like RLFH or related techniques the distribution of the output is significantly shifted from the simple "most likely" continuation. You can observer this very strongly with Claude for example which will correct itself sometimes when it makes a mistake in its output even without human prompting - and of course the classic "answering this question would be unethical or dangerous" refuse responses.
@@doppelrutsch9540 I suppose RLHF with the model making mistakes and then backpedaling could very well solve what I'm describing. Historically, no one ever trained on models making mistakes and then correcting them, we instead would simply train on 'clean' data. But, I suppose, if a mask was applied such that the training loss was only applied to the correction tokens, with the actual bad data tokens masked (so that the model wouldn't learn to generate bad data, only the corrections to the bad data), then it could work.
@@marinepower I wasn't describing a hypothetical, models *do* work like that and have for years. And they are also trained on data that is purposefully incorrect with later added corrections, I am quiet sure.
There is no law that the model "must match the existing context". But when training on the next token prediction task, models "discover" by themselves that copying from context more often than not minimizes the average loss across batch and they develop structures suitable to perform such copying ("induction heads" - in small models it can be even empirically observed when they develop during training and they can be ablated away). You could construct your training data in such a way that copying would not be an effective loss minimizing strategy. I think the bigger problem is the "average" part in minimizing average loss. A training outcome where you are slightly wrong on each token is arithmetically the same as an outcome where you are very right on many tokens and very wrong on others. Also, add to that that with cross-entropy loss only the probability of the correct token matters in considered for loss calculation - and maximizing this probability can also rise probability of similiar (but incorrect) tokens as a side effect (as mentioned in the ORPO paper). So overall, the impression is that our easy-to-compute loss function is crap, but it is hard to devise better ones (especially differentiable and efficient ones). That is where the PPO/RLHF algorithms try to compensate, but the later devised DPO/ORPO/etc. show that they do not really do such a great job either (because you can obtain similar "alignment" quality using these simpler non-PPO algorithms - what seems like an achievement really just highlights how the original approach "also sucks"). It could be that the core problem of "rare truths" is just not possible to represent well using statistical distributions. Imagine a training dataset in which you have a million sequences which end in one way (and that's correct for them), but just one sequence which ends in a different way, based on some slight difference in the constituent tokens. How do you exactly teach a statistical model to pay attention to that one example and please disregard the mountain of what looks like "contrary" evidence opposing it? Possibly by assigning a million times weight to that one example, but then, if you are forced to apply such tricks and think about exceptions by yourself, then what good is machine learning, why not just write an if-else rule or a database of cases if you already know what to look for? I think the public is slowly coming to the realization that there is no free lunch in AI (probably something which seasoned ML researchers have known all along, but they certainly have been quite silent about it recently because of a huge conflict of interest).
@@clray123 I do think the next-token prediction training regime is sort of ass. It is nice in the sense that (with temporal masking), it essentially allows you to train each token as an independent sample (sort of), so it's parallelizable, but there can definitely be improvements made. One thing that could maybe work is diffusion-based transformer models. In essence, instead of autoregressive token generation, you generate an entire sentence at a time (via iterative diffusion). Basically, you first train a CLIP-like model where you have a corpus of sentences, you use an LLM to reword said sentences while keeping the same semantic meaning, then train a model such that the original and reworded sentences maximize their cosine similarity between each other and minimize the cosine similarity between other sentences. Then, during training, you noise your input embedding (instead of using a one-hot encoding representing just that token), and pass the thing through your transformer decoder. The final loss is signal_strength * logit_loss + (1 - signal_strength) * similarity_loss. In essence, in low signal regimes, we want our model to predict a sentence that is semantically similar (even if it's not using the same tokens), whereas, when the noise is low we want it to predict the actual tokens as to what is in our training set. I haven't thought too deeply about this so maybe there's some sort of crucial issue with this methodology but I think it makes sense.
Yannic has provided a fairly good summary of the risk, I may have missed him saying it, but in my opinion having LLMs generating different answers to identical prompts at different times when none of the facts or rules have change is expoentialy increasing the risk of a negligence claim. The issue of rulings and legislation being superseded by later legislation and later rulings. I also believe pertinent is more accurate than relevant.
Lexis's marketing blurb is fine. Lexis's answers are grounded, i.e., referenced to, sources that are real (and thus generally useful and authoritative), exactly as advertised, as opposed to fictional sources. Lexis's marketing is reasonably accurate and fair, but incomplete. The "incompleteness" may be missed by customers who don't pay attention to niceties, but most of the people who pay attention and are reasonably intelligent will not miss it. Lexis eliminates responses that are backed up by hallucinated, non-existing citations. This is a significant improvement over tools that do not do so, since it reduces the amount of verbiage from the sources you have read and evaluate, but you still have to read the sources cited because the ideas attributed to them might be hallucinated.
got this link from a colleague. I liked your video a lot! We are on the same wavelength! The "AI"-hype, it's like advertising very narrow spectrum intensity enhanced glasses to see the world in "pink", instead of seeing the world in it's beautiful spectrum from red to violet :) I guess someone profits by making the new stock bubble.
22:43 to be clear, when we say "paste references", chatgpt and claude can't retrieve the contents of URLs, so a list URLs as references doesn't work with those afaik. i paste the full text of the material i want it to use
*Summary* This video essay critiques a Stanford paper that examined the reliability of AI-powered legal research tools. *Here's a breakdown:* * *0:00* - *The Problem:* AI legal research tools, which use LLMs (like ChatGPT) to answer legal questions based on public data, are prone to "hallucinations"-generating inaccurate or misleading information. * *15:45* - *The Claim:* Some legal tech companies advertise their products as "hallucination-free" or "eliminating" hallucinations due to using RAG (Retrieval Augmented Generation). * *15:45* - *RAG Explained:* RAG enhances LLMs by incorporating a search engine to fetch relevant documents alongside user queries. It essentially allows the LLM to "refer to notes" while answering. * *29:16* - *The Study:* The paper's authors tested popular legal research tools (Lexis+ AI, Westlaw AI, Casetext), comparing them to GPT-4 with and without RAG. * *~**30:00* - *The Findings:* While RAG did reduce hallucinations compared to a standalone LLM, these tools still exhibited hallucinations between 17% and 33% of the time. * *32:00* - *Critique of the Study:* * *32:00* - *Shady Practices:* Kilcher argues both the researchers and companies engaged in shady behavior. The researchers allegedly evaluated the wrong product (Practical Law instead of Westlaw) and only gained access to the correct tool after publishing their findings. * *55:27* - *Misleading Metrics:* The paper's definition of "hallucination" is criticized, as is their choice of metrics, which might misrepresent the tools' capabilities. * *~**49:00* - *Kilcher's Main Argument:* * *~**49:00* - *LLMs are not designed for complex reasoning tasks.* Expecting them to solve intricate legal questions end-to-end is unrealistic. * *~**49:00* - *The focus should be on human-AI collaboration.* Combining the strengths of search engines, LLMs, and human legal expertise is a more effective approach. * *1:09:20* - *Conclusion:* * AI legal research tools are not a replacement for human lawyers. * Users should always verify the information generated by these tools. *Key Takeaway:* While AI can be a valuable tool in legal research, it's crucial to be aware of its limitations and use it responsibly. Overhyping AI capabilities and neglecting human oversight can lead to flawed conclusions. i used gemini 1.5 pro to summarize the transcript.
At 39:00 - lawyers scamming lawyers with technically correct wording that misleadingly advertises, seeks to contractually entrap, and thereby defraud. Hilarious!!! And thanks for pulling the rug out from RAG's frequently outrageous claims. Well deserved 256k!
I would argue that the claim of LexisNexis is not even "technically correct." They assert that their responses are "grounded" in trustworthy documents, which can obviously not be the case if the system hallucinates.
@@hieroben The "those responses" phrase has no valid semantic connection to the previous things mentioned in the same sentence ("legal citations" or "source documents"). It is just slimy marketing mumbo jumbo not worth paying any attention to. But you can bet that such subtleties will fly over the head of many lawyers with all their love for logic and precise language.
Lexis's marketing blurb is fine. Lexis's answers are grounded, i.e., referenced to, sources that are real (and thus generally useful and authoritative), exactly as advertised, as opposed to fictional sources. Lexis's marketing is reasonably accurate and fair, but incomplete. The "incompleteness" may be missed by customers who don't pay attention to niceties, but most of the people who pay attention and are reasonably intelligent will not miss it. Lexis eliminates responses that are backed up by hallucinated, non-existing citations. This is a significant improvement over tools that do not do so, since it reduces the amount of verbiage from the sources you have read and evaluate, but you still have to read the sources cited because the ideas attributed to them might be hallucinated.
Have we tried putting all the legal textbooks necessary to read to become a lawyer in-context yet? Maybe they can be post-trained to learn the definitions and the approach?
That would be worth doing, and it would give you simple facts, but as yanny says, you need to think (reason) over the facts. For sure, in some form the whole background should be used, either icl+, or foundation model.
the problem is the complete body of case law (even the potentially relevant case law) can be bigger than the context window of the llm, and anything past that is only statistically stored in the dot product between idea-association-vectors in the internal weights. Even within large context windows models statistically answer, and are only "often" right.
I remember when Lexis/Nexis was just a keyword database for patents, with the "(a OR NOT b) c"-type query language. It looks like it's aspirationally come a long way, but it does seem they've forgotten (or lost) the knowledge of where they came from (database management), and are dazzled by the Eye of AI.
Useful practical comparisons/tests between models. typical questions Well crafted questions with lots of prompt crafting, clearing the context window where needed, looking up and providing some stuff yourself etc… sloppy questions Questions about cases that involve domain specific knowledge outside of law, such as number theory, cryptography, structural engineering. Questions about gruesome cases to test for refusals. Large and small context window utilisation. Easy vs hard questions. Questions that are controversial or where no case law exists yet. and the such…
Hi Yannic! Thanks for your insightful videos. You are right that hallucinations are inherent in GPT-based models, and it's challenging to limit them with in-context learning. However, I think that retrieval-augmented generation for querying large corpora of documents might not be the best use case for LLMs, even if it's the most popular. In my opinion, where LLMs truly shine is in emulating human tasks based on conversation. I demonstrated some interesting use cases in my recent article: "A Conversational Agent with a Single Prompt?" published on the social network that starts with an "L." I'm curious to hear your thoughts on it. Giorgio
Interesting video. How many years are we away from building a complete, semantic logical knowledge graph of all legal precendents based on a large set of documents?
Hi, love your videoes on LLMs and hallucinations - I'm new to machine learning (doing a masters in CS) so I find this channel very useful. I understand that LLMs arent designed specifically for reasoning hence why we shouldn't expect them to perform well in QA-like tasks. So what are LLMs fundamentally designed for then?
Very interesting, thank you. I enjoyed it very much and recommended it to my colleagues. What sparked my interest is the benchmark. Like with any expert opinion, many people I listen to hallucinate (imagine, brainstorm, we can use a lot of other non-loaded words here) a lot, but that's a valuable perk. They make me think about the subject from a new perspective. What are better ways to evaluate the quality of answers? I understand that we should at least distinguish between explorative questions/answers and definitive ones. And the quality of those will be different. But what else?
Thanks Yannic and team...it was mentioned that there are better ways to solve this problems...(no LLM as I understood). Can you comment on that what are these other methods?
Probably something involving a proper database of information and something ai based to assist searching in there. A model isn't a good way to store explicit information, it encodes information implicitly into areas of numbers and neither we nor the ai can know what any of them specifically mean. We also can't know whether an area of numbers maybe also contains information that isn't true but simply an artifact of wiggling adjacent numbers into place during training.
There is some proof that LLMs have indeed some world models, but they are just not as sophisticated as human world models. It makes sense though; How would an LLM be able to generate (at least sometimes) multiple sentences that are completely true and which are not in the training dataset? This kind of interpolation only works if you have some world model.
I wonder whether some sort of transformer model built to move an embedding of the retrieved results towards what is hopefully an embedding of the answer guided by the question would work better for retrieval augmented generation. I feel like one major problem with using LLMs in this context is that they are also trained to answer questions by themselves, rather than having to stick to the retrieved results, which could make them more likely to produce hallucinations.
Requests like "help me with this specific neurology research" or "here's a vauge half-remembered thing, what's the correct termoniligy for it" ⭐️⭐️⭐️⭐️💫. Requests like "[Baba is You level 7 screenshot] lets work step by step to solve it" ⭐️
Not that Lexis is deserving, but a charitable reading of that marketing spiel could see it as an answer to that recent case where a lawyer who didn't know what he was doing tried to use GPT-4 to find case law and it completely hallucinated references multiple times. I think they ended up getting fined and reprimanded, besides being rather embarrassing. I'm neither a lawyer nor ML researcher, just a regular old programmer, but this spiel looks to be targeted at those lawyers who are largely put off of AI tools altogether by this case and don't really understand/care if it's basically the same as current search techniques as long as it avoids this one big (in their eyes) flaw.
Yes, they hallucinate often. I type the same response into Claude, ChatGPT, and PerplexityAI. More often than not I'll get different responses. But then I play all of them against each other by repeating what the other said in a follow up response. Eventually, they'll all agree, but sometimes they agree on the WRONG ANSWER!!!! And the problem with Claude is it doesn't have a memory. So if you close the website and return to ask the same question, it will repeat the same wrong response it gave you the first time without any reference to your follow up questions.
I guess I've already been doing this myself by cut-and-pasting the output of --help from a command or web pages full of a documention I'm too lazy to read before asking a question. (22:00)
"Ohh it hallucinates" ... Technically they always hallucinate. If you don't agree with the answer it's a hallucination. So control comes from whether you leave it to figure things out completely on its own or you steer it... Like a good expert should. Hopefully we see a change in understanding of what these things are good at because even with that limitation, the outsized gains in productivity we'll have from using them in law or any field will be better than not having these sorts of tools
LLM is always smarter than retrieval tech... so, we rely on LLM to better judge if a document is relevant to a given query.. LLM should be able to juggle between relevant and irrelevant documents.. it mostly does to my surprise
thanks for explaining this paper to all. So you did not actually evaluate this by yourself? The fact that these tools are better than ChatGPT shows that they improved which they did using e.g. RAG but also other technologies. My strong guess is that over time these tools will get better and better and who knows how close to the 100% correct answers. I strongly suppose also DeepJudge cannot claim 100% accuracy. Can you share where you are there? I agree though that claiming so is a bad business practice.
From personal experience building AskPandi, the issue with hallucinations is more about lack of data rather than the model itself (assuming great QA capabilities). It's equally true that most QA models are not trained to say "I don't know", which complicates things too...
You said something like "don't use LLMs for reasoning" and I agree that you need human in the loop. But I also know from experience that GPT-4 can be used for reasoning if I do a lot of handholding and split the task into smaller reasoning tasks.
The whining about how private companies won't share access for free reminds me of dealing with a junior engineer who expects to be handed an X and y matrix before starting work on any project. On another note, I'm writing AI to predict the ideal structure for better aerospace materials. Why won't Alcoa give me access to their proprietary database of alloy properties?
The limitations are in the illogical structure of English. Try much more precise Esperanto & see if it hallucinates. That's why Western politicians constantly hallucinate that the West is still relevant ;)
@@drdca8263 Lojban is insignificant vs Esperanto which has huge Wikipedia presence & media. New better language can be created but what people accept is the most important. Noone wanted to improve Esperanto - But use it everywhere!!! Once we pass that stage, then we can agree to improve to the best at that time.
@@DimitarBerberu If your main concern is number of users, why not go with Mandarin, Hindi, or English? :P [referential humor]Though I suppose Esperanto does have those “new radio shows”.[/referential humor]
@@drdca8263 I speak 5 lang (inc. Eng) & I started Esperanto due to frustration with English as most illogical & ambiguous. With my Maths/Informatics education & love of Holistic Philosophy (DiaMat) I have mission to push Esperanto as AI language for full communication between 8b people & >16bn AI Robots (coming soon) to make this world much better & more humane than ever imagined (by any communist ;) Any national language has nuances that will never be accepted by >80% population to submit to any imperial language. Chinese/ scientists learn Eng to better understand other achievements, Indians/others to get more Anglo jobs (until BrExit/Trumpism), but if you want to sell to other nations you have to learn their language. iPhones have to learn Simplified Chinese, so forget the 1.4 bn using English as necessary language, same for >85% non-Western people/nations… To simplify & be the most productive, the min auxiliary communication language is Esperanto. Google Translate took 100x less effort to inc. Esperanto, so why waste time with so many limited languages before we have Esperanto as universal auxiliary communication language with 2x more understood AI robots in the Community?
@@drdca8263 I speak 5 lang (inc. Eng) & I started Esperanto due to frustration with English as most illogical & ambiguous. With my Maths/Informatics education & love of Holistic Philosophy (DiaMat) I have mission to push Esperanto as AI language for full communication between 8b people & >16bn AI Robots (coming soon) to make this world much better & more humane than ever imagined (by any communist ;) Any national language has nuances that will never be accepted by >80% population to submit to any imperial language. Chinese/ scientists learn Eng to better understand other achievements, Indians/others to get more Anglo jobs (until BrExit/Trumpism), but if you want to sell to other nations you have to learn their language. iPhones have to learn Simplified Chinese, so forget the 1.4 bn using English as necessary language, same for >85% non-Western nations… To simplify & be the most productive, the min auxiliary communication language is Esperanto. Google Translate took 100x less effort to inc. Esperanto, so why waste time with so many limited languages before we have Esperanto as universal auxiliary communication language with 2x more understood AI robots in the Community?
"almost never". But we never know exactly when it does not hallucinate, so we have to manually fact-check every time, rendering the automatization in more work or just a different kind of work.
There's a more fundamental problem with what this paper is trying to do, RAG doesn't even pull synonymic data very well. RAG will fail spectacularly at retrieving relevant docs especially in a jargon-heavy field like legal, and there's no way around this. Forget hallucination, any current RAG based system for legal is going to be useless in the real world, it just won't find the actual relevant information. I know this because hundreds of companies (like Yannic's) have tried it and none have succeeded to a useful degree.
If you use a generic embedding model, it may struggle with synonyms. However, if you train or fine-tune the model on jargon-heavy documents, it will handle synonyms effectively.
That’s just not true. I have my own AI setup with RAG and I use it everyday at my firm. From depositions summaries to cast of characters, it can almost do it all. Hallucinations have stopped being a problem, now the only issue is output limit
Wait, if it involves thinking and reasoning it shouldn't be done by AI? You just made a ton of this-changes-everything AI influencers cry, you monster.
Here is me trying to be smart now, but does anyone else think GenAI should be shortened to GAI? and then the pronunciation should be close to similar words like "hail" or "snail" but "GAI" and capped for shouting it as well
This is the voice of a man who’s been trying to explain to customers and stakeholders this exact technology problems for months.
perfect timing! Full of wisdom.
Yannic you are a great academic and youtuber. I appreciate your videos, and as an NLP graduate student myself, I find your content intellectual and entertaining. Keep it up man
Thank you for this insightful video showing indeed the challenges we face in Legal AI. A while ago, I did my own research on LLMs in Law and came to very similar conclusions, just that at that time I was not aware of "Retrieval-Augmented Generation". LLMs in Law currently do require a human in the loop.
I presented my work in ALP 2023: Workshop on AI, Law and Philosophy at the JURIX conference and we had some debate on this, where many lawyers agreed that "Justifiability" is currently a good way to go.
The paper's title is: "Justifiable artificial intelligence: Engineering large language models for legal applications".
In this, I also look at the so-called "emergent capabilities" of LLMs. It is almost entertaining to see the ongoing debate on whether LLMs can perforn legal reasoning, with papers referencing and debunking each other's claims.
congrats on 256K subs!
Thank you Yannic for sharing your insights in legal language models.
I remember your 2017 revision of the Attention is All you Need, first on RUclips, then on my run, after making an mp3 of it. Good memories.
Thanks for explaining the RAG in a such simple form. Can you make a video that explains various other approaches that attempt to solve hallucination ?
I don't know any others... There are 1000 flavours of rag, my favourite being raptor.
Perhaps multi agent systems? Or icl+?
The section you highlighted in the paper as defining "hallucinations' is a general introduction and a reference to another paper . The working definition of "hallucination" for this reserach is much more nuanced and defined in section 4.3 of the paper "We now adopt a precise definition of a hallucination in terms of the above variables. A response is considered hallucinated if it is either incorrect or misgrounded. In other words, if a model makes a false statement or falsely asserts that a source supports a statement, that constitutes a hallucination".
It would need something like git versioning with commits in order to have the llm see the history of the relevant laws and changes that were made to it.
It would be like a retrieval but with the past included
Documentation but with time-dimension
It has always shocked me that when important legal texts are altered, there is no publicly available record for where each line and word of legal text originated from. Yes, you can see the new version of a law, but how interesting would it be to see that Mr. X edited that line or Lobby Organisation Y contributed that very paragraph. As we know from source code version control, it is technically very possible, even trivial to track such information, but I'm afraid that our dear lawmakers have zero interest in such levels of accountability and transparency.
@@clray123 it just works "fine" as it is and too complicated to enhance regardless of benefits, also lawyers need to eat i guess
Superb !! I think every answer from LLM should pass a legal expert Where there is a need of critical reasoning and thinking
Technically correct but made to deceive, apple marketers are bleeding edge at that technique.
Easier to fool than to convince that Apple is fooling their customers ;)
LLMs hallucinate because they have zero grounding, but also because they are trained with next-token prediction. In essence, this means that whatever is in the context is maximally correct -- including whatever crap the llm itself generated. This is why, if your context is, 'does 3+5=9? Yes, 3+5 equals 9 because', then it's trivial to see that the model will hallucinate because it must match the existing context. I don't know what you were talking about with llms not being able to build a world model, being 'just statistical' models, etc. Humans are also statistical models with a lifetime of training data, that has nothing to do with hallucinations.
That is both kind of true and not true in practice. A lot of gen AI tools these days are not LLMs in the strict sense of the word: After receiving additional training like RLFH or related techniques the distribution of the output is significantly shifted from the simple "most likely" continuation. You can observer this very strongly with Claude for example which will correct itself sometimes when it makes a mistake in its output even without human prompting - and of course the classic "answering this question would be unethical or dangerous" refuse responses.
@@doppelrutsch9540 I suppose RLHF with the model making mistakes and then backpedaling could very well solve what I'm describing. Historically, no one ever trained on models making mistakes and then correcting them, we instead would simply train on 'clean' data. But, I suppose, if a mask was applied such that the training loss was only applied to the correction tokens, with the actual bad data tokens masked (so that the model wouldn't learn to generate bad data, only the corrections to the bad data), then it could work.
@@marinepower I wasn't describing a hypothetical, models *do* work like that and have for years. And they are also trained on data that is purposefully incorrect with later added corrections, I am quiet sure.
There is no law that the model "must match the existing context". But when training on the next token prediction task, models "discover" by themselves that copying from context more often than not minimizes the average loss across batch and they develop structures suitable to perform such copying ("induction heads" - in small models it can be even empirically observed when they develop during training and they can be ablated away).
You could construct your training data in such a way that copying would not be an effective loss minimizing strategy. I think the bigger problem is the "average" part in minimizing average loss. A training outcome where you are slightly wrong on each token is arithmetically the same as an outcome where you are very right on many tokens and very wrong on others. Also, add to that that with cross-entropy loss only the probability of the correct token matters in considered for loss calculation - and maximizing this probability can also rise probability of similiar (but incorrect) tokens as a side effect (as mentioned in the ORPO paper).
So overall, the impression is that our easy-to-compute loss function is crap, but it is hard to devise better ones (especially differentiable and efficient ones). That is where the PPO/RLHF algorithms try to compensate, but the later devised DPO/ORPO/etc. show that they do not really do such a great job either (because you can obtain similar "alignment" quality using these simpler non-PPO algorithms - what seems like an achievement really just highlights how the original approach "also sucks").
It could be that the core problem of "rare truths" is just not possible to represent well using statistical distributions. Imagine a training dataset in which you have a million sequences which end in one way (and that's correct for them), but just one sequence which ends in a different way, based on some slight difference in the constituent tokens. How do you exactly teach a statistical model to pay attention to that one example and please disregard the mountain of what looks like "contrary" evidence opposing it? Possibly by assigning a million times weight to that one example, but then, if you are forced to apply such tricks and think about exceptions by yourself, then what good is machine learning, why not just write an if-else rule or a database of cases if you already know what to look for?
I think the public is slowly coming to the realization that there is no free lunch in AI (probably something which seasoned ML researchers have known all along, but they certainly have been quite silent about it recently because of a huge conflict of interest).
@@clray123 I do think the next-token prediction training regime is sort of ass. It is nice in the sense that (with temporal masking), it essentially allows you to train each token as an independent sample (sort of), so it's parallelizable, but there can definitely be improvements made.
One thing that could maybe work is diffusion-based transformer models. In essence, instead of autoregressive token generation, you generate an entire sentence at a time (via iterative diffusion). Basically, you first train a CLIP-like model where you have a corpus of sentences, you use an LLM to reword said sentences while keeping the same semantic meaning, then train a model such that the original and reworded sentences maximize their cosine similarity between each other and minimize the cosine similarity between other sentences. Then, during training, you noise your input embedding (instead of using a one-hot encoding representing just that token), and pass the thing through your transformer decoder. The final loss is signal_strength * logit_loss + (1 - signal_strength) * similarity_loss. In essence, in low signal regimes, we want our model to predict a sentence that is semantically similar (even if it's not using the same tokens), whereas, when the noise is low we want it to predict the actual tokens as to what is in our training set.
I haven't thought too deeply about this so maybe there's some sort of crucial issue with this methodology but I think it makes sense.
Yannic has provided a fairly good summary of the risk, I may have missed him saying it, but in my opinion having LLMs generating different answers to identical prompts at different times when none of the facts or rules have change is expoentialy increasing the risk of a negligence claim. The issue of rulings and legislation being superseded by later legislation and later rulings. I also believe pertinent is more accurate than relevant.
Lexis's marketing blurb is fine. Lexis's answers are grounded, i.e., referenced to, sources that are real (and thus generally useful and authoritative), exactly as advertised, as opposed to fictional sources. Lexis's marketing is reasonably accurate and fair, but incomplete. The "incompleteness" may be missed by customers who don't pay attention to niceties, but most of the people who pay attention and are reasonably intelligent will not miss it. Lexis eliminates responses that are backed up by hallucinated, non-existing citations. This is a significant improvement over tools that do not do so, since it reduces the amount of verbiage from the sources you have read and evaluate, but you still have to read the sources cited because the ideas attributed to them might be hallucinated.
got this link from a colleague. I liked your video a lot! We are on the same wavelength! The "AI"-hype, it's like advertising very narrow spectrum intensity enhanced glasses to see the world in "pink", instead of seeing the world in it's beautiful spectrum from red to violet :) I guess someone profits by making the new stock bubble.
22:43 to be clear, when we say "paste references", chatgpt and claude can't retrieve the contents of URLs, so a list URLs as references doesn't work with those afaik. i paste the full text of the material i want it to use
*Summary*
This video essay critiques a Stanford paper that examined the reliability of AI-powered legal research tools.
*Here's a breakdown:*
* *0:00* - *The Problem:* AI legal research tools, which use LLMs (like ChatGPT) to answer legal questions based on public data, are prone to "hallucinations"-generating inaccurate or misleading information.
* *15:45* - *The Claim:* Some legal tech companies advertise their products as "hallucination-free" or "eliminating" hallucinations due to using RAG (Retrieval Augmented Generation).
* *15:45* - *RAG Explained:* RAG enhances LLMs by incorporating a search engine to fetch relevant documents alongside user queries. It essentially allows the LLM to "refer to notes" while answering.
* *29:16* - *The Study:* The paper's authors tested popular legal research tools (Lexis+ AI, Westlaw AI, Casetext), comparing them to GPT-4 with and without RAG.
* *~**30:00* - *The Findings:* While RAG did reduce hallucinations compared to a standalone LLM, these tools still exhibited hallucinations between 17% and 33% of the time.
* *32:00* - *Critique of the Study:*
* *32:00* - *Shady Practices:* Kilcher argues both the researchers and companies engaged in shady behavior. The researchers allegedly evaluated the wrong product (Practical Law instead of Westlaw) and only gained access to the correct tool after publishing their findings.
* *55:27* - *Misleading Metrics:* The paper's definition of "hallucination" is criticized, as is their choice of metrics, which might misrepresent the tools' capabilities.
* *~**49:00* - *Kilcher's Main Argument:*
* *~**49:00* - *LLMs are not designed for complex reasoning tasks.* Expecting them to solve intricate legal questions end-to-end is unrealistic.
* *~**49:00* - *The focus should be on human-AI collaboration.* Combining the strengths of search engines, LLMs, and human legal expertise is a more effective approach.
* *1:09:20* - *Conclusion:*
* AI legal research tools are not a replacement for human lawyers.
* Users should always verify the information generated by these tools.
*Key Takeaway:* While AI can be a valuable tool in legal research, it's crucial to be aware of its limitations and use it responsibly. Overhyping AI capabilities and neglecting human oversight can lead to flawed conclusions.
i used gemini 1.5 pro to summarize the transcript.
This summary is NOT hallucination-free
@@marcoramponi8462 Where? I will make a note
At 39:00 - lawyers scamming lawyers with technically correct wording that misleadingly advertises, seeks to contractually entrap, and thereby defraud. Hilarious!!!
And thanks for pulling the rug out from RAG's frequently outrageous claims.
Well deserved 256k!
I would argue that the claim of LexisNexis is not even "technically correct." They assert that their responses are "grounded" in trustworthy documents, which can obviously not be the case if the system hallucinates.
RAG-pull?
@@hieroben The "those responses" phrase has no valid semantic connection to the previous things mentioned in the same sentence ("legal citations" or "source documents"). It is just slimy marketing mumbo jumbo not worth paying any attention to. But you can bet that such subtleties will fly over the head of many lawyers with all their love for logic and precise language.
Lexis's marketing blurb is fine. Lexis's answers are grounded, i.e., referenced to, sources that are real (and thus generally useful and authoritative), exactly as advertised, as opposed to fictional sources. Lexis's marketing is reasonably accurate and fair, but incomplete. The "incompleteness" may be missed by customers who don't pay attention to niceties, but most of the people who pay attention and are reasonably intelligent will not miss it. Lexis eliminates responses that are backed up by hallucinated, non-existing citations. This is a significant improvement over tools that do not do so, since it reduces the amount of verbiage from the sources you have read and evaluate, but you still have to read the sources cited because the ideas attributed to them might be hallucinated.
Have we tried putting all the legal textbooks necessary to read to become a lawyer in-context yet? Maybe they can be post-trained to learn the definitions and the approach?
That would be worth doing, and it would give you simple facts, but as yanny says, you need to think (reason) over the facts. For sure, in some form the whole background should be used, either icl+, or foundation model.
the problem is the complete body of case law (even the potentially relevant case law) can be bigger than the context window of the llm, and anything past that is only statistically stored in the dot product between idea-association-vectors in the internal weights. Even within large context windows models statistically answer, and are only "often" right.
@@andytroo ehh your context window is also "just" a bunch of vectors
I remember when Lexis/Nexis was just a keyword database for patents, with the "(a OR NOT b) c"-type query language. It looks like it's aspirationally come a long way, but it does seem they've forgotten (or lost) the knowledge of where they came from (database management), and are dazzled by the Eye of AI.
Useful practical comparisons/tests between models.
typical questions
Well crafted questions with lots of prompt crafting, clearing the context window where needed, looking up and providing some stuff yourself etc…
sloppy questions
Questions about cases that involve domain specific knowledge outside of law, such as number theory, cryptography, structural engineering.
Questions about gruesome cases to test for refusals.
Large and small context window utilisation.
Easy vs hard questions.
Questions that are controversial or where no case law exists yet.
and the such…
Hi Yannic!
Thanks for your insightful videos.
You are right that hallucinations are inherent in GPT-based models, and it's challenging to limit them with in-context learning. However, I think that retrieval-augmented generation for querying large corpora of documents might not be the best use case for LLMs, even if it's the most popular. In my opinion, where LLMs truly shine is in emulating human tasks based on conversation. I demonstrated some interesting use cases in my recent article:
"A Conversational Agent with a Single Prompt?"
published on the social network that starts with an "L."
I'm curious to hear your thoughts on it.
Giorgio
Interesting video. How many years are we away from building a complete, semantic logical knowledge graph of all legal precendents based on a large set of documents?
Hi, love your videoes on LLMs and hallucinations - I'm new to machine learning (doing a masters in CS) so I find this channel very useful.
I understand that LLMs arent designed specifically for reasoning hence why we shouldn't expect them to perform well in QA-like tasks.
So what are LLMs fundamentally designed for then?
Imitation
Nice, feel like I’m up to speed on this topic at a high level, cheers
"Lex" is excellent at including the human in the loop.
Very interesting. Thank you.
Very interesting, thank you. I enjoyed it very much and recommended it to my colleagues. What sparked my interest is the benchmark. Like with any expert opinion, many people I listen to hallucinate (imagine, brainstorm, we can use a lot of other non-loaded words here) a lot, but that's a valuable perk. They make me think about the subject from a new perspective.
What are better ways to evaluate the quality of answers? I understand that we should at least distinguish between explorative questions/answers and definitive ones. And the quality of those will be different. But what else?
I have seen prompts where the user literally begs "please do not hallucinate results". How does that help if at all?
Thanks Yannic and team...it was mentioned that there are better ways to solve this problems...(no LLM as I understood). Can you comment on that what are these other methods?
Probably something involving a proper database of information and something ai based to assist searching in there. A model isn't a good way to store explicit information, it encodes information implicitly into areas of numbers and neither we nor the ai can know what any of them specifically mean. We also can't know whether an area of numbers maybe also contains information that isn't true but simply an artifact of wiggling adjacent numbers into place during training.
There is some proof that LLMs have indeed some world models, but they are just not as sophisticated as human world models. It makes sense though; How would an LLM be able to generate (at least sometimes) multiple sentences that are completely true and which are not in the training dataset? This kind of interpolation only works if you have some world model.
I wonder whether some sort of transformer model built to move an embedding of the retrieved results towards what is hopefully an embedding of the answer guided by the question would work better for retrieval augmented generation.
I feel like one major problem with using LLMs in this context is that they are also trained to answer questions by themselves, rather than having to stick to the retrieved results, which could make them more likely to produce hallucinations.
There are some papers on rag vs. fine tuning. The balance of relevant info in the pre training / fine tuning / rag database is critical
Simply Amazing!
Lexis: delivering 99.6% hallucination free advice connected to citations to 100% hallucination free citations.
Requests like "help me with this specific neurology research" or "here's a vauge half-remembered thing, what's the correct termoniligy for it" ⭐️⭐️⭐️⭐️💫. Requests like "[Baba is You level 7 screenshot] lets work step by step to solve it" ⭐️
Chat GPT already RAGs when it thinks it needs to, with web searches, analysis steps, and the such.
Great video
In non-english speaking countries, we tend to say RAG to avoid the mouthful
Are you at AI Engineer World's Fair in SF right now by any chance?
Yes.
fine-tune on lots of domain specific documents might help improve performance ??
Not that Lexis is deserving, but a charitable reading of that marketing spiel could see it as an answer to that recent case where a lawyer who didn't know what he was doing tried to use GPT-4 to find case law and it completely hallucinated references multiple times. I think they ended up getting fined and reprimanded, besides being rather embarrassing.
I'm neither a lawyer nor ML researcher, just a regular old programmer, but this spiel looks to be targeted at those lawyers who are largely put off of AI tools altogether by this case and don't really understand/care if it's basically the same as current search techniques as long as it avoids this one big (in their eyes) flaw.
Yes, they hallucinate often. I type the same response into Claude, ChatGPT, and PerplexityAI. More often than not I'll get different responses. But then I play all of them against each other by repeating what the other said in a follow up response. Eventually, they'll all agree, but sometimes they agree on the WRONG ANSWER!!!!
And the problem with Claude is it doesn't have a memory. So if you close the website and return to ask the same question, it will repeat the same wrong response it gave you the first time without any reference to your follow up questions.
I guess I've already been doing this myself by cut-and-pasting the output of --help from a command or web pages full of a documention I'm too lazy to read before asking a question. (22:00)
What about finetuning instead of using RAG?
Thanks for an awsome channel! Will you do a video of KAN: Kolmogorov-Arnold Networks? Would be nice to listen to your take on it!
"Ohh it hallucinates" ... Technically they always hallucinate. If you don't agree with the answer it's a hallucination. So control comes from whether you leave it to figure things out completely on its own or you steer it... Like a good expert should.
Hopefully we see a change in understanding of what these things are good at because even with that limitation, the outsized gains in productivity we'll have from using them in law or any field will be better than not having these sorts of tools
so what you're telling me is LLMs would benefit from Mixture Density layers
5:00 everyone is shady, well if you take off those sunglasses INDOORS it might help
LLM is always smarter than retrieval tech... so, we rely on LLM to better judge if a document is relevant to a given query.. LLM should be able to juggle between relevant and irrelevant documents.. it mostly does to my surprise
55:00 "Hallucinations! Hallucinations!" seems like someone is sick of the current state of ML as well.
26:50 , 😭😭😭 You didn't have to do me like that 😭😭
14:01 : I’m not sure how to make this claim precise.
What about Lamini Memory Tuning
Is this publicly accessible yet? Seems so cool
Not open source, but you can get it if you pay them money I believe. There's a lamini x meta deal for llama
thanks for explaining this paper to all. So you did not actually evaluate this by yourself? The fact that these tools are better than ChatGPT shows that they improved which they did using e.g. RAG but also other technologies. My strong guess is that over time these tools will get better and better and who knows how close to the 100% correct answers. I strongly suppose also DeepJudge cannot claim 100% accuracy. Can you share where you are there? I agree though that claiming so is a bad business practice.
Like your spice comments 🔥 It hurts but it is true.
how do you find good documents :D
sounds like you need a LLM to evaluate the relevancy of all the documents...
With a vector database
congrats on 2^8 * 1000 subs
From personal experience building AskPandi, the issue with hallucinations is more about lack of data rather than the model itself (assuming great QA capabilities). It's equally true that most QA models are not trained to say "I don't know", which complicates things too...
As the old saying goes: If you only have a hammer, every problem looks like a nail.
You said something like "don't use LLMs for reasoning" and I agree that you need human in the loop. But I also know from experience that GPT-4 can be used for reasoning if I do a lot of handholding and split the task into smaller reasoning tasks.
The whining about how private companies won't share access for free reminds me of dealing with a junior engineer who expects to be handed an X and y matrix before starting work on any project.
On another note, I'm writing AI to predict the ideal structure for better aerospace materials. Why won't Alcoa give me access to their proprietary database of alloy properties?
ESM3!
As a recent law graduate I’m siding with you over the academics and lawyers lmao
my neurons are firing
some claims that grokking is better
Never been done on a big corpus... Would be great if it can be done.
36:36 shots fired
came for the cringe. not disappointed. keep it up, professor!
The limitations are in the illogical structure of English. Try much more precise Esperanto & see if it hallucinates.
That's why Western politicians constantly hallucinate that the West is still relevant ;)
Esperanto? Really? If you wanted a *logical* language, I would think you would go with lojban.
@@drdca8263 Lojban is insignificant vs Esperanto which has huge Wikipedia presence & media. New better language can be created but what people accept is the most important. Noone wanted to improve Esperanto - But use it everywhere!!!
Once we pass that stage, then we can agree to improve to the best at that time.
@@DimitarBerberu If your main concern is number of users, why not go with Mandarin, Hindi, or English? :P
[referential humor]Though I suppose Esperanto does have those “new radio shows”.[/referential humor]
@@drdca8263 I speak 5 lang (inc. Eng) & I started Esperanto due to frustration with English as most illogical & ambiguous. With my Maths/Informatics education & love of Holistic Philosophy (DiaMat) I have mission to push Esperanto as AI language for full communication between 8b people & >16bn AI Robots (coming soon) to make this world much better & more humane than ever imagined (by any communist ;) Any national language has nuances that will never be accepted by >80% population to submit to any imperial language.
Chinese/ scientists learn Eng to better understand other achievements, Indians/others to get more Anglo jobs (until BrExit/Trumpism), but if you want to sell to other nations you have to learn their language. iPhones have to learn Simplified Chinese, so forget the 1.4 bn using English as necessary language, same for >85% non-Western people/nations…
To simplify & be the most productive, the min auxiliary communication language is Esperanto. Google Translate took 100x less effort to inc. Esperanto, so why waste time with so many limited languages before we have Esperanto as universal auxiliary communication language with 2x more understood AI robots in the Community?
@@drdca8263 I speak 5 lang (inc. Eng) & I started Esperanto due to frustration with English as most illogical & ambiguous. With my Maths/Informatics education & love of Holistic Philosophy (DiaMat) I have mission to push Esperanto as AI language for full communication between 8b people & >16bn AI Robots (coming soon) to make this world much better & more humane than ever imagined (by any communist ;) Any national language has nuances that will never be accepted by >80% population to submit to any imperial language.
Chinese/ scientists learn Eng to better understand other achievements, Indians/others to get more Anglo jobs (until BrExit/Trumpism), but if you want to sell to other nations you have to learn their language. iPhones have to learn Simplified Chinese, so forget the 1.4 bn using English as necessary language, same for >85% non-Western nations…
To simplify & be the most productive, the min auxiliary communication language is Esperanto. Google Translate took 100x less effort to inc. Esperanto, so why waste time with so many limited languages before we have Esperanto as universal auxiliary communication language with 2x more understood AI robots in the Community?
Virtually hallucination-free AI is totally doable. In digital marketing you have Lemon AI with AI Reports, which almost never hallucinates
Does it use knowledge graphs or just vector db ?
Stop spamming your SaaS product here. 😂
😅
"almost never". But we never know exactly when it does not hallucinate, so we have to manually fact-check every time, rendering the automatization in more work or just a different kind of work.
Almost
Lots of strong claims/language ("garbage", "that's not what these models for") without any arguments to back it up...
Yannic, please don't pick a paper like this again.
First!
There's a more fundamental problem with what this paper is trying to do, RAG doesn't even pull synonymic data very well. RAG will fail spectacularly at retrieving relevant docs especially in a jargon-heavy field like legal, and there's no way around this. Forget hallucination, any current RAG based system for legal is going to be useless in the real world, it just won't find the actual relevant information. I know this because hundreds of companies (like Yannic's) have tried it and none have succeeded to a useful degree.
If you use a generic embedding model, it may struggle with synonyms. However, if you train or fine-tune the model on jargon-heavy documents, it will handle synonyms effectively.
That’s just not true. I have my own AI setup with RAG and I use it everyday at my firm. From depositions summaries to cast of characters, it can almost do it all. Hallucinations have stopped being a problem, now the only issue is output limit
Wait, if it involves thinking and reasoning it shouldn't be done by AI? You just made a ton of this-changes-everything AI influencers cry, you monster.
Last 50 years lol.
Here is me trying to be smart now, but does anyone else think GenAI should be shortened to GAI? and then the pronunciation should be close to similar words like "hail" or "snail" but "GAI" and capped for shouting it as well
This is dumb
That acronym is GAI.
💀
Christopher Manning is a co-author. I am pretty sure you are missing onto something fundamental. 😂
cringe cringe
First!
First!