Hamel Husain
Hamel Husain
  • Видео 40
  • Просмотров 75 827
Build Applications For LLMs in Python
Building minimal applications around your fine-tuned models is critical for success. However, there aren't great tools for doing this in Python. We are going to announce something special in this workshop that will give you new superpowers.
Просмотров: 2 918

Видео

Axolotl Office Hours
Просмотров 17614 часов назад
See parlance-labs.com/education/ for more resources
LangChain/LangSmith Office Hours
Просмотров 19119 часов назад
General office hours and Q&A About LangSmith / LangChain
FSDP, DeepSpeed and Accelerate
Просмотров 29821 час назад
Advanced techniques and practical considerations for fine-tuning large language models, comparing tools, discussing model precision and optimization, and exploring best practices for effective training and deployment. Slides, links and additional resources: parlance-labs.com/education/fine_tuning/zach.html *0:00 Axolotl vs. Hugging Face AutoTrain* Zach discusses the differences between Axolotl ...
Fine-Tuning with Axolotl
Просмотров 650День назад
This lesson illustrates an end-to-end example of fine-tuning a model using Axolotl to understand a domain-specific query language. Guest speakers include Wing Lian, creator of Axolotl and Zach Mueller lead developer on HuggingFace Accelerate. Notes, slides, and additional resources: parlance-labs.com/education/fine_tuning_course/workshop_2.html This is lesson of 2 of 4 course on applied fine-tu...
Instrumenting & Evaluating LLMs
Просмотров 1,1 тыс.День назад
This lesson discusses instrumentation and evaluation of LLMs. Guest speakers Brian Bischof and Eugene Yan describe how they think about LLM evaluation in industry. Finally, Shreya Shankar discusses her research on LLM eval systems. Slides, notes, and additional resources are available here: parlance-labs.com/education/fine_tuning_course/workshop_3.html This is lesson of 3 of 4 course on applied...
Deploying Fine-Tuned Models
Просмотров 703День назад
We will discuss inference servers, backends and platforms like Replicate that you can host models on. This is lesson of 4 of 4 course on applied fine-tuning: 1. When & Why to Fine-Tune: ruclips.net/video/cPn0nHFsvFg/видео.html 2. Fine-Tuning w/Axolotl: ruclips.net/video/mmsa4wDsiy0/видео.html 3. Instrumenting & Evaluating LLMs: ruclips.net/video/SnbGD677_u0/видео.html 4. Deploying Fine-Tuned LL...
Replicate Office Hours
Просмотров 91День назад
Replicate Office Hours
Fine Tuning OpenAI Models - Best Practices
Просмотров 881День назад
Best-practices on how to fine-tune OpenAI models. Notes, links, and more resources available Here: parlance-labs.com/education/fine_tuning/steven.html *00:00 What is Fine-Tuning* Fine-tuning a model involves training it on specific input/output examples to enable it to respond appropriately to similar inputs in the future. This section includes an analysis of when and when not to fine-tune. *02...
Modal Office Hours
Просмотров 128День назад
Modal Office Hours
Getting the Most Out of Your LLM Experiments
Просмотров 18814 дней назад
Reproducibility is critical to iterating on your machine-learning pipelines, and keeping track of everything is hard. We'll explore your shared fine-tuning projects, uncovering hidden insights, and showcasing advanced features of the experimenting with LLMs in the Weights & Biases workspace. You will also discover our new LLM tracing tool, Weave.
LLM Eval For Text2SQL
Просмотров 72014 дней назад
Ankur from Braintrust discusses the systematic evaluation and enhancement of text-to-SQL models. Highlighting key components like data preparation and scoring mechanisms, Ankur demonstrates their application with the NBA dataset. The presentation emphasizes iterative refinement through advanced scoring and model-generated data, offering insights into practical AI evaluation pipelines. *00:00 In...
Napkin Math For Fine Tuning Pt. 2 w/Johno Whitaker
Просмотров 15114 дней назад
See Part 1: ruclips.net/video/-2ebSQROew4/видео.html Chapters, links and resources: parlance-labs.com/education/fine_tuning/napkin_math_2.html
Building LLM Applications w/Gradio
Просмотров 41814 дней назад
Freddy, a software engineer at Hugging Face, demonstrates ways to build AI applications with Gradio, an open-source Python package. Freddy demonstrates building applications like a chatbot interface with just 50 lines of Python, discusses Gradio’s versatility in handling various media types, and its seamless integration with Hugging Face’s ecosystem. He also covers comparisons with tools like S...
When and Why to Fine Tune an LLM
Просмотров 1,5 тыс.14 дней назад
This session introduces the course "Fine Tuning for Data Scientists and Software Engineers". It introduces the concept of fine-tuning, and establishes a basic intuition for when it might be applied. Notes, links and resources are here: hamel.quarto.pub/parlance/education/fine_tuning_course/workshop_1.html This is lesson of 1 of 4 course on applied fine-tuning: 1. When & Why to Fine-Tune: ruclip...
Predibase Office Hours
Просмотров 6414 дней назад
Predibase Office Hours
Train (almost) Any LLM Model Using 🤗 Autotrain
Просмотров 44814 дней назад
Train (almost) Any LLM Model Using 🤗 Autotrain
Prompt Engineering Workshop
Просмотров 2,2 тыс.14 дней назад
Prompt Engineering Workshop
Modal: Simple Scalable Serverless Services
Просмотров 39414 дней назад
Modal: Simple Scalable Serverless Services
Systematically improving RAG applications
Просмотров 1,7 тыс.21 день назад
Systematically improving RAG applications
A Deep Dive on LLM Evaluation
Просмотров 1,2 тыс.21 день назад
A Deep Dive on LLM Evaluation
Creating, Curating, and Cleaning Data for LLMs
Просмотров 1,6 тыс.21 день назад
Creating, Curating, and Cleaning Data for LLMs
Inspect, an OSS Framework for LLM Evals
Просмотров 2,4 тыс.21 день назад
Inspect, an OSS Framework for LLM Evals
Slaying OOMs with PyTorch FSDP and torchao
Просмотров 71521 день назад
Slaying OOMs with PyTorch FSDP and torchao
Fine Tuning LLMs for Function Calling w/Pawel Garbacki
Просмотров 71928 дней назад
Fine Tuning LLMs for Function Calling w/Pawel Garbacki
From Prompt to Model: Fine-tuning when you've already deployed LLMs in prod w/Kyle Corbitt
Просмотров 69228 дней назад
From Prompt to Model: Fine-tuning when you've already deployed LLMs in prod w/Kyle Corbitt
Why Fine Tuning is Dead w/Emmanuel Ameisen
Просмотров 28 тыс.Месяц назад
Why Fine Tuning is Dead w/Emmanuel Ameisen
Back to Basics for RAG w/ Jo Bergum
Просмотров 3,2 тыс.Месяц назад
Back to Basics for RAG w/ Jo Bergum
Napkin Math For Fine Tuning Pt. 1 w/Johno Whitaker
Просмотров 1,8 тыс.Месяц назад
Napkin Math For Fine Tuning Pt. 1 w/Johno Whitaker
Beyond the Basics of Retrieval for Augmenting Generation (w/ Ben Clavié)
Просмотров 5 тыс.Месяц назад
Beyond the Basics of Retrieval for Augmenting Generation (w/ Ben Clavié)

Комментарии

  • @briancase9527
    @briancase9527 3 часа назад

    Really good and useful talk.

  • @jeremyh2083
    @jeremyh2083 7 часов назад

    I’m in healthcare, my data isn’t publicly available. Fine tuning has pretty decent results because most things are single shot. Agents/rags should be fun but again our data being segregated is a hassle. I think Bloomberg has to be similar.

  • @vincenthenderson3733
    @vincenthenderson3733 12 часов назад

    Around 46:00 - context size is mostly a vanity metric AFAICT. I'd like to see data about how accuracy varies with the percentage of total nominal context that is actually used by the prompts. In fact, this could be one of the most beneficial uses of fine-tuning, for avoiding filling up the context with very long instructions.

  • @vincenthenderson3733
    @vincenthenderson3733 14 часов назад

    Around 24:00 Another insight here wrt the question of the life science guy, is that when we say "RAG", we tend to assume "out of the box", embeddings match RAG. But RAG is in many special cases best implemented with dedicated software parts that take the LLM query output and use other domain-specific NLP and business rules software to actually do the retrieval of what you care about. In other words, build LLM workflows that are not only using LLMs. Get the LLM to do a task, then use that output to get your advanced semantic retrieval stuff that you know works and embeds a lot of your subject matter expertise to take care of the next step of your workflow, then use that output, which will typically be much more precise than a vanilla embedding, to build your next LLM prompt. I would have given the advice to the life sciences guy that he's very likely not going to get much benefit from fine tuning. You can't train a knowledge representation into an LLM using fine tuning. Fine tuning helps with task-specific input and output simplification and formatting, pruning, compliance, that sort of thing, not with the actual "logical" inference that the model does.

  • @vincenthenderson3733
    @vincenthenderson3733 14 часов назад

    Around 19:10 absolutely fine tuning works better for some tasks, and certainly doesn't for knowledge injection, for which you must use RAG. But you have to take into account also the economics and logistics of it. Fine-tuning is a task-specific thing that you have to do, then maintain all your FT models and so on, which costs money and introduces complexity. RAG and prompting is far more nimble. It's not often in life that the easier solution is in fact better than the complex one.

  • @vincenthenderson3733
    @vincenthenderson3733 15 часов назад

    Around 8:30 to 10:10 - The RAG picture absolutely turns the problem into a search problem that is at least as important as the prompting problem. This is a far less trivial problem than most people realize. Using RAG requires deep thinking about the retrieval part, and this is notoriously difficult using embeddings only, at least if you want to optimize your token consumption, and overall inference time of your prompt chain. You'd greatly boost your RAG-based workflow by not only using embeddings but considering sticking a real search index behind it that is configured for the retrieval that you care about. That's a kind of LLM workflow optimization that I feel is not being talked about.

  • @ayushman_sr
    @ayushman_sr День назад

    For some use cases, having a better prompt will add more tokens hence more cost for the same inference. Any thoughts?

  • @tankieslayer6927
    @tankieslayer6927 День назад

    I love how all midwits think RAG is some kind of miracle without basic understanding of freshman level math. RAG is literally the most low eye-queue idea one can come up with.

  • @Steve-lu6ft
    @Steve-lu6ft День назад

    When you say we should be spending days working on prompts, how so? I'm assuming you have a high level overview of how these prompts should be structured in mind, but can you break it down and simplify it for me?

  • @notfaang4702
    @notfaang4702 2 дня назад

    I was expecting more from this talk, the tldr was do RAG first, maybe fine tune later, but it also very depends on a case by case and what you want to do with model.

  • @RobertJohnson-xg5kh
    @RobertJohnson-xg5kh 3 дня назад

    Thanks for sharing this course! Could you please post the discord link? Thanks!

  • @nyan-cp5du
    @nyan-cp5du 3 дня назад

    The problem is if you don't do the cool thing, you help your company continue to generate revenue, but you don't get promoted and you spend the rest of your life writing SQL queries for shit pay

  • @codevacaphe3763
    @codevacaphe3763 3 дня назад

    I have some experience about fine-tuning, I mean fine-tuning and transfer learning is still a great technique but you have to had in depth knowledge about the math to know how to apply it. LLM is really hard to fine-tune, some of the time it will break the parameters learn from previous context (just my opinion tho). Some CNN application on the other hand have good fine-tune result since it can capture the shapes and patterns of the image then after CNN layer it still maintain and capture the key features.

  • @Jay-wx6jt
    @Jay-wx6jt 4 дня назад

    Hamel, I just cant thank you enough. for making this public.

  • @kcm624
    @kcm624 4 дня назад

    The questions that constantly keep interrupting the talk are super distracting.

  • @SearchingForSounds
    @SearchingForSounds 5 дней назад

    Another place I think this is echoed is in image diffusion models. IP Adapter has become some powerful in stable diffusion that we're able to basically create instant models using 3/4 reference images and normalizing/averaging the tokens they create. By conditioning a prompt with those tokens via IP adapter, fine tuning base models is in all but the most niche cases, pointless now.

  • @Tenebrisuk
    @Tenebrisuk 5 дней назад

    it's a shame the host didn't actually let the guest answer the last question and instead proposed a different question, otherwise I found this very interesting

  • @MegaGGWP
    @MegaGGWP 5 дней назад

    This paper(Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs) only does continued pre-training(ie. next word prediction) not supervised finetuning so the comparison is not entirely accurate.

  • @hasana.3078
    @hasana.3078 6 дней назад

    I have a question regarding sharegpt style datasets for function calling (tool use) like the glaiveai/glaive-v2 dataset. I am trying to get that work with axolotl but with the latest checkouts from GitHub this seems to be broken despite having been added in a MR to main earlier this year. To my understanding it needs the specific type: sharegpt.load_glaive which has the problem that the tool role never gets actually added and recognized. Now I have seen just a few days ago, Wing Lian has added a commit where exactly this issue was solved by modifying the code in those functions and adding another field “tool_field” in the config to allow for setting the tool role correctly. I see that this commit has not been merged and has been deleted and closed for some reason. Why is that the case and will the use of sharegpt tool datasets be fixed somehow? P.S. I also created a bug request for this on GitHub. It would be nice if you could come back to me about this issue as I desperately need it. Thanks

  • @thegrumpydeveloper
    @thegrumpydeveloper 6 дней назад

    I like the questions but really wished they were asked at the end of the presentation rather than break the flow and keep on having the answer being a few slides or talking points down the way.

    • @hamelhusain7140
      @hamelhusain7140 6 дней назад

      @@thegrumpydeveloper we did an experiment but didn’t do the same in other videos. I didn’t like it either !

  • @solaxun
    @solaxun 6 дней назад

    Thanks for sharing! One question I had about `input_output` format discussed at the beginning when specifying your own prompt templates. If you are fine-tuning a base model, I assume you need to come up with your own start/stop tokens (in the example, that would be <s> and </s>) and explicitly include them in the training pairs? If so, how do you know if the tokens you pick for that purpose are already in the model's vocabulary, or if you should add them to the embedding table?

  • @rounakkundu7831
    @rounakkundu7831 6 дней назад

    Great Course. Thanks for putting this out. Is the slides shared anywhere? Didn't find it on the website.

    • @hamelhusain7140
      @hamelhusain7140 14 часов назад

      @@rounakkundu7831 parlance-labs.com/education/fine_tuning_course/workshop_1.html

  • @toreon1978
    @toreon1978 6 дней назад

    46:53 context window line is incorrect. For the models you mentioned we are are around 100K average

    • @xspydazx
      @xspydazx 5 дней назад

      yes 128k , but some are unlimited : in fact they are all unlimited : its up to you how you train your model , the problem is only the processing size it takes to load and train the model : inside the tokenizer config files ( mistral for instance ) it sayus that the max token length was way past a trillion !... so first it needs to be trained then it can be used .... this is what we expect from released models : we expect that they have been trained on the largest psossible contexts leaving us to be able to choose a lower context according to our memeory abilitys ! the larger the context you set for a model the slower the response time as it is based on your gpu , so even a small model could consume a large gpu stack : , but the smaller models and large context are the best combination for local execution : so these 4b and 3b etc need training on long context : after this the pretrained models they release will be worth downloading as a home base model : or locking in as a gguf model : currently you need a training regime for any model downloaded if your doing serious work ! hence rags !

  • @toreon1978
    @toreon1978 6 дней назад

    35:41 sorry but I don’t get that. Isn’t the hard part creating the fine tune data set? The actual execution isn’t that much cost, is it?

  • @toreon1978
    @toreon1978 6 дней назад

    11:16 I think I get the idea of where to not finetune. But for the 5% where it is needed, I always thought that it’s like when you want to carve out a specific part of the general knowledge and behavior LLMs have. So, to give concise fitting financial advice LLMs have to ignore much (bad) knowledge and also stop a lot a expansive responses. Thats bot something good prompting can achieve, right?

  • @JL-1735
    @JL-1735 6 дней назад

    This guest has quite a condescending attitude, I don't buy what he's saying as a result of it. On top he came across as rude (his "This is not an Anthropic presentation" etc, you can say that in a less aggressive way. Anyway, interesting topic, but it's too much founded on fluffy lines and reasoning that serves ... suppliers of big foundation models like Anthropic. As closely guarded models like those of Anthropic are the opposite of being easy to fine tune. Anyway, Meta will prove open weights model are the future, and yes finetuning has it's place not everything is an LLM-problem built in to the foundation model.

  • @tarikborogovac9614
    @tarikborogovac9614 6 дней назад

    Could finetuning make the model overall less capable overall, i.e. forget general knowledge, reasoning, instruction following, and other abilities that help you answer questions even in the domain that you are finetuning for. And this type of pervasive capability loss maybe hard to measure.

  • @Douchebagus
    @Douchebagus 6 дней назад

    In the context of diffusion-based image models, fine-tuning is infinitely more important than prompting, so I don't think your talk applies to all machine learning models.

  • @davidwright6839
    @davidwright6839 7 дней назад

    The conceptual analogy that I like to use comes from cartography. The LLM is a map of regions called "concepts" that are projected into the multidimensional tensor space of tokens. Fine-tuning is a conformal map projection of this tensor space to create a "view" appropriate to a user's domain. Prompts are tokens that adjust the zoom level of the conformal map to view greater detail and narrow the possible output responses from the tensor space. RAG is like "street-view" images or satellite data that adjust the temporal window of the map beyond its training cutoff date. Prompts can be optimized for either the LLM or the fine-tuned map. If the prompt tokens are optimized for the LLM, fine-tuning is superfluous. If the prompt tokens are domain-specific for a conformal "view," the fine-tuned map should perform somewhat better.

    • @xspydazx
      @xspydazx 7 дней назад

      yes i recently saw another fine tuning tehnoique to increse the probablitys if a particular series : hence by adding a list of entitys to the content then you can always have other keywords which will activate or extract the content !

    • @antonystringfellow5152
      @antonystringfellow5152 5 дней назад

      Thanks, that's an excellent analogue. I didn't really have a good grasp of these points before I read this. It's easy for novices like me to get lost in the terminology.

    • @xspydazx
      @xspydazx 5 дней назад

      when i was investigating building languge models for the first time : i really was interested at what happens atr each layer mathmatically : ( forget the optimization funciton ) .. what data is actually at each layer : and the first chatGPT told me that : Lnaguage models were a collection of ngram languge models? SO i begun with these type of model , i did not quite get it so i digged deeper only to find out embedding were the key and after checking out skipgrams / glove etc i finally got some where ... after i discovered the transformer arch and found that each layer was actually a word to word matrix : of probablitys of the next word as this is how ngram models do predict the next words : but this transformer is a massive vocal and layers.... i created a model step by step with the gpt ( it could not make a transformer then ) ... but we made the components ... the self attention et .... these are the search function .... enabling for the later layers to refocus on selecting the next token . : SO at every layer the data is in embedding matixes aa you can view the journey of the token prediction as it travels through the network .... hence the layer stack : it was said that gpt had so many layers , but in truth we find : 32 layers !! <<< this is a golden spot for this type of transformer model and languge models o not need more layers than this : as well as the model being related to the vocabulary: SO the Bytepair encoding enables for less tokens to predict many possibitys: ( also 32,000 is good ) but we can improve this with larger chucks or even technical chucks : be pretokenizing the text into valuable segments ie code fragments , so semantic chunking is a good way to begin th etokenization process: all this just to predict a token ?? NO? we are predicting sequences by simularity : so the model is trained to output a sequence based on an input sequence , even a language model can be used to transform a picture or language as it is only tokenization: and encoding : hence audio is converted to an image first then tokenized ot base64 ( text ) and can be encoded into the document : this tokenization is also the innovation of the transformer and embeddings and self attention !! get these under control then your making networks in any programming language : i begun in .net ! but only switched to python to do llm stuff : in truth after this year on the models etc i could babsically rewrite it in vb.net quete easy ( gpt already wrote the code long ago !) << SO by a step by step learning process i was able to master transformers , where cnns. lstms etc really was boring and i did not partake !

  • @esantirulo721
    @esantirulo721 7 дней назад

    In LLM, fine-tuning just makes more complicated the problem of grounding: where the output comes from? the base model, the learned data or from nothing (hallucination) ? that's why embedding-based search is great: you know what data you're generating your output from. In some industries (e.g., medical), being able to justify ("ground") an answer is mandatory. There are a few use-cases for fine-tuning, if the cost for transforming the data in pair (prompt -> completion) is not too expensive.

  • @mrpocock
    @mrpocock 7 дней назад

    I am fairly convinced that we are doing LLM wrong. We should have language models that generate complete nonsense, and puppet them with knowledge models. So if RAG is how you do this, you want your non-RAG model to score essentially zero on all benchmarking except language structure, and inject everything with RAG or some other knowledge or skill injection.

  • @chunheichau7947
    @chunheichau7947 8 дней назад

    ONLY FOR LLM!!!

  • @zeryf4780
    @zeryf4780 8 дней назад

    great and informative conversation! I wonder if there are more channels like yours!

  • @peterbizik224
    @peterbizik224 8 дней назад

    Nice session. Thank you. I would love to see a reliable stable base model understanding the languages. But the domain knowledge is questionable always in my opinion, as most/some? of the books used for base model training (technical books, advanced papers) are quite complex and I am still not truly convinced that text + pictures + math was captured with very high precision.

    • @artifishially_stupid
      @artifishially_stupid 4 дня назад

      Well put. Most of the LLMs contain way more knowledge than we need at the base model level. Nevertheless, I've achieved excellent results by uploading my own documents and instructing the chatbot to search my documents first. The documents are very technical and complex in their formatting, so I had to do A LOT of cleaning and preprocessing, added some annotation, and converted to txt files. Having done all of that, I'm not sure fine-tuning would add much in my particular case.

    • @peterbizik224
      @peterbizik224 3 дня назад

      @@artifishially_stupid well that's the way, I guess, but - for realistic corporate operational needs - once you will get a managerial material with some level of competency - once it will come to "A LOT" then it's no go :)

  • @rajeevsingh758
    @rajeevsingh758 9 дней назад

    Are you comparing Fine tuning with Prompting ????? Who made this chart?

  • @realtalkmotive836
    @realtalkmotive836 9 дней назад

    hey man , kindly check your work email !! 😊

  • @explorer945
    @explorer945 9 дней назад

    Fabric AI summary: SUMMARY: HL, Dan, and guests discussed evaluation methods for large language models, including unit tests, using LMs as judges, human evaluation, and various metrics. IDEAS: - Unit tests are a first line of defense for catching obvious failures. - Look at your data rigorously to find failure modes to test for. - LM as judge can help scale evaluations but requires human alignment periodically. - Human evaluation is important but doesn't scale well for large datasets. - Metrics like recall, ranking, and ability to return zero results are important. - Evaluations should evolve as you learn more about failure modes. - Code-based and LM-based evaluations have different use cases. - Iterative grading of outputs can help refine evaluation criteria over time. - Evaluation criteria may drift as you see more outputs from the LM. - Avoiding contamination of test data in base model training is challenging. INSIGHTS: - Evaluations enable fast iteration and feedback for improving LLM applications. - Different evaluation methods suit different use cases and stages of development. - Evaluations are an iterative process of discovering and codifying desired behavior. - Human judgment is crucial for aligning evaluations with true goals. - Evaluation criteria and implementations should evolve with increased understanding. - Logging outputs and revisiting evaluations is important for production systems. - A combination of methods is often needed for comprehensive evaluation. - Evaluation frameworks can help but the hard part is understanding requirements. QUOTES: "If you don't have really dumb failure modes like things that can trigger an assertion often times like is it's natural to think that hey like I can't write any unit tests for my AI because it's spitting out natural language and it's kind of fuzzy." "We want to make sure that the more relevant ones are closer to top personally for me what I find to be quite important uh for rag is this metric that I've never had to considered before." "I don't think an evaluation interface we learned no evaluation assistant can just be a One-Stop uh thing where you grade your examples come up with evals and then push it to your Ci or push it to your production workflow no you've got to always be looking." "Grading has to be continual you've always got to be looking at your production data you've always got to be learning from that." HABITS: - Look at data rigorously to find failure modes to test for. - Use LM as judge but periodically check human alignment. - Conduct human evaluation regularly, especially for evolving criteria. - Log outputs and revisit evaluations for production systems. - Iterate on evaluation criteria as understanding of requirements increases. - Grade outputs continually to refine evaluation criteria and implementations. - Check for contamination of test data in base model training. - Use a combination of unit tests, LM judges, metrics for comprehensive evaluation. FACTS: - Unit tests are limited for open-ended language model outputs. - LM as judge can provide directional signal but requires human alignment. - Human evaluation doesn't scale well for large datasets. - Metrics like recall, ranking, zero-result ability are important for retrievers. - Evaluation criteria may drift as more outputs are seen. - Code-based and LM-based evaluations suit different use cases. - Avoiding test data contamination in base models is challenging. REFERENCES: - Hamil's blog post on the iteration cycle - Spade paper on generating assertion criteria - Shrea Shankar's work on systematic LM judging - Langs Smith for logging, testing, datasets - BrainTrust, Weights & Biases tools mentioned - Instruct library for honeycomb example - Eugene's writeups on LM evals, hallucination detection, domain fine-tuning ONE-SENTENCE TAKEAWAY: Comprehensive evaluation of large language models requires an iterative process combining multiple methods like unit tests, LM judges, metrics, and human evaluation to continuously align with evolving goals. RECOMMENDATIONS: - Write unit tests to catch obvious failures as a first line of defense. - Look at data rigorously to find and test for different failure modes. - Use LM as judge but periodically check alignment with human judgments. - Conduct regular human evaluation, especially when criteria are evolving. - Log outputs and revisit evaluations for production systems to refine criteria. - Iterate on evaluation criteria as understanding of requirements increases through grading. - Use a combination of methods like unit tests, LM judges, metrics. - Consider evaluation frameworks but focus on understanding requirements first. - Check for contamination of test data in base model training data. - Evaluate agents by breaking down into steps and evaluating each component.

  • @explorer945
    @explorer945 9 дней назад

    Extracted Wisdom from Fabric Summary: This is a discussion about fine-tuning large language models using tools like Axolotl and Hugging Face Accelerate. The key points covered include: Ideas: - Fine-tuning involves adapting a pre-trained language model to a specific task by further training on relevant data. - Choosing the right base model (size, family) and using LoRA vs full fine-tuning are key decisions. - LoRA (Low-Rank Adaptation) is recommended over full fine-tuning as it requires less GPU memory. - Quantization like Q8BERT can further reduce memory requirements with some performance tradeoff. Insights: - Focus more on curating high-quality training data than obsessing over model details. - Write evaluations and assertions to validate outputs and filter bad training examples. - Use techniques like LLM-as-a-judge to encode human preferences into the fine-tuning process. - Iterate between training, evaluating, and improving the dataset - it's not a linear process. Habits: - Always sanity check the fine-tuned model's outputs before deployment. - Inspect preprocessed data to catch tokenization issues early. - Use tools like Weights & Biases to track training metrics. - Start with working example configs and make incremental changes. Facts: - 7B-13B parameter models are a popular sweet spot for fine-tuning. - Larger models require model/data parallelism techniques like DeepSpeed and FSDP. - Mixed precision training can reduce memory requirements with little accuracy loss. Recommendations: - Use Axolotl as it bundles best practices and rapidly integrates new techniques. - Try Modal for cloud-based parallel hyperparameter tuning of Axolotl jobs. - Leverage tools like sample packing, offloading, and efficient memory loading in Accelerate. - Explore deployment-time techniques like greedy/topk sampling for deterministic outputs. One-Sentence Takeaway: Focus on curating high-quality data and iterating between training/evaluation, using tools like Axolotl to rapidly leverage cutting-edge model parallelism techniques.

  • @user-gj3kz7cm3x
    @user-gj3kz7cm3x 9 дней назад

    LLM hype is causing severe brain rot.

  • @Ahmedelgebaly
    @Ahmedelgebaly 10 дней назад

    Is there any details about this video like the first one? and Chapters?

  • @yvettecrystal6075
    @yvettecrystal6075 10 дней назад

    When fine-tuning LLM with techniques like LORA, what is the model actually doing? I know it is weight updating, but what does the model learn from it? Can anyone explain in an intuitive way.

    • @fneful
      @fneful 9 дней назад

      Only thing LLMs are good at is predicting next one word given all previous words. Better weights means better prediction (with more confidence) of next word. You can think as, if previously model had doubt among 5 words for prediction after fine tuning model has confusion between say 3 words only.

    • @StarnikBayley
      @StarnikBayley 8 дней назад

      The original LLM model's weights are frozen, which is left intact. Nothing changes in the weights of the original model. However LoRa adds new segments to the model, which can be trained. For example, a silly example would be, a person's vision may not be able to separate camouflage in jungle . LoRa is like adding a new segment with its own weights to the brain of the person, without modifying the existing brain, which enables the person to see green and brown with higher precision so that he can easily distinguish camouflage in the jungle.

  • @tufcat722
    @tufcat722 11 дней назад

    I think what is misleading here is conflating machine learning with LLMs. The scope of LLMs is not the same as machine learning overall. Fine tuning of foundation models is not dead. Furthermore, aren’t the big LLM companies like Anthropic already doing extensive fine tuning on their own base models before releasing to the public? How does that fit with this idea?

  • @riser9644
    @riser9644 11 дней назад

    great presentation

  • @hasaniqbal3180
    @hasaniqbal3180 12 дней назад

    This was great, thank you.

  • @YoutubeThumbnailDesigner
    @YoutubeThumbnailDesigner 12 дней назад

    I saw that you are looking for a thumbnail designer, plz let me know with yes or no.

  • @kaleemullahmalik
    @kaleemullahmalik 12 дней назад

    share your email please

  • @muhannadobeidat
    @muhannadobeidat 12 дней назад

    Nice discussion, thanks for sharing. I am 70% into it and still didn’t hear examples or justification why fine tuning should be avoided. Lots of evaluation results, but that does not make sense if you are fine tuning. You are doing that to work on your custom data mostly and therefore generic evaluation models may not apply nor portray the real performance of the fine tuned model. I fine tune for example to do better detection of service requests into categories and potential solutions.

  • @officialchaitanyasharma
    @officialchaitanyasharma 13 дней назад

    no description is given notes and all

    • @khanate2750
      @khanate2750 13 дней назад

      pajeet, this is a free video from a $500 course, don't expect extra stuff you don't deserve.

    • @hamelhusain7140
      @hamelhusain7140 13 дней назад

      If you want to contribute notes we are happy to provide them.

    • @GeniusGuy-pd5ys
      @GeniusGuy-pd5ys 13 дней назад

      ​@@hamelhusain7140Keep up the content! Btw Hamel, I think You should Get a Thumbnail Designer to boost Views and CTR for your Videos. So am one of them 😄

  • @azogdevil
    @azogdevil 13 дней назад

    Thank You 😊

  • @Aditya_khedekar
    @Aditya_khedekar 16 дней назад

    Notes link is not working :)

    • @hamelhusain7140
      @hamelhusain7140 16 дней назад

      It's back up

    • @Aditya_khedekar
      @Aditya_khedekar 16 дней назад

      @@hamelhusain7140 Hii i really like your work !! if possible can you make a video on how to run OSS models on edge like IOS or android locally like ollama. i am finding it difficult to bundle 2-3 models together in expo app.