Summarizing legal documents with Hugging Face and Amazon SageMaker

Поделиться
HTML-код
  • Опубликовано: 20 окт 2024
  • Real-life generative AI! In this video, I show you how to fine-tune a Google FLAN-T5 model to summarize legal text.
    We first deploy the model straight from the Hugging Face Hub to Amazon SageMaker, and we evaluate it on legal data. Then, using GPU instances managed by SageMaker, we fine-tune the model with a Hugging Face script and we deploy it again.
    October 2023: follow-up video on using QLoRA to optimize cost-performance: • Parameter-efficient fi...
    ⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos ⭐️⭐️⭐️
    Code: gitlab.com/jul...
    Model: huggingface.co...
    Dataset: huggingface.co...

Комментарии • 28

  • @caiyu538
    @caiyu538 Год назад +1

    Keep on learning from your great lectures.

  • @stephenielane783
    @stephenielane783 8 месяцев назад

    Thank you for the video Julien. My summarisation task involves (1) taking verbal recordings (2) keeping certain domain specific English phrases, and fixing any grammatically errors. I don't have lots of inputs but have a lot of "output text". Do you think we can still train flan-t5?

  • @anuragbhatia1980
    @anuragbhatia1980 Год назад +1

    Amazing tutorial. One minor issue: Video uses "title" column while the Gitlab notebook uses "summary" column.

    • @juliensimonfr
      @juliensimonfr  Год назад +1

      Thank you, and good catch: 'title' it should be. I fixed the notebook.

  • @giantdutchviking
    @giantdutchviking 11 месяцев назад

    May I ask why your domain specific training data have an imbalance in row amount between the text, summaries and titles? I assume every row contains the text, corresponding summary and title. Does it just ignore the +/- 15k rows which doesnt have a corresponding summary?
    Thanks for making this vid, wanting to see the magic before learning the theory.

  • @caiyu538
    @caiyu538 Год назад

    Great lectures. I used this model to summarize medical report.

    • @juliensimonfr
      @juliensimonfr  Год назад

      Great! If you can, please share the model on the Hugging Face hub :)

  • @aru6575
    @aru6575 8 месяцев назад

    julien, can you do content for training for the summary label instead of title? I have concern with the training capacity of google colab, im a free user

    • @juliensimonfr
      @juliensimonfr  8 месяцев назад

      Hi, you can restrict the number of training samples if needed, or use a smaller T5 model.

  • @MAx-gi1pn
    @MAx-gi1pn 6 месяцев назад

    What would mean if the code in the training part runs for many hours but nothing happens? And when I stop it manuallly it says: INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.

    • @juliensimonfr
      @juliensimonfr  6 месяцев назад

      Something's wrong with the training container. You may need to update the Python version or the transformers version to a more recent release. See github.com/aws/deep-learning-containers/blob/master/available_images.md

    • @MAx-gi1pn
      @MAx-gi1pn 6 месяцев назад

      @@juliensimonfr Thank you for responding to me, did not think I would get a response. I finally fixed the issue but now the code has been running for 2 hours and I still see no output yet. Do you think I should just leave it running? I actually do not know if its working because its not specifying errors but it has been too long and I see no results yet.

    • @juliensimonfr
      @juliensimonfr  6 месяцев назад

      @@MAx-gi1pn check the Cloudwatch monitoring infomation for the training job (logs and graphs)

  • @danielguns2019
    @danielguns2019 8 месяцев назад

    Great video! One question, the limit for tokens is 512 on my model. Can I increase this safely?

    • @juliensimonfr
      @juliensimonfr  8 месяцев назад

      No, the sequence length is a built-in property. You need to consider models with a longer sequence. Another popular option is to split long documents in chunks, summarize each chunk, and then summarize the summaries :)

    • @danielguns2019
      @danielguns2019 8 месяцев назад

      Ok thanks for the response! @@juliensimonfr

  • @iqranaveed2660
    @iqranaveed2660 Год назад

    Sir i want to do abstractive summarization on pubmed dataset but it cant run on colab please tell me some platform for it

  • @meirgoldenberg5638
    @meirgoldenberg5638 Год назад

    Is there not a way to be charged only for compute resources that you actually use, i.e. per second of usage? (that's how it works with AWS Lambda)

    • @juliensimonfr
      @juliensimonfr  Год назад

      You can try serverless inference on SageMaker

  • @stayinthepursuit8427
    @stayinthepursuit8427 Год назад

    Do we need a local gpu or everything can be done through sagemaker? Why then do i see people complaining about gpus all the time

    • @juliensimonfr
      @juliensimonfr  Год назад

      SageMaker is a cloud service, so it runs in the cloud ;)

  • @holydarknes
    @holydarknes Год назад

    Let's say I want to summarize any type of incoming document. Would I have to train a bunch of different models for different types of files, then determine the type of file before submitting it to be summarized? Is there a way to have a more general solution?

    • @juliensimonfr
      @juliensimonfr  Год назад +3

      2 options IMHO:
      1) a large summarization model trained/fine-tuned on tons of different documents
      2) a text classification model (to figure out what the doc is about) + several small domain-specific summarizers
      #1 may feel simpler, but it can difficult to get great results if you don't have a lot of data and if the domains are extremely different. #2 is also more flexible, you can add new domains without retraining a larger model every time.

  • @truthseeker318
    @truthseeker318 5 месяцев назад

    Do you offer consultations for non-profits?

    • @juliensimonfr
      @juliensimonfr  5 месяцев назад +1

      Hi, I'm afraid I can't find time for that. I would recommend posting a message at discuss.huggingface.co (maybe in the "community calls" forum?) and hopefully someone can help.

    • @truthseeker318
      @truthseeker318 5 месяцев назад

      @@juliensimonfr Great thanks! Do you have any other videos for training models, specifically to summarize legal ease accurately?

  • @dstyle5120
    @dstyle5120 Год назад

    My kernel crushes when I try to use flan-t5-large, while the small and base versions work fine. Does anybody know why? I can only select conda_pythorch_p310 and not the p39 Julien is using and also using the free tier of AWS services. Any help would be much appreciated, I've just got back coding after 10 years and a lot has changed.