Thank you for the video Julien. My summarisation task involves (1) taking verbal recordings (2) keeping certain domain specific English phrases, and fixing any grammatically errors. I don't have lots of inputs but have a lot of "output text". Do you think we can still train flan-t5?
May I ask why your domain specific training data have an imbalance in row amount between the text, summaries and titles? I assume every row contains the text, corresponding summary and title. Does it just ignore the +/- 15k rows which doesnt have a corresponding summary? Thanks for making this vid, wanting to see the magic before learning the theory.
What would mean if the code in the training part runs for many hours but nothing happens? And when I stop it manuallly it says: INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
Something's wrong with the training container. You may need to update the Python version or the transformers version to a more recent release. See github.com/aws/deep-learning-containers/blob/master/available_images.md
@@juliensimonfr Thank you for responding to me, did not think I would get a response. I finally fixed the issue but now the code has been running for 2 hours and I still see no output yet. Do you think I should just leave it running? I actually do not know if its working because its not specifying errors but it has been too long and I see no results yet.
julien, can you do content for training for the summary label instead of title? I have concern with the training capacity of google colab, im a free user
Let's say I want to summarize any type of incoming document. Would I have to train a bunch of different models for different types of files, then determine the type of file before submitting it to be summarized? Is there a way to have a more general solution?
2 options IMHO: 1) a large summarization model trained/fine-tuned on tons of different documents 2) a text classification model (to figure out what the doc is about) + several small domain-specific summarizers #1 may feel simpler, but it can difficult to get great results if you don't have a lot of data and if the domains are extremely different. #2 is also more flexible, you can add new domains without retraining a larger model every time.
My kernel crushes when I try to use flan-t5-large, while the small and base versions work fine. Does anybody know why? I can only select conda_pythorch_p310 and not the p39 Julien is using and also using the free tier of AWS services. Any help would be much appreciated, I've just got back coding after 10 years and a lot has changed.
No, the sequence length is a built-in property. You need to consider models with a longer sequence. Another popular option is to split long documents in chunks, summarize each chunk, and then summarize the summaries :)
Hi, I'm afraid I can't find time for that. I would recommend posting a message at discuss.huggingface.co (maybe in the "community calls" forum?) and hopefully someone can help.
Keep on learning from your great lectures.
That's the plan!
Amazing tutorial. One minor issue: Video uses "title" column while the Gitlab notebook uses "summary" column.
Thank you, and good catch: 'title' it should be. I fixed the notebook.
Thank you for the video Julien. My summarisation task involves (1) taking verbal recordings (2) keeping certain domain specific English phrases, and fixing any grammatically errors. I don't have lots of inputs but have a lot of "output text". Do you think we can still train flan-t5?
May I ask why your domain specific training data have an imbalance in row amount between the text, summaries and titles? I assume every row contains the text, corresponding summary and title. Does it just ignore the +/- 15k rows which doesnt have a corresponding summary?
Thanks for making this vid, wanting to see the magic before learning the theory.
What would mean if the code in the training part runs for many hours but nothing happens? And when I stop it manuallly it says: INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
Something's wrong with the training container. You may need to update the Python version or the transformers version to a more recent release. See github.com/aws/deep-learning-containers/blob/master/available_images.md
@@juliensimonfr Thank you for responding to me, did not think I would get a response. I finally fixed the issue but now the code has been running for 2 hours and I still see no output yet. Do you think I should just leave it running? I actually do not know if its working because its not specifying errors but it has been too long and I see no results yet.
@@MAx-gi1pn check the Cloudwatch monitoring infomation for the training job (logs and graphs)
Sir i want to do abstractive summarization on pubmed dataset but it cant run on colab please tell me some platform for it
julien, can you do content for training for the summary label instead of title? I have concern with the training capacity of google colab, im a free user
Hi, you can restrict the number of training samples if needed, or use a smaller T5 model.
Is there not a way to be charged only for compute resources that you actually use, i.e. per second of usage? (that's how it works with AWS Lambda)
You can try serverless inference on SageMaker
Great lectures. I used this model to summarize medical report.
Great! If you can, please share the model on the Hugging Face hub :)
Let's say I want to summarize any type of incoming document. Would I have to train a bunch of different models for different types of files, then determine the type of file before submitting it to be summarized? Is there a way to have a more general solution?
2 options IMHO:
1) a large summarization model trained/fine-tuned on tons of different documents
2) a text classification model (to figure out what the doc is about) + several small domain-specific summarizers
#1 may feel simpler, but it can difficult to get great results if you don't have a lot of data and if the domains are extremely different. #2 is also more flexible, you can add new domains without retraining a larger model every time.
Do we need a local gpu or everything can be done through sagemaker? Why then do i see people complaining about gpus all the time
SageMaker is a cloud service, so it runs in the cloud ;)
My kernel crushes when I try to use flan-t5-large, while the small and base versions work fine. Does anybody know why? I can only select conda_pythorch_p310 and not the p39 Julien is using and also using the free tier of AWS services. Any help would be much appreciated, I've just got back coding after 10 years and a lot has changed.
Great video! One question, the limit for tokens is 512 on my model. Can I increase this safely?
No, the sequence length is a built-in property. You need to consider models with a longer sequence. Another popular option is to split long documents in chunks, summarize each chunk, and then summarize the summaries :)
Ok thanks for the response! @@juliensimonfr
Do you offer consultations for non-profits?
Hi, I'm afraid I can't find time for that. I would recommend posting a message at discuss.huggingface.co (maybe in the "community calls" forum?) and hopefully someone can help.
@@juliensimonfr Great thanks! Do you have any other videos for training models, specifically to summarize legal ease accurately?