Fine tuning Whisper for Speech Transcription

Поделиться
HTML-код
  • Опубликовано: 10 дек 2024

Комментарии • 98

  • @SiD-hq2fo
    @SiD-hq2fo 10 месяцев назад +5

    I cant thanks enough for the quality content you are providing
    please continue to upload such video!!

  • @miblish5168
    @miblish5168 10 месяцев назад +3

    This video really saved my @$$. I had Whisper&CoLab running a few moinths ago, but it broke. Your video and notebooks showed me why, and taught me several new tricks! Keep it up please.

    • @scifithoughts3611
      @scifithoughts3611 8 месяцев назад

      @Trellis have you considered instead of fine tuning, use an LLM to correct the spelling of Whisper output? (Prompt it to fix “my strell” to “mistrell”, etc.)

    • @scifithoughts3611
      @scifithoughts3611 8 месяцев назад

      Or another alternative is to prompt Whisper with the context and correct spelling of its common transcript mistakes?

  • @ahmeterdonmez9195
    @ahmeterdonmez9195 Месяц назад +2

    A lifesaving channel that I found by chance.💪

  • @gautammandewalker8935
    @gautammandewalker8935 8 месяцев назад +1

    Great video! You are one of the best teachers I have ever heard.

  • @MindfulMountaineer
    @MindfulMountaineer 2 месяца назад

    By far the best explanation of ASR systems that I've seen so far! Thank you!

  • @pypypy4228
    @pypypy4228 10 дней назад +1

    Underrated video! Thanks!

  • @anasdavoodtk3160
    @anasdavoodtk3160 10 месяцев назад +3

    Great explanation. the drum story!. good work

  • @AbdennacerAyeb
    @AbdennacerAyeb 10 месяцев назад +2

    easy, simple, well organized. Thank you

  • @MaximeDde
    @MaximeDde 3 месяца назад

    Superbly explained, damn ! I'm still looking for equivalents in French, but that video will help me set it up ! Would be glad to have the chance to discuss with you, as I'm learning more about fine-tuning those beautiful transcription models for specific use-cases, there's a hell lot of value to bring here !

    • @TrelisResearch
      @TrelisResearch  3 месяца назад

      Merci, maybe I’ll do a vid in French some time. I just don’t know the LLM words.

    • @MaximeDde
      @MaximeDde 3 месяца назад

      @@TrelisResearch Willing to help once I get there, if you let me in on some knowledge of yours haha ! I need more AI aficionados in my network :D

    • @MaximeDde
      @MaximeDde 3 месяца назад

      I’m going to rewatch your video to train my model in French, if you want we can be in touch so I can help if you want to go to French language ? It would be a pleasure 🤗

  • @michaelblodow7779
    @michaelblodow7779 8 дней назад +1

    Great Video, thanks got that.

  • @waelabou946
    @waelabou946 3 месяца назад

    You are very good explaining this thank you and please keep going.. 👍👍👍👍👍

  • @LucasJustinoCostaAssis
    @LucasJustinoCostaAssis 9 месяцев назад +1

    This video was very instructive, thanks!
    For my case, I need a model that recognize items on a list, it consists mainly of medical vocabullary, so a simple whisper model does not get them. Regarding the terms and their pronunciation I will record them in a later moment, but are they inserted in the "DatasetDict()" part of the code instead of Hugging Face's "common_voice"? Also, how is the taught model saved and used in a new project? Untill now I've only used a simple
    model = whisper.load_model("small")
    code line in my projects

    • @TrelisResearch
      @TrelisResearch  9 месяцев назад +1

      Your training data will need to be prepared and included into the huggingface dataset (like the new dataset I created).
      To re-use the model, it's easiest to push it to huggingface hub as I do here, and then you can load it back down by using the same loading code I used for the base model.
      Technically I think it's possible to convert back to the openai format as well and then load it using a code snippet like you did. See here: github.com/openai/whisper/discussions/830#discussioncomment-4652413

  • @seancarmody1506
    @seancarmody1506 6 месяцев назад +1

    Loved the video, I'm wondering if it's possible to do something similar using a vision model. Say for example a resnet trained for a certain task. Do you think it would be possible to train an adapter to allow the llm to understand the resnet features? I watched your Llava training video but the concept seemed a little different than I expected

    • @TrelisResearch
      @TrelisResearch  6 месяцев назад

      I suppose the original resnet didn't include attention, so probably that would be a disadvantage to transformers used now.
      But yes, in principle you could attach a resnet to the inputs of an llm - but I think it would be done something like in my llava / idefics video.

  • @m_tron99
    @m_tron99 8 месяцев назад +1

    Great video. Can you do one on using WhisperX for diarisation and timestamping?

  • @heski6847
    @heski6847 10 месяцев назад +2

    great, thx! I needed it.

  • @RustemShaimagambetov
    @RustemShaimagambetov 10 месяцев назад +1

    Great video! How much data(rows) do we need to train to get acceptable results? Is it enough 5-6 rows ??

    • @TrelisResearch
      @TrelisResearch  10 месяцев назад

      Yes, even 5-6 can be enough to add knowledge of a few new words. I only had 6 rows. Probably 12 or 18 would have been better here.

  • @jetpro
    @jetpro 9 месяцев назад +1

    Do you know how to export it to ONNX and correctly use it in deployment? Helpful video!

    • @TrelisResearch
      @TrelisResearch  9 месяцев назад

      I haven't dug into that angle for ONNX but here's the guide for getting back from huggingface to whisper and probably you can go from there? github.com/openai/whisper/discussions/830#discussioncomment-4652413

  • @Rems766
    @Rems766 8 месяцев назад +1

    I've trouble fine tuning the large-v3 model. When I am evalutating, the compute_metrics function do not call properly the tokenizer method and it do not work. Any idea why?

    • @TrelisResearch
      @TrelisResearch  8 месяцев назад

      hmm that's odd, I haven't trained the large myself, I assume you tried posting on the github repo? any joy there, feel free to share the link if you create an issue

  • @LinkSF1
    @LinkSF1 10 месяцев назад +1

    Do you know if there’s a way to downsample the frequencies? Eg if I have a 24khz sample I want to downsample to 16khz, what would be the preferred way of doing this?

    • @TrelisResearch
      @TrelisResearch  10 месяцев назад

      Howdy! Actually you can check in this vid there's a part towards the middle where I show how to downsample

  • @AndrewBawitlung
    @AndrewBawitlung 9 месяцев назад +2

    What to do when my language is not in the whisper tokenizer?

    • @TrelisResearch
      @TrelisResearch  9 месяцев назад +1

      Probably imperfect, but maybe you could choose the closest language and then fine-tune from there.

  • @AbhijeetTamrakar-k4l
    @AbhijeetTamrakar-k4l 10 месяцев назад +1

    Recently I faced a situation where I fine-tuned a model on a training set and it returns good results from the training set example or validation set examples but when I give an input which he has never seen then it tends to produce contextually irrelevant results.
    Could you suggest what one should do in such a case?
    One thing that we can do is to make our training dataset more extensive but other than else can we so something else?

    • @TrelisResearch
      @TrelisResearch  10 месяцев назад +1

      create a separate validation set using data that is not from your training or validation set (could just be wikitext) and measure the validation loss on that during training.
      If it is rising quickly, then you are overtraining and need to train for less epochs and/or lower learning rate

  • @PierreDELOM
    @PierreDELOM 10 месяцев назад +1

    Very instructive videos. Next one with Diarization ?

    • @TrelisResearch
      @TrelisResearch  10 месяцев назад +1

      interesting idea, I'll add to my notes

  • @WaléSassi
    @WaléSassi 7 месяцев назад +1

    good job!! but I'm not finding the checkpoints folders

    • @TrelisResearch
      @TrelisResearch  7 месяцев назад

      They'll be generated when you run through the training . Also, you need to set save_dir output_dir to somewhere you want the files to be.

  • @PrernaChander
    @PrernaChander Месяц назад

    Once you have the finetuned whisper model, how do you load it and pass an audio file to it to generate an output in the form of a vtt file?

    • @TrelisResearch
      @TrelisResearch  Месяц назад

      You may need to convert it but then see this video and the free inference notebook: Fine tune and Serve Faster Whisper Turbo
      ruclips.net/video/qXtPPgujufI/видео.html

  • @KrishnaNamasivayam
    @KrishnaNamasivayam 6 месяцев назад

    Thanks for this amazing video. I am trying to tune whisper to understand slurred speech (e.g. cerebral palsy). Would a small data sample work for that scenario too. Thanks!

  • @tariqyahia9039
    @tariqyahia9039 9 месяцев назад +1

    Question, does the training file have to be in vtt format? or can it be in .txt?

    • @TrelisResearch
      @TrelisResearch  9 месяцев назад

      has to have time stamps, so vtt (or srt and you can convert to vtt).

  • @imranullah3097
    @imranullah3097 10 месяцев назад +1

    Kindly make a video on the following.
    Hifi-gan with transformer
    Multi model (text+image)

    • @TrelisResearch
      @TrelisResearch  10 месяцев назад

      thanks, I'll add to my list. I was already planning on multi-modal some time. will take me a bit of time before getting to it

  • @estherchantalamungalaba5295
    @estherchantalamungalaba5295 4 месяца назад

    Hi. I’m fine tuning Whisper for transcription on a mac, using Hugging Face Transformers. I can’t seem to figure out how to get the model and the data either both passed to the cpu or the gpu. Loved this tutorial on fine tuning and was able to follow along well until I hit this snag. And there doesn’t seem to be wide enough support on the internet for this specific problem. Can you point me to any communities where I might be able to find help on specifically using Apple machines to fine tune models? I’d really appreciate the help.

    • @TrelisResearch
      @TrelisResearch  4 месяца назад +1

      Cheers for the comment!
      For data processing, you shouldn't need to do anything, it should work fine on a mac.
      In principle, I think you could fine-tune the model using transformers on a Mac but it requires some digging around how to use MPS (the mac gpu) properly and would require an M1, M2 or M3 Mac (anything other than that will be very slow).
      Can I suggest you just run in a free colab or kaggle notebook?

    • @estherchantalamungalaba5295
      @estherchantalamungalaba5295 4 месяца назад

      @@TrelisResearch I’m running an M2 Mac so I will look into mps a little more. Thank you! Also, I hadn’t considered Kaggle notebooks. That might be a better alternative to Colab as Colab pro is not yet available in my region and the free tier resource allocation is just not good enough. Thanks again!

    • @TrelisResearch
      @TrelisResearch  4 месяца назад

      @@estherchantalamungalaba5295 Great, best of luck

  • @onursarikaya1385
    @onursarikaya1385 8 месяцев назад +1

    Thank you! It's a great investment :)

  • @mrsilver8151
    @mrsilver8151 6 месяцев назад

    thanks for your great work,
    is there any way to convert the finetuned model to run with faster whisper,
    or there is another way to fine tune for faster whisper ?

    • @TrelisResearch
      @TrelisResearch  6 месяцев назад

      yup - see here: github.com/SYSTRAN/faster-whisper/issues/248
      opennmt.net/CTranslate2/python/ctranslate2.converters.TransformersConverter.html
      If you try it, could you let me know if that really gives a 4x speed up on a GPU?

  • @thomashuynh6263
    @thomashuynh6263 Месяц назад

    Hi friend,
    As your video, how much GPU memory does Whipser large-v3 fine-tuning require?
    Thank you.

    • @TrelisResearch
      @TrelisResearch  Месяц назад

      Just a few GB I think

    • @thomashuynh6263
      @thomashuynh6263 Месяц назад

      @@TrelisResearch Does your approach work on large-v3 model, I tried fine-tuning on large-v3 model then the fine-tuned model get worse transcription especially for transcribing not English such as Malay, Chinese. Have you tried on large-v3?

    • @TrelisResearch
      @TrelisResearch  Месяц назад

      @@thomashuynh6263 I haven't tried that, but turbo comes from large, so it should work on large too...
      I recommend trying a smaller model first and then going to the larger model when you have turbo (or small) working.

    • @thomashuynh6263
      @thomashuynh6263 Месяц назад

      @@TrelisResearch Yes, I tried a small model, and it works well. But when I try a large-v3 model then it does not work well, the resulting transcript is worse :(

  • @imranullah3097
    @imranullah3097 10 месяцев назад +1

    For low resource language how to train tokenizer and add and then fine tune whisper.?

    • @TrelisResearch
      @TrelisResearch  10 месяцев назад

      oooh, yeah low resource is going to be tough. Probably the approach depends on language and whether it has close languages.
      Ideally you want to start with a tokenizer and fine-tuned model for a close language.
      If you do need to train a tokenizer, you can check this vid out here: huggingface.co/learn/nlp-course/chapter6/2?fw=pt

  • @simonsu-yz9vo
    @simonsu-yz9vo 9 месяцев назад +1

    is it possible to fine tuning for speech translation?

    • @TrelisResearch
      @TrelisResearch  9 месяцев назад

      yes, you just need to format the Q&A for that.

  • @ASphoton_energy
    @ASphoton_energy 5 месяцев назад

    Thanks so much, great description but I'm a little confused. At 10'58'' when discussing the breakdown of the frequencies you point to the blue graph and say: "here you can see its just an amplitude graph". I'm confused, I thought the red graph in front of 'Time' would have been the amplitude graph?

    • @TrelisResearch
      @TrelisResearch  5 месяцев назад +1

      yeah, sorry, it's a bit unclear.
      Indeed in the 3D graph, the red is amplitude and blue is frequency.

    • @ASphoton_energy
      @ASphoton_energy 5 месяцев назад

      @@TrelisResearch Thanks for that. I noticed you uploaded a single mp3 file for training/testing. I have 8 hours of mp3 files; will the repo allow for an entire folder of many mp3s for training/testing data?

    • @TrelisResearch
      @TrelisResearch  5 месяцев назад

      @@ASphoton_energy you can use chatgpt to tweak the code so that it loops through multiple mp3 files. If you have trouble, you can create a comment in the github repo after buying and I'll add that capability.

  • @sumitjana7794
    @sumitjana7794 9 месяцев назад +1

    I have transcripted text in .srt format, can I train with it??

    • @TrelisResearch
      @TrelisResearch  9 месяцев назад +1

      Yes! And for this script you can just convert srt to vtt losslessly using an online tool.

    • @sumitjana7794
      @sumitjana7794 9 месяцев назад

      thanks a lot
      @@TrelisResearch

  • @bijoyashreedas4242
    @bijoyashreedas4242 4 месяца назад

    Great work! Can someone help me with any resources for using Whisper for code switching application. How do i fine tune for this purpose?

    • @TrelisResearch
      @TrelisResearch  4 месяца назад

      Can you clarify (maybe with a link) as to what code switching is?

  • @_loong9906
    @_loong9906 6 месяцев назад

    Great video! But, in my checkpoints, there's no 'added_tokens.json' or 'config.json' and so on. what's happening to me. what did i miss?

    • @TrelisResearch
      @TrelisResearch  6 месяцев назад

      You mean you are running training but not finding those files in your saved checkpoints? whereas you see them in my video when I do the same?

  • @bryantgoh1888
    @bryantgoh1888 6 месяцев назад

    May i ask, why the PEFT example given is different from the one you used in the video?

    • @TrelisResearch
      @TrelisResearch  6 месяцев назад +1

      It’s a more generic example that I then adapted/integrated for this video

  • @dachuandu6539
    @dachuandu6539 6 месяцев назад

    best explanation ever

  • @kavins8054
    @kavins8054 5 месяцев назад

    Coool Video man!! but during my training the WER going up to 100 while both training and validation loss decreases. Help me

    • @TrelisResearch
      @TrelisResearch  5 месяцев назад

      Hard to say without having more details but often it can be because of small errors in the formatting of the timestamps in the training data.

  • @johnkidsman
    @johnkidsman Месяц назад

    It's a good one!

  • @daviddad7388
    @daviddad7388 4 месяца назад

    Great video thanks

  • @gabrielfraga2303
    @gabrielfraga2303 4 месяца назад

    I can deploy this model output in Azure?

    • @TrelisResearch
      @TrelisResearch  4 месяца назад

      You mean on a GPU you rent from Azure?
      Yes, if it’s an Nvidia GPU for example

    • @gabrielfraga2303
      @gabrielfraga2303 4 месяца назад

      @@TrelisResearch I believe so, but I don't know which service would be more practical. I also studied Azure AI Services and Azure ML, but I couldn't find anything like uploading my model based on the codes you made available on your website to finetuned to upload to Azure to consume my credits because today I have Microsoft partnerships and I would like this model that generates after finetuning within Azure through some service.

  • @MuhammadAliAhson
    @MuhammadAliAhson 3 месяца назад

    Hello Sir, I'm a student and i want to work on ASR. My problem is to make my model understand Urdu Language with great easy making not mistake.
    I'm so confused about the data. Even if I get the audio data, I don't have the text of that audio.
    Can you please help me in this matter?

    • @TrelisResearch
      @TrelisResearch  3 месяца назад

      Ooooh, yeah you'll have to fine-tune for urdu. What you need is some videos that have captions on them in Urdu, and with the video spoken in Urdu. If you can find a set of (open source/public) videos like that then you can grab that data to do training.

  • @ivor1113
    @ivor1113 6 месяцев назад

    Can you share the code in ADVANCED-transcription?

    • @TrelisResearch
      @TrelisResearch  6 месяцев назад

      Howdy! Yes, the code is in the ADVANCED-transcription repo, which you can buy lifetime access to (incl. future updates). If you buy and something is missing, you can create an issue in the repo.

  • @peace2941
    @peace2941 2 месяца назад

    I can't find the github repo anymore

    • @TrelisResearch
      @TrelisResearch  2 месяца назад

      Howdy Have a look in your email for the receipt. Respond to that and I’ll help further if needed

  • @thomashuynh6263
    @thomashuynh6263 Месяц назад

    Is this tutorial can applied for Faster Whisper?

    • @TrelisResearch
      @TrelisResearch  Месяц назад

      I believe yes if you convert to the right format.
      New video out soon in whisper and I’ll dig into it

    • @thomashuynh6263
      @thomashuynh6263 Месяц назад

      @@TrelisResearch Can your approach work well on Whisper's large-v3 model? I tried fine-tuning the large-v3 model, but the fine-tuned model produced worse transcriptions. The fine-tuned model can only transcribe English; it cannot transcribe other languages such as Chinese. It auto-translates Chinese to English, though I specify transcribing in Chinese on Chinese audio files. Have you faced this issue?

  • @javadasoodeh
    @javadasoodeh 6 месяцев назад

    Thank you for your explanation. Imagine I’m gonna start training whisper on a low resource language. I don’t have the whole entire dataset at first to feed for training. If I do the same as you do, meaning correct the transcription and pair it with the voice, and finally give it to model for fine tuning. If I do these several times, the model doesn’t forget the pervious learning or wouldn’t overfit?
    By and large, I would like to create a pipeline to collect pair of voices with their manual transcriptions, and then fine tune the model each time. Could you guide me what do I need to do this in this way?

    • @TrelisResearch
      @TrelisResearch  6 месяцев назад

      to avoid forgetting and overfitting you should blend in about 5% of original/english type voice data into your new dataset.

  • @boringblobking3783
    @boringblobking3783 3 месяца назад

    ty for the op video

  • @user-jk9zr3sc5h
    @user-jk9zr3sc5h 10 месяцев назад +1

    Would a DPO method theoretically work for more effectively fine tuning whisper?

    • @TrelisResearch
      @TrelisResearch  10 месяцев назад

      yeah DPO could be good for general performance improvement.
      for adding sounds/words, standard finetuning is probably best (SFT).