Use OpenAI Whisper For FREE | Best Speech to Text Model

Prompt Engineering

Просмотров 43 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 31 янв 2025

Комментарии • 67

@engineerprompt Год назад
Want to connect?
💼Consulting: calendly.com/engineerprompt/consulting-call
🦾 Discord: discord.com/invite/t4eYQRUcXB
☕ Buy me a Coffee: ko-fi.com/promptengineering
|🔴 Join Patreon: Patreon.com/PromptEngineering
@milokornblum8672 Год назад
Can you tell me what is the command to put the results in subtitle format? .srt
@Nihilvs Год назад ⁺⁶
Thanks for the video ! Been using this model for a long while to do translation+transcription of lectures (one and a half hours), mostly it works like a charm. I dont know about large-v3 but large-v2 would sometimes repeat and loop one sentence about half of the transcription.
So it needs optimization (some solutions clean the audio before whisper).
@ACse-v7y Год назад ⁺¹
is small , tiny that model available in the v3 , if yeas please give me link.
@marcin8432 Год назад
@@ACse-v7ylaziness destroys any kind of progress, bear that in mind
@ekstrajohn Год назад ⁺¹
I am using V2 on my Nvidia 1080 GPU. The perofrmance difference between the base model and the large model is very small. I tried multiple sources, tones of voice, noise, etc. Base version is really fast, so I recommend that one. Even V2 is really perfect for transcribing speech to text.
@ACse-v7y Год назад
hey please, can you please explain me ,, why the 10 gb model just stored in 3.9Gb . I want to download complete model , because I want to host the model in the server.
@ardavaneghtedari Год назад
Thanks!
@engineerprompt Год назад
thank you
@thunderwh Год назад ⁺¹
I like the idea of chatting with documents through speech
@ACse-v7y Год назад
hey please, can you please explain me ,, why the 10 gb model just stored in 3.9Gb . I want to download complete model , because I want to host the model in the server.
@RameshBaburbabu Год назад ⁺⁵
🎯 Key Takeaways for quick navigation:
00:00 🎙️ *Overview of Whisper V3 Model*
- Whisper V3 is OpenAI's latest speech-to-text model.
- Five configurations available: tiny, base, small, medium, and large V3.
- Memory requirements vary from 1 GB to 10 GB VRAM.
01:25 🔄 *Comparison: Whisper V2 vs. V3*
- V3 generally performs better with lower error rates than V2.
- There are specific cases where V2 outperforms V3, demonstrated later.
- Important to consider performance metrics when choosing between V2 and V3.
03:02 ⚙️ *Setting Up Whisper V3 in Google Colab*
- Installation of necessary packages: Transformer, Accelerator, and Dataset.
- GPU availability check and configuration for optimal performance.
- Loading the Whisper V3 model, setting processor, and creating the pipeline.
05:27 🎤 *Speech-to-Text Transcription Process*
- Creating a pipeline for automatic speech recognition using the Whisper V3 model.
- Uploading and transcribing an audio file in a Google Colab notebook.
- Additional options such as specifying timestamps during transcription.
07:45 🌐 *Language Recognition and Translation*
- V2 may be preferable when language is unknown, as it can automatically recognize it.
- Whisper supports direct translation from one language to another.
- Highlighting the importance of specifying the language in V3 if known.
09:22 ⚡ *Flash Attention and Distal Whisper*
- Enabling Flash Attention for improved performance if the GPU supports it.
- Introduction to Distal Whisper, a smaller, faster version of Whisper.
- Demonstrating how to use Distal Whisper Medium English in code.
11:54 🌐 *Future Applications and Closing*
- Exploring potential applications, like enabling speech communication with documents.
- Encouraging viewers to explore and experiment with the Whisper model.
- Expressing the usefulness and versatility of the Whisper model in various applications.
Made with HARPA AI
@ACse-v7y Год назад
hey please, can you please explain me ,, why the 10 gb model just stored in 3.9Gb . I want to download complete model , because I want to host the model in the server.
@JuanGea-kr2li Год назад ⁺³
VERY interesting, I would love to know how to run it locally, I mean with a UI in a local computer, not in a google notebook, it will be very VERY useful to transcript a video, translate later with another tool or model and then generate subtitles or generate an audio to translate the video, everything locally :)
@engineerprompt Год назад ⁺²
Let me see if I can put together a streamlit based UI for it
@JuanGea-kr2li Год назад
@@engineerprompt awesome, thank you!
@ACse-v7y Год назад
hey please, can you please explain me ,, why the 10 gb model just stored in 3.9Gb . I want to download complete model , because I want to host the model in the server.
@TheScifichik 5 месяцев назад
@@engineerprompt did this become a reality?
@rccarsxdr Год назад ⁺¹
Helpful video! Now I can run code locally. Thanks
@ACse-v7y Год назад
hey please, can you please explain me ,, why the 10 gb model just stored in 3.9Gb . I want to download complete model , because I want to host the model in the server.
@thevinnnslair Год назад ⁺¹
Very helpful, thanks for this
@yuzual9506 Год назад
Your a god of pedagogy and i m french!!! thx!
@engineerprompt Год назад
🙏
@farahabdulahi474 Год назад
is it just me, or is this not a big jump in improvement? at least for me
i wanted
1. speaker recogition / diarization
2. higher accuracy rate in mandarin
hope they do the first asap, and the second will get better over time i hope. Azure already has a speech-to-text service that includes speaker recognition and is quite good. I wonder if could affect how they prioritise this important feature
@Nawaz-lb9eq Год назад ⁺¹
Can you do a video on multi-speaker identification and transcription using whisper please.
@mbrochh82 Год назад
don't think it is possible with whisper
@ACse-v7y Год назад
hey please, can you please explain me ,, why the 10 gb model just stored in 3.9Gb . I want to download complete model , because I want to host the model in the server.
@benoitmialet9842 Год назад
Whisper is an amazing model. I use the medium version and with a fine tuning with only 40 min of audio, it is able to adapt to a specific domain (2 epochs are enough) quite well.
I never tried to fine tune the large V3, but I will. Large V2 seems worse than medium for French language.
Did you have the opportunity to fine tune large V3 and to compare it's performance with medium ?
@engineerprompt Год назад ⁺¹
I haven’t looked at fine tuning v3 yet. But seems like it will track with large v2.
@billcollins6894 Год назад
Can fine tuning help accuracy if the vocabulary is limited? I only want maybe 100 words to be recognized, but I want high probability of a match. I have also tried to see how to get a return value of probability of word or phrase match, but not clear on how to do that.
@benoitmialet9842 Год назад
@@billcollins6894 if you FT Whisper with audios containing these words, normally it should increase model performance on these specific words.
If you set word timestamps to True as shown on the video, normally the returned dictionary gives you the probability for each word as well. I havent tried with the pipeline but it works.
@ACse-v7y Год назад
hey please, can you please explain me ,, why the 10 gb model just stored in 3.9Gb . I want to download complete model , because I want to host the model in the server.
@ZaazZ-s8u Год назад
Hey. I have been trying to reduce the length of the subtitles as the characters generated by whisper can be overwhelming, ranging between 12 - 18 words in a single caption. I am using google colab and so far there's no success. Here are the commands i have used:
!whisper "FILE NAME" --model medium --word_timestamps True --max_line_width 40
!whisper "FILE NAME" --model medium --word_timestamps True --max_words_per_line 5
It works completely fine with the following command but with large number of words
!whisper "FILE NAME" --model medium
Could you please help.
@Hasi105 Год назад
Yesterday i was trying it out, nice that you explain it. Thanks! Can you create tell me how to use the a microphone for direct transcription? Maybe also a nice use for tavernAI and or MemGPT.
@engineerprompt Год назад ⁺²
Sure, will create a video on it
@ACse-v7y Год назад
hey please, can you please explain me ,, why the 10 gb model just stored in 3.9Gb . I want to download complete model , because I want to host the model in the server.
@jason77nhri Год назад
Thank you for sharing. May I inquire if you provide Colab code for testing? Does the new version of Whisper still have a 25MB file size limitation? Previously, I was able to split files for batch processing and then integrate them. However, when it comes to batch-processing SRT files, there seems to be a timing issue.
@ACse-v7y Год назад
hey please, can you please explain me ,, why the 10 gb model just stored in 3.9Gb . I want to download complete model , because I want to host the model in the server.
@WillyFlowerz Год назад
Is there still NO practical application for real-time transcribing (+/- translating the text) that is readily available on android ?
I think I heard about one or two projects that wanted to do that but still nothing concrete more than a full year after this incredible piece of technology appeared
Am I missing something ? Is whisper incompatible with android ? Is there no way to apply whisper to continuous live audio recording ?
Has nobody managed to do it ??
@jimmyjohn-g8f 3 месяца назад
I used Buzz (based on Whisper) to transcribe a Chinese audio file and got all kind of crap... so, I decided to abandon it. I have not tried English. Since I am not good in Chinese and hence, I need a reliable tool to transcribe and translate for me but it is just didn't work.... Whisper is still not production ready yet.
@AustinStAubin Год назад
Do you think you could show an example with diarization?
@engineerprompt Год назад
I haven't worked with it directly but I think I have seen some examples. Probably not directly in whisper but it can be used to augment. Here is something that seems interesting. tinyurl.com/yeysk4bz
I will explore this further and see what I can come up with.
@ACse-v7y Год назад
hey please, can you please explain me ,, why the 10 gb model just stored in 3.9Gb . I want to download complete model , because I want to host the model in the server.
@bmqww223 9 месяцев назад
Greetings, can this be used for whisperX too? i tried it with v2 model it used to work maybe
@binayaktv2646 6 месяцев назад
i want to use prompt for little personalization, how to do it ?
@SonGoku-pc7jl Год назад
thanks :) please, what is the config if audio is english and y want translate to text in spanish ? only translate other-languages to english? or is posible translate english to spanish for example? And model whisper-destil, if is english .en model or not, for translate audio english to text in spanish is posible? thanks! :)
@ACse-v7y Год назад
hey please, can you please explain me ,, why the 10 gb model just stored in 3.9Gb . I want to download complete model , because I want to host the model in the server.
@ericneeds1285 9 месяцев назад
I'm a "scopist" and I need to edit transcripts with different speakers, I take it this does not differentiate speakers?
@engineerprompt 9 месяцев назад
I have a video on speaker identification on the channel.
@techmoo5595 Год назад
Just run the colab note, and compare v3 with v2, I think the v2 is better.
@ACse-v7y Год назад ⁺¹
hey please, can you please explain me ,, why the 10 gb model just stored in 3.9Gb . I want to download complete model , because I want to host the model in the server.
@contractorwolf Год назад
is there a colab for this?
@ACse-v7y Год назад ⁺²
in this we are downloading the model , or using the interface API ??? actualy , I am new to it and confused. If it is model then it will be best for me to host it in the server.
@ZaazZ-s8u Год назад
Hey. I have been trying to reduce the length of the subtitles but havent been successful. I am using google colab. so here is the command
!whisper "FILE NAME" --model medium --word_timestamps True --max_line_width 40 (didnt succeed)
!whisper "FILE NAME" --model medium --word_timestamps True --max_words_per_line 5 (no success)
help needed please
@matangbaka 11 месяцев назад
Hi, can anyone help me, I'm having problem in following the tutorial, I encountered some error in
pipe = pipeline("automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=128,
chunk_length_s=15,
batch_size=16,
return_timestams=True,
torch_dtype=torch_dtype,
device=device)
it says that:
TypeError Traceback (most recent call last)
in ()
----> 1 pipe = pipeline("automatic-speech-recognition",
2 model=model,
3 tokenizer=processor.tokenizer,
4 feature_extractor=processor.feature_extractor,
5 max_new_tokens=128,
@trilogen Год назад ⁺¹
People want to run it locally for privacy not route it to Google
@WEKINBAD 4 месяца назад
srry but about transcribing from url like youtube?
@Vollpflock Год назад
So what's the difference to using Whisper through the API? Is this free and even better than using it through the API?
@engineerprompt Год назад ⁺¹
The main difference is it’s free if you can run it locally compared to the API which will cost you money. Performance seems to be the same
@ACse-v7y Год назад
hey please, can you please explain me ,, why the 10 gb model just stored in 3.9Gb . I want to download complete model , because I want to host the model in the server.
@DihelsonMendonca Год назад
⚠️ I need a good text to speech for free. It doesn't help if you can talk to a model, but it can't talk back. So, what to do ? Any good, free text to speech ?? 😮
@engineerprompt Год назад
You might want to check out github.com/suno-ai/bark
@easylife7775 Год назад
HELLO can i use this from A language to B language for example..
@Tyrone-Ward Год назад ⁺²
This is NOT running Whisper locally. Misleading title.
@listentomusic8160 Год назад
My PC have 128 mb vram 😅. How could I suppose to run 16 gb vram model in my local machine 😂
@weslieful Год назад
どうしようかしら
@krstoevandrus5937 11 месяцев назад
hi, i am med student i got some med video for transcribe. i tried whisper API vs colab local run large-v3. i found the API MUCH better. the local run large-v3 is NOT acceptable unless there is human edition, while the API one is fully okay on it's own. even for a name called moblitz (correct), API could recognize this, and local run named it mobits(wrong). mind to comment on this for me? thanks

Следующие

Автовоспроизведение

Multi Speaker Transcription with Speaker IDs with Local Whisper