Thank you! You can see the accuracy on 10:57 ruclips.net/video/pH07mng2jBU/видео.htmlsi=pGX2A9TTy_gcHFqc&t=657 . The most common differences is punctuation, lowercase/upper case. However, I didn't test the real-live scenario. The youtube video has a professional sound with a speach from a professional actor. I don't know the quality of the transcription if it happens in the noised spaces 😂 I'll let you know if I try it :)
OpenAI whisper doesn’t support streams.this feature. There are some third-party libraries that added this feature, but I don’t know about their performance. You can find a link on it in one of comments here.
It shouldn’t be a problem. There are a lot of technical issues with the transcription as the Whisper tries to transcribe sounds that aren’t voices. But I haven’t tried this.
nice work!! so if you would put the text to translate this and give that as sound out... you would have the first life translation. If you do such project im highly interested so see the results :)
Thank you! To be honest, there are plenty of such solutions on the market. Just Google "ai live translation". However, it's not so simple, and the devil is in the details. The transcription worked perfectly fine on a speech from Netflix. In real life, sounds and noises will add some false words. Additionally, a narrator's quality does matters. Then, the translation works great, but it also produces a lot of false-translation. So it will work, but the quality wouldn't be so good. And the process require something more powerful, like Nvidia Jetson, Xavier, etc.
chunking in 10 seconds block is no good, you can cut it in the middle of the word, and the LLM will use the context. better to use a wrapper project whisper_streaming which works much more correctly with the streaming audio
You are right, the 10 seconds approach could cut in the middle of the word. However, I didn't find native solution from OpenAI (docs: platform.openai.com/docs/guides/speech-to-text#improving-reliability ) The purpose of the video was to test performance and I am sure that a wrapper doesn't improve it. I guess they "feed" content a few times to cover the "cut" case, but it's an additional operation that will consume even more CPU. So if someone is looking for a reliable implementation - yes, they should think about it. Thank you for noticing it :)
I gave up whisper and faster whisper for my voice assistant rpi, its slow, inaccurate and has some hallucinations, for some reason google speech recognition is much faster and accurate lol
wow! great experiment! At the end, I also wondered about the accuracy, so if it's an interesting topic for you I will be grateful for your sharing.
Thank you!
You can see the accuracy on 10:57 ruclips.net/video/pH07mng2jBU/видео.htmlsi=pGX2A9TTy_gcHFqc&t=657 .
The most common differences is punctuation, lowercase/upper case.
However, I didn't test the real-live scenario.
The youtube video has a professional sound with a speach from a professional actor.
I don't know the quality of the transcription if it happens in the noised spaces 😂
I'll let you know if I try it :)
Is there any way to cut the chunking time of 10 seconds down?
OpenAI whisper doesn’t support streams.this feature. There are some third-party libraries that added this feature, but I don’t know about their performance.
You can find a link on it in one of comments here.
Would it be easy to pass the transcription to lama to summarize it, create task list, etc… ?
It shouldn’t be a problem.
There are a lot of technical issues with the transcription as the Whisper tries to transcribe sounds that aren’t voices.
But I haven’t tried this.
@@itkacheryeah but I meant on the RPI itself with a lama instance which may run on the Hailo ?
I haven’t try llama. I saw that people run it on CPU. It was very slow.
Sorry, I have no idea if it supports Hailo.
nice work!! so if you would put the text to translate this and give that as sound out... you would have the first life translation. If you do such project im highly interested so see the results :)
Thank you! To be honest, there are plenty of such solutions on the market.
Just Google "ai live translation". However, it's not so simple, and the devil is in the details.
The transcription worked perfectly fine on a speech from Netflix. In real life, sounds and noises will add some false words.
Additionally, a narrator's quality does matters.
Then, the translation works great, but it also produces a lot of false-translation.
So it will work, but the quality wouldn't be so good.
And the process require something more powerful, like Nvidia Jetson, Xavier, etc.
how can I reach you ? have some questions
I haven’t received any requests on LinkedIn so I assume you’ve figured out all the questions:)
chunking in 10 seconds block is no good, you can cut it in the middle of the word, and the LLM will use the context. better to use a wrapper project whisper_streaming which works much more correctly with the streaming audio
You are right, the 10 seconds approach could cut in the middle of the word.
However, I didn't find native solution from OpenAI
(docs: platform.openai.com/docs/guides/speech-to-text#improving-reliability )
The purpose of the video was to test performance and I am sure that a wrapper doesn't improve it.
I guess they "feed" content a few times to cover the "cut" case, but it's an additional operation that will consume even more CPU.
So if someone is looking for a reliable implementation - yes, they should think about it.
Thank you for noticing it :)
I gave up whisper and faster whisper for my voice assistant rpi, its slow, inaccurate and has some hallucinations, for some reason google speech recognition is much faster and accurate lol
Hm… I haven’t tried it. Will do
Thanks pointing it out!