Additional question: Does the model "re-learn" the voice everytime I want it to generate a sentence? Is there a way to learn the voice once and then use the trained model over and over again?
I tired this out on a rtx 3600 12gb model and it's fast. Quicker than speaking, maybe 2x faster to process than to listen to. Sounds really good to me.
@ThorstenMueller I should have said it's paired with a 2700 ryzen. It's a pretty cheap rig now, I think you could buy both parts used for about 300 pounds on eBay. 30 pound cpu and 270 for the gpu. Or wait a year and pick up a 3090 24gb for same price, currently sitting around 500. I did pick up a tesla 24gb I forget model number, from China for 300 which is good for really large llm. Thank you for showing me this, I have project I can purposely upgrade now.
Hi Thorsten, thank you for another excellent tutorial. I have installed f5 on a Raspberry Pi 5 and it generates very good quality output but to be expected it is very slow. I am trying to understand how f5 works, does it take a standard model and modify it in some way using the ref_text & audio before generating the desired output (gen_text)? Is there an intermediate stage that could be executed separately? Thanks Ernie
Haha the F5 joke😂. The progress is amazing, right? Still waiting for german support for F5... Anyway in english it is now already easy to create synthetic voice datasets for piper for example, just an idea😊
I tried it and it works but it did not sound like me. Nothing close to what you did. Not a fan at this time it really should have done better. Thanks for sharing you got my thumbs up...
correction: I'm running it on a 1080ti, it takes 16 sec for 4 sec of speech to synthesise. Don't know, whether it's always re-analysing the reference as well.
okay, further investigation: i let the output text the same but uploaded a longer reference, it then also takes longer to synthesise. so, the whole time is comprising reference learning as well as synthesis. would be interesting to see how much time mere synthesis would take...
If you use f5 on huggingface it will use a random gpu that is available in that momoment. If you use it locally without cuda (nvidia gpu) it will use cpu.
Hello Thorsten, thanks for your great channel. I came about these videos which shows how one can train F5 with different languages ruclips.net/video/UO4usaOojys/видео.html ruclips.net/video/RQXHKO5F9hg/видео.html As you are experienced with training of speech models, I am wondering how much hours material would be required to train a German language model in good quality and what things should be considered in regards to training data. In the referenced youtube video the creator simply takes audiobooks. Can one expect to get a good quality model in this way?
Hi, Thorsten, the community thrives because of people like you - thanks for your work!
Thank you for your very kind words 🥰
Thanks for your video. F5 TTS is absolutely stunning!
Let's hope they will include other languages (GERMAN) soon. ;)
Additional question: Does the model "re-learn" the voice everytime I want it to generate a sentence? Is there a way to learn the voice once and then use the trained model over and over again?
According to their community they are working on additional languages, including german 😊
huggingface "marduk-ra/F5-TTS-German"
OMG, you are life saver for me!! Awesome!!
Wow, thanks for your kind feedback 😊.
I tired this out on a rtx 3600 12gb model and it's fast. Quicker than speaking, maybe 2x faster to process than to listen to. Sounds really good to me.
Thanks for your helpful comment and performance indicator on a 3600 👍🏻.
@ThorstenMueller I should have said it's paired with a 2700 ryzen. It's a pretty cheap rig now, I think you could buy both parts used for about 300 pounds on eBay. 30 pound cpu and 270 for the gpu.
Or wait a year and pick up a 3090 24gb for same price, currently sitting around 500. I did pick up a tesla 24gb I forget model number, from China for 300 which is good for really large llm.
Thank you for showing me this, I have project I can purposely upgrade now.
That was great!! Thanks for your content! I've got this running now and it is amazing!!
Thanks for your nice feedback 😊.
Hi Thorsten, thank you for another excellent tutorial. I have installed f5 on a Raspberry Pi 5 and it generates very good quality output but to be expected it is very slow. I am trying to understand how f5 works, does it take a standard model and modify it in some way using the ref_text & audio before generating the desired output (gen_text)? Is there an intermediate stage that could be executed separately? Thanks Ernie
That whisper at the beginning really sounded like Stephan Molyneux?!!!
I enjoyed the intro it made me laugh.
I'm happy you liked it 😊.
Haha the F5 joke😂.
The progress is amazing, right?
Still waiting for german support for F5...
Anyway in english it is now already easy to create synthetic voice datasets for piper for example, just an idea😊
H(ei) 👋,
thanks for your nice comment 😊 and yes, progress is really impressive.
great stuff!
I tried it and it works but it did not sound like me. Nothing close to what you did. Not a fan at this time it really should have done better. Thanks for sharing you got my thumbs up...
Thanks for your "thumb up" and sorry to hear it didn't work for you as expected.
@ThorstenMueller not your fault, you laid it out perfectly. Its probably the quality of my samples.
Thanks again
May I ask what gpu you are using, or if it is using a gpu?
when you start gradio the fist time and the model is downloading, it shows that pytorch loading the models into CPU, i'll investigate on that
correction: I'm running it on a 1080ti, it takes 16 sec for 4 sec of speech to synthesise. Don't know, whether it's always re-analysing the reference as well.
okay, further investigation: i let the output text the same but uploaded a longer reference, it then also takes longer to synthesise. so, the whole time is comprising reference learning as well as synthesis. would be interesting to see how much time mere synthesis would take...
If you use f5 on huggingface it will use a random gpu that is available in that momoment. If you use it locally without cuda (nvidia gpu) it will use cpu.
thanks! it is faesabel to do all of that trought scripted pyton code?
Good point 👍🏻. I took a quick but did not see an obvious solution for native python integration.
What GPU do you have on your computer?
An nvidia 1050 ti in this case.
Hello Thorsten, thanks for your great channel. I came about these videos which shows how one can train F5 with different languages ruclips.net/video/UO4usaOojys/видео.html ruclips.net/video/RQXHKO5F9hg/видео.html As you are experienced with training of speech models, I am wondering how much hours material would be required to train a German language model in good quality and what things should be considered in regards to training data. In the referenced youtube video the creator simply takes audiobooks. Can one expect to get a good quality model in this way?
You made a reference to your computer speed. Care to elaborate on its GPU and CPU and ram?
You're absolutely right. I forgot adding it to the description. Thanks to your hint, my computer specs are now in description 😊.
Great
Thank you 😊, i'm impressed by f5 too.
Is online Huggingface better than local?
The tts model is the same. It's just the question of your local available compute power. In my case huggingface has been more performant.
can this be deployed and hosted on a server?
Yes, absolutely 😊.
can we use it for making RUclips videos and monetize it ? i mean is legal