Says this was posted 6 days ago but when I go to the site its got different setup, so links are gone or changed etc. So do we assume they have added some of the features into the main installation process like the requirements?
Hey there is a problem after reinstalling miniconda3 and checking the script folder I was not able to find conda.exe I would appreciate it if you can provide a solution
I love how you do not assume that I know what you know, and bothered explaining the basics. and made time stamps for the more knowledgeable to skip. excellent man!!! so we cant train it properly on a larger audio file (you cant pack enough vocal range in that for professional works..
This is wild! It’s crazy how little input audio it requires. Also I just wanted to say thanks. If it weren’t for you I would have never discovered my passion for creating AI voice models!
@@amitnishad0777No, I guess I could do commissions but I haven’t really thought much about it. I also want to improve before I do something like that as I’m to amateurish at data cleaning atm.
I follow your channel since the early days. I´m super happy for your growth and also super happy when you do content like this... for non-tech people to be able to try and have fun with AI. A dedicated video for everyone to follow. Keep up the good stuff!
Voice synthesis with emotions? That’s a next-level breakthrough for personalizing user experiences. Feels like we're inching closer to seamless AI-human conversations.
@@AlterRizzТы часто смотришь японские, корейские, итальянские, британские, французские, немецкие и прочие каналы с синхронной озвучкой на понятном для тебя языке?
That mixing Chinese and English is simply perfect, any Chinese no matter it's Mandarin even Cantonese just speaks like that, the TTS shows no flaw with it's voice, tone and pronunciation, if I play that to my friends and family they can't really spot the common AI characteristics with it.
If you generate anything longer than 10 minutes, you'll notice that the voice model gets worse and worse until it becomes absolute gibberish and then static noise at around an hour
Man, your channel is the bomb 💣 And right, that "Spanish" reading was a little bit hilarious and awful at the same time. Hope they make more languages available soon. 3 of your videos in a row. New subscriber here!
After a break, I deleted all uploaded files and started again, this time successfully. First error was when uploading programs, stick to the older nominated versions! Don't think that by uploading a newer version, things will be better, they won't ! The program is brilliant and will save me a lot of money. Thankyou! Where I went wrong was creating the virtual environment? You sat to add "conda activate f5"; but you must put in "conda init" first, hit enter, and then add "conda activate f5" Once done, it went smoothly
@leodark_animations2084 Sorry to hear that. Afraid I'm no expert and just stumbled my way through. I'd just shut the computer down and restart, see how you go?
Gotta love installing installers for installing installers in an installer that installs the installer needed for a virtual environment used for installing an installer for a tts program. 👍
@@jaredf6205 I think he had like AB testing going on in the thumbnail. One is a normal wavelength thumbnail and the other thumbnail also has a wavelength pic paired with a.. sus anime pic.
@@captteemo9133 I built the bot from scratch, the basis of my bot is Ollama, for fast communication I used Llama3.2 with 1B parameters. Speech recognition works on Whisper, I used to work with VOSK, VOSK is not inferior by the way, only Whisper allows you to insert punctuation marks into speech. Speech synthesis is based on COQUI TTS - VITS multi-voice model. Unfortunately, it will not work on a smartphone
@@captteemo9133 I built the bot from scratch, the basis of my bot is Ollama, for fast communication I used Llama3.2 with 1B parameters. Speech recognition works on Whisper, I used to work with VOSK, VOSK is not inferior by the way, only Whisper allows you to insert punctuation marks into recognized text. Speech synthesis is based on COQUI TTS - VITS multi-voice model. Unfortunately, it will not work on a smartphone
I'm glad that this is being developed, even if it's still at a point where I wouldn't even enable it if it was as easy as a toggle, let alone dig into code to get it working.
This AI is really good...at sounding like a bad audiobook narrator! 😂 It nails those over-the-top emotions, but they don't sound very human. Maybe the problem is that it's trained on audiobooks, where the emotions are often exaggerated. What if we used this "fake emotion" data to our advantage? First, train an AI to recognize those audiobook patterns. Then, train a second AI to spot real emotions in everyday speech from RUclips, podcasts, etc. The second AI could learn to tell the difference between fake and genuine, and we'd get an AI that truly understands how we express emotions! What do you guys think?
Have you tried the eleven labs reader for audio books? Not all voices are great but i foubd the voice of burt Reynolds to work really well for audiobooks. It also works in different languages
I think that's what a lot of these AI models use. It's called a discriminator, and it's just is to do just that; tell the determine whether a piece (image, audio, etc) is genuine or ai generated. That's the base of my knowledge, I don't know much after that, or if they use it for this voice model.
i alaway wonder why the requirements are never listed first ... xD (specs vram/ram req) the chinese is insane . it always sounds more than the original voice lol
There was a promise about updates with emotions, right? So far, nothing. With ElevenLabs we need to try some workarounds like: (And she says with great sadness) or something like (She says with great anger) Insert the text - The context helps, this uses more characters but in some tests it was worth it for me.
This has got to be the best explanation and breaking down of an objectively nightmarishly complicated setup anywhere. Congrats! You left NO stone unturned. "Oh no Python? Let me take you to the page where to get it, run the setup with you, and show you the gotchas and workarounds before we go on to the next step". Absolutely brilliant. Most other "step-by-step" guides pull out a black box and point at how some magic happens there and good luck figuring it out lol. I'll also note that it must have taken you forever and a day to get ready for this, write the script/steps, collect all the links, files, test it, narrate the entire thing, edit it, and publish it. Your Wondershare Filmora sponsor got their money's worth, and then some. Now.... why in the world hasn't someone taken all this stuff, and made a nice Windows app that installs with a single click? 😁
Looks great, but the only thing i wanted to know was inference speed without processing the reference. What would the potential be for realtime if the reference voice was not being processed as part of the inference?
I haven't looked at it yet, but it shows a spectrogram of the clip¹, so it's possible/probable that it generates the entire clip in one go, I.e. it works on every part of the clip at the same time. If that's the case, it could probably create a 20 second clip in e.g. 15 seconds, but you would still have to wait 15 seconds before you can hear any of it. I may be wrong though. ¹ some text to audio systems generates an image of the spectrogram and then converts the spectrogram to an audio file. The spectrogram is a representation of the audio where time is on the x-axis, the frequency is on the y-axis, and the amplitude is the intensity/color of the pixel.
There was some any language to any language AI voice tool too. Does anyone remember? We can just feed it any language voice and it will learn from it, and after that step it can the be used to generate voice to speak in any language. I believe, It was possible to make it sing too. it even creates a tts file, I believe. So that, we can use that file with any text to speech engine.
It's really cool but I need it to be able to blend multiple voices together to create a new original one. Just copying other people's voices is not really ethical when using voices for commercial purposes.
Crazy! i'm interested in the cross language options, and generally how it handles other non English languages. EDIT: just reached end, so it's Chinese and English support at the moment. All in all, thx for the upload definately checking this out!
It did, and it was fun!!! You can find absolutely funny examples over on The Lost Narrator's RUclips channel. Yeah, it's My Little Pony voice examples from fan actresses, but I say they are some of the best clips I have found.
Yep and does anyone remember adobe voco it could do cloning as well as emotions it was very real for 2016 I bet the big tech already has very advanced stuff in their labs
Yes, sometimes it hallucinate after text to generate is there. But, then... one should to adjust the speed and the cross fade duration and repeat synthesize.
thank you for the tutorial, just wanted to let you know I managed to get it working on a 2060 laptop (6 GB VRAM) and it works fast as well. Also I wasn't sure if pytorch has the latest cuda, but it works with 118
Thank you for the effort in explaining this topic, but the video is too long with a lot of unnecessary examples. the point was clear early on, so trimming the extras and making it more concise would really improve both the content and the viewing experience. Hope you'd see this feedback ;)
Software seems great! Do you know wheter or not it can handle Subtitle formats, with timestamps declaring exactlly at what time something is spoken, or stumbled apon any other text to speech tool that can do that in your research so far? Reply would be much appreciated :)
What GPU are you running? Your 30 seconds is around 5000 for me. I tried on huggingface with about the same results. Replicate was at about the speeds you were getting.
18:00 The dependency installation process has been changed. Instead of entering 'pip install -r requirements.txt', you'll want to enter 'pip isntall -e .'
Thanks to our sponsor Wondershare Filmora, a user-friendly video editor supercharged with AI features. bit.ly/4f60nmB
i have AMD Radeon 6800XT graphic card. i don't have CUDA what will i do can u tell me please help me.
It's amazing! Thanks a lot. Do you know if it will be an Spanish option to generate speech? Thanks
yes me too have rx 6800
im only have amd gpu how to intall it
i need this ai voice cloner
please help
Says this was posted 6 days ago but when I go to the site its got different setup, so links are gone or changed etc. So do we assume they have added some of the features into the main installation process like the requirements?
Hey there is a problem after reinstalling miniconda3 and checking the script folder I was not able to find conda.exe I would appreciate it if you can provide a solution
I love how you do not assume that I know what you know, and bothered explaining the basics. and made time stamps for the more knowledgeable to skip. excellent man!!!
so we cant train it properly on a larger audio file (you cant pack enough vocal range in that for professional works..
thanks!
This is wild! It’s crazy how little input audio it requires. Also I just wanted to say thanks. If it weren’t for you I would have never discovered my passion for creating AI voice models!
are you making money out of it? will be very helpful if you can give some insights.
You're welcome! Glad you found your passion
That's definitely a new passion no one prior to 5 years ago could say, i'll tell you that much.
@@amitnishad0777 It is for non-commercial use only.
@@amitnishad0777No, I guess I could do commissions but I haven’t really thought much about it. I also want to improve before I do something like that as I’m to amateurish at data cleaning atm.
I follow your channel since the early days. I´m super happy for your growth and also super happy when you do content like this... for non-tech people to be able to try and have fun with AI. A dedicated video for everyone to follow. Keep up the good stuff!
Thank you so much!!
I watch lots of tutorials on youtube. This one is among the best. Keep up the good work and thanks for sharing your know-how!
Thanks!
Voice synthesis with emotions? That’s a next-level breakthrough for personalizing user experiences. Feels like we're inching closer to seamless AI-human conversations.
Нам сначала нужно приблизиться к беспрепятственному общению между человеком и человеком )
@@homuchoghoma6789we already talk however we want. All the barriers are in our own heads
curious if they moan
@@AlterRizzТы часто смотришь японские, корейские, итальянские, британские, французские, немецкие и прочие каналы с синхронной озвучкой на понятном для тебя языке?
@@homuchoghoma6789 I watch in many languages cuz i know many languages. Not knowing 3+ languages in 21 century is a skill issue
That mixing Chinese and English is simply perfect, any Chinese no matter it's Mandarin even Cantonese just speaks like that, the TTS shows no flaw with it's voice, tone and pronunciation, if I play that to my friends and family they can't really spot the common AI characteristics with it.
thanks for sharing!
If you generate anything longer than 10 minutes, you'll notice that the voice model gets worse and worse until it becomes absolute gibberish and then static noise at around an hour
yes. i tested it .
did you know how eleven lab doing this
@@CodewithRiz I guess they split the given text into multiple parts, generate for each one, then merge them into one file.
@@CodewithRiz do you think that 11 labs is using this same LLM?
@@DLuzElAngelMusikalno Eleven Labs supports 32 languages. F5 tts supports, and is trained on and for English and Chinese only
The thumbnail man 🤣 man of culture! like and sub!
He changed it what was it😭?
yamete kudasai ...🌶🤣
I'll save it for later. Thank you so much for the detailed tutorial man! Your channel is excellent!
you're welcome!
I have subscribed just because of this video - man - what a find. Great work.
Thank you!
Man, your channel is the bomb 💣
And right, that "Spanish" reading was a little bit hilarious and awful at the same time. Hope they make more languages available soon.
3 of your videos in a row. New subscriber here!
After a break, I deleted all uploaded files and started again, this time successfully. First error was when uploading programs, stick to the older nominated versions! Don't think that by uploading a newer version, things will be better, they won't ! The program is brilliant and will save me a lot of money. Thankyou! Where I went wrong was creating the virtual environment? You sat to add "conda activate f5"; but you must put in "conda init" first, hit enter, and then add "conda activate f5" Once done, it went smoothly
thanks for sharing!
Thank you for that little note "init".
@@Vojec9 *laughs in Bri'ish*
i have the same issue but after conda init and typing conda activate it says again type conda init first. . .
@leodark_animations2084 Sorry to hear that. Afraid I'm no expert and just stumbled my way through. I'd just shut the computer down and restart, see how you go?
Great but needs to support more languages.
Gotta love installing installers for installing installers in an installer that installs the installer needed for a virtual environment used for installing an installer for a tts program. 👍
this is great for npcs in video games
Truly appreciate the detailed installation procedure, made my life much easier. Thanks!
you're welcome!
WW thumbnail
😏
Sauce
+1
What the thumbnail
@@theAIsearch I don't remember if you mentioned your hardware. Can it do inference fast enough for realtime tts of text streams?
That thumbnail... He knew what he was doing
What about it? Can you guys hear waveforms by looking at a picture of them or something?
A computer could probably
@@jaredf6205 I think he had like AB testing going on in the thumbnail. One is a normal wavelength thumbnail and the other thumbnail also has a wavelength pic paired with a.. sus anime pic.
@@AimaruVee Oh it's actually coming up for me now
Ai girlfriends are becoming a reality we are doomed 😭😭😭
Ive been looking for this for a long time so thank you👍🏼
sounds good, but not good enough. I'll wait a bit longer for an upgrade
Right! Not good enough. I can tell it's ai
You are lucky you can tell it's AI, wait until you get a phone call and you can't understand if it's fake or AI
Surely the best among the free ones. If you want the absolute best and are willing to pay, try eleven labs.
@@InnerEagle Maybe that's the thing he tries to build?
@@hilmiterzi6369 Right, then he has to wait for it
finally my local voice AI companion will have emotions!
Why do i have a bad feeling about this
@@jahpistol3486 bro 💀
How do you use it on mobile phones?
@@captteemo9133 I built the bot from scratch, the basis of my bot is Ollama, for fast communication I used Llama3.2 with 1B parameters. Speech recognition works on Whisper, I used to work with VOSK, VOSK is not inferior by the way, only Whisper allows you to insert punctuation marks into speech. Speech synthesis is based on COQUI TTS - VITS multi-voice model. Unfortunately, it will not work on a smartphone
@@captteemo9133 I built the bot from scratch, the basis of my bot is Ollama, for fast communication I used Llama3.2 with 1B parameters. Speech recognition works on Whisper, I used to work with VOSK, VOSK is not inferior by the way, only Whisper allows you to insert punctuation marks into recognized text. Speech synthesis is based on COQUI TTS - VITS multi-voice model. Unfortunately, it will not work on a smartphone
Sir, YOU ARE AMAZING. BELLED to "GIT" notified of everything you make. Simply WOW.
Thanks!
This is absurdly crazy~!!! Many thanks for the installation walkthrough ~!!!
you're welcome!
This is amazing and so well explained, thanks !
I'm glad that this is being developed, even if it's still at a point where I wouldn't even enable it if it was as easy as a toggle, let alone dig into code to get it working.
Great tutorial. Added to our best of AI.
I hope one day someone make an open source ai that make songs like suno or udio
This AI is really good...at sounding like a bad audiobook narrator! 😂 It nails those over-the-top emotions, but they don't sound very human. Maybe the problem is that it's trained on audiobooks, where the emotions are often exaggerated.
What if we used this "fake emotion" data to our advantage? First, train an AI to recognize those audiobook patterns. Then, train a second AI to spot real emotions in everyday speech from RUclips, podcasts, etc. The second AI could learn to tell the difference between fake and genuine, and we'd get an AI that truly understands how we express emotions! What do you guys think?
Have you tried the eleven labs reader for audio books? Not all voices are great but i foubd the voice of burt Reynolds to work really well for audiobooks. It also works in different languages
@@samuel_innerwinkler Lol Burt Reynolds was an actor.
I think that's what a lot of these AI models use. It's called a discriminator, and it's just is to do just that; tell the determine whether a piece (image, audio, etc) is genuine or ai generated. That's the base of my knowledge, I don't know much after that, or if they use it for this voice model.
I know@@jmg9509
i alaway wonder why the requirements are never listed first ... xD (specs vram/ram req)
the chinese is insane . it always sounds more than the original voice lol
I don't know where to go without you. You don't know how important you are in my life. Saved for later as usual.
Crazy stuff! I'm glad i found this channel.
I'm really excited about this. I'd love to have a go !
This is very impressive.
😃
that was an awesome tutorial, very didactic, congratulations.
Awesome it sounds good thanks for the guide to set it up Nvm it sounds alright it has a lot of hallucinations
Awesome, but please mention stuff like needing a CUDA supported GPU earlier in the video. Followed all steps up until I realized I couldn't use it :')
Same XD
same
@@neuron8950 same lol amd sucks
I couldnt stop laughing with sudden switch from normal to sad and then to anger LMAO
Damn. I work extensively with Eleven Labs but this is actually showing some advances. Especially the emotional side of things.
There was a promise about updates with emotions, right? So far, nothing.
With ElevenLabs we need to try some workarounds like:
(And she says with great sadness) or something like (She says with great anger)
Insert the text -
The context helps, this uses more characters but in some tests it was worth it for me.
really awesome tutorial!
thanks for the steps wise explanation great with complete info
you're welcome!
The best part is we can use the existing XTTS set of tools to modify our own voices and create the emotional samples, for the existing voices.
thanks for sharing!
How though I do not know coding would be very interested if you can put a RUclips channel on this very topic
I was also very surprised with how good this works... Thanks!
yes, theft and fraud are a button click away and justified with a shrug of your smug shoulders.
it took a quite a while for people to find this
this is the best AI text to speech program i ever seen. tnx AI Search...😍😍😍😍😍
You are welcome!
Is it free@@theAIsearch
@@angelbeatsenpai_manhwa yes
Thanks, i got it working and im a smooth brain.
You're welcome!
This has got to be the best explanation and breaking down of an objectively nightmarishly complicated setup anywhere. Congrats!
You left NO stone unturned. "Oh no Python? Let me take you to the page where to get it, run the setup with you, and show you the gotchas and workarounds before we go on to the next step". Absolutely brilliant. Most other "step-by-step" guides pull out a black box and point at how some magic happens there and good luck figuring it out lol.
I'll also note that it must have taken you forever and a day to get ready for this, write the script/steps, collect all the links, files, test it, narrate the entire thing, edit it, and publish it. Your Wondershare Filmora sponsor got their money's worth, and then some.
Now.... why in the world hasn't someone taken all this stuff, and made a nice Windows app that installs with a single click? 😁
I approve of the Hitchhiker's Guide reference.
Looks great, but the only thing i wanted to know was inference speed without processing the reference. What would the potential be for realtime if the reference voice was not being processed as part of the inference?
inference is quite fast. there's a good chance someone might make a realtime variant of this
I haven't looked at it yet, but it shows a spectrogram of the clip¹, so it's possible/probable that it generates the entire clip in one go, I.e. it works on every part of the clip at the same time. If that's the case, it could probably create a 20 second clip in e.g. 15 seconds, but you would still have to wait 15 seconds before you can hear any of it. I may be wrong though.
¹ some text to audio systems generates an image of the spectrogram and then converts the spectrogram to an audio file. The spectrogram is a representation of the audio where time is on the x-axis, the frequency is on the y-axis, and the amplitude is the intensity/color of the pixel.
Thumbnail-kun goin cray
🤪
XD waoo,list for my open free project 😅
This is very good at cloning voice wave files nice one.👍😁💯
OMG this is 🔥thank you!
No problem!!
This is outstanding
I will try this
COOL bro, please I want Spanish TTS and cloning
thanks for sharing your skill with us.
this is amazong, thanks
Thank you for the details
This version of the tool is astonishing! It is exactly what I have been looking for.Thank you!
There was some any language to any language AI voice tool too. Does anyone remember? We can just feed it any language voice and it will learn from it, and after that step it can the be used to generate voice to speak in any language. I believe, It was possible to make it sing too. it even creates a tts file, I believe. So that, we can use that file with any text to speech engine.
Man , You're a legend 🙌
Thank you for your efforts ❤
Great video Very helpful.Thanks for sharing
You are welcome!
now reading visual novels feels cinematic, thanks for suggesting
It's really cool but I need it to be able to blend multiple voices together to create a new original one. Just copying other people's voices is not really ethical when using voices for commercial purposes.
RVC can do this. I downloaded via this channel. If u go to this guys videos and search popular it's on of the most watched.
So what do you think?
What's the ETA for this to be added to Sillytavern?
should be very soon. open source community builds fast!
Crazy!
i'm interested in the cross language options, and generally how it handles other non English languages. EDIT: just reached end, so it's Chinese and English support at the moment.
All in all, thx for the upload definately checking this out!
Thank you! Installation part of this tutorial is about 15 minutes long...? Is there a way how regular people can install this software? :)
This is inane bro, i used to train a model for hours to get something near this level
This is dope 👏
This needs more languages
0:22 Why "bob" sounds like Vedal 💀
Fr 💀
That's awesome!
This is impressive. No wonder the voice actors have problems with this software
I remember a product called lier bird that vanished from existence 😅 it did voice cloneing almost a decade ago.
It did, and it was fun!!!
You can find absolutely funny examples over on The Lost Narrator's RUclips channel.
Yeah, it's My Little Pony voice examples from fan actresses, but I say they are some of the best clips I have found.
@@TrentonMatthews I remember someone showing me a website with my little pony voice clones so many years back, I completely forgot that existed 😅
@@TrentonMatthews the first video I randomly clicked on was titled "apple jack tells the truth " 💀 was not expecting that
Yep and does anyone remember adobe voco it could do cloning as well as emotions it was very real for 2016 I bet the big tech already has very advanced stuff in their labs
good video. question, where did you get the original emotive voice samples to use (angry, sad, excited, etc??)
The Chinese ones sounds like native speakers. This is a really powerful tool.
Thank you so much as always your tutorials are very helpful and insightful. I hope to use this to translate and dub the new Dragon Ball series.
good luck!
voice actors exiting the chat
39:39 - for the podcast mode, i wonder if they will add in the feature for you to provide emphasis, i.e. the moods from before.
Your knowledge is awesome, what was your profession before before starting this fabulous channel ?🤔
Yes, sometimes it hallucinate after text to generate is there. But, then... one should to adjust the speed and the cross fade duration and repeat synthesize.
thank you for the tutorial, just wanted to let you know I managed to get it working on a 2060 laptop (6 GB VRAM) and it works fast as well. Also I wasn't sure if pytorch has the latest cuda, but it works with 118
Thank you for the effort in explaining this topic, but the video is too long with a lot of unnecessary examples. the point was clear early on, so trimming the extras and making it more concise would really improve both the content and the viewing experience.
Hope you'd see this feedback ;)
Best thumbnail!, sauce?
at 38min it feels like it was the movey inside out :D
So good sir 😊❤
Software seems great! Do you know wheter or not it can handle Subtitle formats, with timestamps declaring exactlly at what time something is spoken, or stumbled apon any other text to speech tool that can do that in your research so far? Reply would be much appreciated :)
help me i have cuda version 6.1 . is it compatible? my laptop use (Nvidia MX-150)
Feel like a 7/10.
6/10 maybe. Still more robotic than CoquiTTS
What GPU are you running? Your 30 seconds is around 5000 for me. I tried on huggingface with about the same results. Replicate was at about the speeds you were getting.
it doesn't sound like a human at all, but it really nailed these emotions and I can see it over taking Eleven labs if they keep developing it
Love your video! So cozy to listen to your voice :). I was wondering if you tried with your own voice? If yes did it work? 😊
thanks! i haven't actually tried it w my voice, but good idea!
Well, you can 100% make his voice read you books now lol
"Insane AI TTS with Emotions!"
*proceeds to play the most monotone TTS voice I heard* 🤣🤣
ngl i clicked because of the thumbnail, not the tutorial, i'm cooked
Good instructions!
Can f5-TTS possibly be run without CUDA, just on the CPU?
Yes, it can. It is only much, much slower. Skip the torch GPU choice and install all the generic torch libraries.
18:00 The dependency installation process has been changed. Instead of entering 'pip install -r requirements.txt', you'll want to enter 'pip isntall -e .'
This is great! My all time favorite voices are Morgan Freeman, Peter Thomas (from Forensic Files), and Samuel L Jackson (see: Go the f**k to sleep)
great video
thanks!
This is great but how many languages can we use in this or just limit some languages?
Great video thank you for the education
you're welcome!