Wht we need is is a model that gives precise control over the emotion, intonation, cadence, pacing, volume, timing and pitch of the voices, not more monotone models.
This would be good for people that want to run something like Alexa locally at home. I know some people have been putting together systems for home assistant. While maybe the OpenAI integration might sound slightly better I'd consider this more than good enough to replace that and not have to send your data to OpenAI.
One potentially cool application of blending would be to blend between voice styles like laughing, crying, angry, etc, based on what's being said (maybe with a small llm) and other things.
@samwitteveenai I was thinking about writing an equation to create my voice by combining existing voices "Y=ax^2+bx+c" or even train a new model on weights to find the optimal values
As awesome as M4 Macs are at AI, this model is crazy lightweight, and even my laptop with a 10th gen intel processor is able to run this TTS in real time.
Not to discourage you from getting the Mac mini. For LLMs and image generation, it will serve you well. But this is a super low footprint model to run unless you're wanting it to process multiple simultaneous TTS outputs.
Hey, thank you so much for the tutorial. I completely almost set myself on fire doing that, as I'm a complete noob in any sort of coding, and it was a real pain to go through, but eventually I made it work. I have a question, how to change the voice?
Interested to start hand editing a voice pack and see how it affects the results.. or applying simple transformation functions to the embeddings.. would even small changes turn the result into pure noise?
Installing it using the UV instructions is a nightmares'. I am bot that vast in coding. Please kindly do a video on how to install it on a windows system. Thanks
i would like to see the ability to design a custom voice based on a prompt, something similar to what eleven labs has with voice design. that is the real breakthrough
Sam, I can't access the shortened URL links. I can't name this website shortener in my comment but you know which one you are using. it either timesout or is unreachable. Anyone else bothered with this issue?
As soon as you want to have real time voice assistants, STT->LLM->TTS is outdated and we need better Multi Modal (Omni like) open weights or open source models.
7:10 Is blending voices something new? Didn't come across that yet. But this is something I imagined and always wanted. We can't just use voices of real people.
It is just interpolation between embeddings. It is used in GANs a fair bit, I haven't seen it used in voices like this but I figured I would show people how to play with it and try it out
Good question unfortunately it’s not really possible to fade between them because you need to put the full embedding in at the generation time and you can only put one in.
@@samwitteveenai , ok, so I should iterate word by word from 0.0 to 1.0 for both of the values. 😆 Why not? At least the same sentence multiple times to compare it.
I have used google's TTS apis quite extensively and I do not understand how Kokoro can match or even beat googles best TTS models using 100 hours of training data. Google must have access to millions if not billions of hours of speach data along with their vast resources. What is going on here?!
To get a good result you would probably need to mix some real Bahasa audio into the train mix. Or fine tune it later. Might be able to do something with with a phoneme dictionary but really need some example audio
Thanks, have realized over the last few days that the thing we need (as a start) for many models are simple guides like this to get started
love to see video on conversation with local agents
Wht we need is is a model that gives precise control over the emotion, intonation, cadence, pacing, volume, timing and pitch of the voices, not more monotone models.
This would be good for people that want to run something like Alexa locally at home. I know some people have been putting together systems for home assistant. While maybe the OpenAI integration might sound slightly better I'd consider this more than good enough to replace that and not have to send your data to OpenAI.
Yeah that is how I feel too. It’s not the best but it is damn good .
One potentially cool application of blending would be to blend between voice styles like laughing, crying, angry, etc, based on what's being said (maybe with a small llm) and other things.
hmm Tiny TTs is definitely an interesting name
Took a bit… 🧿🧿
@@dinoscheidt 😎
I've been waiting for this for so long. Being able to turn any PDF/text file into an audio book should have been possible so long ago.
Jarod Mica’s audiobook maker is pretty good
Very helpful, thanks!
Any chance you could take a look at RealtimeSTT? And maybe put that and Kokoro into a single local conversational AI agent?
Thanks for putting this together 👍🏼👍🏼
interesting, the interpolation part shocked me, thanks
hope it was useful I have never seen anyone show that kind of thing so thought it would be cool to let people know
@samwitteveenai I was thinking about writing an equation to create my voice by combining existing voices "Y=ax^2+bx+c" or even train a new model on weights to find the optimal values
Thanks.
You have given me another reason to buy a Mac mini M4 😉
As awesome as M4 Macs are at AI, this model is crazy lightweight, and even my laptop with a 10th gen intel processor is able to run this TTS in real time.
Not to discourage you from getting the Mac mini. For LLMs and image generation, it will serve you well. But this is a super low footprint model to run unless you're wanting it to process multiple simultaneous TTS outputs.
Great small fast model❤
Unfortunately no voice cloning function yet
thanks a lot for this wonderful video
Thanks for making this video.
Would love to see you host the whole project locally and use it.
Hey, thank you so much for the tutorial. I completely almost set myself on fire doing that, as I'm a complete noob in any sort of coding, and it was a real pain to go through, but eventually I made it work. I have a question, how to change the voice?
Nevermind, I see - Available voices are af, af_bella, af_nicole, af_sarah, af_sky, am_adam, am_michael, bf_emma, bf_isabella, bm_george, bm_lewis
Were there any instuctions on how to train voicepacks?
No I don’t think they have made any
you wont be able to for quite some time. the creator is not compleatly done creating the model yet. i think it will have that at some point though
Sky is back! Wooohooo!!! ❤❤❤❤
Interested to start hand editing a voice pack and see how it affects the results.. or applying simple transformation functions to the embeddings.. would even small changes turn the result into pure noise?
Sam is such a legend.
Is it possible to train own model for some language other than US from scratch?
Yes or you could fine tune this to another language, but you would need some training code as well which currently isn’t in the repo
Wonderful video. ipynb code on huggingface worked but linked colab stops on deprication error
Really waiting for Japanese!
How to do cloning with any sample voice
you cant with this model yet, and maybe not for quite some time
want to clone a specific voice, but it seems to be hard using it.
Installing it using the UV instructions is a nightmares'. I am bot that vast in coding. Please kindly do a video on how to install it on a windows system. Thanks
i would like to see the ability to design a custom voice based on a prompt, something similar to what eleven labs has with voice design. that is the real breakthrough
Very interesting, can we use it as a pdf reader where it reads in real time and not after processing the whole text ?
You would probably process a sentence or a line at a time(maybe even a paragraph to help it with prosody), but should be possible
What I‘d use it for? Voice Chat, based on aya-expanse.
Thanks for the video! Is it possible to get another language into Kokoro like Dutch
you would most likely need to fine tune the model with some dutch audio recording etc. I don't think they are supporting this yet.
Sam, I can't access the shortened URL links. I can't name this website shortener in my comment but you know which one you are using. it either timesout or is unreachable. Anyone else bothered with this issue?
weird in my stats I know there are thousand of people opening them. Could it be your location?
I tried text to speech on my laptop, generating speed is slow, need to wait about 30s to hear the sound.
Please help ,How can we deplywnd run on Windows?
Can anyone tell me how large it is
Like how much data do I need to download it
And how much vram
Interesting -- definitely is fast for the quality
As soon as you want to have real time voice assistants, STT->LLM->TTS is outdated and we need better Multi Modal (Omni like) open weights or open source models.
7:10 Is blending voices something new? Didn't come across that yet. But this is something I imagined and always wanted. We can't just use voices of real people.
It is just interpolation between embeddings. It is used in GANs a fair bit, I haven't seen it used in voices like this but I figured I would show people how to play with it and try it out
Transformers js version coming soon from Xenova 👀
Is there a defined context length it can parse and process at a time? I want to test it out for large text sources.
Idk but I just generated 25min long audio file but it took 5-10mins to generate.
@@finbenton This is useful info, thanks for letting me know.
What are the minimum hardware requirements for real time? Raspi 4/5?!
Good question. I'm not sure if this would actually work on a Raspberry Pi or not.
Is it possible to fade from one voice to another voice? Could help to find great voices. (With values in terminal)
Good question unfortunately it’s not really possible to fade between them because you need to put the full embedding in at the generation time and you can only put one in.
@@samwitteveenai , ok, so I should iterate word by word from 0.0 to 1.0 for both of the values. 😆 Why not? At least the same sentence multiple times to compare it.
It stopped working for no reason, now I can't use it ((( it just doesn't generate the wav...
how do i load the spanish voices?
How is it for long text?
Colab link is not working
Tried this code for code and....welp.....nothing...just my luck
the colab should work fine. For the local one what are you trying to run on?
NEVER GIVE UP
Hi Sam. I need help to build an AI model. Can you help me or suggest someone who can?
I have used google's TTS apis quite extensively and I do not understand how Kokoro can match or even beat googles best TTS models using 100 hours of training data. Google must have access to millions if not billions of hours of speach data along with their vast resources. What is going on here?!
It's pretty neat, isn't it? What a time to be alive
I think for Google, it's not about a technical issue, it's about the blowback of people using those voices to impersonate humans etc..
God tks 🇧🇷
I did a double take when I first saw the title of this video 😅
Do you know how to add a new language, like Indonesian?
To get a good result you would probably need to mix some real Bahasa audio into the train mix. Or fine tune it later. Might be able to do something with with a phoneme dictionary but really need some example audio
@@samwitteveenai Is there a step-by-step tutorial on this?
Yes, adding a new language is what I would be also interested in...
Please enlighten us if you have any clue. 😊
I would appreciate a fine tuning tutorial for a custom voice in any language
there is no tutorial that I know of currently
Nice, but xttsv2 is still my favorite, it has a lot of non english language models.
af_nicole most ASMR voice
edge tts
Is it better than piper-ttts? piper is sooooo fast and decent
This is what I want to know!
@@devon9374 +
Englisch only.
I played on tts space. I would say they all sound aweful.
what do you use? what would you recommend? - I don't know any better.
@ i use Openai TTS. They sound awesome for me. Didn’t find a good local TTS yet.
@@tonkyboy8920 Do you use it for free or does it cost money?