Kokoro Local TTS + Custom Voices

Поделиться
HTML-код
  • Опубликовано: 31 янв 2025

Комментарии • 97

  • @KrullMaestaren
    @KrullMaestaren 4 дня назад +1

    Thanks, have realized over the last few days that the thing we need (as a start) for many models are simple guides like this to get started

  • @mageshyt2550
    @mageshyt2550 17 дней назад +11

    love to see video on conversation with local agents

  • @jmg9509
    @jmg9509 14 дней назад +8

    Wht we need is is a model that gives precise control over the emotion, intonation, cadence, pacing, volume, timing and pitch of the voices, not more monotone models.

  • @pin65371
    @pin65371 17 дней назад +11

    This would be good for people that want to run something like Alexa locally at home. I know some people have been putting together systems for home assistant. While maybe the OpenAI integration might sound slightly better I'd consider this more than good enough to replace that and not have to send your data to OpenAI.

    • @samwitteveenai
      @samwitteveenai  17 дней назад +2

      Yeah that is how I feel too. It’s not the best but it is damn good .

  • @CapsAdmin
    @CapsAdmin 16 дней назад +3

    One potentially cool application of blending would be to blend between voice styles like laughing, crying, angry, etc, based on what's being said (maybe with a small llm) and other things.

  • @andherium
    @andherium 17 дней назад +49

    hmm Tiny TTs is definitely an interesting name

    • @dinoscheidt
      @dinoscheidt 17 дней назад +6

      Took a bit… 🧿🧿

    • @TeLocd
      @TeLocd 9 дней назад

      @@dinoscheidt 😎

  • @kevin.malone
    @kevin.malone 15 дней назад +1

    I've been waiting for this for so long. Being able to turn any PDF/text file into an audio book should have been possible so long ago.

    • @cassusgames
      @cassusgames 15 дней назад

      Jarod Mica’s audiobook maker is pretty good

  • @MojaveHigh
    @MojaveHigh 17 дней назад +2

    Very helpful, thanks!
    Any chance you could take a look at RealtimeSTT? And maybe put that and Kokoro into a single local conversational AI agent?

  • @Xopher30
    @Xopher30 16 дней назад

    Thanks for putting this together 👍🏼👍🏼

  • @sajjaddehghani8735
    @sajjaddehghani8735 16 дней назад +1

    interesting, the interpolation part shocked me, thanks

    • @samwitteveenai
      @samwitteveenai  16 дней назад

      hope it was useful I have never seen anyone show that kind of thing so thought it would be cool to let people know

    • @sajjaddehghani8735
      @sajjaddehghani8735 15 дней назад

      @samwitteveenai I was thinking about writing an equation to create my voice by combining existing voices "Y=ax^2+bx+c" or even train a new model on weights to find the optimal values

  • @khangvutien2538
    @khangvutien2538 16 дней назад +1

    Thanks.
    You have given me another reason to buy a Mac mini M4 😉

    • @kevin.malone
      @kevin.malone 15 дней назад +1

      As awesome as M4 Macs are at AI, this model is crazy lightweight, and even my laptop with a 10th gen intel processor is able to run this TTS in real time.

    • @kevin.malone
      @kevin.malone 15 дней назад +1

      Not to discourage you from getting the Mac mini. For LLMs and image generation, it will serve you well. But this is a super low footprint model to run unless you're wanting it to process multiple simultaneous TTS outputs.

  • @TheMiczu
    @TheMiczu 16 дней назад

    Great small fast model❤

  • @nanowander
    @nanowander 11 дней назад +3

    Unfortunately no voice cloning function yet

  • @user-nbfkxngjmyb
    @user-nbfkxngjmyb 14 дней назад

    thanks a lot for this wonderful video

  • @lovol2
    @lovol2 17 дней назад

    Thanks for making this video.

  • @Zyphorix7
    @Zyphorix7 12 дней назад

    Would love to see you host the whole project locally and use it.

  • @d1sstr4ck
    @d1sstr4ck 12 дней назад

    Hey, thank you so much for the tutorial. I completely almost set myself on fire doing that, as I'm a complete noob in any sort of coding, and it was a real pain to go through, but eventually I made it work. I have a question, how to change the voice?

    • @d1sstr4ck
      @d1sstr4ck 12 дней назад

      Nevermind, I see - Available voices are af, af_bella, af_nicole, af_sarah, af_sky, am_adam, am_michael, bf_emma, bf_isabella, bm_george, bm_lewis

  • @djstraylight
    @djstraylight 17 дней назад +8

    Were there any instuctions on how to train voicepacks?

    • @samwitteveenai
      @samwitteveenai  17 дней назад +2

      No I don’t think they have made any

    • @AceOnlineMath
      @AceOnlineMath 8 дней назад +1

      you wont be able to for quite some time. the creator is not compleatly done creating the model yet. i think it will have that at some point though

  • @MeinDeutschkurs
    @MeinDeutschkurs 17 дней назад +2

    Sky is back! Wooohooo!!! ❤❤❤❤

  • @ugotworms
    @ugotworms 15 дней назад

    Interested to start hand editing a voice pack and see how it affects the results.. or applying simple transformation functions to the embeddings.. would even small changes turn the result into pure noise?

  • @nexuslux
    @nexuslux 16 дней назад +1

    Sam is such a legend.

  • @helloworld7796
    @helloworld7796 17 дней назад +4

    Is it possible to train own model for some language other than US from scratch?

    • @samwitteveenai
      @samwitteveenai  17 дней назад +1

      Yes or you could fine tune this to another language, but you would need some training code as well which currently isn’t in the repo

  • @onoff5604
    @onoff5604 9 дней назад

    Wonderful video. ipynb code on huggingface worked but linked colab stops on deprication error

  • @doppel33
    @doppel33 16 дней назад +3

    Really waiting for Japanese!

  • @SaiTeja-go6lw
    @SaiTeja-go6lw 16 дней назад +3

    How to do cloning with any sample voice

    • @AceOnlineMath
      @AceOnlineMath 8 дней назад

      you cant with this model yet, and maybe not for quite some time

  • @so_annoying
    @so_annoying 4 дня назад

    want to clone a specific voice, but it seems to be hard using it.

  • @LordPerujoy
    @LordPerujoy 15 дней назад

    Installing it using the UV instructions is a nightmares'. I am bot that vast in coding. Please kindly do a video on how to install it on a windows system. Thanks

  • @adityakaul8065
    @adityakaul8065 15 дней назад

    i would like to see the ability to design a custom voice based on a prompt, something similar to what eleven labs has with voice design. that is the real breakthrough

  • @moundercesar3102
    @moundercesar3102 17 дней назад

    Very interesting, can we use it as a pdf reader where it reads in real time and not after processing the whole text ?

    • @samwitteveenai
      @samwitteveenai  17 дней назад +2

      You would probably process a sentence or a line at a time(maybe even a paragraph to help it with prosody), but should be possible

  • @MeinDeutschkurs
    @MeinDeutschkurs 17 дней назад

    What I‘d use it for? Voice Chat, based on aya-expanse.

  • @edwesterik717
    @edwesterik717 16 дней назад

    Thanks for the video! Is it possible to get another language into Kokoro like Dutch

    • @samwitteveenai
      @samwitteveenai  16 дней назад

      you would most likely need to fine tune the model with some dutch audio recording etc. I don't think they are supporting this yet.

  • @altmediamedia9654
    @altmediamedia9654 17 дней назад +1

    Sam, I can't access the shortened URL links. I can't name this website shortener in my comment but you know which one you are using. it either timesout or is unreachable. Anyone else bothered with this issue?

    • @samwitteveenai
      @samwitteveenai  16 дней назад

      weird in my stats I know there are thousand of people opening them. Could it be your location?

  • @JoeyXie
    @JoeyXie 15 дней назад

    I tried text to speech on my laptop, generating speed is slow, need to wait about 30s to hear the sound.

  • @XITIJTHOOL
    @XITIJTHOOL 8 дней назад

    Please help ,How can we deplywnd run on Windows?

  • @soulbreacher1410
    @soulbreacher1410 13 дней назад

    Can anyone tell me how large it is
    Like how much data do I need to download it
    And how much vram

  • @Quantum_Nebula
    @Quantum_Nebula 17 дней назад

    Interesting -- definitely is fast for the quality

  • @JanBadertscher
    @JanBadertscher 12 дней назад +1

    As soon as you want to have real time voice assistants, STT->LLM->TTS is outdated and we need better Multi Modal (Omni like) open weights or open source models.

  • @NoidoDev
    @NoidoDev 16 дней назад

    7:10 Is blending voices something new? Didn't come across that yet. But this is something I imagined and always wanted. We can't just use voices of real people.

    • @samwitteveenai
      @samwitteveenai  16 дней назад +1

      It is just interpolation between embeddings. It is used in GANs a fair bit, I haven't seen it used in voices like this but I figured I would show people how to play with it and try it out

  • @figs3284
    @figs3284 17 дней назад +1

    Transformers js version coming soon from Xenova 👀

  • @VanillaGun
    @VanillaGun 17 дней назад

    Is there a defined context length it can parse and process at a time? I want to test it out for large text sources.

    • @finbenton
      @finbenton 17 дней назад

      Idk but I just generated 25min long audio file but it took 5-10mins to generate.

    • @VanillaGun
      @VanillaGun 16 дней назад

      @@finbenton This is useful info, thanks for letting me know.

  • @NoidoDev
    @NoidoDev 16 дней назад

    What are the minimum hardware requirements for real time? Raspi 4/5?!

    • @samwitteveenai
      @samwitteveenai  16 дней назад

      Good question. I'm not sure if this would actually work on a Raspberry Pi or not.

  • @MeinDeutschkurs
    @MeinDeutschkurs 17 дней назад

    Is it possible to fade from one voice to another voice? Could help to find great voices. (With values in terminal)

    • @samwitteveenai
      @samwitteveenai  17 дней назад

      Good question unfortunately it’s not really possible to fade between them because you need to put the full embedding in at the generation time and you can only put one in.

    • @MeinDeutschkurs
      @MeinDeutschkurs 17 дней назад

      @@samwitteveenai , ok, so I should iterate word by word from 0.0 to 1.0 for both of the values. 😆 Why not? At least the same sentence multiple times to compare it.

  • @d1sstr4ck
    @d1sstr4ck 9 дней назад

    It stopped working for no reason, now I can't use it ((( it just doesn't generate the wav...

  • @usuarioaleatorio336
    @usuarioaleatorio336 12 дней назад

    how do i load the spanish voices?

  • @nirsarkar
    @nirsarkar 13 дней назад

    How is it for long text?

  • @TheAweto
    @TheAweto 15 дней назад

    Colab link is not working

  • @soulbreacher1410
    @soulbreacher1410 10 дней назад +1

    Tried this code for code and....welp.....nothing...just my luck

    • @samwitteveenai
      @samwitteveenai  10 дней назад

      the colab should work fine. For the local one what are you trying to run on?

    • @devon9374
      @devon9374 3 часа назад

      NEVER GIVE UP

  • @MeUnboxing
    @MeUnboxing 16 дней назад

    Hi Sam. I need help to build an AI model. Can you help me or suggest someone who can?

  • @finlay422
    @finlay422 16 дней назад +2

    I have used google's TTS apis quite extensively and I do not understand how Kokoro can match or even beat googles best TTS models using 100 hours of training data. Google must have access to millions if not billions of hours of speach data along with their vast resources. What is going on here?!

    • @devon9374
      @devon9374 16 дней назад +2

      It's pretty neat, isn't it? What a time to be alive

    • @samwitteveenai
      @samwitteveenai  16 дней назад +2

      I think for Google, it's not about a technical issue, it's about the blowback of people using those voices to impersonate humans etc..

  • @MaxJM74
    @MaxJM74 12 дней назад

    God tks 🇧🇷

  • @TheRemarkableN
    @TheRemarkableN 15 дней назад

    I did a double take when I first saw the title of this video 😅

  • @SyamsQbattar
    @SyamsQbattar 17 дней назад +1

    Do you know how to add a new language, like Indonesian?

    • @samwitteveenai
      @samwitteveenai  17 дней назад

      To get a good result you would probably need to mix some real Bahasa audio into the train mix. Or fine tune it later. Might be able to do something with with a phoneme dictionary but really need some example audio

    • @SyamsQbattar
      @SyamsQbattar 17 дней назад +1

      @@samwitteveenai Is there a step-by-step tutorial on this?

    • @miklosprisznyak9102
      @miklosprisznyak9102 17 дней назад +1

      Yes, adding a new language is what I would be also interested in...
      Please enlighten us if you have any clue. 😊

    • @Notifest
      @Notifest 17 дней назад +3

      I would appreciate a fine tuning tutorial for a custom voice in any language

    • @samwitteveenai
      @samwitteveenai  10 дней назад +1

      there is no tutorial that I know of currently

  • @TABandiTA
    @TABandiTA 12 дней назад

    Nice, but xttsv2 is still my favorite, it has a lot of non english language models.

  • @clray123
    @clray123 7 дней назад

    af_nicole most ASMR voice

  • @Otiyyy
    @Otiyyy 16 дней назад

    edge tts

  • @concretec0w
    @concretec0w 17 дней назад +3

    Is it better than piper-ttts? piper is sooooo fast and decent

    • @devon9374
      @devon9374 16 дней назад +1

      This is what I want to know!

    • @d1sstr4ck
      @d1sstr4ck 12 дней назад

      @@devon9374 +

  • @maglat
    @maglat 14 дней назад

    Englisch only.

  • @tonkyboy8920
    @tonkyboy8920 14 дней назад

    I played on tts space. I would say they all sound aweful.

    • @d1sstr4ck
      @d1sstr4ck 12 дней назад

      what do you use? what would you recommend? - I don't know any better.

    • @tonkyboy8920
      @tonkyboy8920 12 дней назад

      @ i use Openai TTS. They sound awesome for me. Didn’t find a good local TTS yet.

    • @KDFrosh9734
      @KDFrosh9734 12 дней назад

      ​@@tonkyboy8920 Do you use it for free or does it cost money?