Create your AI digital voice clone locally with Piper TTS | Tutorial

Поделиться
HTML-код
  • Опубликовано: 26 сен 2024

Комментарии • 237

  • @synesthesiam
    @synesthesiam Год назад +6

    Guude, Thorsten! Thank you for making this video :) I've added it to the Piper training guide already.
    I had a previous comment, but I guess RUclips didn't like it containing a link.

    • @ThorstenMueller
      @ThorstenMueller  Год назад +1

      Guude Mike 👋, you're very welcome and thanks for adding the video link to the docs 🥰. Feel free to send me the link directly and i can add it to a comment or video description.

  • @MyHeap
    @MyHeap Год назад +8

    Thank you for demonstrating this. Now I admit I will need to watch it a couple more times to get my head around it all, but I certainly appreciate your efforts here. Thank you for making the video and sharing your knowledge.
    Joe

  • @NoDebut
    @NoDebut 10 месяцев назад +2

    Adding my +1 to all the feedback in the comments below - THANK YOU for creating a step by step guide for this. I work with computers and am no stranger to python programming, but still would have found setting this up and getting it working unwieldy and off-putting. I really appreciate you going through this step by step - I (and I'm sure others) need this kind of help! Also, can I just say that I wasn't even aware this was a *possibility* until I saw this vid? :)

    • @ThorstenMueller
      @ThorstenMueller  10 месяцев назад +2

      Thank you so much for this really kind feedback 🥰. Comments like yours always keep me motivated making new tutorial videos.

  • @OpinionatedReviewer
    @OpinionatedReviewer 9 месяцев назад +2

    Great video! I'll need to rewatch it a few times to nail down the steps, but your effort in sharing this knowledge is much appreciated. Thanks! 👍

    • @ThorstenMueller
      @ThorstenMueller  9 месяцев назад +1

      Thanks a lot for your nice feedback and good luck 😊.

  • @truszko91
    @truszko91 Год назад +2

    Why have I never thought to hide the venv dir with a "." HA! Thanks for making this tutorial!

    • @ThorstenMueller
      @ThorstenMueller  Год назад

      You're very welcome 😊. Enjoy your "clean" directory structure.

  • @juntiamores2418
    @juntiamores2418 5 месяцев назад

    Big thanks to you Thorsten, finally my Personal home assistant A.I ToniaBot has a wonderful voice now. ❤

    • @ThorstenMueller
      @ThorstenMueller  5 месяцев назад

      I'm happy that my videos where helpful for you to finish your personal project 👏😊.

  • @QuokkaBff
    @QuokkaBff Год назад

    Hi, Thorsten, Thank you for all of these. The experience of hearing tts in my own voice was awesome.

    • @ThorstenMueller
      @ThorstenMueller  Год назад +1

      You are very welcome 😊. And i know absolutely what you mean and i feel the same way.

  • @pktechentertainment863
    @pktechentertainment863 16 дней назад

    Nice Sir i will try this. Thank You for you Tut.

  • @whalemonstre
    @whalemonstre 11 месяцев назад

    Thank you Thorsten - useful and enlightening. 🙂🖖

    • @ThorstenMueller
      @ThorstenMueller  11 месяцев назад

      Thank you for your kind feedback :). Happy you find my content useful.

  • @renan7716
    @renan7716 5 месяцев назад

    Great video! I'm learning a lot!Thanks!

  • @cloudsystem3740
    @cloudsystem3740 Год назад

    thank you very much for the hidden env and tutorial!

  • @TillF
    @TillF Год назад +6

    How many hours of voice recordings would you recommend as a minimum for fine-tuning a small/large model to retrieve good results?

    • @ThorstenMueller
      @ThorstenMueller  Год назад +1

      Hard to say. Maybe start with one/two hours, give finetuning a try and hear how close you already are and add more recordings by demand. But in general - more is better. But keep a high recording quality in mind.

    • @jantester2357
      @jantester2357 10 месяцев назад

      Thanks 🙏 Services like Coqui only needs a few seconds of voice to clone it. When I want to train my own model, I need hours of recording a the result is not so good. What am I missing?

    • @godwinspeaks
      @godwinspeaks 5 месяцев назад

      Pipers usecase is mainly for being realtime, not for being the best in quality. The ones like coqui, tortoise etc dont come close in that aspect. This is a different usecase

    • @RemoteAccessGG
      @RemoteAccessGG 5 месяцев назад

      @@jantester2357conqui has really good pretrained model with lots of voices, but piper’s finetune-ready model is not that good, so you need to do almost everything by yourself.

  • @ei23de
    @ei23de 10 месяцев назад +3

    As i'm currently try this on my own, i realise how much work you must have put in this.
    Thanks a lot Thorsten!
    Also one question here.
    Are you run training on a Nvidia GPU and is it a real Ubuntu Machine or do you use some kind of WSL2 / VM?
    Currently have some Cuda related problems with WSL2 on Windows 10, haha... CPU is working tho.
    Edit:
    I could solve some problems with
    cd /usr/lib/wsl/lib/
    sudo rm -r libcuda.so.1
    sudo rm -r libcuda.so
    sudo ln -s libcuda.so.1.1 libcuda.so.1
    sudo ln -s libcuda.so.1.1 libcuda.so
    A Nvidia GTX1060 is now working.
    But my RTX4090 is not.
    I also tried the docker solution (which is as you said, not really easier) and thay shows in the logs that die RTX4090 is not supported yet.
    I get this error which seesm WSL related
    Could not load library libcudnn_cnn_infer.so.8. Error: libcuda.so: cannot open shared object file: No such file or directory
    So i may set up a Ubuntu-20.04 since python3.10-env is deprecated on debian.
    Also another question comes up:
    If i want to train a whole new model, how many epochs are needed? Usualy around 2000-3000? Also depends on the dataset quality i guess?!
    But first i have to get the 4090 running. 2000 epochs will need way to long on the gtx 1060

    • @ei23de
      @ei23de 9 месяцев назад +1

      Update: 4090 has some bugs in cuda 117 - we found a workaround (see in piper issues)

    • @ThorstenMueller
      @ThorstenMueller  9 месяцев назад +1

      You're welcome 😊. And yes, recorded a voice dataset takes way more effort than you might think at first. I trained my model on an NVIDIA Jetson Xavier AGX device and training took around 6 weeks 7x24. Thanks for sharing your CUDA solution 👏.

  • @JUKEBOXANIMATION
    @JUKEBOXANIMATION Год назад

    I guess this is great, and thank you for your effort, but I don't understand how you get to where you are at the begining. Should I open some command prompt windows and just type whatever you type the same way ? It would be even better doing some "don't think, do this +this +this" for people who understand nothing about how it works :)

  • @SUBATOMICRAY
    @SUBATOMICRAY 14 дней назад

    i am useing "WSL - Windows Only" instructions and I am stuck on training and it crashes with "ModuleNotFoundError: No module named 'piper_train.vits.monotonic_align.monotonic_align'"
    i can't figure out why i am missing the modules or how to install it

    • @ThorstenMueller
      @ThorstenMueller  8 дней назад

      When you run "pip list" do you see a piper package?

  • @rokifromhk
    @rokifromhk 26 дней назад

    ERROR: Could not find a version that satisfies the requirement piper-phonemize~=1.1.0 (from piper-train) (from versions: none)
    why this error came

    • @ThorstenMueller
      @ThorstenMueller  22 дня назад

      Does "pip install piper-phonemize==1.1.0" work?

  • @domesticatedviking
    @domesticatedviking 5 месяцев назад

    This was great. It would be useful if you could recommend where to get an appropriate model for english speakers to use in place of the german model.

    • @ThorstenMueller
      @ThorstenMueller  4 месяца назад

      Thanks 😊. Do you know Pipers huggingface model location with lots of models for many languages?

  • @ComatoseMN
    @ComatoseMN 3 месяца назад

    I'm not shocked, because literally nothing ever just works for me. However, after trying to complete the step with install the requirements.txt, I got this
    "ERROR: Ignored the following versions that require a different python version: 1.21.2 Requires-python >=3.7,

    • @ThorstenMueller
      @ThorstenMueller  3 месяца назад

      First of all thanks for your nice feedback 😊. Did you get it working already? Which python version are you using? Are you using a python venv?

  • @MicroblastGames
    @MicroblastGames Месяц назад

    I have a voice in .pth.. how can I use it in piper? how can I convert making the .onnx + .json file?

    • @ThorstenMueller
      @ThorstenMueller  Месяц назад

      Is your .pth file from a Coqui TTS training? If so you can't (imho) convert it to Piper model structure (onnx).

  • @tesfa2586
    @tesfa2586 8 месяцев назад

    Can't have enough of your superb explanation. Just one more step if you do not mind. for every piper run, it is loading the model. Is there a way we can load the model in memory and generate audio via an API. Like how torch-serve does for PyTorch models.

    • @ThorstenMueller
      @ThorstenMueller  8 месяцев назад

      Thanks for your nice feedback 😊. I've added your question about one time model loading and synthesizing multiple times to my questions for my interview with Michael.

  • @razvanab
    @razvanab 2 месяца назад

    Hello, Sir. Thank you so much for all of these wonderful tutorials. Since Kaggle has recently improved its free tier, could you create a tutorial explaining how to make a notebook on Kaggle for Piper's voice training? Thank you again.

    • @ThorstenMueller
      @ThorstenMueller  2 месяца назад

      Thanks for your topic suggestion 😊. I've added a "Kaggle Piper TTS model training" tutorial on my TODO list.

    • @razvanab
      @razvanab 2 месяца назад

      @@ThorstenMueller Thank you, sir.

  • @012_siddhantprasad9
    @012_siddhantprasad9 18 дней назад

    Which model would be the best to add emotions like anger/happiness/sadness etc. in the speech? Coqui or does Piper supports this functionality?

    • @ThorstenMueller
      @ThorstenMueller  17 дней назад

      Imho no. Piper has ssml on their roadmap to adjust output in some ways, but not really meant for adding emotions. You can record your own emotional dataset to train on, but this is probably not what you are looking for. Maybe try an emotional prompt on parler tts by huggingface.

    • @012_siddhantprasad9
      @012_siddhantprasad9 16 дней назад

      @@ThorstenMueller I tried Parler TTS, no emotions in the voice 🥲

  • @UeujkmVeujkm
    @UeujkmVeujkm 2 месяца назад

    Классное видео

  • @evertonlimaaleixo1084
    @evertonlimaaleixo1084 8 месяцев назад

    Very cool!
    How many minutes of audio did you use to mount the dataset?

    • @ThorstenMueller
      @ThorstenMueller  8 месяцев назад +1

      I used my Thorsten-Voice Dataset 2022.10 for this training. This dataset contains over 12k of wave files with a duration of over 11 hours.

  • @giovannisardisco4541
    @giovannisardisco4541 3 месяца назад

    The video is interesting, but it was hard to follow because the text and characters were too small to see clearly on a smartphone 😢

    • @ThorstenMueller
      @ThorstenMueller  2 месяца назад +1

      Thanks for your helpful feedback 😊. In later videos i increased font size in terminal. Hope it is better to read in newer videos.

  • @RedDread_
    @RedDread_ 7 месяцев назад

    Great video, how much audio is required for a decent training though? I have a few hours, is that enough?

    • @ThorstenMueller
      @ThorstenMueller  7 месяцев назад +1

      Depending on what a few hours mean, but in general i'd say this sounds like a good basis for model training 😊.

  • @serqetry
    @serqetry 4 месяца назад

    Thanks for this video. I was able to fine-tune a model and it sounded good with the test phrase generation, but when I exported it to onnx and used it with piper tts, it just made garbage noise and robotic gibberish. :(

    • @ThorstenMueller
      @ThorstenMueller  4 месяца назад

      That is strange. I did not encounter this problem. Did you already ask on the piper github community?

    • @serqetry
      @serqetry 4 месяца назад

      @@ThorstenMueller I created an issue on the github. I was able to get it to work though because I figured out only the binary version of piper has a problem with the exported models. I switched to using python version via pip and my model worked.

  • @BlazingFunk
    @BlazingFunk 9 месяцев назад +1

    I'm missing the part where you show how to record your own voice. Do you simply have to replace the exported WAV files?

    • @ThorstenMueller
      @ThorstenMueller  9 месяцев назад +1

      I created an own video on recording a voice dataset using "Piper-Recording-Studio" 😉. Have you seen it? ruclips.net/video/Z1pptxLT_3I/видео.html

    • @FleaRHCP97
      @FleaRHCP97 8 месяцев назад

      @ThorstenMueller what if you already have a ton of recorded audio of yourself (lectures)?

  • @suhaybmir7773
    @suhaybmir7773 10 месяцев назад +1

    Hello Thorsten. I've enjoyed this tutorial. I'm stuck on the exporting section where I attempt to write = echo 'This is a test.' | piper -m /home/suhayb/Desktop/RealTraining/piper/out-train/lightning_logs/version_9/checkpoints/suhayb-real-voice_high.onnx --output_file /home/suhayb/Desktop/test.wav
    and i get the message unkown option -m. Do you know a way to fix or test this? Thank you in advance!

    • @ThorstenMueller
      @ThorstenMueller  10 месяцев назад +1

      Glad you enjoyed my tutorial 😊. What is "piper --help" showing?

    • @suhaybmir7773
      @suhaybmir7773 10 месяцев назад

      @@ThorstenMueller I managed to figure it out from another video of yours. Download and used the Piper executable file! Thanks again!

  • @tegeztheguy2719
    @tegeztheguy2719 Год назад

    Hi, I loved the video where you showed how to train your CoquiTTS model on Windows, could you maybe do a video about how to train a vocoder model for the trained model?

    • @ThorstenMueller
      @ThorstenMueller  Год назад +1

      Thanks for your topic suggestion 😊. I've added it to my (growing) TODO list.

  • @ruthfodor9632
    @ruthfodor9632 2 месяца назад

    hi. i’m currently trying to retrain the romanian model to correct the pronunciation of diacritics (ă,â,î,ș,ț), but keep bumping into a problem: in the generated jsonl dataset, ț is matched to 2 sounds: t, s instead of one but i’m not sure what to do. ț kinda sounds like ts in english: “cats” but t, s doesn’t give the best results in piper. maybe you can give me some advice? also, thank you for the video, super helpful

    • @ThorstenMueller
      @ThorstenMueller  2 месяца назад

      Thanks for your kind feedback 😊. Is that "roa" language code? Does it work when using espeak-ng -vroa "some demo text"?

    • @ruthfodor9632
      @ruthfodor9632 2 месяца назад

      @@ThorstenMueller yes it is roa and espeak-ng -vroa works. what i am trying right now is training after replacing with the correct phoneme “translation” in the dataset.jsonl. hope it works

  • @YKSGuy
    @YKSGuy 7 месяцев назад

    Are there any tools for creating a structured training data set from existing voice clips vs recording new ones by reading specific text?

    • @ThorstenMueller
      @ThorstenMueller  7 месяцев назад

      Not sure if there are tools for that, but the widely supported LJSpeech syntax is basically a CSV file with spoken text and filename split by a pipe | character. This shouldn't be too complex to create.

  • @scapolozza
    @scapolozza 8 месяцев назад

    Hi Thorsten, thank you for your work. I learned a lot from your channel.
    I saw the video to increase the quality of espeak-ng, I was wondering if piper already used the espeak-ng dictsources. If he doesn't use them, do you think it is possible to implement them in some way? Keep up your amazing work.

    • @ThorstenMueller
      @ThorstenMueller  8 месяцев назад

      Thanks for your nice feedback 😊. Yes, Piper TTS uses espeak-ng, but brings it own version out of the box. But you can apply the same process and just replace the compiled dict file within Piper own espeak-ng files. Does this help you?

    • @scapolozza
      @scapolozza 8 месяцев назад

      @@ThorstenMueller Of course it helps me. Thank you Thorsten
      GUUDEEE!

  • @sickeningdreams
    @sickeningdreams 6 дней назад

    where do i find english voice datasets?

    • @ThorstenMueller
      @ThorstenMueller  4 дня назад

      There are multiple places you can take a look. Zenodo, OpenSLR and huggingface datasets. For english language there should be some datasets to choose from.

  • @yakov6292
    @yakov6292 20 дней назад

    Thanka for this video! Does it should work when fine tunning new language with different accent like Hebrew?

    • @ThorstenMueller
      @ThorstenMueller  18 дней назад

      Yes, i even trained/finetuned my german model on an existing english checkpoint.

    • @yakov6292
      @yakov6292 18 дней назад

      @@ThorstenMueller
      Amazing!
      I think that German have more similar accent like English compared to Hebrew that it's completely different

    • @yakov6292
      @yakov6292 18 дней назад

      @@ThorstenMueller How long it took for you to fine tune?

    • @ThorstenMueller
      @ThorstenMueller  10 дней назад

      @@yakov6292 Around 6 weeks on an NVIDIA Jetson Xavier AGX device.

  • @paregis
    @paregis 10 месяцев назад

    Great stuff! How long did it take to train? GPU? Which one?

    • @ThorstenMueller
      @ThorstenMueller  10 месяцев назад +1

      You mean for my free german "Thorsten-Voice" (high) model? If so, i trained it on an NVIDIA Jetson Xavier AGX device and training took over 6 weeks.

  • @NacidoSKYRIM
    @NacidoSKYRIM 22 дня назад

    Thanks! very nice video!
    I have a question, how can I transform a .json file to onnx? Is it possible?
    Well, I have created some voice models using xtts mantella, creating .json files directly, I want to try to use them in piper tts

    • @ThorstenMueller
      @ThorstenMueller  18 дней назад +1

      A trained piper model requires both files. The onnx is the actual trained model and the json is the config that "describes" your model. Did you run a full piper tts model training?

    • @NacidoSKYRIM
      @NacidoSKYRIM 18 дней назад

      @@ThorstenMueller Yes, thanks! I already checked that little bit of information so now I'm following your guides to train my own model locally ^^! thanks for everything! ♥

  • @NNokia-jz6jb
    @NNokia-jz6jb Год назад +1

    24:32 Here is a voice test.

    • @ThorstenMueller
      @ThorstenMueller  Год назад +1

      Thank you 😊. Do you think adding this timestamp to video description would make sense?

    • @NNokia-jz6jb
      @NNokia-jz6jb Год назад

      @@ThorstenMueller Not adding, correcting and replacing with the 19.** minute one.

  • @stevennovotny9568
    @stevennovotny9568 3 месяца назад

    Do you know if reducing the learning rate helps with the erratic behavior in the learning curve?

    • @ThorstenMueller
      @ThorstenMueller  3 месяца назад +1

      Honestly, i'm not sure. Maybe you can get better answers by asking their community on github.

  • @CM-mo7mv
    @CM-mo7mv Год назад

    I am not sure if you ever made a tutorial on ho to create the actual data set. I think it would be nice to gather a sample from older loved ones before they continue to the next plane ....

    • @ThorstenMueller
      @ThorstenMueller  Год назад +5

      I've included it as a part of an early tutorial (ruclips.net/video/4YT8WZT_x48/видео.html), but maybe i can create a special tutorial just on recording and dataset creation. Do you think this would be useful?

  • @420gramas7
    @420gramas7 3 месяца назад

    Hey, is it possible to make this custom TTS to work in android to read pdf/epub books?

    • @ThorstenMueller
      @ThorstenMueller  3 месяца назад

      IMHO piper tts does not work on android. Not sure if this is on their roadmap.

  • @tannisroot
    @tannisroot 4 месяца назад

    Hi, do you have a guide or can maybe point to some documentation on how to easily convert a public domain audio book recording into a voice dataset? I found a few hours of a single speaker speech, and from what I've seen piper recording studio is not suitable for this. Is there some sort of compilation of tools to split the audios apart into phrases, transcribe them using something like whisper and to compile it all into a dataset with the .csv file that can be then fed to piper training setup?

    • @ThorstenMueller
      @ThorstenMueller  3 месяца назад

      That's a good point. I've seen this questions a a few times now, but i don't know a tool or pipeline doing this. But would be a nice feature, for sure 😊.

  • @flethacker
    @flethacker Месяц назад

    i have to record 12,000 wav of my voice samples??

    • @ThorstenMueller
      @ThorstenMueller  Месяц назад

      No, i did many recordings. But if you train / finetune an existing model you might be able to get good results with 500-1000 recordings already.

  • @feizhu-yf7cl
    @feizhu-yf7cl 6 месяцев назад

    Great work. I test piper-tts on my terminal device, the voice sounds good. But I met a critical issue was that the cpu usage was nearly 100% when running piper-tts on one cpu core. Furthermore, I found that when i ony run espeak-ng and phonmizer to transfer input text to phonme-id, the cpu usage was still nearly 100%. Have you meet similar issue before. Hope to get some suggestion to low cpu usage. Thanks very much.

    • @ThorstenMueller
      @ThorstenMueller  6 месяцев назад

      First of all thanks for your nice feedback 😊. As i didn't take a closer to cpu load while using piper i can't tell anything about it. But i can keep an eye on it on my next tests. Maybe ask on piper tts github community.

  • @rommix0
    @rommix0 7 месяцев назад

    0:05 Just discovered you, because your Piper voice sounds just like you.

    • @ThorstenMueller
      @ThorstenMueller  7 месяцев назад +1

      I guess this is a compliment, or? Thank you 😊.

    • @rommix0
      @rommix0 7 месяцев назад

      @@ThorstenMueller Of course it's a compliment lol. You have a nice sounding voice.

  •  4 месяца назад

    I'm stuck. All went well till I run the "python3 -m piper_train..." command and got an error (_pickle.UnpicklingError: invalid load key, '

    • @ThorstenMueller
      @ThorstenMueller  4 месяца назад

      What size is your voice dataset (number of recordings)?

  • @PowerGlideSir
    @PowerGlideSir 4 месяца назад

    Great video! However, I keep getting the following error: RuntimeError: Error(s) in loading state_dict for VitsModel: Missing key(s) in state_dict when running without the "--quality high" flag. When I run with the fag enabled I keep running out of VRAM on my RTX 3090. Any tips on how to solve this issue?

    • @ThorstenMueller
      @ThorstenMueller  4 месяца назад

      Thanks for your kind feedback 😊. There seems to be an active issue on Piper repo. Do you know this? github.com/rhasspy/piper/issues/108

  • @Mr-Coke
    @Mr-Coke 8 месяцев назад

    IN deinem Video davor erstellst du das Stimmentraining mit Windows. Kann man diesen Part hier auch unter Windows machen? oder muss man dafür Linux verwenden?

    • @ThorstenMueller
      @ThorstenMueller  8 месяцев назад

      Gute Frage - soweit ich weiß funktioniert das Training aktuell nicht auf Windows. Aber habe es selber (noch) nicht ausprobiert.

  • @Borszczuk
    @Borszczuk 4 месяца назад

    Hint: consider adding pop filter to your mic setup

    • @ThorstenMueller
      @ThorstenMueller  4 месяца назад

      Thanks. I have one, but as the mic has an integrated one i thought it would not be required. I will give it a try.

  • @itsmekhadu2408
    @itsmekhadu2408 7 месяцев назад

    ERROR: Could not find a version that satisfies the requirement piper-phonemize==1.1.0 (from versions: none)

    • @ThorstenMueller
      @ThorstenMueller  7 месяцев назад

      Which operating system are you using? 32 or 64bit?

  • @zerthura8500
    @zerthura8500 Год назад

    Hello. Do you think that use this software instead of coqui can be compatible with SadTalker?
    Coqui create conflicts used with SadTalker on windows.

    • @ThorstenMueller
      @ThorstenMueller  Год назад +1

      Hi and thanks for your comment 😊. As i have no experiences with SadTalker this is hard to answer for me.

  • @anarmustafayev9145
    @anarmustafayev9145 9 месяцев назад

    Sehr schönes Tutorial. Mal eine technische Frage: warum ist es denn nicht möglich, dass man keinen riesigen Datensatz an Sätzen, sondern lediglich wenige Sätze benötigt, um den Klon zu erstellen. Es gibt einige AI-Software die anhand einer kurzen Sequenz von 30 Sekunden eigener Stimme bereits geschrieben Romane auslesen können. Gibt es in der Richtung auch eine Entwicklung oder beschränkt sich das Ganze lediglich auf solche festen Datensätze, die man vorher mit Mühe erfassen müsste?
    Vielen Dank und gutes Gelingen weiterhin.

    • @ThorstenMueller
      @ThorstenMueller  9 месяцев назад

      Vielen Dank für deinen netten Kommentar 😊.
      Im freien Open-Source Umfeld ist das noch etwas schwierig mit hochwerten Stimm-Klons auf Basis wenigen Daten. XTTS von Coqui geht ja genau in diese Richtung (ist solide, aber hat auch noch Luft nach oben). Kennst Du das Video dazu?
      ruclips.net/video/HJB17HW4M9o/видео.html

  • @ernieprevost6555
    @ernieprevost6555 Год назад +2

    Hi Thorsten, great video really useful.
    I hope you don’t mind me asking a couple of questions. I am trying to create a standalone voice assistant project and have most of it working. I am using ‘vosk’ as a STT engine a ‘piper’ a TTS engine. At this time I am using one of the available models “vosk-model-small-en-us-0.15”. My ultimate aim would be to create model based on HAL9000 (Douglas Rain) from 2001 a Space Odyssey. I decided to work through your video to understand and test how the model creation process works. I am running this on a Raspberry Pi model 4. I downloaded your voice dataset (it must have taken you a long time to create all the wav files) and successfully ran the pre-processing stage. The problem I have now is running the training stage, I get the following error: ModuleNotFoundError: No module named 'piper_train.vits.monotonic_align.monotonic_align'. I have not been able to find what the problem is, do you have any idea?
    I also notice that the accelerator parameter has a ‘TPU’ option, is that Google USB TPU used with Tensorflow?
    Thanks again Ernie

    • @ThorstenMueller
      @ThorstenMueller  Год назад

      Thanks for your nice feedback and yes, recording took some month. Did you run the "build_monotonic_align.sh" script?

  • @amitmaharjan3522
    @amitmaharjan3522 Год назад

    thank you

  • @ssamjh
    @ssamjh Год назад

    How do I create a dataset of my voice? I'm not finding any instructions in the piper repo.

    • @ThorstenMueller
      @ThorstenMueller  Год назад +1

      Maybe this tutorial from me helps on that. It's part of another video but at this timestamp i show how to record your voice and create a dataset in LJSpeech structure.
      ruclips.net/video/4YT8WZT_x48/видео.html

  • @sayednabeelakhtar3243
    @sayednabeelakhtar3243 Год назад +1

    RuntimeError: cuFFT error: CUFFT_INTERNAL_ERROR
    I am having this error when i start to train

    • @ThorstenMueller
      @ThorstenMueller  Год назад

      Which CUDA and Torch version do you have?

    • @sayednabeelakhtar3243
      @sayednabeelakhtar3243 Год назад

      @@ThorstenMueller '2.0.1+cu118' Torch version (Ubuntu 22)

    • @sayednabeelakhtar3243
      @sayednabeelakhtar3243 Год назад

      | NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.0 |

    • @sayednabeelakhtar3243
      @sayednabeelakhtar3243 Год назад

      I have downgraded ubuntu to 20 still the same error Im using RTX 4090

    • @ThorstenMueller
      @ThorstenMueller  Год назад

      Hmm, have you seen this closed issue? Maybe this helps.
      github.com/pytorch/pytorch/issues/88038 @@sayednabeelakhtar3243

  • @sabanheric6897
    @sabanheric6897 10 месяцев назад

    Hello,
    after running pip3 install -e .
    I'm getting error
    ERROR: Could not find a version that satisfies the requirement piper-phonemize~=1.1.0 (from piper-train) (from versions: none)
    ERROR: No matching distribution found for piper-phonemize~=1.1.0
    I was testing on python 3.8 3.9 3.10 3.11 but always the same error. I suppose problem is around piper-train

    • @ThorstenMueller
      @ThorstenMueller  10 месяцев назад

      Maybe try to install the package this way: "pip install piper-phonemize==1.1.0". Does this help?

    • @ThorstenMueller
      @ThorstenMueller  9 месяцев назад

      @user-jn9vs1yi8b Which hardware architecture do you have?

  • @NNokia-jz6jb
    @NNokia-jz6jb Год назад

    So, where are the voice examples to listen to? What is the quality?

    • @ThorstenMueller
      @ThorstenMueller  Год назад +1

      According to your other comment i guess you found it ;-).

  • @TNMPlayer
    @TNMPlayer 10 месяцев назад

    I really wish TTS would come packaged with RVC.

    • @ThorstenMueller
      @ThorstenMueller  10 месяцев назад

      Thanks for this "hope" and suggestion. Maybe it's worth starting a discussion on Piper TTS Github repo. github.com/rhasspy/piper/discussions

  • @LucidFirAI
    @LucidFirAI Год назад

    So far I've managed to get Tortoise TTS and ClipChamp TTS working. Tortoise is slow and unreliable, ClipChamp is surprisingly versatile and high quality for a totally free web based thing. I take RVC with datasets to make the final output sound the way I want.
    Do you have any recommendations of what I should be using though? I want free, local (I have an Nvidia 4090 to take advantage of), with unlimited generation, and a human sounding voice. Ideally it would also be trainable so I could run it on a model of the voice I want, which could be refined with RVC, whereas currently I have to take a vaguely correct voice and then refine it with RVC.

    • @ThorstenMueller
      @ThorstenMueller  Год назад

      If you have GPU power available as you said (NVIDIA 4090) you should take a closer look to Coqui TTS. Some models provide a really great quality.
      I've created audio sample comparison videos if you would like to get an idea on their quality:
      ruclips.net/video/HojuVmW5LUI/видео.html and ruclips.net/video/Vnjv2L31eyQ/видео.html

  • @HoratioNegersky
    @HoratioNegersky 6 месяцев назад

    Thorsten, I'm unclear on this after watching - is it possible to put in just a couple hundred new clips with a specific new voice and then use the checkpoints as a basis for cloning a new voice??

    • @HoratioNegersky
      @HoratioNegersky 6 месяцев назад

      Nevermind, I see your full dataset contains the typical ~10,000+
      I'm working on one with a baked-in accent so I'll be doing it from the ground up also. Thanks for the tutorial!

    • @ThorstenMueller
      @ThorstenMueller  6 месяцев назад

      Glad you found your answer and best of luck for your project 😊.

  • @helloworld7796
    @helloworld7796 5 месяцев назад

    Hey Thorsten, i started with a new model from scratch, I got a good 300 sentences for now and started training. It works fine for now, but I would like to add more and more dataset, do I have to start each time from scratch since I have more dataset, or is there a way to add additional dataset and continue training?

    • @ThorstenMueller
      @ThorstenMueller  5 месяцев назад

      In Piper you can use the argument "resume_from_checkpoint" for that. See more infos here: github.com/rhasspy/piper/blob/master/TRAINING.md#training-a-model

    • @helloworld7796
      @helloworld7796 5 месяцев назад

      ​@@ThorstenMueller Oh, I thought that option was only if I stopped training and continued on the same dataset. So I could use this even if I add more dataset and the model will recognize new dataset inside the same folder where i had old dataset? Also another questions, since you know everything. I am using CPU for training and waiting for days to get some result. I want to get GPU from nvidia. Which one would you recommend, 2080ti vs 3080? Or some other in that price range

    • @ThorstenMueller
      @ThorstenMueller  4 месяца назад

      @@helloworld7796 Good point, but as i finetuned my german Thorsten-Voice models with that option based on an english checkpoint this should be possible.

  • @morgan_rockefeller_official
    @morgan_rockefeller_official 10 месяцев назад

    kann man das TTS auch dazu nutzen um seine buecher auf dem handy sich vorlesen zu lassen? zur zeit habe ich TTS stimmen die standardmaessig von google zur verfuegung gestellt werden wenn ich meine buecher mir anhoere. Und falls das geht wie macht man das dann?

    • @ThorstenMueller
      @ThorstenMueller  10 месяцев назад

      Piper TTS selber läuft nicht auf Mobilgeräten. Allerdings kannst Du den Text auf einem PC als Wave Audiodatei erzeugen und die dann auf dein Handy kopieren und abspielen. Ist zwar ein Workaround, aber wäre das eine Option?

  • @khajask8113
    @khajask8113 3 месяца назад

    Hindi language supports..?

    • @ThorstenMueller
      @ThorstenMueller  3 месяца назад +1

      You can create a tts model for every language in Piper. Existing languages and models can be found here: github.com/rhasspy/piper/blob/master/VOICES.md

  • @audreylin5424
    @audreylin5424 2 месяца назад

    great stuff! thanks a lot. btw, i was wondering if piper can be trained to speak mix languages, for example, mix between english and chinese

    • @ThorstenMueller
      @ThorstenMueller  2 месяца назад

      IMHO this might work in near future when SSML is implemented. But i am not sure on their roadmap.

  • @syedmuqtasidalibscs0434
    @syedmuqtasidalibscs0434 5 месяцев назад

    metadata.csv file is not placed in dataset from where I can get this ?

    • @ThorstenMueller
      @ThorstenMueller  5 месяцев назад +1

      Metadata.csv is part of a (ljspeech structured) voice dataset for TTS training. Do you have a voice dataset for your model training?

    • @syedmuqtasidalibscs0434
      @syedmuqtasidalibscs0434 5 месяцев назад

      @@ThorstenMueller yes i record 1150 voices and it generate wave folder for voices and metadata.csv same i am following your video in your video when you download the dataset you have 4 files but i have 2 why ?

    • @ThorstenMueller
      @ThorstenMueller  5 месяцев назад

      @@syedmuqtasidalibscs0434 I've split metadata up in a evaluate, train and test dataset but this is optional. You can put all your 1150 wave files in one complete metadata.csv. This should work fine 😊.

    • @syedmuqtasidalibscs0434
      @syedmuqtasidalibscs0434 5 месяцев назад

      @@ThorstenMueller its not working fine when i am running command i am getting this assertion error missing metadata.csv

    • @syedmuqtasidalibscs0434
      @syedmuqtasidalibscs0434 5 месяцев назад

      @@ThorstenMueller when running long command piper_train.preprocess in which input ditectory output directry langauge etc

  • @احمدصبيح-خ7و
    @احمدصبيح-خ7و 8 месяцев назад

    When I start the final phase of voice training, this message appears /piper/src/python/.venv/lib/python3.10/site-packages/pytorch_lightning/utilities/data.py:153: UserWarning: Total length of `CombinedLoader` across ranks is zero. Please make sure this was your intention.
    rank_zero_warn(
    `Trainer.fit` stopped: No training batches.

    • @ThorstenMueller
      @ThorstenMueller  8 месяцев назад +1

      Oh, that's a very specific error i did not encounter myself. Maybe ask this question on the Piper TTS github community for support.

    • @احمدصبيح-خ7و
      @احمدصبيح-خ7و 8 месяцев назад

      It became clear to me that during vocal training the language should be chosen without specifying the geographical region of the language. The problem was that I was selecting language + country and that is what caused the problem. No problem now, but the result was weak because I relied on an average file in training, and Piper does not have a high-quality file for my language so I can work on it... Is it possible to create audio from scratch in Piper? How and how long does it take? Or is it possible to train a medium to high voice without vocal distortions?

    • @ThorstenMueller
      @ThorstenMueller  8 месяцев назад

      @@احمدصبيح-خ7و I think finetuning a medium model to high will not work, but i'm not a 100% sure. I finetuned my german high model based on an english high model. Maybe this is an option for you?

  • @MrZongwei
    @MrZongwei Год назад

    Can I training other languages model by cloning my own Chinese digital text to speech?

  • @Tourbillion9048
    @Tourbillion9048 Год назад

    I'm not computer savvy. Would you be able to create an app that we could use to clone our voice, so that we could just dump a text document into it to convert into an audio file? That would be so cool.

    • @ThorstenMueller
      @ThorstenMueller  Год назад

      I agree that this would be really cool. At the moment voice cloning requires some technical detail skills. Maybe you can take a look to Coqui studio as they are working on an easy voice cloning process in a nice user interface.

    • @Tourbillion9048
      @Tourbillion9048 Год назад

      @@ThorstenMueller Thanks for responding to my inquiry. I really appreciate all the hard work you put into making your videos, and trying to help people. Could you make a video for dummies, with step by step instructions of how to install Coqui on Windows, and how clone voices?

    • @ThorstenMueller
      @ThorstenMueller  Год назад

      @@Tourbillion9048 You're very welcome. Here are two tutorials for Windows that might help you.
      * Installing Coqui TTS on Windows: ruclips.net/video/zRaDe08cUIk/видео.html
      * Voice cloning on Windows with Coqui TTS: ruclips.net/video/bJjzSo_fOS8/видео.html

  • @raiyanahmed1177
    @raiyanahmed1177 6 месяцев назад

    can i use any dataset here for my language like bengali?

    • @ThorstenMueller
      @ThorstenMueller  6 месяцев назад

      You mean a text corpus in bengali or a prerecorded voice dataset?

  • @dash8x
    @dash8x 8 месяцев назад

    I am getting Segmentation fault (core dumped) error when I run piper_train

    • @ThorstenMueller
      @ThorstenMueller  8 месяцев назад

      Strange - which operating system? Did you open an issue on Github about it?

    • @dash8x
      @dash8x 8 месяцев назад

      @@ThorstenMueller It's on Ubuntu. I got it working after reducing the batch size. Training has been going on for a couple of days. But the sound it generates is illegible even though it sounds like spoken text.

    • @ThorstenMueller
      @ThorstenMueller  8 месяцев назад

      @@dash8x Happy you got it working. Give training some time to produce human sounding results.

  • @tr1pod623
    @tr1pod623 9 месяцев назад

    does this work on windows 11? cuz i get an error when trying to train
    nrvcerrror

    • @ThorstenMueller
      @ThorstenMueller  9 месяцев назад +1

      As i do not have Windows 11 i can't really help you on that. Maybe it's a good idea to ask this question of Piper TTS Github repository.

  • @yyyzzz-k3r
    @yyyzzz-k3r Год назад

    What is their model behind? Does anyone know?

    • @ThorstenMueller
      @ThorstenMueller  Год назад

      Technically these are VITS TTS models. Is this your question?

  • @truszko91
    @truszko91 Год назад

    "ERROR: Could not find a version that satisfies the requirement piper-phonemize (from versions: none)"
    Have you run into this? I cannot find a way to install this at all...

    • @synesthesiam
      @synesthesiam Год назад

      Only Linux is supported for now :/

    • @truszko91
      @truszko91 Год назад

      @@synesthesiam But I am on Linux :/ what am I doing wrong then haha. Do I need any specific chip architecture or something? haha. that's so odd.

    • @synesthesiam
      @synesthesiam Год назад

      @@truszko91 What Python version? Only 3.9-3.11 are currently supported.

    • @truszko91
      @truszko91 Год назад

      @@synesthesiam oh, where did you see that?? ;o I missed that memo.. I'm on 3.8.10

    • @ThorstenMueller
      @ThorstenMueller  Год назад +1

      Thanks for the clarification. I've added this info to the video description.

  • @lennart3626
    @lennart3626 7 месяцев назад

    Thanks for this! Real excited to try this out for my university project. Unfortunately, after brude forcing it for 2 days I could not make it work to install the requirements libraries. I am on the raspberry pi 4 and after some first struggles I have upgraded my python to 3.10. Nevertheless I still get this error for piper-phonemize:
    ERROR: Could not find a version that satisfies the requirement piper-phonemize~=1.1.0 (from piper-train) (from versions: none)
    ERROR: No matching distribution found for piper-phonemize~=1.1.0
    Any advice how I can break free?

    • @ThorstenMueller
      @ThorstenMueller  7 месяцев назад

      I've heard about that issue a few times already. Maybe try installing it with a fixed version: "pip install piper-phonemize==1.1.0", this should work then.

    • @lennart3626
      @lennart3626 7 месяцев назад

      @@ThorstenMueller Thank you very much! I have been watching a lot of your great videos the last couple of days to get the perfect tts model for my needs. As a bloody beginner, I do still have some trouble keeping up but practice makes perfect, right? For the problem above, I also went down a deep rabbit hole of form posts and ended up having to reinstall the OS on my rpi. From school we got the 32bit OS but apperently piper-phonemize only works on 64x. Doing that and making sure my python is at 3.10 finally got the job done. Unfortunately, piper turned out to be not the best model for my needs and I now moved on Coqui! Let's see where this adventure leads me!

    • @ThorstenMueller
      @ThorstenMueller  7 месяцев назад

      @@lennart3626 Thanks for your response and tipp that piper-phonemizer is 6bit only. Good luck with Coqui TTS.

  • @annlaosun6260
    @annlaosun6260 2 месяца назад

    does it work on mac?

    • @ThorstenMueller
      @ThorstenMueller  Месяц назад

      I'm not sure if training works on mac, but as there are piper downloads for mac available it should be able to run with pretrained voice models on mac. github.com/rhasspy/piper/releases

  • @dash8x
    @dash8x 8 месяцев назад

    Can I train for a language not supported by espeak-ng as well?

    • @ThorstenMueller
      @ThorstenMueller  8 месяцев назад

      Not sure if that's possible in Piper, but in Coqui TTS you can choose a "character based" training in case espeak-ng is not available for your language.

    • @tesfa2586
      @tesfa2586 8 месяцев назад

      you can't. but you can add your language to espeak-ng first. I just did that and it worked. espeak-ng has good document on how to add new language

  • @katokira260
    @katokira260 Год назад

    Thanks for making this tutorial ... is it possible to train Arabic Model ?

    • @ThorstenMueller
      @ThorstenMueller  Год назад +1

      You're welcome and yes Arabic TTS model training should be possible to train 😊. Have you tried it and encountered any problems?

    • @katokira260
      @katokira260 Год назад

      @@ThorstenMueller nice ... no i never tried ... but i will ... thanks again

  • @FrankGlencairn
    @FrankGlencairn Год назад

    Jetzt noch auf Win und ohne Prompt aber mit UI - das wär's

    • @ThorstenMueller
      @ThorstenMueller  Год назад

      Stimmt, das wär's 😊. Weiß aber nicht, ob sowas in die Richtung geplant ist.

  • @_Sepherial
    @_Sepherial 8 месяцев назад

    How do I use a cloned voice to read aloud a pdf file?

    • @ThorstenMueller
      @ThorstenMueller  8 месяцев назад

      This is not possible by default. You might write a script extracting text from pdf file and putting this text to Piper for TTS generation.

    • @_Sepherial
      @_Sepherial 8 месяцев назад

      @@ThorstenMueller If I was a programmer, I wouldn't need to ask this question.

    • @ThorstenMueller
      @ThorstenMueller  8 месяцев назад

      @@_Sepherial Right now there is no easy way to do this, not sure if someone is working on that. Because what you ask for is really a useful feature. I can ask Mike this question on my interview.

  • @ugn
    @ugn 5 месяцев назад

    "I Gude, V?"

  • @manuelmao4700
    @manuelmao4700 Год назад

    Hi! great video! How can i do this on a macos?

    • @ThorstenMueller
      @ThorstenMueller  Год назад +1

      Thanks for your feedback. By now Piper is not officially supported on Mac, just on Linux. But there are open issues with some tips for Mac OS. Maybe it's worth a look.
      github.com/rhasspy/piper/issues/27

    • @serge3595
      @serge3595 Год назад

      @@ThorstenMueller can you update us with a guide when Piper is made to work smoother with MacOS? Thanks :)

  • @darkjudic
    @darkjudic Год назад

    Is it better than Coqui TTS ?

    • @synesthesiam
      @synesthesiam Год назад +2

      The answer isn't straightforward. Coqui TTS is much more flexible: it supports many different types of TTS models. Piper only has VITS, but it's been designed to be usable on lower-end hardware (RPi 4). So speed has been more important than quality. The binary releases are also only ~25MB and self-contained with voices being 50-100 MB apiece.

  • @Teletha
    @Teletha 2 месяца назад

    why has no one just made a simple windows app

    • @ThorstenMueller
      @ThorstenMueller  Месяц назад

      I didn't try it myself, but have you seen this project (github.com/natlamir/PiperUI)? Not sure if this works in general or on windows or it's safe to use. But maybe check it out.

  • @MrZongwei
    @MrZongwei Год назад

    Does it support Chinese voice

    • @ThorstenMueller
      @ThorstenMueller  Год назад

      According to their Github page github.com/rhasspy/piper Chinese (zh_CN) is supported. So yes 😊. Check my tutorial on how to use Piper TTS with the pretrained Chinese model here: ruclips.net/video/rjq5eZoWWSo/видео.html

    • @MrZongwei
      @MrZongwei Год назад

      @@ThorstenMueller it's great。

  • @cybertrike
    @cybertrike 9 месяцев назад

    This whole video is why nix exists .

  • @TheVisitorX
    @TheVisitorX Год назад +1

    Finde es sehr verwirrend, dass deine Titelbeschreibung und der Name des Videos auf Deutsch, deine Videos aber auf Englisch sind. Ich falle da jedes Mal drauf rein. Ich habe nichts gegen Englisch, aber dann solltest du vielleicht zumindest deine Videos so kenntlich machen, dass man das auf Anhieb sieht und nicht erst, wenn man das Video schon angeklickt hat. Sehr irreführend.

    • @siemensohm
      @siemensohm Год назад

      Bei mir ist sowohl der Titel, als auch die Beschreibung auf Englisch. Oder ich bin zu spät und das wurde in der Zwischenzeit geändert. ;)

    • @ThorstenMueller
      @ThorstenMueller  Год назад +2

      Hi, Danke für den Hinweis. Meistens, wenn ich es nicht vergesse, mache ich eine US-Flagge in den Titel um es zu kennzeichnen. Ich habe das hier auch ergänzt, daher nochmal Danke für die Info 😊. Wenn man die Sprache in RUclips und/oder die Sprache im Browser auf Englisch hat, dann kommt die Beschreibung auch gleich in Englisch.

    • @ThorstenMueller
      @ThorstenMueller  Год назад +2

      @@siemensohm Hi, das liegt an der Sprache mit der du RUclips nutzt (und/oder die Spracheinstellung im Browser). Ich habe es aber nach dem Hinweis mit einer US-Flagge im Titel gekennzeichnet, dass es ein englisches Video ist. Danke Dir für die ergänzende Info 😊.

    • @siemensohm
      @siemensohm Год назад

      @@ThorstenMueller Ah, so funktioniert das also. :)

    • @TheVisitorX
      @TheVisitorX Год назад

      @@ThorstenMueller Vielen lieben Dank dafür! So ist es finde ich besser nachzuvollziehen. Sollte auch gar nicht böse gemeint sein, denn ich schätze und mag deine Videos! :)

  • @teenudahiya01
    @teenudahiya01 10 месяцев назад

    hi bro, can I create onnx model and onnx json file for Hindi language voice and can I use it in Piper tts, please reply .

    • @ThorstenMueller
      @ThorstenMueller  10 месяцев назад

      IMHO there's no pretrained Hindi model to be used out of the box available or did you train a Hindi model yourself?

    • @teenudahiya01
      @teenudahiya01 10 месяцев назад

      @@ThorstenMueller is Hindi model and hindi onnx file are different i have piper tts wsl can i train whith this

  • @MrZongwei
    @MrZongwei Год назад

    Is there a simple way to generate myself dataset? Your video from two years ago using Mycroft Mimic Recording Studio was still a bit complicated。For example, placing a recording file in the wav directory and a content file in the CSV file, with the file name also starting from 000001

    • @ThorstenMueller
      @ThorstenMueller  Год назад +2

      I have a video about Piper-Recording-Studio on my TODO list. Maybe this will make process little easier and more up to date.

    • @MrZongwei
      @MrZongwei Год назад

      we need you, please move on faster, my friend @@ThorstenMueller

  • @broketechenthusiast2372
    @broketechenthusiast2372 8 месяцев назад

    Thank you so much for your guide. This helped me get past some of the problems that I was having. I’m curious if you have a video or know of a video that goes over what to look for during training. Such as recognizing issues when looking at the generative loss or weights and what settings/dataset changes to make to counteract them. Thank you!

    • @ThorstenMueller
      @ThorstenMueller  8 месяцев назад +1

      You're welcome and happy you like my videos. Do you mean how to keep an expert view on training parameters like losses, etc (eg. Tensorboard details)?

    • @broketechenthusiast2372
      @broketechenthusiast2372 8 месяцев назад

      @@ThorstenMuelleryes. For example my gen_loss graph is erratic with no overall trend. I’d like to be able to have some idea of the causes and solutions for that kind of thing and how to identify similar issues.