Voice Cloning Tutorial with Coqui TTS and Google Colab | Fine Tune Your Own VITS Model for Free

Поделиться
HTML-код
  • Опубликовано: 26 июл 2022
  • 12/22/22 - Follow-up video with different notebook and Linux install instructions, • Near-Automated Voice C... - Automated Dataset Creation with Whisper STT + VITS Training with Coqui TTS
    **!!!***IMPORTANT***!!!** 11/29/22: New notebook added, no need to manually upload training scripts anymore. Just set the variables and run the cells. The scripts will be output to files on your Google Drive:
    colab.research...
    A quick and dirty voice cloning tutorial. How to fine tune a VITS voice model using the Coqui TTS framework on Google Colab.
    Follow along and see how to make a voice model like the Bill Gates one used in this: • Video (Bill Gates raps to Deltron 3030's Virus)
    pastebin.com/H... - save this as rnnoise.py and upload it to Google Drive
    NEW INFO 9/13/22. UPDATED COQUI 0.8.0 MODEL. FASTER TRAINING. BETTER QUALITY.
    pastebin.com/U... save as finetune_vits.ipynb and open/upload to Google Colab
    pastebin.com/U... save as train_vits.py to google drive
    OLD INFO:
    pastebin.com/6... - ***NOTE: 9/3/22 Colab script updated to fix restore fine tuning. save this as Voice_Clone.ipnyb and upload it to Google Colab or your Google Drive. Then, in Colab, select open from Drive.
    pastebin.com/6... - save this as train-vits-bg-colab.py in your google drive folder
    ​‪@ThorstenMueller‬ For great Coqui TTS and Mycroft videos
    github.com/coq... - Coqui TTS site
    www.audacityte... - Audacity editor
    waveshop.source... - WaveShop editor
    www.sonicvisua... - Sonic Visualiser
    www.gyan.dev/f... - FFmpeg Windows Builds
    notepad-plus-p... - Notepad++
  • НаукаНаука

Комментарии • 161

  • @jasonhardy7945
    @jasonhardy7945 Год назад +1

    Thank you for your help, this days it's difficult to come by people like you, i want you to know that my heart will never stop appreciating your generosity..

  • @alanmarin431
    @alanmarin431 Год назад +1

    You are literally the best, I've been looking for a tutorial for three days and yours works

  • @kalebcitoxd783
    @kalebcitoxd783 Год назад

    i've been ssing around on a friends soft soft for years, finally bought it. I found your videos and instantly subscribed and have been

  • @wiltonmarqueswilton1470
    @wiltonmarqueswilton1470 Год назад

    Dude, witNice tutorialn 4 hours of watcNice tutorialng tNice tutorials video, I bought soft soft, and made my first 56-second soft clip. You're an aweso teacher,

  • @freakycrap3
    @freakycrap3 Год назад +4

    This was very helpful! The code you wrote is very clear and easy to follow. Thank you! Looking forward to more videos.

    • @nanonomad
      @nanonomad  Год назад +2

      i just slap things together that people smarter than me figured out. its like the old days when computers were still fun.
      the tutorial is already a bit out of date, but it still works i hope. coqui had some big updates, and VITS can now have emotional affectations somehow. once i get it going ill do a video

    • @Sparkyh
      @Sparkyh Год назад

      @@nanonomad please do. also the py files gave back an error couldnt even upload them to google colab. sighs im just trying to experiment for funsies but not even tacotron 2 was working anymore im like beyond frustrated

  • @victle
    @victle Год назад +1

    Seriously, thanks for keeping this up to date!

    • @nanonomad
      @nanonomad  Год назад

      Let me know if you have any trouble making sense of the script or anything.
      I may have forgotten to put a note. if you get a warning about a bunch of layers not found when you first start, you will need to train a few steps until you get a bestmodel file, then stop, change the names in your cells, run them again and then resume training. the model needs to be updated to save all the layers. itll still train, but you might get better quality with the quick stop and restart.
      Coqui posted a bunch of very impressive videos using their Coqui Studio app. i dont think the models are downloadable, but the voice quality is excellent if you wanted a more off the shelf approach to cloning

  • @ThorstenMueller
    @ThorstenMueller 2 года назад +13

    Thanks for making this helpful video and mentioning my channel. I really appreciate it 😊. But always think on copyright and license topics when building your voice dataset - esp. when public sources like RUclips, etc. are involved.

    • @NSPlayer
      @NSPlayer 2 года назад +3

      No one cares about copyright when you're having fun, good points to not get banned on RUclips

  • @samadpassu
    @samadpassu Год назад +1

    Awesome tutorials, I was always interested in AI based tts systems. This made it really easy.

  • @Figma_timelapse
    @Figma_timelapse Год назад +2

    Yo this helped so much and I always appreciate the content and when i found the channel and got the energy from you from the previous video, you've been nothing but real and can vouch for the amazing content and how down to earth you are with everything! All the most love, respect, and appreciation

  • @conglechi9210
    @conglechi9210 Год назад +1

    What a video bro! Can't thank you enough! Thank you so much you made life so much easier!

  • @alexyarde8841
    @alexyarde8841 Год назад

    Super great video makes excited to learn more and get started! Love the "stay organized" motto !!

  • @MadsVoigtHingelberg
    @MadsVoigtHingelberg Год назад

    I hear it. The theme song from the Amiga 500 game "Lost Patrol". 👍 Epic! Also fantastic video, i learned a lot.

    • @nanonomad
      @nanonomad  Год назад

      Haha this made my day. I love the music in some of these old games. This one a Lost Patrol remix .mod file from some BBS, but somewhere along the line I lost the original file.
      Just training some models today for a followup video on automating dataset creation with OpenAI Whisper and a bunch of Linux command line utils.
      It'll be a mess, because I'm not great with Colab, but the workflow is: Mp3 file / wav file -> ffmpeg convert -> split with sox -> rnnoise denoise -> OpenAI Whisper STT -> take .txt transcript, add filename, duplicate text with | delimiter -> output to metadata.csv
      Whisper is slow, but the transcript quality for the medium and even small English models has been pretty incredible from my experience. I've tried using the out-of-the-box Coqui STT models, but the transcript quality for all the voice samples I tried was too poor to use.

  • @actgiangaming7433
    @actgiangaming7433 Год назад +3

    This tutorial is amazing and you are really good at teaching !! great job sir !

  • @arbozfermozshodan70
    @arbozfermozshodan70 Год назад

    Thyöv yek behat khub mard sin video havat sāhārā e man hæst ber main na zānam main ketor dhanvād e havat karatruv erkhāzat
    Behat Dhānvat e Havat 🖖
    Dohaay yiez Hindesthān
    Love from India

  • @BillyMinxFashion
    @BillyMinxFashion Год назад

    i love when he tests it out for us

  • @alazar6594
    @alazar6594 Год назад

    I just wanna learn even more now- it looks so cool o.O

  • @user-nf6yn9zt8r
    @user-nf6yn9zt8r Год назад

    Thank you man for sharing this stuff, it's really amazing !

  • @leandrolagos2512
    @leandrolagos2512 Год назад

    Thank you so much Sensei! You are a blessing!

  • @Unreal_StudiosVFX
    @Unreal_StudiosVFX 7 месяцев назад

    i have recorded my 200 voice samples and i want to clone my voice . What changes i have to do to accomplish this task. Please help

  • @melroypereira8325
    @melroypereira8325 10 месяцев назад

    I tried this for hindi language.
    I trained VITS model from scratch and created single speaker tts model then fine-tuned with different voice. Accuracy of fine-tuned model decreased a lot compared tts model. Why?
    Can I train tortoise tts for custom languages?

  • @saddamjamali8501
    @saddamjamali8501 Год назад

    I am Fine-Tuning ljspeech-neuralhmm using my own dataset. I am getting avg_loss between -3 to -4.
    Learning rate is default i.e. 0.001
    GPU is Nvidia A100 with 80GB memory.
    Dataset distribution is 80:10:10 For training, validation and set.
    Other parameters are set to default.
    How can tune the hyperparameters to get best results possible ?
    I don't know any programs or tools.
    What should be my learning rate, batch_size etc?
    Are there any other parameters I should try adjusting ?
    Any other suggestions ?

  • @khushhirani9389
    @khushhirani9389 Год назад

    very nice...i startes soft soft learning.thank you so much....

  • @bleachedout805
    @bleachedout805 Год назад +1

    Coqui is the little frog on Puerto Rico. I only clicked because of the name.

    • @nanonomad
      @nanonomad  Год назад

      The group behind the project apparently named themselves after the 🐸 because of its sound

  • @netbin
    @netbin Год назад +1

    Awesome explanation

  • @linhle413
    @linhle413 Год назад

    THANKS ALOT MA MAN IT WORKED FOR ME SURE TOOK A WHILE BUT ITS GREAT

  • @loafandjug321
    @loafandjug321 Год назад +3

    I cloned the voice from this video but it sounds like Bill Gates?

  • @ihassan1001
    @ihassan1001 Год назад

    Why doesn't the Google collab work for me..I tried on 2 laptops and my desktops and it fails at the first step I take...I just don't understand....any suggestions?

    • @nanonomad
      @nanonomad  Год назад

      Where is it failing/what is the error? The computer you're using it on shouldn't make much difference as long as its a modern browser; its just a remote desktop into a google server. But, the colab window can be hard on an older computer - my laptop struggles a lot.

  • @YurisHomebrewDIY
    @YurisHomebrewDIY Год назад +1

    Looks like BaseDatasetConfig() has replaced the "name" argument with one called "formatter" in the most recent update. Otherwise, thanks for the excellent intro to Coqui TTS! I fear my dataset is a bit too brief for good results, but now I know how to go about training with whatever dataset I desire!

    • @nanonomad
      @nanonomad  Год назад +1

      I think there have been updates to the VITS model training that may improve the quality with smaller datasets. Once you start training, you'll see an error note about a bunch of layers not found. If you let it train a few steps up until it saves a best_model.pth, then resume training using that as the restore file rather than the downloaded model (as you would for resuming a fine tuning session), you will be able to train the missing layers. Subjectively, it seems to improve training speed and output quality.
      Thank you for reminding me to fix that error

    • @nanonomad
      @nanonomad  Год назад +1

      colab.research.google.com/drive/1N_B_38MMRk1BUqwI_C829TyGpNmppUqK?usp=sharing
      This notebook may be a bit more tidy than the older one. You won't need to manually upload the training scripts or rnnoise.py anymore, I just put the code in a cell and have it output the files to Colab.

    • @Lowkeh
      @Lowkeh Год назад

      @@nanonomad What a Chad! Thank you very much for taking the time and keeping this up to date!

  • @albertoalfaro6491
    @albertoalfaro6491 Год назад

    Thank you for this! Super cool video! A+++

  • @user-zp3xr5cv9u
    @user-zp3xr5cv9u Год назад

    I appreciate this amazing informative video and your dedication for following updates about the TTS! I have a question about your recent update 'no need to manually upload training scripts anymore'... Could you elaborate more please?

    • @nanonomad
      @nanonomad  Год назад

      It was a reference to some of the older videos if I recall correctly. I hadn't integrated all the scripting into Colab yet. However, this video is also quite old now and I've posted some revisions since. There's AI voice playlist on my channel that should have all of them listed

  • @nolimits8973
    @nolimits8973 Год назад +1

    Muito bom, vou atualizar. Very good. I will update.

  • @SinanAkkoyun
    @SinanAkkoyun Год назад +1

    Hey! Thank you so much :) How much does espeak influence the finetuning and in which way? I dislike some speaking rythms it itself generates, I believe some of that transfers to the model

    • @nanonomad
      @nanonomad  Год назад +1

      I dont think espeak is going to make too much of an impact on prosody/tempo/pause length(from punctuation), but I could be wrong. But, training with phonemes seems easier for English because it's a nonphonomic language. But it's not necessarily better. I find them less predictable. Maybe it's just my lack of familiarity with all the IPA phoneme sounds.
      With character set trained models I find it easier to correct training issues by providing more samples of the problematic words, and it's also easier to just massage the input text to work around it. (If the model cant say dodo bird, type in d'oh d'oh bird)
      If your doing a VITS or Yourtts model, you may find the prosody shifting a lot between checkpoints. The shifts get smaller as it gets more stable. You'll get clearer speech faster with phonemes, but need to eval more checkpoints to find one with good pacing.
      Hope that made sense. If not lmk and I'll reply when not on mobile and risking the comment box closing again

    • @SinanAkkoyun
      @SinanAkkoyun Год назад

      @@nanonomad Thank you so so much for all the info, makes total sense, I really appreciate it! Also thanks for all the content you provide, real gems!

  • @satyajitroutray282
    @satyajitroutray282 Год назад +1

    Can I use the model(.Pth file and config.json) generated from colab notebook..in our locally installed coqui tts server?what are the changes that we need to make?
    Thanks for your awesome work..i have been watching your videos from last 5 days nonstop.

    • @nanonomad
      @nanonomad  Год назад

      If its a regular VITS model, you'll just need the model.pth and config.json, and specify them on the command line when launching the server.
      If its the speaker encoder mode, you'll need the downloaded speaker encoder as well, and the speaker encoder json. And if its multispeaker you'll need the speakers.pth/speaker json and the language ids json if it applies.

  • @julian78W
    @julian78W 2 года назад +1

    Great video!

  • @melroypereira8325
    @melroypereira8325 Год назад

    In coqui tts we can use fairseq models too.. is it possible to use that model and finetune for voice cloning ?

    • @nanonomad
      @nanonomad  Год назад +1

      I haven't tried. Coqui usually lists what is available for training on the main github page though. If they dont specify generally its just for inference. Try poking around in the recipes directory in the coqui github. If there are prepared training scripts that's where they are.

  • @rayyanahmad1423
    @rayyanahmad1423 Год назад

    thank u helped me a lotNice tutorial.... Very helpful

  • @ticklecat6247
    @ticklecat6247 Год назад +1

    Hello, thanks for this work but i get this error: Traceback (most recent call last):
    File "/content/TTS/output/test-dataset/train_vits.py", line 72, in
    train_samples, eval_samples = load_tts_samples(
    File "/usr/local/lib/python3.8/dist-packages/TTS/tts/datasets/__init__.py", line 120, in load_tts_samples
    meta_data_train = formatter(root_path, meta_file_train, ignored_speakers=ignored_speakers)
    File "/usr/local/lib/python3.8/dist-packages/TTS/tts/datasets/formatters.py", line 162, in ljspeech
    text = cols[2]
    IndexError: list index out of range

    • @nanonomad
      @nanonomad  Год назад +1

      Hi. Typo in the dataset probably. one column may not be duplicated, or maybe was put three times instead of 2. I know manually making the datasets is absolute torture.
      idk when i can get it posted but im working on a followup to this one that uses Whisper STT to autogenerate a dataset. The Colab script is a bit of a mess but i can send it to you if you want. my email is on the about tab for the channel

    • @ticklecat6247
      @ticklecat6247 Год назад +1

      @@nanonomad Thanks for reply, i maybe try to figure out this

  • @ThugLife-is1yo
    @ThugLife-is1yo Год назад

    yes but it sounds very robotic why ?

  • @tsunderes_were_a_mistake
    @tsunderes_were_a_mistake Год назад

    I would like to clone Japanese voice and on going through coqui's GitHub looks like the kokoro model has already been trained for it, so I assume in this case I have to fine-tune it, what value should I set for lr in this case and roughly how many recordings would I need to get a good cloned voice? Is there any way to add emotions or make adjustments like coqui studio provides?

    • @nanonomad
      @nanonomad  Год назад

      Coqui hasn't implemented any of the emotion stuff into their open source offerings last I checked, and have been really guarded about how they're doing it. I'm not really following the project closely anymore due to the lack of progress on roadmap features and limited documentation.
      re: lr, you can play around with it, but your data quality/amount will probably have more of an impact. The LR in the notebooks is probably too high for the batch size. 0.0001 or 0.00005 may be best with batch_size 16
      re: samples No real way for me to give a difinitive answer. Depends on the original model, phonemes/character training, batch size, and a whole mess of other things. I've had reasonable results with certain speakers/voices with 3-5 mins, but some speakers just don't align with that little. Could be the result of anomolies and noise in the audio.
      Oddly, sometimes you can improve things by training another voice alongside the poor quality/limited sample voice and not using the weighted sampler.

  • @mir_intizam
    @mir_intizam 2 года назад +1

    hello, I want to train a friend's voice and add it to my own voice assistant, how can I do that?

    • @nanonomad
      @nanonomad  2 года назад +2

      I'm not sure. I think it would primarily depend on what platform you're using for the voice assistant, because you'd be stuck providing the voice/model in whatever format it required or an alternate TTS server. Thorsten Mueller has a video on using Mycroft as a voice assistant and replacing the TTS service with Coqui TTS: ruclips.net/video/95kMhN4_LdA/видео.html
      Then you could use the custom trained model with Coqui to generate speech

    • @mir_intizam
      @mir_intizam 2 года назад +1

      @@nanonomad There is no TTS server for any platform in azerbaijan language, what I need is to speak in azerbaijan language, so I decided to make my own TTS model. Unfortunately I couldn't find any info on how to make my own TTS model.

  • @tautegu
    @tautegu Год назад

    I have some voiceover and I want to test whether I can clone and train using this voiceover. If I split the audio into sentences to create individual audioclips, will I also need to create an individual captions files to accompany the audioclips.

    • @nanonomad
      @nanonomad  Год назад +2

      Yeah, that's the workflow. Check my video list, there is a more up to date VITS video with a link to a better notebook. Whisper STT from openai will make transcripts from the clips. There's a little cleanup code there but you'll still want to proofread them for errors

  • @vickyrajeev9821
    @vickyrajeev9821 Год назад

    Thanks Bro, how can we turn this tech in saas product can you help me

    • @nanonomad
      @nanonomad  Год назад

      Unfortunately that's beyond me. You'd also need to check the licensing for the model tech etc I wouldn't want anyone getting sued. Coqui is working on their python api, so that might be a good place to check out. There's a link on their github. They're also working on their own commercial offering with coqui studio I think

  • @disruptive_innovator
    @disruptive_innovator Год назад +1

    Audacity has a Spectrogram View.

  • @sweetapocalyps3268
    @sweetapocalyps3268 4 месяца назад

    Hi man, thanks for your useful tutorial. No many out there. I would like to know how much time it took to fine tune the model. On my experience it requires almost an hour on a 4090 to run 400 epochs, so to run 50K epochs as in your video it would require very much time 😅 thanks in advance

    • @nanonomad
      @nanonomad  4 месяца назад

      Yeah, this takes a long time. It's full parameter training. I have no idea how long this one took, but the multilingual tests ran for days to weeks. VITS was great at the time, but there's been a lot of development in TTS in the past 18 months or so.
      If you want a faster type of training, look at Tortoise. Or if you just want a quick voice clone, Bark will probably do a good job. I have videos for both on the channel

  • @aydabdioui6506
    @aydabdioui6506 Год назад

    Thank you so much it was very nice

  • @JoutaiHenkou
    @JoutaiHenkou Год назад

    I spent more than 5 hours now and it still doesn't work 🙁
    I have this error at the beginning of training :
    File "/usr/local/lib/python3.8/dist-packages/TTS/tts/models/vits.py", line 340, in collate_fn
    wav_padded[i, :, : wav.size(1)] = torch.FloatTensor(wav)
    RuntimeError: The expanded size of the tensor (1) must match the existing size (2) at non-singleton dimension 0. Target sizes: [1, 104496]. Tensor sizes: [2, 104496]
    Seems there's a mismatch between the dimensions of two tensors but I don't understand why...

    • @nanonomad
      @nanonomad  Год назад

      Probably a broken audio file somewhere in the mix. Do you have extremely short (under 1 second) or long (over 9 seconds) audio clips? There are updated scripts linked in the newer videos, which may work better. I'll be posting the latest ones hopefully today.

    • @JoutaiHenkou
      @JoutaiHenkou Год назад

      @@nanonomad I tried the collab of one of your latest videos and it's working. I had a couple of samples around 10 seconds so it was probably the issue (I did all the preprocessing manually then I forgot about the length 😪).
      I still tried to do some part of your automatic preprocessing (run_denoise : False, run_splits : False, use_audio_filter : True, normalize_audio : True) and it was not working. It worked with run_denoise : True but the result was only 4 of my .wav files in the wavs folder instead of 29 🤔
      As I said, I've already done it manually so it doesn't really matter. So I skipped all the automated preprocessing part and started a training with all values at default except "eval_slipt_size" at 0.04 (got an error with the default value. This is probably because I have only 29 files).
      After only 2000 steps the result seems encouraging. Though, I'm not sure if the three batch size parameters at 16 is a good thing considering my small number of samples. I plan to try with much more sample later as long as I get a convincing result.
      Thanks for your help !

  • @CancelIFR
    @CancelIFR Год назад

    You would think there would be a repository of previously generated voices somewhere.

    • @nanonomad
      @nanonomad  Год назад +1

      There may be a few on Huggingface, but im not sure if theres a central repo other than the official Coqui models. Theres more sharing outside the English speaking world it seems. Ive stumbled across a fair number of Korean and Japanese language sites sharing their models.
      After making a Duke Nukem model using samples from the games, i got a little uneasy about sharing models based on living people or properties. The model was too good, and Jon still makes money voicing Duke and doing Cameo. Im sure his bank account is fine, but id feel bad if my typing words into a computer had any impact on his future

  • @NathanKasper
    @NathanKasper Год назад

    Can you please include a copy of the colab page? I'm sorry I don't understand very much about code and I do not know what order these chunks of code go in.
    If there was a simple link to the correct updated voice clone .ipnyb that you are using, that would simplify this. Can anyone tell me how to find a make a copy of the notebook he is using ? Without having to piece it together?idk help me I'm very confused...

    • @nanonomad
      @nanonomad  Год назад +1

      I'll see if i can figure out how to simplify it. if you want to try it as is; save those links as separate files named as indicated. on the main colab page you can choose to upload a notebook - thats your .ipnyb file. Upload the.py files to your main google drive folder

  • @cliffordmcgraw8444
    @cliffordmcgraw8444 Год назад

    Hey man thanks for doing all this. I'm having trouble at the last step on the colab link you posted in the description. I get FileNotFoundError: [Errno 2] No such file or directory: '/content/drive/MyDrive/TTS/testingdataset/metadata.csv'. I changed the output path to output_path = "/content/drive/MyDrive/TTS/""" and dataset_name = "testingdataset"

    • @nanonomad
      @nanonomad  Год назад

      If you move testingdataset to your Google drive root, i think it'll work. In there put the wavs folder, and the metadata.csv Traineroutput will be created by the script
      I think you have an extra subdirectory in there named TTS and I don't have it set up to handle different directory structures.
      I have it cloning the github repo to TTS just in case anyone wants to poke around the files, but that line is actually not necessary.
      The Coqui framework can just be installed using the pip install tts line

  • @guillermoparracencio6902
    @guillermoparracencio6902 Год назад +1

    Like it very much :)

  • @AndrewSolano
    @AndrewSolano Год назад

    Can coqui handle question marks in the dataset? And if so, when recording your own dataset should you read those sentences as questions (with pitch changes) or maintain a neutral pitch/tone?

    • @nanonomad
      @nanonomad  Год назад

      Coqui is the framework that ties everything together, so you can choose your own model; I was using VITS here, but I think most of the models handle inflection changes to some degree. I had good results training GlowTTS as well, but VITS is good at picking up tone and emphasis.
      If you have a really small dataset, I would say don't pitch up too often. It'll get over-expressed in your model. If you have a normal dataset, include a some expressive talking, normal speech, even shouting if you feel like annoying your neighbors while recording. With a balanced dataset you won't end up with every sentence sounding like a California Valley-girl with chronic uptalk as long as things are properly transcribed.
      I tried training a model on a few clips from my videos, but I tend to have a pretty dry way of speaking, so the model picked it up too much and the whole thing was monotone.
      Check out the Coqui site, because there have been a bunch of new models and techniques in the past couple of months. Idk if you can fine tune a "YourTTS" model yet, but the samples I heard a few months back were incredible.

    • @andrewsolano4957
      @andrewsolano4957 Год назад

      @@nanonomad What would you consider a "normal" dataset? I'd really like the ability to inflect for questions, what would you suggest as a robust dataset size if I'm creating the recordings myself?

    • @nanonomad
      @nanonomad  Год назад

      @@andrewsolano4957 Sorry it took me so long to get back to you - its more about coverage than size overall, but I haven't been successful running any scripts to audit phoneme coverage. You can _probably_ get by with 10ish minutes of speech. I've had a MUCH easier time training male voices VS female, but I'm not using ideal datasets.
      You want to cover all the sounds in the language - I've found reading something like The Great Dictator speech and the words to the song Modern Major General are good scripts. MMJ is fun to use as a test for the model. Though I should mention that you should use different train and test scripts, because the model will probably be good at parroting back the training data.
      You can browse the LJSpeech 1.1 dataset somewhere online and then get an idea of what samples and quality was used to train the original vits model.
      The best example of reading I can think of would be Jon St John:
      ruclips.net/video/9SNy1LyrDjY/видео.html (nsfw)

  • @mariadanielasolisbarquero5375
    @mariadanielasolisbarquero5375 Год назад

    where to go to ge the softs im tNice tutorialnking of .

  • @shkrystorm7316
    @shkrystorm7316 Год назад +1

    Obrigado mano, funcionou bem.

  • @phen-themoogle7651
    @phen-themoogle7651 Год назад

    Hey, wondering if there's a more modern or easier way to do this now? (for someone without a background in coding or without extreme patience) since technology has supposedly gotten insanely better with gpt4 and everything in these last 9 months. So many sites charge money for voices...

    • @nanonomad
      @nanonomad  Год назад +1

      Not that I know of. Coqui has their own paid and free online service options now, so I dont think much focus has gone into making things user friendly. Idk if there's any sort of gui out there. I looked a few months ago and didn't find anything
      This is like taking up making sour dough bread as a hobby. You gotta be a special kind of weirdo to enjoy it.

    • @Shivam-nj9ly
      @Shivam-nj9ly Год назад

      @@nanonomad now if I got the checkpoints after these steps saved in the gdrive as .pth so how to do inference from here or how to get output voice which will be cloned one?

  • @unpopularopinion1032
    @unpopularopinion1032 Год назад

    hey bro, great tutorial. I got most of it down, I'm just confused about where I need to upload the speech transcript that I edited with Notepad++.

    • @nanonomad
      @nanonomad  Год назад +1

      Name it metadata.csv and copy it to the dataset dir. The wavs should be in a folder called wavs in the dataset dir. Once you start training, the script will make another folder called traineroutput where the model and json are stored

    • @unpopularopinion1032
      @unpopularopinion1032 Год назад

      @@nanonomad I thought so. Thanks so much bro.

    • @Shivam-nj9ly
      @Shivam-nj9ly Год назад

      @@nanonomad Can u give me demo of the format used transcript file? I have used like: "|transcript"

  • @Qwerty-cq1zw
    @Qwerty-cq1zw 2 года назад

    Could you please reply with the training scripts you used? They were cut off in the video.

    • @nanonomad
      @nanonomad  2 года назад +2

      Thank you for letting me know. Didn't intentionally want to force everyone to transcribe that. I've updated the description with the link: pastebin.com/6TBGzbQY which is the training script saved as train-vits-bg-colab.py

    • @Qwerty-cq1zw
      @Qwerty-cq1zw 2 года назад

      @@nanonomad Thanks!

  • @aiisnice1453
    @aiisnice1453 2 года назад +1

    Please release the bill gates models

  • @sullivanmagneron7504
    @sullivanmagneron7504 Год назад

    Bro you are beast

  • @GodofNow
    @GodofNow Год назад

    Hey there, I love your Project it is a blast. Thanks for the Information. But I've a question that I'm gonna train this model in Google colab just like yours but can I save the model and run it in my local machine. Then How I shall do that Please 🥺 Help. Hope to here from you soon and hava a #GreatDay

    • @nanonomad
      @nanonomad  Год назад +1

      Thanks for leaving a comment. I don't know of any simple point and click methods, because I've been doing this kind of stuff as a way to learn a little Python scripting. If you're on Windows, you can set up the command line version of Coqui TTS. It works for inference (generating voice samples), but I can't get training working. It works great if you use Linux or Linux within Windows through WSL, but setting of that is a bit more of a pain and a lot more downloading/disk space used if you're not going to be doing training.
      For Windows:
      This is off the top of my head, so I may be missing a step. Let me know if you try it and have any trouble.
      Download and install Anaconda or Miniconda from docs.anaconda.com/anaconda/install/windows/
      Install that, open a Conda session with launch link from the start menu
      Type conda create -n coqui pip git to create a new environment named coqui
      Activate it with conda activate coqui
      Type pip install tts
      Clone the Coqui TTS git repository with
      git clone github.com/coqui-ai/TTS.git
      cd TTS
      mkdir yourmodelfilesdirectory
      copy your model and conig.json to yourmodelfilesdirectory
      tts --text "Generate some text" --config_path yourmodelfilesdirectory\config.json --model_path yourmodelfilesdirectory\checkpoint_whatever.pth --out_file blah.wav

    • @GodofNow
      @GodofNow Год назад

      @@nanonomad Thanks you So Much You Really Make My Day! 😌 Have a Great Day Friend

  • @manusantillan7099
    @manusantillan7099 Год назад

    Bro, thx so much !

  • @1hitkill973
    @1hitkill973 Год назад

    Maybe it's too complicated for me. Maybe it's outdated. It's throwing me tons of script errors. I give up.

  • @deadshotgamess
    @deadshotgamess Год назад

    i was gonna ask, does the trained model get saved somewhere? so i can just load it into the model without having to train it in the future?

    • @nanonomad
      @nanonomad  Год назад

      As long as your Google drive is connected, and the paths are set, the script is set up to put it in the dataset directory in a subdirectory called traineroutput if I recall correctly.

    • @deadshotgamess
      @deadshotgamess Год назад

      @@nanonomad because what i’m trying to do is make a server with the model ready to be loaded with a trained file from colab. is that possible u think?

  • @DrVektor
    @DrVektor Год назад

    what kind of windows 7 is this? Is it special iso? What is in there? Can you share?

    • @nanonomad
      @nanonomad  Год назад

      Its just Win10 with one of those tweak packs to uninstall all the bloat and telemetry if i recall, probably used Privatezilla, but i dont think it works on new builds

  • @hectordecastro1279
    @hectordecastro1279 Год назад

    Hi, I cannot run cmd "yt-dp", can you help me?

    • @nanonomad
      @nanonomad  Год назад

      I cant share links to it because YT considers it an 'ad blocker', but search for yt-dlp and you should be able to find their site and download links

  • @brunofaleiros4150
    @brunofaleiros4150 Год назад +1

    alguem no brasil estudando essa tecnologia? vamos nos conectar!

    • @michelangelo24
      @michelangelo24 Год назад +2

      Eu estou querendo treinar um modelo do zero no nosso idioma. Você consegui?

    • @Nyx_z_
      @Nyx_z_ Год назад

      @@michelangelo24 Como ficou? Estou querendo aprender pra fazer vozes pra um jogo a meses

  • @Sparkyh
    @Sparkyh Год назад

    having tensor flow errors when attempting to install what do

    • @Sparkyh
      @Sparkyh Год назад

      total bs after many edits tensorboard didnt work - Reusing TensorBoard on port 6006 (pid 5728), started 0:02:38 ago. (Use '!kill 5728' to kill it.) it also says No package metadata was found for TTS
      Traceback (most recent call last):
      File "/content/drive/MyDrive/train_vits.py", line 5, in
      from TTS.tts.configs.shared_configs import BaseDatasetConfig
      ModuleNotFoundError: No module named 'TTS'

  • @bestaudiolibrary9374
    @bestaudiolibrary9374 Год назад

    Can it clone on other languages like Greek or French?

    • @nanonomad
      @nanonomad  Год назад

      Yes, but you'll have an easier job with languages close to English. I'd recommend looking at some of the work Thorsten Mueller has done training German: github.com/thorstenMueller/Thorsten-Voice and check out his RUclips videos too ruclips.net/user/ThorstenMuellervideos

  • @JaveGeddes
    @JaveGeddes Год назад

    You had me when you said you didn't want to use the cloud.. then you did it any way..

    • @nanonomad
      @nanonomad  Год назад +1

      I was more meaning untethered, but quality TTS. Finally got the followup done though.
      ruclips.net/video/e_DCb1XPWS0/видео.html
      Last half is on Linux. Copypasta command link at the bottom of the description.

  • @konstantinkoryashkin7757
    @konstantinkoryashkin7757 2 года назад +1

    Hello, I have problem on step "Fine Tune VITS model". What could be the reason?
    | > hop_length:256
    | > win_length:1024
    Traceback (most recent call last):
    File "/content/drive/MyDrive/ColabNotebooks/train-vits-bg-colab.py", line 98, in
    eval_split_size=config.eval_split_size,
    File "/usr/local/lib/python3.7/dist-packages/TTS/tts/datasets/__init__.py", line 111, in load_tts_samples
    meta_data_train = formatter(root_path, meta_file_train, ignored_speakers=ignored_speakers)
    File "/usr/local/lib/python3.7/dist-packages/TTS/tts/datasets/formatters.py", line 150, in ljspeech
    text = cols[2]
    IndexError: list index out of range

    • @nanonomad
      @nanonomad  2 года назад

      Probably a typo in the dataset csv. double check that it is
      wavfilenamenoextension|transcript|transcript
      Missing a pipe character anywhere there or maybe accidentally putting a transcript 3x will throw that error

    • @konstantinkoryashkin7757
      @konstantinkoryashkin7757 2 года назад +1

      @@nanonomad Started the learning process. To quickly master the video, you would need to add example the fill line (csv) - the most difficult thing turned out to be, I tried many options, I looked for examples on the internet and there are different options everywhere.
      Thanks a lot!
      I have two questions:
      1. Can I later add audio files with (csv) for further learning and how to do it?
      2. How can the model train the vocoder(Melgan)?

    • @nanonomad
      @nanonomad  2 года назад +2

      ​@@konstantinkoryashkin7757 Writing the dataset transcript is very tedious. I had some code to audit the file and find any inconsistencies but it was embarrassingly messy so I didn't want to make anyone try to use it. Im not a coder, just a hacker haha When I do an update to this I'll try to get that fixed up so its easier to error check.
      Re 1: Sort of. Things can get weird. Its like baking bread. You can add more to the dataset, and then 'restore' fine tuning like from the beginning, but substitute your own fine-tuned model. As far as I know it'll end up retraining on the original data in addition to the new data. I think this is responsible for a lot of the feature loss and degradation in the output quality.
      I think the best results for me came from adding more to the dataset and then starting over from the beginning and restoring fine tuning from the original checkpoint
      Re 2: All of the vocoders seem to have different training requirements. Im not sure if melgan needed you to export mel spec files first or anything like that. I tried training a few before I got in to fiddling with the VITS model, but my vocoder training results were all failures to some degree. If I recall, there are training scripts in the Coqui Github repo. VITS produced pretty good results without needing the vocoder trained, though. Glow was probably the second best so far.
      The BIll Gates voice was just a gag, but I've done models of a few celebrities using sound booth quality recordings and the results were a lot better. You may not need to do the vocoder if you can get enough clear audio.
      Let me know if you run into any trouble. I may not be able to help, but I'll try

    • @angelachen8006
      @angelachen8006 Год назад

      I also have the same problem. I double check that metadata has not missed pipe. Does each transcript have the limit of words?

    • @konstantinkoryashkin7757
      @konstantinkoryashkin7757 Год назад +1

      @@angelachen8006 metadata.csv format here:
      2121|And that's looking great.|And that's looking great.

  • @daylifedreaming
    @daylifedreaming Год назад

    so with this I can make my fav celebrity sing whatever I want?

    • @nanonomad
      @nanonomad  Год назад

      Sing, probably not, but you can probably make a decent enough sound-alike for speech.
      For singing, there are academic projects like VISinger and DiffSinger, but they're too complicated for me to implement anything.
      Another singing option is something like github.com/stakira/OpenUtau for working with Vocaloid-style voices

  • @goboulot
    @goboulot Год назад

    So gooooood crack

  • @user-hb3zm3nj8j
    @user-hb3zm3nj8j Год назад

    Can the audio be reproduced in Arabic?

    • @nanonomad
      @nanonomad  Год назад

      The VITS model is really flexible, but I'm not sure if anyone has done publicly-available Arabic model yet. I don't think fine-tuning a pretrained model would work very well because the language differs a lot.
      There are a few posts on the Coqui Github discussion site from people trying to train Arabic models, so it might be worthwhile reaching out to them and seeing if they've made any progress.

  • @raxsgamer
    @raxsgamer Год назад

    how can i find my metadata.csv?

    • @nanonomad
      @nanonomad  Год назад

      If youre using the script it should be output to the dataset directory. /content/drive/MyDrive/whateveryourdatasetis/metadata.csv

  • @cypher4528
    @cypher4528 Год назад

    Damn, just how in the world did ElevenLabs did what they did...

    • @nanonomad
      @nanonomad  Год назад

      My guess would be something close to what TortoiseTTS is doing with uh.. I forgot what it's called play.ht or something. There's a new commercial offering with tortoise as the backend

  • @ermander2241
    @ermander2241 Год назад

    No program as .exe

  • @YourDrive26125
    @YourDrive26125 Год назад

    TNice tutorials

  • @AshyerAndderiasDA
    @AshyerAndderiasDA Год назад

    this is tasty, good Crack

  • @parito5523
    @parito5523 Год назад +2

    Nice video however it would be much better if you avoid talking while also having other voices running at the same volume in the background (as for example in around 16 minutes where you talk while also randomly playing the audios on your screen), it just create an unintelligible voice mess.

  • @dabijaadrian8081
    @dabijaadrian8081 2 года назад

    but this app can speak other languages?

    • @nanonomad
      @nanonomad  2 года назад +1

      Yes, but the original models were trained in English, so its a bit more complicated. You can train your own model from the start, but you'll need to assemble a large dataset and accurate transcriptions. Or you can do like we're doing here and fine tune the English model and try to force another language into it. This is going to work best on languages that closely resemble English like French, Germen, Dutch, etc
      If the phonemizer (GRUUT or eSpeech) supports your language you'll be able to continue as normal, but if it doesn't, I'm not sure how to proceed. Check and see if the phonemizer knows how to work with your language before going too far.
      I've mentioned Thorsten's videos before (link in description), and he's been a big contributor to Coqui from what I understand. He did a German voice that you can use just like we download the other VITS model in the video.
      When the script runs that "tts --list" command in the video, that is the list of models that is included with the Coqui installation.
      There may be one that is related to/similar to the language you're interested in, and that could make fine tuning more successful.
      If you try to do it, let me know how it goes. I'd love to see your results.

    • @dabijaadrian8081
      @dabijaadrian8081 2 года назад

      ​ok, thanks for help

  • @aiisnice1453
    @aiisnice1453 2 года назад

    What do you think about "nuwave2"

    • @nanonomad
      @nanonomad  2 года назад +1

      This is the first im hearing about it, but the audio samples are really impressive. Would be cool to run the voice model output through it and see if it'll enrich it at all.
      if you mean the as seen on tv induction cooktop - decent for what it is, had to use a bunch of them for catering a while back. good pan detect. a lot of induction burners are bad with that, and its a problem when you're making 300 omelettes or whatever. 🥳

    • @aiisnice1453
      @aiisnice1453 2 года назад

      @@nanonomad I assume it will .
      that with a a bit of denosing from audiacity seems great...
      its not mainstream yet because you guessed it
      200mb model lol so you cant run it on your browser your something plus need pytorch

    • @aiisnice1453
      @aiisnice1453 2 года назад

      I think its pytorch idk I am retarded

    • @aiisnice1453
      @aiisnice1453 2 года назад

      @@nanonomad sneed cope dilate
      when you released the trained models im a peajeet

    • @FreeThinker0
      @FreeThinker0 2 года назад

      @@aiisnice1453 I think as long as the model size fits your Google drive space it won't be a problem in Google colab

  • @iamrain388
    @iamrain388 Год назад

    Julius Pringles thanksgiving ca early

  • @paulamontero3031
    @paulamontero3031 Год назад

    god bless u xdd

  • @Iceberg_ice63
    @Iceberg_ice63 Год назад

    +sub

  • @ZyronJeff_0915
    @ZyronJeff_0915 Год назад

    Dave 84

  • @ismailmahboub4447
    @ismailmahboub4447 Год назад

    again...................yikes.

  • @mustafakharodawala7853
    @mustafakharodawala7853 Год назад

    Donnelle Raeburn they have a free trial version

  • @alexkolstoe5551
    @alexkolstoe5551 Год назад

    Hey thanks for making this video! I tried other methods for days before I found it, but I am very glad I did! I was wondering, when continuing training, am I better off using the best_model_xxxxxxx.pth, or the highest checkpoint_xxxxxxxx.pth? I suppose I don't know enough about the guts of machine learning to understand whether starting with the best model so far would be get me better results faster, or if the further tuning of the parameters in the later generation model provides advantages in training regardless of overall performance. Thanks again!

    • @nanonomad
      @nanonomad  Год назад +1

      Best model will probably stop saving after a few cycles. it's mathematically doing its ML voodoo and thinks the output is as good as it gets. It usually happens early on with VITS idk why but i havent really dug into it. stick with highest checkpoint number, but keep checking in and evaluating the quality of the output. you can overbake your model and the output quality will be lower. how long depends on your dataset and training parameters though.
      hopefully ill get it done today, but I'm working on a quick post about getting coqui running on Windows using Conda. so far i can get inference running and it works as well as under Linux

    • @alexkolstoe5551
      @alexkolstoe5551 Год назад

      @@nanonomad Oh that sounds great, I'll keep an eye out for it! I do have another question though. When running the code block to continue training, I am getting a "valueerror" asserting that there is no checkpoint model at the path I am supplying, but I am looking at the file. Have you had this issue, or do you have any ideas? I feel like I am probably just being stupid, but I'm not sure what else to try.
      I genuinely apologize for my ignorance, and truly appreciate the help!

    • @chris7868
      @chris7868 Год назад

      @@alexkolstoe5551 If YT comments will let you, can you copy and paste the few lines out of the big block it spits out with that error or maybe post it on pastebin and then let me know the rAnDoM aLpHaNuMeRiC string that i can plug in to the URL? Its probably the model path somehow. Im guessing Gdrive is already mounted? did i screw up the capitalization of the /content/drive/MyDrive/ bit? ill be able to take a look at it in a couple hours

    • @nanonomad
      @nanonomad  Год назад +1

      I think it'll be fixed with this. Sorry for the trouble. Updated the link to the script for Colab. I gave the wrong command line for the script.
      For the restore run, when I mistakenly say continue_path, change it to:
      !CUDA_VISIBLE_DEVICES=\"0\" python /content/drive/MyDrive/train-vits-bg-colab.py --restore_path /content/drive/MyDrive/DATASET-DIRECTORY/traineroutput/TRAINING_RUN/CHECKPOINTFILE.pth --config_path /content/drive/MyDrive/DATASET-DIRECTORY/traineroutput/TRAINING_RUN/config.json
      Change DATASET-DIRECTORY to wherever you put your dataset, and TRAINING_RUN to that really long string of characters folder name that is located in the traineroutput folder (go to google drive, right click on that long folder name, hit rename, copy the filename to clipboard with ctrl+c, paste it in to the script command to save time - or use the file browse button inside Colab to do the same process). Do the same for the config.json path name. It will be in the same output folder for that training run.

    • @alexkolstoe5551
      @alexkolstoe5551 Год назад

      @@nanonomad Thank you so much!!

  • @yusmabudiyanti4719
    @yusmabudiyanti4719 Год назад

    Thank you so much it was very nice