another great video, Thorsten 👏 We have a happy update... you can now use unlimited audio for the 0-shot clone :D no longer are you limited to just 6 seconds. The HuggingFace space is still hard coded to max out at 30 seconds though... so we don't overload their servers 😆
This is great news! :D You probably should make another video comparing the quality differences between the 6 seconds and 30 seconds input audio! (or maybe more, if you can change that max value in the local installation) ^^ @@ThorstenMueller
@@tsunderes_were_a_mistake In my german model i didn't encounter a change depending on the text length. But i did not exactly check this specific aspect. If you think this would be helpful i can give it a more specific try (with a german model). But i can't say anything about the Japanese model.
I was exactly like you, I also had too high expectations for Coqui XTTS, haha ^_^ While the outcome wasn't quite what I was expecting, the results are still quite impressive, especially considering they are based on just a 6-second sample. I was also really happy to read in the comments that the devs are working on improvements, like allowing for voice samples longer than 6 seconds. I loved the video! Thanks a lot for your work, Thorsten! ^^
Great video! I'm really getting into TTS and it's so exciting to see what's possible now. It's incredible how something that needed hours of data a year ago can now be done in just 6 seconds. It's fascinating to watch this tech evolve
Great video, expectations after listening to the interview with Josh were high, but XTTS is still kinda new, so I am excited for the future improvements.
Nettes Tool und großen Respekt an den Entwickler! Ich finde die Idee super, allerdings könnte ich persönlich nichts mit der Qualität anfangen. Aber hey, für 6 Sekunden input ist dass doch ein mega Ergebnis finde ich!
Hi Thorsten, I am a computer engineer and AI RUclipsr myself (who isn't nowadays? haha :P). Just wanted to say that you make great tutorials on AI voice. I stumbled on this tutorial while exploring Coqui and it is the best tutorial I found. Thanks for taking the time to do these. Also, a subscriber asked me for a resource on Coqui TTS tutorials on reddit, I have shared your channel! Keep up the great work.
amazing video! I am wondering if it's possible to train a given voice and then just use that voice for future use. In the "clone your voice locally" section, the code requires the reference audio as an input. I'm thinking in terms of efficiency and that if you plan to use the same voice over and over, you shouldn't need to train the model each time.
Quite amazing that they can do this with such a short clip. I had the same results as you with english, it doesn't really sound like me even though I tried to speak my best english. :) - How would you compare it with Piper with regards to TTS performance? Ofc Piper is quite difficult to train for new voices, but its free to use commercially even. I wish there was some simpler way to clone voices with it and that would be golden. I have looked at your video for this but preparing the training set seems like a chore.
Thanks for your comment 😊. I didn't compare the performance between XTTS and Piper TTS. I guess when you want a free and best voice clone i'd go with Piper TTS right now, but the effort is higher - as you said.
Ich mag deine Videos sehr, auch wenn viele leider nur auf Englisch sind. Könntest du dir vorstellen, einmal ein generelleres Übersichtsvideo zur Sprachsynthese machen? Auch nach tagelager Recherche blickt man als Laie nur unvollständig durch, es wäre großartig, wenn mal ein Profi wie Du für den Interessierten etwas tiefergehend folgende Themen erläutert: Was genau ist/machen Coqui, Xtts, Tortoise, Espeak / espaek-ng und wo ist der Unterschied zu Mbrola und dessen Stimmen? (Kann ich tts anstelle von Mbrola in Skripten verwenden? Ja/nein - Wie/Warum?) Beispielhafte Fragen zu xtts: Was ist eine Multilingual Voice im Unterschied zur Thorsten Voice? Was genau ist voice cloning im Gegensatz zu voice transfer? Was machen/sind Coqui speakers? Wo ist der Unterschied darin, des xtts Modell zu feintunen und einfach nur eine speaker_wav Referenz anzugeben?
Vielen Dank für deine tolle Rückmeldung und den Vorschlag 😊. Das Thema gefällt mir sehr gut. Wenn man sich so lange und intensiv mit einem Thema beschäftigt, dann werden diese "Grundlagen" irgendwie so normal, dass man gar nicht mehr drüber nachdenkt. Ich habe das Thema auf meine TODO Liste gesetzt. Besten Dank dafür 😊.
Could you help, please? tts : The term 'tts' is not recognized as the name of a cmdlet, function, script file, or operable program. Check the spelling of the name, or if a path was included, verify that the path is correct and try again. At line:1 char:1 + tts --list_models + ~~~ + CategoryInfo : ObjectNotFound: (tts:String) [], CommandNotFoundException + FullyQualifiedErrorId : CommandNotFoundException
Ich habe mich ebenfalls ein wenig mit Coqui XTTS ausprobiert. Ich bin zu dem Entschluss gekommen dass es sich nicht lohnt. 1. kann coqui XTTS nicht annährend mit den führenden Mitstreitern bezogen auf Qualität der clones mithalten. 2. Ist coqui XTTS für diese Qualität bei diesem Preis meiner Meinung nach nicht lohnenswert, betrachtet man auch hier die Qualität und Pricings der Mitstreiter! Trotzdem wieder vielen Dank für dein Video Thorsten!
For anyone coming recently, the tts repo isn't maintained anymore according to an issue post on the github. It results in an error when running 'pip install tts'. This fork worked for me instead: 'pip install coqui-tts'
Sehr gut erklärt. Ich hatte von dem video jedoch erhofft, nicht nur einen einzelnen speech zu erstellen, sondern mein eigenes model abzuspeichern, so dass es dann z.B. unter tts --list_models auftaucht oder ich es zumindestens bei --model_name angeben kann. Ist das auch möglich?
Vielen Dank 😊. Die "--list_models" Option zeigt Informationen aus der .models.json Datei aus dem Repo an. Du könntest versuchen dein Modell in der Datei lokal bei dir einzutragen. Du hast also bereits ein eigenes Modell trainiert?
Hello, sir Thorsten! The title of the video doesn't really capture the point. Unfortunately, I didn't find in your video how to start the GUI for Coqui TTS. In the title to the video you have stated - XTTS - and just I was hoping that I could run the gradio-gui that was at the beginning of your video. Too bad you don't have a video tutorial on how to deploy on your local machine the handy GUI for voice generation that was in the demo.
Sehr sehr guter Kanal! 👍 Ich habe mich gefragt: Was ist denn der Grund für die doch niedrige Samplingrate von 22.050Hz im ThorstenVoice Dataset? Einfach eine schnellere Vearbeitung der Daten?
Vielen Dank für deine tolle Rückmeldung 😃. In den Tests war in der Audioausgabe kaum ein Unterschied hörbar, dafür aber war der Rechenaufwand bei bspw. 44kHz merklich höher.
@@ThorstenMueller Danke für die Info. Elevenlabs will ja für ein Professional Voice Cloning auch nur 128kbps mp3 und meint, dass kein Nachteil feststellbar ist. Sehr interessant, wie die AI das verarbeitet.
I've been using coqui for months and it's amazing that Coqui simulates breathing at all, but breathing is typically the most distorted part of the generated the audio which can make it sound unnatural.. I'm wondering if you remove the breathing from the source audio whether that will improve the quality of the cloned voice or whether the distorted breathing is just a symptom of the underlying model.
I've no idea how this could work. Maybe it helps if you use audio tools to cut out your breathing from the recording you provide to XTTS. Or maybe there are audiofilters like sox or ffmpeg that can remove breathing sounds from the generated audio.
Super erklärt 👍Wie kann ich denn meine Stimme Klonen das er mir ganze Texte vorliest? z.b. eine PDF Datei oder ein Word Dokument, oder beschränkt es sich nur auf 6 Sek.
Vielen Dank für das Lob - das freut mich sehr 😊. Eine fertige Lösung für Text/Word/PDF Input gibt es (glaube ich) nicht, aber generell kannst Du längeren Output erzeugen. Du musst den Eingabetext vielleicht aufteilen, aber sicherlich gehen deutlich mehr als 6 Sekunden.
Ei subba, freut mich', dass des Video gefällt :) According to my talk with co-founder of Coqui AI, Josh Meyer, the model is optimized for a 6 second audio input. Before trying longer audio input try using other 6 second clips.
This error originates from a subprocess, and is likely not a problem with pip. error: subprocess-exited-with-error im getting this error please help someone
Hi, now that Coqui is shutting down, we can’t use the model via API? I find trouble using the model like that. for the import code: “from TTS.api import TTS” module not found
Hi Thorsten, sieht so einfach aus bei dir. Ich hab Coqui über Pinokio installiert und gestartet, in der Erwartung dann irgendwie lokal zu dieser GUI zu kommen. Pinokio sagt dann auch "running" aber unter den üblichen local hosts im browser finde ich nichts. Dann gibt es noch einen button "server", den hab ich mal gedrückt und bekomme die Antwort: .........Connected! Macht alles den Eindruck als liefe alles wie es soll... nur für mich endet das Erlebnis dort, weil ich nicht weiß wo sich Coqui mir zeigen könnte... schade eigentlich. Pinokio ist normalerweise ein gute Zugang für Non-Coder.
Btw, how do I get the gpu parameter to work. I have a 3000 series GPU but even if I select gpu=True it says CUDA is not available. Also I have noticed that the cloned voice from my own speech shifts to sometimes output british accent and sometimes american (likely because my accent is neither). But it also means it is impossible to get consistent results with this. Is there some way to save a snapshot of whatever it came to was "the voice" and reuse that as input on subsequent generations. If not it is quite useless and just a fun demo really.
Hello, good video, do you know how to remove the character limit restriction when writing? Warning: The text length exceeds the character limit of 239 for language 'es', this might cause truncated audio.
Thanks for your nice feedback 😊. Hmm, not really. Earlier we sometimes run into a "max_decoder_steps" which caused truncated audio, but i'm not sure if this applies here too.
I'm not sure about that. I'd recommend asking on Coqui community, but as Coqui AI (the company) has shut down i'm not sure on how fast you might get an answer.
Is there any way we can push this trained model to huggingface? Like once we give the audio sample and next time when pushed to huggingface hub we only need to pass the text to generate the audio with respective voice?
Normally generated output is the same samplerate as the voice dataset the model has been trained on. Maybe you can use tools like ffmpeg to adjust samplerate afterwards, but i doubt if this will increase the quality.
@@ThorstenMueller thnks for your response. Yea!... GPU, but my notebook is only to development... i need better process to audio files from cloning voice tts
As i'm not sure, i'd recommend asking on Coqui community on github. But as Coqui AI (the company) has shut down, i'm not sure on how fast you might get a reaction.
thank you for this video! i am running into problems. when i execute the script, it shows "AssertionError: CUDA is not availabe on this machine.". But i have cuda12.3 and compatible torch and my other ai software ran well. i have no idea what is happening. please help!
This should be the case. The limitation is part of their Huggingface space and should not apply locally. huggingface.co/spaces/coqui/xtts/blob/d3b67acd01a3f63524371ad7d35a044ac0e75f60/app.py#L200
@@ThorstenMueller This may have been the issue. Played around with it a bit and got it working again, but can't recall exactly which thing I did differently. Thanks for the reply though! If you're looking for content ideas, one thing I am struggling with is how this all fits together now, in June 2024. Specifically - when I start the server and hit the local webserver, I get a very different UI than what I see in other videos on XTTS. And I know there are all different UIs for XTTS - there's a fine tuning one, a web UI, RVC, etc. and some of them have bits that don't work, and it sounds like Coqui has abandoned the project now and... it's hard to catch up on it all when coming into it for the first time, and it changes so rapidly. So I guess what I'm trying to figure out is - if I want to build an AI voice clone of me, today, what's the strategy/stack you recommend?
Normally this is an aspect of SSML (Speech Synthesis Markup Language), which is by now not supported by Coqui and Piper. Maybe you can try a workaround and add multiple dots (....) to create a pause. But i didn't try it out myself.
@@nomadhgnis9425 Okay, then maybe it's a workaround to create multiple tts wave files and merge them together including pauses. That's not an optimal way but it could do the job.
@@ThorstenMueller I found a way. I am using debian. I had to create a 3 second silent wav file and split the paragraphs into different wav files and then merge them together with the ilent wav where I need it. I done this with a bash script. So problem solved. Do you know where I can get more voice files other then the ones listed.
Hello! I've been using this on hugging face for a few months, but today when I went to the page this error appears: Runtime error Scheduling failure: not enough hardware capacity Container logs: Fetching error logs... Any idea of what's happening? Thank you!
According to the error message the XTTS container does not have enough compute power on Huggingface platform. This might be a temporary problem or might relate to the shutdown of Coqui AI as a company.
I have tried a some voice cloning tools and provided my voice as a reference audio, but none of the results sound anything like me... : ( I have an australian accent but the generated voices come out with American accents, not sure what I'm doing wrong.
I guess you're doing nothing wrong. Maybe the english model has been trained on a voice dataset with hours of native english speaking people and one phrase has not enough "power" to change the accent. Normally i'd recommend asking in Coqui TTS community, but as Coqui is shutting down, it might take some time to get an answer, because of other priorities maybe.
Hi Thorsten, I can't get it to run. I always receive "No module named 'TTS.api'; 'TTS' is not a package" Even though the tts package is installed. Pip lists it in the installed packages. The few threads I found are no help. Maybe you have an idea?
This is strange. If "pip list" shows the tts package then it seems that everything is installed correctly. Are you running your python script really in the right python venv? Can you run "tts --help" in the command line successful?
@@ThorstenMueller I managed to get it running briefly when I use the setup of the git repo. But it is only working in that terminal and after closing it everything is gone with it. Thats not a solution, because the setup is taking too long.
Hehe, thanks for your suggestion. I'll keep it in mind for next videos. As a non-native english speaker i have to think a little while for the right words 😆.
I got this line or error code when I wanted to in the wheel -U: ERROR: Could not build wheels for tts, which is required to install pyproject.toml-based projects how to fix that?
Hey, ich habe das über Pinokio installiert, da ich es anders nicht zum laufen gebracht habe. Allerdings weiß ich nicht, wie ich bei coqui-tts auf GPU umstellen kann. Welche Datei muss ich öffnen? Auch die Geisterstimmen möchte ich gerne verhindern. Weißt du wo ich da was einstellen muss? Ich weiß, dass es möglich ist, da ich einen Telegram-Bot verwende, der mit coqui arbeitet und fehlerfrei funktioniert, allerdings mit starker Zeichenbegrenzung. Achja, Zeichenbegrenzung :D wo kann ich die auch ändern? Danke dir im vorraus
Bei den Coqui TTS Modellen gibt es einen Kommandozeilenparameter "--use_cuda". Damit sollte die GPU genutzt werden. Zur Länge kannst Du mal versuchen die Konfigurationsdatei des Modells zu öffnen und den Wert von "max_decoder_steps" zu erhöhen (habe ich aber bei XTTS selber noch nicht versucht). Viel Erfolg 😊.
@@ThorstenMueller danke. Das werde ich heute Abend mal versuchen. Wo genau finde ich die Konfigurationsdatei? Ist das die configs.py im TTS Ordner? Gibt es auch eine Möglichkeit, die Fehler am Ende von Sätzen und in den Stellen zwischen den Sätzen zu vermeiden? Oft entstehen da auch eine Art Geisterstimmen, die echt seltsam klingen xD
@@ThorstenMueller Ja, ich habe eine bessere variante für coqui-tts gefunden, die wesentlich einfacher für Anfänger ist. Kann ich dir nur empfehlen: Alltalk_tts
I'm getting this issue where when I try to check for models this happens: LLVM ERROR: Symbol not found: __svml_cosf8_ha Anyone know what's going on here?
Honestly i'm not sure on the future of XTTS (model, code and huggingface space) cause of their shutdown. But right now code and space is still available so it should still work as described but please let me know if you experience bigger problems.
@@ThorstenMueller For some reason my reply keeps getting deleted. Anyhow, I run a local TTS that can pull from a text file. Maybe it will help you. It is by neonbjb on Github.
@john_blues You can totally read in, one or muliple files via python, transform the text as you like, and use xtts to generate a synthetic speech audiofile from it. Im using i currently to create sort of a audobook from a fanfiction. Removing points at end of sentences improved the result quite a lot.
How to fix "ERROR: Could not build wheels for tts, which is required to install pyproject.toml-based projects" chatgpt cannot help me. it´s necessary downgrade python?
No. But as Coqui (company) shut down i'm not sure on further development of their code. Maybe it's worth taking a look to Piper TTS for training an Indonesian tts model. ruclips.net/video/b_we_jma220/видео.html
@@ThorstenMueller gibt inzwischen auch auf GitHub ein WebUI fürs finetunen 🙌 funktioniert ganz gut. Das einzige was noch ein Problem ist das sich die Einstellungen ändern Temperatur und Co hab da Stunden ausprobiert es werden immer Sätze übersprungen.
Hello Thorsten, great video tutorials but xtts is not for me. No support for windows and never will be. No chance on older macs with nvidia cards because of lacking drivers. No support on linux without cuda. I was really looking forward to this but I simply don't have the time to fidel around for days or weeks. Thank you.
Cloning a voice with a sample of just 6 seconds even though it's not 100% identical, for me that's an AI that really needs to be improved, these AI that need dozens of hours to clone a voice didn't interest me much, I did it several tests using samples longer than 30, 60, 80 seconds in various languages and some were perfect, I also copied dozens of voices available on websites and the results were also very good, I suggest saving each audio generated in a different file because each The generated audio will never be the same as the previous one.
Josh Meyer (co-founder of Coqui AI) mentioned in my XTTS interview that 6 seconds audio input duration should be perfect for XTTS model. ruclips.net/video/XsOM1WZ0k84/видео.html
According their README python 3.11 is the max supported version. As Coqui AI hat shut down i'm not sure if or when this will be adjusted to higher python version.
I agree, at least on my personal tests with my foreign (german) pronunciation. The result has been far away from being a high class voice clone. Have you seen my interview with Josh (Coqui AI co-founder)? ruclips.net/video/XsOM1WZ0k84/видео.html)
another great video, Thorsten 👏 We have a happy update... you can now use unlimited audio for the 0-shot clone :D no longer are you limited to just 6 seconds. The HuggingFace space is still hard coded to max out at 30 seconds though... so we don't overload their servers 😆
You're very welcome and thanks for the update 😊.
This is great news! :D You probably should make another video comparing the quality differences between the 6 seconds and 30 seconds input audio! (or maybe more, if you can change that max value in the local installation) ^^ @@ThorstenMueller
@@juanjesusligero391 An audio samples comparison video with different audio input length is already in the making 😉.
Does the output sound better with longer audio? I tried the Japanese version on hugging face and output sounded robotic.
@@tsunderes_were_a_mistake In my german model i didn't encounter a change depending on the text length. But i did not exactly check this specific aspect. If you think this would be helpful i can give it a more specific try (with a german model). But i can't say anything about the Japanese model.
I was exactly like you, I also had too high expectations for Coqui XTTS, haha ^_^
While the outcome wasn't quite what I was expecting, the results are still quite impressive, especially considering they are based on just a 6-second sample. I was also really happy to read in the comments that the devs are working on improvements, like allowing for voice samples longer than 6 seconds.
I loved the video! Thanks a lot for your work, Thorsten! ^^
Thanks a lot for your nice feedback 🥰.
omg, the quality is so good compared to all the other voice-cloning TTS
Thanks a lot. I was wanting to train a model from many days and was thrashing with various errors. This solved everything
Great video! I'm really getting into TTS and it's so exciting to see what's possible now. It's incredible how something that needed hours of data a year ago can now be done in just 6 seconds. It's fascinating to watch this tech evolve
Thank you for your nice feedback 😊. I'm really curious to see where quality is going in near future.
Great video, expectations after listening to the interview with Josh were high, but XTTS is still kinda new, so I am excited for the future improvements.
I'm excited too 😊.
Thank You Yet Again! P.S. In addition to "Schei? Encoding" ... I am a fan of: "CAUTION I TEST IN PRODUCTION".
Nice one 😆
Nettes Tool und großen Respekt an den Entwickler! Ich finde die Idee super, allerdings könnte ich persönlich nichts mit der Qualität anfangen. Aber hey, für 6 Sekunden input ist dass doch ein mega Ergebnis finde ich!
Dem kann ich mich anschließen 😊.
Hi Thorsten, I am a computer engineer and AI RUclipsr myself (who isn't nowadays? haha :P). Just wanted to say that you make great tutorials on AI voice. I stumbled on this tutorial while exploring Coqui and it is the best tutorial I found. Thanks for taking the time to do these.
Also, a subscriber asked me for a resource on Coqui TTS tutorials on reddit, I have shared your channel! Keep up the great work.
Hi 👋. Thanks for your kind feedback on my content 😊. You're right, we are not alone on AI content 😆.
Pretty much 95% of youtube and the working class are against AI lmfao but keep daydreaming
Thanks for the good explanation and clear example. I wish you prosperity and new opportunities. I apologize for my broken English.
Thank you for your nice comment. I wish you all the best, too 😊.
Using my local PC GPU: Cloned Voice WORKED WELL ... and ... sounded 'somewhat ' like me BUT actually BETTER than me ( bolder and stronger ) !!!!
this work for you actually??????
@@elplayeravefenix2280 Yes. Not very well but it ‘worked’. On other projects I have found that more voice samples worked better but takes time. Ok.
Thanks for the informative video and interesting presentation.
Please make a guide on how to train a model on a custom dataset.
Thanks for your nice feedback 😊. This topic is already on my (growing) TODO list.
amazing video! I am wondering if it's possible to train a given voice and then just use that voice for future use. In the "clone your voice locally" section, the code requires the reference audio as an input. I'm thinking in terms of efficiency and that if you plan to use the same voice over and over, you shouldn't need to train the model each time.
Good question. I didn't think about that - up to now.
It has a clear English bias, but overall sounds pretty good.
Great! I was looking forward to this, only got it running on linux. Thank you for the tech support ;-)
😂 maybe cuda is exactly my problem on windows🤷♂
Thanks and you're welcome 😊. I'm happy if people find my videos helpful.
Sir, your explanation is very easy to understand.
Thank you, happy to hear that 😊.
Ich danke Ihnen vielmals. Sehr gutes Video. Deutsche Ordnung in allem!
Quite amazing that they can do this with such a short clip. I had the same results as you with english, it doesn't really sound like me even though I tried to speak my best english. :) - How would you compare it with Piper with regards to TTS performance? Ofc Piper is quite difficult to train for new voices, but its free to use commercially even. I wish there was some simpler way to clone voices with it and that would be golden. I have looked at your video for this but preparing the training set seems like a chore.
Thanks for your comment 😊. I didn't compare the performance between XTTS and Piper TTS. I guess when you want a free and best voice clone i'd go with Piper TTS right now, but the effort is higher - as you said.
Thank you for the super informative video! You're awesome!
Wow, thanks a lot for your nice feedback 😊.
Thanks a lot for your efforts. you are doing great work, keep it up.
Thank you a lot for your kind feedback - this keeps me motivated 😊
Ich mag deine Videos sehr, auch wenn viele leider nur auf Englisch sind. Könntest du dir vorstellen, einmal ein generelleres Übersichtsvideo zur Sprachsynthese machen? Auch nach tagelager Recherche blickt man als Laie nur unvollständig durch, es wäre großartig, wenn mal ein Profi wie Du für den Interessierten etwas tiefergehend folgende Themen erläutert:
Was genau ist/machen Coqui,
Xtts, Tortoise, Espeak / espaek-ng und wo ist der Unterschied zu
Mbrola und dessen Stimmen? (Kann ich tts anstelle von Mbrola in Skripten verwenden? Ja/nein - Wie/Warum?)
Beispielhafte Fragen zu xtts:
Was ist eine Multilingual Voice im Unterschied zur Thorsten Voice?
Was genau ist voice cloning im Gegensatz zu voice transfer?
Was machen/sind Coqui speakers?
Wo ist der Unterschied darin, des xtts Modell zu feintunen und einfach nur
eine speaker_wav Referenz anzugeben?
Vielen Dank für deine tolle Rückmeldung und den Vorschlag 😊. Das Thema gefällt mir sehr gut. Wenn man sich so lange und intensiv mit einem Thema beschäftigt, dann werden diese "Grundlagen" irgendwie so normal, dass man gar nicht mehr drüber nachdenkt. Ich habe das Thema auf meine TODO Liste gesetzt. Besten Dank dafür 😊.
Could you help, please?
tts : The term 'tts' is not recognized as the name of a cmdlet, function, script file, or operable program. Check the spelling of the name, or if a path was included, verify that the path is correct and try again.
At line:1 char:1
+ tts --list_models
+ ~~~
+ CategoryInfo : ObjectNotFound: (tts:String) [], CommandNotFoundException
+ FullyQualifiedErrorId : CommandNotFoundException
Did you use a python venv? Is this activated when try to run "tts" command? Does "pip list" show you an installed TTS package?
Ich habe mich ebenfalls ein wenig mit Coqui XTTS ausprobiert. Ich bin zu dem Entschluss gekommen dass es sich nicht lohnt.
1. kann coqui XTTS nicht annährend mit den führenden Mitstreitern bezogen auf Qualität der clones mithalten.
2. Ist coqui XTTS für diese Qualität bei diesem Preis meiner Meinung nach nicht lohnenswert, betrachtet man auch hier die Qualität und Pricings der Mitstreiter!
Trotzdem wieder vielen Dank für dein Video Thorsten!
Welchen Preis? 1$ am Tag für Unternehmen sonst ist es Kostenlos.
For anyone coming recently, the tts repo isn't maintained anymore according to an issue post on the github. It results in an error when running 'pip install tts'. This fork worked for me instead: 'pip install coqui-tts'
Thanks for that fork hint 👍🏻. Maybe an issue with a (too new) python version.
Genau das haben wir gesucht. Herzlichen Dank 👍
Das freut mich sehr 😊.
Sehr gut erklärt.
Ich hatte von dem video jedoch erhofft, nicht nur einen einzelnen speech zu erstellen, sondern mein eigenes model abzuspeichern, so dass es dann z.B. unter tts --list_models auftaucht oder ich es zumindestens bei --model_name angeben kann.
Ist das auch möglich?
Vielen Dank 😊. Die "--list_models" Option zeigt Informationen aus der .models.json Datei aus dem Repo an. Du könntest versuchen dein Modell in der Datei lokal bei dir einzutragen. Du hast also bereits ein eigenes Modell trainiert?
Hello, sir Thorsten! The title of the video doesn't really capture the point. Unfortunately, I didn't find in your video how to start the GUI for Coqui TTS. In the title to the video you have stated - XTTS - and just I was hoping that I could run the gradio-gui that was at the beginning of your video. Too bad you don't have a video tutorial on how to deploy on your local machine the handy GUI for voice generation that was in the demo.
Do you mean the Huggingface UI from the video?
@@ThorstenMueller Yes
Sehr sehr guter Kanal! 👍 Ich habe mich gefragt: Was ist denn der Grund für die doch niedrige Samplingrate von 22.050Hz im ThorstenVoice Dataset? Einfach eine schnellere Vearbeitung der Daten?
Vielen Dank für deine tolle Rückmeldung 😃. In den Tests war in der Audioausgabe kaum ein Unterschied hörbar, dafür aber war der Rechenaufwand bei bspw. 44kHz merklich höher.
@@ThorstenMueller Danke für die Info. Elevenlabs will ja für ein Professional Voice Cloning auch nur 128kbps mp3 und meint, dass kein Nachteil feststellbar ist. Sehr interessant, wie die AI das verarbeitet.
It's sad that they've discontinued the project.
Yes, but they did not just discontinue the project, but Coqui AI (the company) behind XTTS shut down.
What is your python version?
This seems very useful, but when I run "pip install tts", I get "Error compiling Cython file", and the operation breaks.
Strange, which python version are you using?
I've been using coqui for months and it's amazing that Coqui simulates breathing at all, but breathing is typically the most distorted part of the generated the audio which can make it sound unnatural.. I'm wondering if you remove the breathing from the source audio whether that will improve the quality of the cloned voice or whether the distorted breathing is just a symptom of the underlying model.
I've no idea how this could work. Maybe it helps if you use audio tools to cut out your breathing from the recording you provide to XTTS. Or maybe there are audiofilters like sox or ffmpeg that can remove breathing sounds from the generated audio.
Super erklärt 👍Wie kann ich denn meine Stimme Klonen das er mir ganze Texte vorliest? z.b. eine PDF Datei oder ein Word Dokument, oder beschränkt es sich nur auf 6 Sek.
Vielen Dank für das Lob - das freut mich sehr 😊. Eine fertige Lösung für Text/Word/PDF Input gibt es (glaube ich) nicht, aber generell kannst Du längeren Output erzeugen. Du musst den Eingabetext vielleicht aufteilen, aber sicherlich gehen deutlich mehr als 6 Sekunden.
nices video ;)
und ei gude wie?
Is it better, when the ref voice is longer than 6 sec?
or doesn't matter or more worse? 00:43
Ei subba, freut mich', dass des Video gefällt :)
According to my talk with co-founder of Coqui AI, Josh Meyer, the model is optimized for a 6 second audio input. Before trying longer audio input try using other 6 second clips.
did you make a tutorial on how to install and use cuda?
No, not yet. But interesting idea. I've added it on my TODO list 😊.
Creo que clono su voz muy bien con esos pocos datos que tuvo la inteligencia artificial
Estaré feliz si funciona bien en español con pocos datos de entrada.
This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error
im getting this error please help someone
Hi, now that Coqui is shutting down, we can’t use the model via API? I find trouble using the model like that. for the import code: “from TTS.api import TTS” module not found
Might be a problem with your local installation, too. Does "pip list" show a TTS package?
Hi Thorsten, sieht so einfach aus bei dir. Ich hab Coqui über Pinokio installiert und gestartet, in der Erwartung dann irgendwie lokal zu dieser GUI zu kommen. Pinokio sagt dann auch "running" aber unter den üblichen local hosts im browser finde ich nichts. Dann gibt es noch einen button "server", den hab ich mal gedrückt und bekomme die Antwort: .........Connected! Macht alles den Eindruck als liefe alles wie es soll... nur für mich endet das Erlebnis dort, weil ich nicht weiß wo sich Coqui mir zeigen könnte... schade eigentlich. Pinokio ist normalerweise ein gute Zugang für Non-Coder.
Meinst Du die GUI von Huggingface?
@@ThorstenMueller ja, ich meinte generell irgendeine GUI
Btw, how do I get the gpu parameter to work. I have a 3000 series GPU but even if I select gpu=True it says CUDA is not available. Also I have noticed that the cloned voice from my own speech shifts to sometimes output british accent and sometimes american (likely because my accent is neither). But it also means it is impossible to get consistent results with this. Is there some way to save a snapshot of whatever it came to was "the voice" and reuse that as input on subsequent generations. If not it is quite useless and just a fun demo really.
Did you install CUDA and is it working? There are Python code sniplets available to check if CUDA is working.
Hello, good video, do you know how to remove the character limit restriction when writing?
Warning: The text length exceeds the character limit of 239 for language 'es', this might cause truncated audio.
Thanks for your nice feedback 😊. Hmm, not really. Earlier we sometimes run into a "max_decoder_steps" which caused truncated audio, but i'm not sure if this applies here too.
Can you please tell me what program did you use to run the codes on @15:28 ?
Sure, it's a code editor from Microsoft, called "Visual Studio Code".
What do you say to Applio TTS? Maybe the best Open Source TTS?
I didn't heard about Applio TTS. You say it's worth giving it a try?
Is it possible to use AI even with texts in another language? I would really like to know because I want to dub a game with this tool.
I'm not sure about that. I'd recommend asking on Coqui community, but as Coqui AI (the company) has shut down i'm not sure on how fast you might get an answer.
Is there any way we can push this trained model to huggingface? Like once we give the audio sample and next time when pushed to huggingface hub we only need to pass the text to generate the audio with respective voice?
Do you mean the actual model or a space to use the model out of the box?
What parameters I need to include to make audio output more quality? It's looks like only 96kbps bitrate..
Normally generated output is the same samplerate as the voice dataset the model has been trained on. Maybe you can use tools like ffmpeg to adjust samplerate afterwards, but i doubt if this will increase the quality.
@@ThorstenMueller I need to train my own model in 48KHz, so the output will be more quality
Can I hire you for a few hours? I need help with a project that’s deeply personal and I would like to go the local hosting route.
Feel free to contact me here (with some additional info). www.thorsten-voice.de/en/contact/
how we can get a more fast response?... better hardware?, ram? processing? ... thsnks for the video!
First, you're welcome :). Do you use cpu or gpu? Because gpu (CUDA) provides faster response.
@@ThorstenMueller thnks for your response. Yea!... GPU, but my notebook is only to development... i need better process to audio files from cloning voice tts
Posso usar essa voz para narrar um vídeo no RUclips?
Can we create ready to use object instead of "speaker_wav" list passed every time we generate "output.wav" ? to speed up process ?
As i'm not sure, i'd recommend asking on Coqui community on github. But as Coqui AI (the company) has shut down, i'm not sure on how fast you might get a reaction.
thank you for this video! i am running into problems. when i execute the script, it shows "AssertionError: CUDA is not availabe on this machine.". But i have cuda12.3 and compatible torch and my other ai software ran well. i have no idea what is happening. please help!
Does it work if you use it with "use_cuda false" in general?
If i install xtts on my computer, i can use unlimited characters? Because the demo version on huggingface has 200 characters limitation.
Thanks.
This should be the case. The limitation is part of their Huggingface space and should not apply locally.
huggingface.co/spaces/coqui/xtts/blob/d3b67acd01a3f63524371ad7d35a044ac0e75f60/app.py#L200
@@ThorstenMueller Nice, i'm gonna try it. Thanks!
Has anyone made a comparison between xtts and piper training? I'm curious on what's better quality @thorsten?
Personally i prefer Piper. But i trained my models in piper with way more input data then the 6 seconds input to xtts.
What is better this or Tortoise TTS (Ecker Voice Clone) ?
Hard to say, as i didn't give Tortoise TTS a closer look, but it's still on my todo list.
"ERROR: Failed building wheel for tts" - What version of python are you running?
This error often occurs when you use an older version of pip. Did you run "pip install pip setuptools wheel -U" before installing Coqui?
@@ThorstenMueller This may have been the issue. Played around with it a bit and got it working again, but can't recall exactly which thing I did differently. Thanks for the reply though!
If you're looking for content ideas, one thing I am struggling with is how this all fits together now, in June 2024. Specifically - when I start the server and hit the local webserver, I get a very different UI than what I see in other videos on XTTS. And I know there are all different UIs for XTTS - there's a fine tuning one, a web UI, RVC, etc. and some of them have bits that don't work, and it sounds like Coqui has abandoned the project now and... it's hard to catch up on it all when coming into it for the first time, and it changes so rapidly.
So I guess what I'm trying to figure out is - if I want to build an AI voice clone of me, today, what's the strategy/stack you recommend?
have a question for you. IF I wanted to pause for a number of seconds between sentences then how can I do that. Piper is really cool. Thanks.
Normally this is an aspect of SSML (Speech Synthesis Markup Language), which is by now not supported by Coqui and Piper. Maybe you can try a workaround and add multiple dots (....) to create a pause. But i didn't try it out myself.
@@ThorstenMueller thanks. will try that.
@@ThorstenMueller just tried it. I put dots where I wanted to pause bit it does not work. It only responds to one dot.
@@nomadhgnis9425 Okay, then maybe it's a workaround to create multiple tts wave files and merge them together including pauses. That's not an optimal way but it could do the job.
@@ThorstenMueller I found a way. I am using debian. I had to create a 3 second silent wav file and split the paragraphs into different wav files and then merge them together with the ilent wav where I need it. I done this with a bash script. So problem solved. Do you know where I can get more voice files other then the ones listed.
Hello! I've been using this on hugging face for a few months, but today when I went to the page this error appears: Runtime error
Scheduling failure: not enough hardware capacity
Container logs:
Fetching error logs...
Any idea of what's happening? Thank you!
According to the error message the XTTS container does not have enough compute power on Huggingface platform. This might be a temporary problem or might relate to the shutdown of Coqui AI as a company.
@@ThorstenMueller Thanks for your reply! I hope it's not the later, It's the only free and online option that I knew of 😓
Hello, do you know why when converting a text of about 500 words it takes about 25 minutes?
I didn't try it with such long texts. Is it faster when you split it into smaller pieces and put the chunks together in post generation?
I have tried a some voice cloning tools and provided my voice as a reference audio, but none of the results sound anything like me... : ( I have an australian accent but the generated voices come out with American accents, not sure what I'm doing wrong.
I guess you're doing nothing wrong. Maybe the english model has been trained on a voice dataset with hours of native english speaking people and one phrase has not enough "power" to change the accent. Normally i'd recommend asking in Coqui TTS community, but as Coqui is shutting down, it might take some time to get an answer, because of other priorities maybe.
Hi Thorsten,
I can't get it to run. I always receive "No module named 'TTS.api'; 'TTS' is not a package" Even though the tts package is installed. Pip lists it in the installed packages.
The few threads I found are no help. Maybe you have an idea?
This is strange. If "pip list" shows the tts package then it seems that everything is installed correctly. Are you running your python script really in the right python venv? Can you run "tts --help" in the command line successful?
@@ThorstenMueller The tts command in the console works. tts --list_models too.
And yes i am running the created venv.
@@ThorstenMueller I managed to get it running briefly when I use the setup of the git repo. But it is only working in that terminal and after closing it everything is gone with it. Thats not a solution, because the setup is taking too long.
I love your videos bro but you gotta speak a bit faster XD I have to play the video at 1.5x speed haha still love the videos!
Hehe, thanks for your suggestion. I'll keep it in mind for next videos. As a non-native english speaker i have to think a little while for the right words 😆.
I got this line or error code when I wanted to in the wheel -U: ERROR: Could not build wheels for tts, which is required to install pyproject.toml-based projects how to fix that?
Did you update pip to latest version first - "pip install pip setuptools wheel -U"?
Hey, ich habe das über Pinokio installiert, da ich es anders nicht zum laufen gebracht habe. Allerdings weiß ich nicht, wie ich bei coqui-tts auf GPU umstellen kann. Welche Datei muss ich öffnen? Auch die Geisterstimmen möchte ich gerne verhindern. Weißt du wo ich da was einstellen muss? Ich weiß, dass es möglich ist, da ich einen Telegram-Bot verwende, der mit coqui arbeitet und fehlerfrei funktioniert, allerdings mit starker Zeichenbegrenzung. Achja, Zeichenbegrenzung :D wo kann ich die auch ändern? Danke dir im vorraus
Bei den Coqui TTS Modellen gibt es einen Kommandozeilenparameter "--use_cuda". Damit sollte die GPU genutzt werden. Zur Länge kannst Du mal versuchen die Konfigurationsdatei des Modells zu öffnen und den Wert von "max_decoder_steps" zu erhöhen (habe ich aber bei XTTS selber noch nicht versucht). Viel Erfolg 😊.
@@ThorstenMueller danke. Das werde ich heute Abend mal versuchen. Wo genau finde ich die Konfigurationsdatei? Ist das die configs.py im TTS Ordner? Gibt es auch eine Möglichkeit, die Fehler am Ende von Sätzen und in den Stellen zwischen den Sätzen zu vermeiden? Oft entstehen da auch eine Art Geisterstimmen, die echt seltsam klingen xD
@@IngridUterus Hast Du die config Datei gefunden?
@@ThorstenMueller Ja, ich habe eine bessere variante für coqui-tts gefunden, die wesentlich einfacher für Anfänger ist. Kann ich dir nur empfehlen: Alltalk_tts
I tried it on huggingface with Japanese but it sounded robotic. Can you make a tutorial on how to finetune xtts on local?
Thanks for your topic suggestion. I've added it on my TODO list but it might take some time.
I'm getting this issue where when I try to check for models this happens:
LLVM ERROR: Symbol not found: __svml_cosf8_ha
Anyone know what's going on here?
That's strange. Maybe recreate your python venv and reinstall. Maybe there's an error in your installation.
Can I still use this toturial? since Coqui is shut down. Plus can I use it for cloning Urdu language?
Honestly i'm not sure on the future of XTTS (model, code and huggingface space) cause of their shutdown. But right now code and space is still available so it should still work as described but please let me know if you experience bigger problems.
Is this able to pull text from a text file? I have a Tortoise version that can do it, and it is helpful for long form text.
IMHO this isn't supported by now. But finding a suitable solution for that is on my TODO list.
@@ThorstenMueller For some reason my reply keeps getting deleted. Anyhow, I run a local TTS that can pull from a text file. Maybe it will help you. It is by neonbjb on Github.
@john_blues You can totally read in, one or muliple files via python, transform the text as you like, and use xtts to generate a synthetic speech audiofile from it.
Im using i currently to create sort of a audobook from a fanfiction.
Removing points at end of sentences improved the result quite a lot.
How to fix
"ERROR: Could not build wheels for tts, which is required to install pyproject.toml-based projects"
chatgpt cannot help me.
it´s necessary downgrade python?
Did you update the python dependencies in your environment? So running "pip install setuptools wheel pip -U"
It says Cuda is not available on this machine
I'm working on a video about CUDA. If you want i can post an update here when it's online 😊.
@@ThorstenMueller I've solved the problem. Torch and CUDA version should be compatible with each other
@@orcunaicovers17 Happy you could solve it 😊.
Have you made a video tutorial to create a voice model for Indonesian, or how to add a voice model, I want to make an Indonesian voice model
No. But as Coqui (company) shut down i'm not sure on further development of their code. Maybe it's worth taking a look to Piper TTS for training an Indonesian tts model. ruclips.net/video/b_we_jma220/видео.html
can i use a big text dataset?
I which context? Finetuning?
Love thi channel 😊😊😊
Thanks a lot 😊
Amazing result.
Kannst du auch zeigen, wie man es finetune kann? Aber Lokal? Danke
Danke für deinen Themenvorschlag 😊. Ich habe es auf meine TODO Liste gesetzt.
@@ThorstenMueller gibt inzwischen auch auf GitHub ein WebUI fürs finetunen 🙌 funktioniert ganz gut. Das einzige was noch ein Problem ist das sich die Einstellungen ändern Temperatur und Co hab da Stunden ausprobiert es werden immer Sätze übersprungen.
For some reason my terminal doesn't run in the venv.
Could you successfully create a venv and just can't activate it or can't you create it?
@@ThorstenMueller the venv created just fine but I couldn’t open a terminal within it
@@TNMPlayer That's strange. Do you use the .bat or powershell (.ps1) file to activate the venv?
@@ThorstenMueller I used the .ps1
@@TNMPlayer Maybe try out the .bat version, this could have an effect.
Can i make code clone Arabic voice and read arabic text
I've no experience using Arabic with XTTS. Did you already try it using their Huggingface space?
Coqui tts is shotting down?
Sadly, yes. I've made a short about it. ruclips.net/user/shortsQMruRTlQu7I?si=JyDY8ziFJC8omAPY
Is there a TTS for Android?
IMHO by now there's no support for Coqui und Piper TTS on Android. But this would be really cool 😎. Did you ask already at their communities?
Hello Thorsten,
great video tutorials but xtts is not for me. No support for windows and never will be. No chance on older macs with nvidia cards because of lacking drivers. No support on linux without cuda. I was really looking forward to this but I simply don't have the time to fidel around for days or weeks. Thank you.
--- hallo, bitte das tutorial nochmal auf deutsch. weil das würde mich wirklich sehr interessieren. aber englsich verstehe ich kein wort.
Hallo, helfen dir vielleicht zunächst die automatisch auf Deutsch übersetzen Untertitel?
@@ThorstenMueller die sind immer aus bei mir. weil ich beim lesen dem video nicht volgen kann. daher bringt mir das nicht wirklich was.
Not for commercial use. We need a truly open solution.
Yeah, it's a shame it's not 100% open. Fortunatelly, we'll always have Tortoise tts :)
Who cares it's not like they're going to sue you if you do.
@@chryseus1331They could, though. If you have a company and want to use a software for commercial use, I wouldn't recommend ignoring its license.
Cloning a voice with a sample of just 6 seconds even though it's not 100% identical, for me that's an AI that really needs to be improved, these AI that need dozens of hours to clone a voice didn't interest me much, I did it several tests using samples longer than 30, 60, 80 seconds in various languages and some were perfect, I also copied dozens of voices available on websites and the results were also very good, I suggest saving each audio generated in a different file because each The generated audio will never be the same as the previous one.
Josh Meyer (co-founder of Coqui AI) mentioned in my XTTS interview that 6 seconds audio input duration should be perfect for XTTS model. ruclips.net/video/XsOM1WZ0k84/видео.html
Thank you!!
TTS is available on python 3.12?
According their README python 3.11 is the max supported version. As Coqui AI hat shut down i'm not sure if or when this will be adjusted to higher python version.
"all you need is 6 second audio" is just nonsense. It is not enough and the result is miles away from anything close to the original.
I agree, at least on my personal tests with my foreign (german) pronunciation. The result has been far away from being a high class voice clone. Have you seen my interview with Josh (Coqui AI co-founder)? ruclips.net/video/XsOM1WZ0k84/видео.html)
oh no desynchorn video
do you clap your hands by recording?
No, but thanks for the idea to optimize video/audio sync but clapping 👍.
must GPU?
Generally (not sure for XTTS in special) CPU might work but way slower than using a CUDA enabled GPU.
if i want to clone my own voice,i need to train this?how?@@ThorstenMueller
@@רחלישדה-ה4מ I'd recommend you taking a look to Piper TTS for that. ruclips.net/video/b_we_jma220/видео.html
thanks!@@ThorstenMueller
This is only interesting to developers and programmers. Regular hpbbyists will find this video useless, because Coqui has no GUI or server.
Coqui TTS has a simple web UI if you run it locally where you can synthesize audio.
coqui is now dead
Sadly yes, at least the company, let's see what's happening with the code and community.
Anjai
Leider ist das ohne UI ein verdammter Alptraum für jeden der kein Programmierer ist.
Ja, dann benutz doch einfach das UI! xD
Kannst Pinokio nutzen, mit automatischer Installation hat das Web UI von Huggingface
@@ratside9485 leider bekomm ich da immer ne Fehlermeldung bei der installation,