If you run into any issues or have any ideas, please open up a new issue here: github.com/JarodMica/audiobook_maker/issues Try to make it as descriptive as possible if it's an issue and the same goes with improvements.
Mate just want to say i have been following you for some time and really appreciate your tutorial on AI Voice cloning/TTS. Probably the best out there for this niche
was just following your old tutorial when i checked your channel and saw this. good luck in all your future endeavors. funily enough one of the things i will be using your audio book maker for is to turn the re zero web nobel into audio books
When i try to install git things mean in CMD It detracts git installed in cmd i checked That says nothing and goes on When put that patch it says can't find that path in tortes I got the problem 9:38 here.. git modules Not installing (venv) C:\Users\fylte\Desktop\audiobook_maker-master\modules\tortoise_tts_api> git submodule status (venv) C:\Users\fylte\Desktop\audiobook_maker-master\modules\tortoise_tts_api Like this response got
Thank you so much for your work. This project is just amazing. It would be cool to have the option to export it as an M4B file, instead of an mp3, or to have the option to export every chapter as a separate audio file.
this is a great tool for the project that need, which is a short story. i cannot wait to use it. I just need a tool to help clone and create voice models to read the story in different characters.
I‘m looking for some kind of an immersive reader with a decent text to speech system but also highlights the words so the text is there as a support if you want to follow. Any suggestions for this?
Balabolka checks all of your needs. If you set it up correctly then you are good to go for 100 years.(Speaking from experience)😂 (Just get decent Natural voice or see how to use Natural Edge voice in balabolka)
Each of the open source engines (tortoise, xtts, styletts, f5tts), whatever languages those support will be supported. This includes custom models that a user may have trained.
@@Jarods_Journeythat means we can add text in spanish (my interest) and that will do? I live your work, and fir sure i will buy you a coffee! Thank you very much for this!
Thank you for creating this 🥳🎉 Is it only my impression or is RVC functioning worse than in the cloner? Most of my model don't give the genuine voices they used to. Is there a way to adjust this?
Incredible project and amazing achievements tbh congrats man. My only issue is that no matter what model I choose my voices always end up super dark pitched(like sauron lol) any clues as of why? I've played around with Pitch and pich methods to no avail. Tried over 4 cusotm trained models EDIT: This only happens with RVC enabled. EDIT2: Feel so stupid it was the sample rate I had to change. Cheers!
Great will buy later. On the text files it would make sense to allocate voices in there. Eg if generating from AI ask it to use format V1: audio text1 V2: audio text 2 V1: audio text3 Then these would auto map to the selected voices index. Eg if the first voice is me all lines with V1: will use this voice. This would save a lot of time manually selecting each voice per line.
Yeah, I'm thinking about how to incorporate it. I could support a custom speaker import option, but I have to think on how I want to make this option available in the audiobook maker
@@Jarods_Journey On the voice selection per line. Loads of different colours looks very confusing. Have a separate column simply with the speaker name and or meme of the speaker. PS For anyone else... Pay the $14.99 it just worked. No spending hours setting up environments and pip installing for ages.....
I will be purchasing the package to practice using it soon. I would like it to have a language selection option, not only for the entire audiobook, but for some sentences as well. I am interested in Latin Spanish, and with variations of accents, for example Argentine. Would you add this functionality to this project? Thank you very much again for all this incredible work.
I have a use case where I have a database of like 1000 different lines or paragraphs. Is there a way to just jump to a specific line and play that (even maybe through some sort of API or through the interface?) - And specifically map the entries to specific labels? (not nessisarily 1-1000, but maybe some numbers skipped or even stuff labeled like A1, A2, etc) - Think choose your own adventure, that's kind of close to the use case I would try this for.
I'm not quit sure I understand the use case here, but there's only a scroll bar in the table right now that you can use to go up and down. Custom labeling other than speakers is not supported atm
Could you please explain how RVC settings and Tortoise Settings are different? I put in my RVC model in the settings, check Use s2s Engine. But the result is still the random voice from Tortoise
That really impressive, i couldn’t watch the full video yet, maybe you talk about it inside, but did you had time to include the e2,f5 new tts voice cloning app you show in one of your video ? Because there "podcast" option need to have the text formatted with the name of each speaker at the first word of there sentence, like: speaker1: …., speaker2:…, and then you give them an audio sample of 10 sec for each voice , and they do like you show at the start of the video. Really impressive. But i only try with English as it say, it work with English and chinese, and i didn’t try yet to see a result for Japanese or french, for me, not sure it would work great, and don’t know how to train a voice with their tech.
Let's say I am not happy about how the Narrator is saying the sentence with its emotion. Can i use my own voice in combination with the Narrators voice to improve emotional way it says a sentence? How can I implement it? Amazing Tool btw! Love your content
I think if you run various softwares on a cloud machine like Google Collab, Lightning AI, Kaggle, etc., then everyone will have the opportunity to use the software, because not everyone has a PC with high configuration.
It's possible to outsource the generation to cloud compute, but unfortunately, I don't wanna play around with making an application compatible with cloud machines as I'd have to maintain it and I personally don't use much cloud myself. I'm a big fan of having things locally and as open source gets better, models also get more efficient.
Hey David, please open up a new issue on the GitHub issues tab and share the error that your getting in the terminal so that we might be able to figure out what's going on: github.com/JarodMica/audiobook_maker/issues
Yes, a proof of concept has been proven with chatgpt in it's ability to label sentences. But I need a specific format, so working through some ideas on that
@@Jarods_Journey That would be brilliant. I write short stories for my nieces (3 and 6 years old) and have already recorded several of them. Unfortunately, I am increasingly lacking the time for this, which is why I have been watching your videos about the audio book maker for a long time.
That should be able to work, you might top out though if using tortoise TTS. If you're familiar with styletts2, when I release the engine for that, it should be able to inference on that without issues.
Give or take 15-20 seconds max for a tortoise tts example, 20-30 seconds with styletts, and up to 30 seconds with f5tts. Not too sure on the breakdown when it comes to words though.
Question:- I only have 6gb Vram but have 64gb ddr5 Physical RAM. So will it work on my system or it just works on Vram?🧐 1. Its laptop not pc so no gpu upgrade 😭 2. I 40B parameters LLM model on it without any hitch ups, (70B with 40 wpm). And it works as when Vram fills out it utilizes Physical RAM. So will this work like same ie. Will Use Physical RAM after Vram is completely utilized?
6gb of vram should work, I think you'll just be topping out a bit with tortoise tts. I think though all of the engines I'm planning on adding inference with at most 4gb of vram needed. It will overflow to ram though if it gets completely utilized afaik.
@@Jarods_Journey thanks man! I'm currently doing LLM training on laptop & all TTS I've been training are Given input only in "IPA" not "Text typing", so I'm getting better results in form of pronunciation but as far as TRAINING voice using audio clip is not working due to Vram limitation. So I've updated physical ram to compensate for it. So hope it works
@@Jarods_Journey I don't understand how to not run it random. With one narrator and nothing added, the tortoise panel doesn't allow other option than random. How can I set it up so every sentence is read in a single voice?
Thinking of purchasing two quick questions, are you planning to implement E2 F5 TTS at some point it's way more expressive! Also will it work on Apple Silicon? (Im on an M1 chip!). Thanks for a great project!
E2/F5 will be implemented soon, currently finishing up styletts then I'll work on that. Unfortunately, no Mac support atm! It may work if you hack around, but I don't have a mac and haven't tested that use case.
If you're familiar with these open source engines, it supports whichever language your chosen engine will support. The parser is designed for english right now though, so best compatibility with english.
I've been using this with the new F5TTS engine and captures the person so much better for some people. You basically have to enable "use duration prediction model?" to get speaker to actually nail the sentence at normal pace, but if the sentence is too long it starts skipping words.... Didn't ever experience this with the gradio demo they released. Also thought we would escape from the issue of long wav files having to be converted every new sentence. Luckily I was storing in that same folder all the voices split into 30 seconds segments, so I just needed to rename the folder to F5TTS. One question I do have is how to re-enable deepspeed for tortoise? Is it as simple as uninstalling 2.4 and installing pytorch 2.3? Is it even worth it?
If you run into any issues or have any ideas, please open up a new issue here: github.com/JarodMica/audiobook_maker/issues
Try to make it as descriptive as possible if it's an issue and the same goes with improvements.
You need to add E2-F5-TTS imo.
Defo I actually watched hoping it was using E2-F5 TTS!
Mate just want to say i have been following you for some time and really appreciate your tutorial on AI Voice cloning/TTS. Probably the best out there for this niche
Appreciate it :)!
was just following your old tutorial when i checked your channel and saw this. good luck in all your future endeavors. funily enough one of the things i will be using your audio book maker for is to turn the re zero web nobel into audio books
Ayee I approve the choice :)!
Exactly what i wanted to go too 😅
When i try to install git things mean in CMD
It detracts git installed in cmd i checked
That says nothing and goes on
When put that patch it says can't find that path in tortes
I got the problem 9:38 here.. git modules
Not installing
(venv) C:\Users\fylte\Desktop\audiobook_maker-master\modules\tortoise_tts_api> git submodule status
(venv) C:\Users\fylte\Desktop\audiobook_maker-master\modules\tortoise_tts_api
Like this response got
Thank you so much for your work. This project is just amazing. It would be cool to have the option to export it as an M4B file, instead of an mp3, or to have the option to export every chapter as a separate audio file.
I buy'd you a coffee for the audiobook maker! thank you so much for this.
This is exciting stuff. I'm more than happy to in effect pay once for the project if it's then onwardly supported/developed. ;)
Sweet thanks for the update. I cannot wait till theres some ai agent which can parse different characters in books so we can feed it into tjisn
this is a great tool for the project that need, which is a short story. i cannot wait to use it. I just need a tool to help clone and create voice models to read the story in different characters.
I‘m looking for some kind of an immersive reader with a decent text to speech system but also highlights the words so the text is there as a support if you want to follow. Any suggestions for this?
Probably speechify tbh?
Balabolka checks all of your needs. If you set it up correctly then you are good to go for 100 years.(Speaking from experience)😂
(Just get decent Natural voice or see how to use Natural Edge voice in balabolka)
Read Aloud using Edge web browser
Is that Kenjiro Tsuda speaking English? So cool, like, 99% clone 🤯
yup!
Hi Jarods, Is there any foreign languages available like French ? Thnx
Each of the open source engines (tortoise, xtts, styletts, f5tts), whatever languages those support will be supported. This includes custom models that a user may have trained.
@@Jarods_Journeythat means we can add text in spanish (my interest) and that will do? I live your work, and fir sure i will buy you a coffee! Thank you very much for this!
Thank you for creating this 🥳🎉 Is it only my impression or is RVC functioning worse than in the cloner? Most of my model don't give the genuine voices they used to. Is there a way to adjust this?
Incredible project and amazing achievements tbh congrats man. My only issue is that no matter what model I choose my voices always end up super dark pitched(like sauron lol) any clues as of why? I've played around with Pitch and pich methods to no avail. Tried over 4 cusotm trained models EDIT: This only happens with RVC enabled. EDIT2: Feel so stupid it was the sample rate I had to change. Cheers!
Yeah, it's currently a small bug in the rvc library! I'll have to fix it, but SR can be lowered to 0 to resolve it for now
Great will buy later.
On the text files it would make sense to allocate voices in there.
Eg if generating from AI ask it to use format
V1: audio text1
V2: audio text 2
V1: audio text3
Then these would auto map to the selected voices index.
Eg if the first voice is me all lines with V1: will use this voice.
This would save a lot of time manually selecting each voice per line.
Even if the story has been written openai could re format the text.
Yeah, I'm thinking about how to incorporate it. I could support a custom speaker import option, but I have to think on how I want to make this option available in the audiobook maker
@@Jarods_Journey On the voice selection per line. Loads of different colours looks very confusing. Have a separate column simply with the speaker name and or meme of the speaker.
PS For anyone else... Pay the $14.99 it just worked. No spending hours setting up environments and pip installing for ages.....
I will be purchasing the package to practice using it soon. I would like it to have a language selection option, not only for the entire audiobook, but for some sentences as well. I am interested in Latin Spanish, and with variations of accents, for example Argentine. Would you add this functionality to this project? Thank you very much again for all this incredible work.
I have a use case where I have a database of like 1000 different lines or paragraphs. Is there a way to just jump to a specific line and play that (even maybe through some sort of API or through the interface?) - And specifically map the entries to specific labels? (not nessisarily 1-1000, but maybe some numbers skipped or even stuff labeled like A1, A2, etc) - Think choose your own adventure, that's kind of close to the use case I would try this for.
I'm not quit sure I understand the use case here, but there's only a scroll bar in the table right now that you can use to go up and down. Custom labeling other than speakers is not supported atm
So can i download this audio after it is done and upload to my phone to listen?
Yup! It's all yours so do with it what you will
Great project!
Could you please explain how RVC settings and Tortoise Settings are different? I put in my RVC model in the settings, check Use s2s Engine. But the result is still the random voice from Tortoise
That really impressive, i couldn’t watch the full video yet, maybe you talk about it inside, but did you had time to include the e2,f5 new tts voice cloning app you show in one of your video ? Because there "podcast" option need to have the text formatted with the name of each speaker at the first word of there sentence, like: speaker1: …., speaker2:…, and then you give them an audio sample of 10 sec for each voice , and they do like you show at the start of the video. Really impressive. But i only try with English as it say, it work with English and chinese, and i didn’t try yet to see a result for Japanese or french, for me, not sure it would work great, and don’t know how to train a voice with their tech.
F5 will be included in the audiobook maker, other people seem hard at work to adding more languages for it though rn
Let's say I am not happy about how the Narrator is saying the sentence with its emotion. Can i use my own voice in combination with the Narrators voice to improve emotional way it says a sentence? How can I implement it?
Amazing Tool btw! Love your content
hmmm why would i need to purchase the install package when someone pull requested a open source installer at your github?
I think if you run various softwares on a cloud machine like Google Collab, Lightning AI, Kaggle, etc., then everyone will have the opportunity to use the software, because not everyone has a PC with high configuration.
It's possible to outsource the generation to cloud compute, but unfortunately, I don't wanna play around with making an application compatible with cloud machines as I'd have to maintain it and I personally don't use much cloud myself. I'm a big fan of having things locally and as open source gets better, models also get more efficient.
I thought I recognized that first voice. So much more familiar speaking in japanese lol.
If you've watched any anime in the past 5 years, you'll have encountered him lol
@@Jarods_Journey Yeah, ever since he showed up in "Demon Lord Retry" I've been seeing him in literally every anime.
getting alot of errors trying to install the rvc files i bought the packaged files and it seems i dont have something right. please help
Hey David, please open up a new issue on the GitHub issues tab and share the error that your getting in the terminal so that we might be able to figure out what's going on: github.com/JarodMica/audiobook_maker/issues
when i restart this project it's show " Configuration file/tts_config.json not found"
But
Hey, why not use ChatGPT's advanced voice and then switch the voice later with ElevenLabs?
Expensive?
@@mucool328 like $20?
@@mucool328 it's only $20.
@@mucool328 To create an audiobook, I'd spend about $50 to get better quality.
@@mucool328 Are you serious?
Hey Is it possible for a program to automatically select different voices for txt an e-book? Than just dialing manually ?
Imagine writing a book then.
Yes, a proof of concept has been proven with chatgpt in it's ability to label sentences. But I need a specific format, so working through some ideas on that
Still, Can you make it easlier to use?
Great! I wish I could use it in German. Future update for multilanguage maybe?
Possibly! XTTS would support that I believe, but that one is that last engine I'll be adding in
@@Jarods_Journey That would be brilliant. I write short stories for my nieces (3 and 6 years old) and have already recorded several of them. Unfortunately, I am increasingly lacking the time for this, which is why I have been watching your videos about the audio book maker for a long time.
I have an Nvidia GPU but how important is the 8gb of vram? I have a GTX 1660 super which has 6. Will this just not work?
That should be able to work, you might top out though if using tortoise TTS. If you're familiar with styletts2, when I release the engine for that, it should be able to inference on that without issues.
For some reason it's not finding tortoise when I try to load voices
My question is what's the limit per word, because with Eleven Labs it starts breaking down pass 800 words
Give or take 15-20 seconds max for a tortoise tts example, 20-30 seconds with styletts, and up to 30 seconds with f5tts. Not too sure on the breakdown when it comes to words though.
where do i see the PC req for running this?
nvm, i'll give it a try with an old gtx 1070 8gb i dont need to generate that much anyways
Question:- I only have 6gb Vram but have 64gb ddr5 Physical RAM.
So will it work on my system or it just works on Vram?🧐
1. Its laptop not pc so no gpu upgrade 😭
2. I 40B parameters LLM model on it without any hitch ups, (70B with 40 wpm). And it works as when Vram fills out it utilizes Physical RAM.
So will this work like same ie. Will Use Physical RAM after Vram is completely utilized?
6gb of vram should work, I think you'll just be topping out a bit with tortoise tts. I think though all of the engines I'm planning on adding inference with at most 4gb of vram needed. It will overflow to ram though if it gets completely utilized afaik.
@@Jarods_Journey thanks man! I'm currently doing LLM training on laptop & all TTS I've been training are Given input only in "IPA" not "Text typing", so I'm getting better results in form of pronunciation but as far as TRAINING voice using audio clip is not working due to Vram limitation.
So I've updated physical ram to compensate for it.
So hope it works
Why is each line read in a different voice? There seem to be 2-3 voices and each line is read by one of them.
In the video? Well, I selected them. If you're running with random, it will change voices as well.
@@Jarods_Journey I don't understand how to not run it random. With one narrator and nothing added, the tortoise panel doesn't allow other option than random. How can I set it up so every sentence is read in a single voice?
@@zanshibumi did you find a solution? Having the same problem
@@akum4501 No. I assume it's an inevitable consequence of generating the voice independently by sentences.
So a 4gb Nvidia won't cut it, right?
Possibly, I don't think you'll be able to do it too well with tortoiseTTS, but when I finish the styleTTS2 engine, 4gb would be fine
Any chance this runs on AMD RX 6800 XT?
Unfortunately not, AMD support is limited on most of these engines and as well, I don't have AMD to test on either. Sorry!
Thinking of purchasing two quick questions, are you planning to implement E2 F5 TTS at some point it's way more expressive! Also will it work on Apple Silicon? (Im on an M1 chip!). Thanks for a great project!
E2/F5 will be implemented soon, currently finishing up styletts then I'll work on that.
Unfortunately, no Mac support atm! It may work if you hack around, but I don't have a mac and haven't tested that use case.
I've accidentally paid for this on buy me a coffee page but I'm a pay monthly user. Can I be refunded the $14.99 please?
Thank you INFLUENCER PANEL for your tremendous support of my AudioBooks Channel are on All Social Networks face, RUclips.
Who will support which languages?
If you're familiar with these open source engines, it supports whichever language your chosen engine will support. The parser is designed for english right now though, so best compatibility with english.
I've been using this with the new F5TTS engine and captures the person so much better for some people. You basically have to enable "use duration prediction model?" to get speaker to actually nail the sentence at normal pace, but if the sentence is too long it starts skipping words.... Didn't ever experience this with the gradio demo they released. Also thought we would escape from the issue of long wav files having to be converted every new sentence. Luckily I was storing in that same folder all the voices split into 30 seconds segments, so I just needed to rename the folder to F5TTS. One question I do have is how to re-enable deepspeed for tortoise? Is it as simple as uninstalling 2.4 and installing pytorch 2.3? Is it even worth it?
Thank you INFLUENCER PANEL for your tremendous support of my AudioBooks Channel are on All Social Networks face, RUclips.