Comparing 2 Data Curation Methods for Training AI Voice in RVC

Jarods Journey

Просмотров 9 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 23 авг 2024

Комментарии • 58

@Samuel-wl4fw 8 месяцев назад
I think the noise truncation is really useful! It might benefit from having a video about this itself or mentioned in the first data curation video! I am making my own Melina dataset to test with and the video I use as a reference has a lot of silence after applying UVCR on it.
@Codename0_rvc 4 месяца назад ⁺²
I think there's a little misconception here.
It is not a matter of splitting vs no splitting, it is a matter of how those segments come out.
You see, when you leave it unsplit ( let's exclude truncating for now )
you practically have no control over the gaping and stuff.
hifi-gan likes samples to be as much uniform as they can get, meaning, stable. More consistent they are the better.
Currently, as long one's hardware supports mixed precision training ( amp ) so, fp16 amp. It'll be using samples that are 3 seconds in length
( then internally slices it into mini-segments but that's for another story )
in fp32 mode however, so, full precision, it is using samples that are 3.7 sec in length.
Generally, if possible, you want to make your datasets fully supervised and more or less either split it into 3 sec .wav files
OR
leave one big uni-sample ( not split ) just making sure you have 3 second singing / speech content with a gap of 400+ ms between each.
( that's how rvc's slicing knows it's another segment and not a part of the prev. one )
tl;dr. RVC or rather, hifigan likes consistency. Context matters, so unsupervised dataset has more chances of having it's context at various places broken.
Pretty sure you know it at this point Jarods, however I am leaving it here for anyone who stumbles upon the video.
@GBudgiePH123 25 дней назад
Exactly 3 seconds or 3.7 seconds each of duration for the "big uni-sample"?
@Codename0_rvc 24 дня назад
@@GBudgiePH123 Just fuse all samples or whatever you have into 1 big / long sample and that's all. RVC will handle the rest
@fizskip9136 11 месяцев назад ⁺²
Great video as always, I was wondering if you know any way to separate voices, for example if there are two or more people talking. Would love to see a video on that!
@Jarods_Journey 11 месяцев назад ⁺³
Thank you! That'll be speaker diarization, check out the video on my channel that talks about creating the perfect dataset. I have a script that's uses this technique
@xDiKsDe 11 месяцев назад ⁺²
Thanks for your work and sharing your experience!
I am just a bit confused, @ 2:45 you said "For me I'm going to continue splitting the audio file and making data-sets that way. However I'm not shorten the clips any more to 10 seconds or less, I'm just going to let *RVC split it itself and allow RVC handle those longer files - since I know now, it will not run out of memory"
So you are going to split the long audio files to shorter ones but you are also letting RVC split the already splitted short ones again?
@Jarods_Journey 11 месяцев назад ⁺²
Correct, RVC is going to forcibly split the audio samples regardless. However, before I was forcibly cutting audio samples into 10 seconds samples which could result in splittling it mid sentence. Now, the audio files are just following the transcription timing from Whisperx.
Anecdotal of course, but the split audio-dataset sounds better than the long audio file so that is why I am going with this method.
@JR-bn3ev 8 месяцев назад ⁺¹
But shouldn't the dataset consist of a vocal file in MONO???
@cleatersv 11 месяцев назад ⁺¹
I think the separated version is much better, and will continue using it as well. For the whisperer tool you mentioned is it the audio splitter you made (the one that removes silence and separate files that's over 10 sec)?
@Jarods_Journey 11 месяцев назад ⁺²
Yep, that's correct! However, I wouldn't recommend using the audio shortener, the one that truncates audio to 10 seconds or less as that may produce undesired clips
@cleatersv 11 месяцев назад
@@Jarods_Journey thanks! I'll try the one without using the shortener, was still trying some stuffs using the initial instructions you made
@miguelangel-nj8cq 11 месяцев назад ⁺⁴
I have always had the doubt if I should also normalize the sound, for example the audio of video game characters, they have dialogues in which they scream, get excited and use a great variety of voice tones. Very different from your dataset which looks very drab. Besides trimming silences with Audacity, is it worth using the Normalize Audio option to avoid spikes caused by shouting or loud dialog? Or should they stay natural? Should I do some other transformation?
@benman36 11 месяцев назад ⁺²
I'm wondering that too.
@Jarods_Journey 11 месяцев назад ⁺²
As long as your audio isn't clipping, I think it should be fine. I'm not 100% certain on this, but I believe it gets normalized in RVC. You might wanna listen to the files in the 0 and 1 folders to see how they sound. (One of them is down sampled, so it'll sound older in quality)
@blackarrow3138 8 месяцев назад
What if you mix both seperated and unseperated? Also how many epochs was this example trained on?
@user-ku2hc3mr3m 11 месяцев назад
Hello! Thanks for the video. Could you say where to get well prepared voice audios for training, please?
@denblindedjaligator5300 2 месяца назад
how can people train without v1 or v2'i have a model where it says none
@denblindedjaligator5300 10 месяцев назад
if I choose that a module should have no tone and I train it in the new version of RVC, I can still choose which tone algorithm to use. This means that it still uses RMVPE, i.e. the new version and the quality is not particularly good either. Hope it gets fixed. try to choose false in the old and in the new version.
@KenDoStudios 11 месяцев назад
what have you to say about the Google Colab Crash? many users canot use colab anymore as google is cracking down on deepfakes code and banning IPs.
@user-km5ry2zn1n 10 месяцев назад
Hi Jarod, thank you so much for your videos!
I still have a question though. So i have 12-14 minutes audio of pure voice. I truncated silence, removed noise, reverb, echo, sibilance. What should i do? So you are telling us that simply dividing the audio file into 10 seconds is not desirable right? And I should clip the audio into meaningful bits with complete sentences, for which you btw use whisperx?
If so, is whisperx good for let's say nonenglish languages? For example languages of central asia or let's say exotic languages?
@Jarods_Journey 10 месяцев назад
Yes, that would be correct. Let whisper x handle the length of the sentences as the two training tools on my channel that I've shown (tortoise and rvc) both will do their own splitting. Whisperx works good for many langauges and I believe you can find a chart somewhere online for it, but depends on how rare the language is and if it's been trained on it.
@user-km5ry2zn1n 10 месяцев назад
@@Jarods_Journey Okey so if rvc does its own splitting, what is the point of whisperx? Why not just brutely divide the audio into 10 seconds bits?
I understand that whisperx removes silence, noise and all the other junk, but what if my audio of 15 minutes is kinda perfect? I removed all the undesriable stuff, what difference the whisper x will do in comparison with just dividing into 10 second bits? Does it make difference?
Sorry for these weird questions, just trying to wrap everything arounf my head.
@benman36 11 месяцев назад
The sound files I recorded are 44.1 khz but there is no such sample rate in rvc training. There are only 40k and 48k sample rates. Which one should we choose in this case? After UVR, I cleaned the silence in audacity as you explained in the video, then I set the sample rate to 48000 from the settings and saved it. I did the training in two different ways with the same dataset (by selecting 40k and 48k sample rate in rvc). In Tensorboard, the 48k sample rate result resulted in less loss than the 40k sample rate result.
@Jarods_Journey 11 месяцев назад ⁺¹
48k is supposed to produce better sounding models. As for audio your feeding in, the sample rate doesn't matter too much as it gets resampled.
@benman36 11 месяцев назад
@@Jarods_Journey Which sample rate should I choose in rvc in this situation? Should I continue training with 48k sample rate for less loss according to Tensorboard? In addition, I recorded sound with my cell phone at 48khz but uvr converts the sound recording to 44.1 khz. I set the sample rate back to 48000 after cleaning the silence in Audacity.
@bigdaveproduction168 11 месяцев назад ⁺¹
And the ideal duration ? How much ? :/
@Jarods_Journey 11 месяцев назад
Ideal anywhere between 10-60 minutes. Feel free to add more or less though, every voice is different and may need more/less data.
@ahmedsarosh578 10 месяцев назад
hi rvc is now ot allow on colab for free note book whats the alternative
@Lord_V20 11 месяцев назад ⁺¹
Rtx 3080 12GB any good for Ai?
@pilpinpin322 11 месяцев назад
More than enought for RVC !
@Jarods_Journey 11 месяцев назад
Yup!
@user-kp6ud7ht4z 11 месяцев назад
My 3060ti works so absolutely
@__-mk8dv 11 месяцев назад
How to uninstall the ai voice-changer program? Because on the app page that we want to uninstall, there is no program name ai voice-changer. Or we can delete the extracted file right away because it runs with the command program.
@Jarods_Journey 11 месяцев назад ⁺¹
You can just delete the folder containing everything
@__-mk8dv 11 месяцев назад
I would like to ask one more question. is the program of
ai voice-changer program? using it but the CPU refuses to use the GPU. selecting it as the GPU. Try forcing it with grahics settings.
still not using gpu Makes the computer freeze, cpu working 80-100%, gpu 1-5% by using i3 gtx1050ti. Is there a way to fix it?@@Jarods_Journey
@SosyalMedyaArge-so5bs 11 месяцев назад ⁺¹
Dude, couldn't you get a better quality result if the silences of the single piece file were left?
I mean, wouldn't you have gotten a better result if you didn't truncate?
How would you know?
@Jarods_Journey 11 месяцев назад ⁺²
Empirically, models with silence cut out sound better than those with it. Silence is not a phoneme and would be considered noise. In my testing, it's best to remove them but your free to try it out.
@talkingside 11 месяцев назад ⁺⁶
My model will still sound like a fish
@soupisyummy5533 11 месяцев назад
clean your dataset
@seeyou2winyou 8 месяцев назад
[W CUDAGuardImpl.h:124] Warning: CUDA warning: out of memory (function destroyEvent)
Got this error, is running a 1h not splitted set is too much?
do i need to split it?
@aji9666 11 месяцев назад
Please 🥺🙏 I have a 500 track in my pc i need convert in one time 😭
@Jarods_Journey 11 месяцев назад
I'm not sure what you mean by this, could you be a little more clear on what you want to do? If you meant to train a voice, you can go check out my RVC training videos. If you mean batch conversion, there's a batch conversion option at the bottom of the inference tab on the RVC webui
@aji9666 11 месяцев назад
How do I communicate with you
@CiniVoice 11 месяцев назад ⁺¹
How to install in RVC in windows say it
@Jarods_Journey 11 месяцев назад ⁺¹
RVC installation: ruclips.net/video/hB7zFyP99CY/видео.htmlsi=jIia3hd9oRUGlZyD
@CiniVoice 11 месяцев назад ⁺¹
@@Jarods_Journey Thanks bro
@aji9666 11 месяцев назад
So I tell you the cause of the problem that I have
@user-fl3fb1vv1s 8 месяцев назад
thats way split data set its better :)
@user-fl3fb1vv1s 8 месяцев назад
do it errors? try again thats all
@aji9666 11 месяцев назад
Do you have an Instagram to connect with you
@Dare2Dream.Official 11 месяцев назад
Can I do all this with a mobile phone? Someone please answer
@Jarods_Journey 11 месяцев назад ⁺¹
If you were training on Google colab, you could technically train on a phone. But it's going to be hard and it's too involved at the moment
@Dare2Dream.Official 11 месяцев назад
@@Jarods_Journey okay thanks, what ai is best for lipsync, is Heygen a good choice
@Dare2Dream.Official 11 месяцев назад
Bro how do you expect us someone to understand what you're talking about when you talk like people watching are professionals? Please explain in simple terms
@crolix4220 11 месяцев назад ⁺¹
Not quite, this video is more of a follow up/ continued from his previous videos of the related topics and such, it's not that hard to click on his channel and look around before you decide it's necessary to sound rude like this. If anything i would say his channel is more beginner friendly than most other creators of similar topic, well, obviously if you follow from the first/ first few videos of course.
Now to answer you too in a single comment, i'm no expert but i think fk your mobile phone idea because AI stuffs typically needs a powerful PC/ above average build (especially graphic card WITH more than 4gb video-RAM) merely "to get started". Also i hope you stick around and learn, it also encourages the creator to make stuffs.
@luciovids9208 11 месяцев назад ⁺¹
With all due respect it might be better for your to look at Jarod's initial AI videos as I did before following this one which builds on the previous.. His explanations can be quick but with replaying and pausing, they are easy to follow for non computer literates like myself. Certainly one of my favourite channels and excellent teacher.
@Jarods_Journey 11 месяцев назад ⁺²
A browse around the channel you'll find several tutorial videos and I expect that you'd do your own searching to help yourself. I'm not going to explain everything again in each video. The playlist here has all of the RVC tutorials on my channel: RVC (Retrieval-based Voice Conversion): ruclips.net/p/PLknlHTKYxuNshtQQQ0uyfulwfWYRA6TGn

Следующие

Автовоспроизведение

The BEST LOCAL AI Voice Cloning TTS Pipeline - Tortoise TTS + RVC