Neural Network Learns to Generate Voice (RNN/LSTM)

SomethingUnreal

Просмотров 425 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 12 сен 2024
[VOLUME WARNING] This is what happens when you throw raw audio (which happens to be a cute voice) into a neural network and then tell it to spit out what it's learned.
This is a recurrent neural network (LSTM type) with 3 layers of 680 neurons each, trying to find patterns in audio and reproduce them as well as it can. It's not a particularly big network considering the complexity and size of the data, mostly due to computing constraints, which makes me even more impressed with what it managed to do.
The audio that the network was learning from is voice actress Kanematsu Yuka voicing Hinata from Pure Pure. I used 11025 Hz, 8-bit audio because sound files get big quickly, at least compared to text files - 10 minutes already runs to 6.29MB, while that much plain text would take weeks or months for a human to read.
UPDATE: By popular demand, I have uploaded a video where I did this with male English voice, too: • Neural Network Tries t...
I was using the program "torch-rnn" (github.com/jcj..., which is actually designed to learn from and generate plain text. I wrote a program that converts any data into UTF-8 text and vice-versa, and to my excitement, torch-rnn happily processed that text as if there was nothing unusual. I did this because I don't know where to begin coding my own neural network program, but this workaround has some annoying restraints. E.g. torch-rnn doesn't like to output more than about 300KB of data, hence all generated sounds being only ~27 seconds long.
It took roughly 29 hours to train the network to ~35 epochs (74,000 iterations) and over 12 hours to generate the samples (output audio). These times are quite approximate as the same server was both training and sampling (from past network "checkpoints") at the same time, which slowed it down. Huge thanks go to Melan for letting me use his server for this fun project! Let's try a bigger network next time, if you can stand waiting an hour for 27 seconds of potentially-useless audio. xD
I feel that my target audience couldn't possibly get any smaller than it is right now...
EDIT: I have put some graphs of the training and validation losses on my blog for those who have asked what the losses were!
robbi-985.homei...
EDIT 2: I have been asked several times about my binary-to-UTF-8 program. The program basically substitutes any raw byte value for a valid UTF-8 encoding of a character. So after conversion, there'll be a maximum of 256 unique UTF-8 characters. I threw the program together in VB6, so it will only run on Windows. However, I rewrote all the important code in a C++-like pseudocode:
robbi-985.homei...
Also, here is an English explanation of how my binary-to-UTF-8 program works:
robbi-985.homei...
EDIT 3: I have released my BinToUTF8 program to the public! Please have a look here:
robbi-985.homei...

Комментарии • 1,8 тыс.

@SomethingUnreal 7 лет назад ⁺⁹²
Just to let people know, by popular demand, I've also uploaded a video where I do this with a male English voice! ruclips.net/video/NG-LATBZNBs/видео.html
@SomethingUnreal 7 лет назад ⁺¹
+Johnny Mccrum: I'm afraid not. I don't know enough to program my own from scratch, so I was using the open-source software "torch-rnn" (github.com/jcjohnson/torch-rnn/) here.
@postvideo97 7 лет назад ⁺²
Practical RNN applications don't use 'homebrew' code, they always use some kind of GPU-accelerated library, such as Torch, Tensorflow, etc. There's no need to reinvent the wheel by coding the LSTM by youself (except for educational purposes, which is recommended as it teaches the fundamentals of BPTT). Any implementation of an LSTM RNN will be the same, except some differences in performance.
@postvideo97 7 лет назад ⁺³
@SomethingUnreal You should try training the RNN with STFT (Short Time Fourier Transform) instead of raw audio data, it should perform much better at distinguishing words, as the NN won't need to care about generating the signal itself.
@SomethingUnreal 7 лет назад
+postvideo975 If you can point me to an RNN that takes 2D input, thten sure. Otherwise, I'm stuck with torch-rnn, which is 1D. BTW, I actually did experiment with feeding a spectrogram (FFT powers) to torch-rnn, "raster scan"-style (All of first time slice, all of second time slice, etc, end-to-end), and made a program that handles the fact that torch-rnn won't produce perfectly-sized slices some of the time, and amazingly, torch-rnn was able to output something that resembled the voice, but it couldn't make a stable sound at all (each generated slice didn't connect neatly to the next slice). I don't think I can get better than that while using torch-rnn.
@thorn9382 6 лет назад
Wow youve been feeding your gan alot of hentai
@gafeht 7 лет назад ⁺¹⁹⁷⁰
When your creation is screaming to be put out of its misery, maybe it's time to rethink what you're doing
@Alex-oz9eh 7 лет назад ⁺⁵¹
ayy lmao
@top1percent424 7 лет назад ⁺¹⁵
gafeht You took it to whole another level.
@yammie1536 7 лет назад ⁺²
gafeht exactly my thought lol
@elk3407 7 лет назад ⁺¹⁵
Yah.....
This video somewhat reminds me of Nina Tucker from FMA.
If you dont know what im talking about, DONT look it up..... Its honestly kinda disturbing.
@TGRoko 7 лет назад ⁺³⁴
No no, it's "learning."
@hbaggg 7 лет назад ⁺²³⁶⁹
well congrats
you made a computer waifu
@username306 7 лет назад ⁺¹⁸
Skaterboybob made my day :^)
@maxalbert8943 7 лет назад ⁺²³
Paddy made my day a day after your day
@wavefireYT 7 лет назад ⁺⁵
UP
@AzureOnyxscore 7 лет назад ⁺¹⁶
what next? a motorized fleshlight?
@oliverhilton6086 7 лет назад ⁺³³
When you talk to your waifu and she replies with 30 seconds of "iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii"
@smiledogjgp 7 лет назад ⁺²¹³
Funny how it learns to laugh and scream, far before it may form words. Quite reminiscent of infant humanity.
@SuperGirl-eq1le 6 лет назад ⁺¹
Corpus Crewman ikr
@scarm_rune 5 лет назад ⁺³
this neural network defined the evolution of humans in a few mins
@TrekkerLLAP 7 лет назад ⁺⁴⁶⁷
*screams in Japanese*
@Bugingas 6 лет назад ⁺¹⁷
Minodrey あああああああああああああああああああああああああああああああああああああああああああ
(A in hiragana)
@Cassiepult 6 лет назад ⁺⁴
What’s hanging boys? Underrated name and pic
@wigwagstudios2474 6 лет назад ⁺³
あああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああ
@fadingintogames8278 5 лет назад
I am dying.
@Nully34rNoFake 5 лет назад
Uboa
@Spooglecraft 7 лет назад ⁺¹⁵³
AI learns to generate voice: First thing it does is scream.
@BigOlSmellyFlashlight 6 лет назад ⁺⁷
Spooglecraft just like real people
@CiRdy34 4 года назад ⁺²
right!
@AlyphRat 3 года назад ⁺⁶
When humans are born, the first thing they do is cry, the same with this
@red2theelectricboogaloo961 3 года назад ⁺²
i have no mouth. and i cannot scream,
@TDGalea 7 лет назад ⁺²⁰⁷
It goes from screaming to laughing.
Why are you torturing this poor thing?
@tbe7218 5 лет назад ⁺³
Thomas Galea why would it be laughing though
@TheAmazingDoorknob 5 лет назад ⁺⁶
@@tbe7218 insanity
@EtanMarlin 4 года назад ⁺⁵
Nah it's a baby that is born screaming but the. Learns to laugh
@ChocolateMilkMage 8 лет назад ⁺⁸⁵⁴
Yes. One step closer to robot waifu A.I.
Keep doing the good work soldier.
@mateuszbugaj799 8 лет назад ⁺¹¹⁸
What a time to be alive!
@koreboredom4302 7 лет назад ⁺⁷
Mateusz Bugaj No it is not. If you were, however, born in the future, you could probably fuck machines.
@oliverhilton6086 7 лет назад ⁺¹
I'm thinking of kreiger from archer right now
@noblenoob913 7 лет назад ⁺¹
ChocolateMilkMage I don't want a waifu, because now I ahve someone watching what I do and have to take care of her when she breaks down.
@sheeloesreallycool 4 года назад
Lynx Rapid so a waifu
@calebpeters2544 8 лет назад ⁺¹⁰⁹
5:52 I saw "weird glitch" and immediately thought it was gonna say something like, "EVEN NOW, THE EVIL SEED OF WHAT YOU'VE DONE GERMINATES WITHIN YOU."
@Speedow 8 лет назад ⁺⁹⁴
it's like listening to someone in pain while slowly going mad and accepting it.
@KuraIthys 8 лет назад ⁺³⁸²
In some ways the early iterations sounds somewhat like the noises babies make...
@matthewlewiscomposer 7 лет назад ⁺³²
KuraIthys | mind = blown
@TheRealOtakuEdits 7 лет назад ⁺⁸
KuraIthys That's what I thought
@Fishy_Chameleon 7 лет назад ⁺²⁹
welp we have just found the future of what robot babies sound like before learning speech when robots take over
@MrHornfox 7 лет назад ⁺²⁰
EEEEEEEEEEEEEEEEEEEEOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO XD
@maggintons 7 лет назад ⁺⁷
Super fast turbo charged baby talk.
@MaxLebled 8 лет назад ⁺³¹⁹
Good god, the 2k training iteration sounds like pained screaming
@Alex_Gol 8 лет назад ⁺¹⁶
( ͡° ͜つ ͡°)
@thepetrarcticwar2778 7 лет назад ⁺⁴
yeah, it sometimes sounds like "HELP ME!"good grief, its nightamarish.
@kittenclaws5775 6 лет назад ⁺³
Like a newborn, perhaps?
@SuperGirl-eq1le 6 лет назад
Maxime Lebled 5k iterations are funnier
@thechrisgrice 7 лет назад ⁺²⁷⁴
Given the input on this was Japanese, I listened for some Japanese. I heard one phrase:
7:45 - "Denki, hen ka?" meaning "Electricity... is it strange?"
When it's an AI saying that, you bet it's strange.
Then again, Japanese literally only needs it to accurately string 2 sounds together and you get words.
@user-ty6we2sp2m 6 лет назад ⁺⁴⁸
thechrisgrice she also says itai. (That means pain in Japanese)
@alex73217 6 лет назад ⁺¹²
It even laughed afterwards :D
@elliot_rat 6 лет назад ⁺¹
Umitalia nyan yeah i think op knows that
@wigwagstudios2474 5 лет назад ⁺⁸
アアアアアアアアアアアア
@josoup8291 4 года назад
Where
@kaijupants9095 7 лет назад ⁺⁷⁰
This is what happens when you torture a vocaloid
@DAAI741 7 лет назад ⁺¹³¹¹
this is honesty terrifying
@MrDremboy 7 лет назад ⁺¹
Adam Brown so true
@Garganzuul 7 лет назад ⁺⁸²
It's probably what we sound like to animals.
@thetoontastictoon1720 7 лет назад ⁺³
Garganzuul
lmao
@alphakretin2387 7 лет назад ⁺⁷²
She's trying her best
@zinqtable1092 7 лет назад ⁺⁶
Our voices are pretty high pitched to other animals.
@grenadier42 8 лет назад ⁺¹⁵⁰
ANIME AI THROWS TEMPER TANTRUM, MORE AT 9
@achaemenid7394 8 лет назад ⁺¹
But isn't Anime Ai's bedtime 7?
Exactly, hopefully we can get Anime Ai in bed by 9.
Ah, I see.
@EzraelVio 7 лет назад ⁺¹¹³
They can actually hire your computer to produce new Pokemon cry for the next gen Pokemon game
@syntheticfox_real 7 лет назад ⁺⁴
YES. This is EXACTLY WHAT IT IS.
@SkyrimTheBoss 7 лет назад ⁺²¹⁶
40 seconds in and the AI is already screaming in pain.
@geokramer1711 7 лет назад ⁺³
same tbh fam.... same...
@TheLaXandro 7 лет назад ⁺¹⁵⁶
0:31 itai (pain)
1:00 shine (die)
I guess she really hated you for at least first 5k iterations if these were among her first words.
@jongyon7192p 6 лет назад ⁺⁹
TheLaXandro did I hear "yame"? shit, just kill me
@kajwbidonajdowlem5013 6 лет назад
Lol true
@lollergsize763 6 лет назад ⁺³
At 15k iterations she started laughing, so... maybe she planned a revenge
@EmmysVerySeriousVideos 8 лет назад ⁺⁵⁶⁹
This shit is creepy, sounds like it's in pain and screaming like hell
@Darkethi 8 лет назад ⁺⁵
hOi
@EmmysVerySeriousVideos 8 лет назад
Darkethi.eXe hOI
@MsHumanOfTheDecade 7 лет назад ⁺²⁹
It's mostly screaming from existential dread
@mendaxMultorum 7 лет назад ⁺²
Dodeca heavy doc
@Phoen1x883 7 лет назад
Ed... ward...
@GoldphishAnimation 7 лет назад ⁺⁴⁶⁶
oh fuuuuuuck no it started giggling
that's creepy as hell
@Garganzuul 7 лет назад ⁺⁵⁹
You think that's creepy? - There might be a skeleton inside you RIGHT NOW.
@GoldphishAnimation 7 лет назад ⁺³³
Garganzuul oh shit there is
I gotta find a surgeon NOW.
@nozumihishimatchi1880 7 лет назад ⁺²
The Stitch me to he messed up anime voice
@NoxUmbrae 7 лет назад ⁺³²
If you take the point of view that your brain is what you call "you" it gets even creepier! YOU MIGHT BE INSIDE OF A SKELETON RIGHT NOW.
@NoxUmbrae 7 лет назад ⁺⁵
More like eat, shit and be a nuisance 24/7.
@07actual 7 лет назад ⁺⁶⁰
You've successfully invented a very passable Kirby language. Use this newfound power wisely.
@SuperGirl-eq1le 6 лет назад ⁺¹
07actual XD
@CupoChinoMusic 6 лет назад ⁺¹
*using an anime voice
holy shit this is next level amirite guys
@wigwagstudios2474 5 лет назад ⁺¹
ポヨ
@Heliocentricity 4 года назад ⁺¹
7:16 poyo
@KoltoxOfficialChanel 7 лет назад ⁺⁵²
9:09 first time in history when a creation "intentionally" calls its creator a baka
@KyureiProductions 5 лет назад ⁺²
It called it's creator an "idiot" in Japanese.
@charaicommenternotalt 6 месяцев назад
and babies?
@jorge28624 7 лет назад ⁺⁷²
Neural network-chan
@curtiss5781 8 лет назад ⁺⁴³
It cries just like a baby who doesn't know language yet either
@EdwardNavu 7 лет назад ⁺⁴⁷
An interesting case of neural network application, and an unintentional nightmare fuel, just to attempt to reenact voices in anime.
@Gurren813 7 лет назад ⁺³⁷
Imagine bastion but instead of cute beeps, chirps, and whistles, it just makes garbled anime lines. "Kon NIIIIIIII ch wAAAAAAAAAAAAAAAA"
@SomethingUnreal 7 лет назад ⁺⁷
+Gurren813 Consider yourself lucky that I don't have Overwatch, so I won't make that mod.
@wawan8759 5 лет назад ⁺¹
Im already tracer
@SlyHikari03 5 лет назад
Pls somebody make that a thing!
@majula2171 8 лет назад ⁺⁵¹⁸
We need a machine that can perfectly replicate Morgan Freeman's voice, and we need it now.
@thetimelords911 8 лет назад ⁺⁶
Yes.
@andyli1890 7 лет назад
Please stop trying to act cool. You're just making yourself look even worse.
@johnf1402 4 года назад ⁺⁷
Andy Li you’re just jealous you didn’t have this million dollar idea Of having computer freeman
@sheeloesreallycool 4 года назад
Andy Li bruh
@davemarcosmalicdem9543 3 года назад
@@andyli1890 ur mom
@joekewl7539 7 лет назад ⁺⁷³
Kizuna A.I. Prototype
@Zenthex 7 лет назад ⁺²⁹¹
the worst part is that someone somewhere is making this program their waifu.
@darianthe2nd42 7 лет назад ⁺¹²⁹
"... OOOOOOaAaAaA... AAAAAAA!"
"That's so hot."
@thebirchwoodtree 7 лет назад ⁺⁴⁰
Deltinum *Slow fap* *
@ketmax2805 6 лет назад ⁺¹
TheBirchWoodTree
*slow nut*
@Heliocentricity 4 года назад ⁺⁶
7:30 Neural network anime girl learns to sing the 7 GRAND DAD/Flintstones theme
@Qualex14 8 лет назад ⁺²⁸²
Next time you should use Morgan Freeman's voice
@creeperlamoureux 8 лет назад ⁺²¹
YAS
@sesseljabs964 7 лет назад ⁺³
Qualex14 or david attenborough
@AlyphRat 7 лет назад
*Gordon
@gordonfreemanthesemendemon1805 6 лет назад
I saw a gordon and i saw a freeman, so i have thus been summoned
@SweetHyunho 7 лет назад ⁺⁷⁹
Thanks, I had imagined what it would sound like. Now I have a pretty good idea. I wish somebody use a virtual piano to reproduce piano recordings, train a lot, and then let it improvise.
@SomethingUnreal 7 лет назад ⁺³
+SweetHyunho: Check Google's WaveNet project at their blog - they did this, and there are several samples there showing what it's output. The piano ones are near the bottom =)
deepmind.com/blog/wavenet-generative-model-raw-audio/
@SweetHyunho 7 лет назад
Seen that already. That is sample-based. I'm talking about performing a virtual (or real) musical instrument. Perhaps we could simulate a set of virtual hands for extra human feeling!
@SomethingUnreal 7 лет назад
What do you mean "sample-based"? It's trained on lots of speech the same as mine is. The fact that they had to fragment it is just because it's a CNN rather than an RNN (and because they wanted to label each phoneme)... The concatenative speech synth that they compare it to is just samples stitched together, but the CNN's output is a continuous stream based on what it learned.
@SweetHyunho 7 лет назад
Yes, what you said. Both WaveNet and your RNN directly outputs the wave without a virtual instrument. What I want to see is the network "hitting" the keys of the piano, or moving the virtual tongue and lips to speak, by controlling(outputting to) a separate simulator which will synthesize the sound itself. WaveNet contains the piano acoustics itself, cannot replace piano with organ or tweak it, but in my idea the network focuses on the structure of the music. That should enable looking much farther (near-sightedness = boring music). I guess AI musicians will start being really competitive once the history+planning window exceeds one minute.
@SomethingUnreal 7 лет назад ⁺⁵
Right, I understand. So in the case of speech, outputting something like the pitch, volume and the formant frequencies of the voice, which can then be fed to something like Praat to synthesize the sound. Yes, that would be very cool.
@propername4830 7 лет назад ⁺¹⁹⁴
Local robot tries to understand anime
@christiaanprinsloo586 7 лет назад ⁺⁴⁵
local robot goes full weeaboo in under 10 minutes
@frootube5662 6 лет назад ⁺¹
Justin Y. I have a weird love hate relationship with you Justin I kinda like it...
@SuperGirl-eq1le 6 лет назад
XD
@Kinzsters172 6 лет назад
Justin Y. You're fake.
@Heliocentricity 4 года назад
@@abacussssss for freddy's sake
@Ys-wd2lh 8 лет назад ⁺¹⁷⁶
0:32 hentai sound track
@DAAI741 7 лет назад ⁺⁴⁶
AAAAAAAAAAAAAAAAAAAA
@joshualettink7582 7 лет назад ⁺⁶
3 months late, but this made me laugh out loud for real haha
@sinistrolerta 7 лет назад ⁺¹¹
Dank Meme Sir, what the fuck kind of hentai are you watching? O_o
@rajatmond 7 лет назад
I'll watch what he's watching. Thanks!
@Ac3sdg 7 лет назад
「 OKAY 」 holy fuck i'm dead 😂
@Zahlenteufel1 8 лет назад ⁺⁹⁴
8:01 did it say "senpai"???
@skorpius2029 8 лет назад ⁺²¹
yes
@henrik00000000000001 8 лет назад ⁺⁵⁰
4:11 (eeeeeeh oniisan )
@gabrielecipriani6798 7 лет назад ⁺¹⁸
Zahlenteufel1 2:43 "yamete!" so creepy...
@bolgeg6191 7 лет назад
7:10 tomare tte
@HaSTaxHaX 7 лет назад ⁺²
9:50 "motto, motto..."
@AshtonSnapp 7 лет назад ⁺¹⁹²
So, give neural network Japanese anime girl, get gibberish. Perfect.
@magikarpusedsplash8881 7 лет назад ⁺⁴
SnappGamez If you spoke Japanese, then it'd probably make more sense.
@AshtonSnapp 7 лет назад ⁺⁵
Magikarp Used Splash I tried learning once.
I understand why it is considered a Category V language.
@magikarpusedsplash8881 7 лет назад ⁺¹
SnappGamez it's even more difficult (allegedly) to native English speakers.
@makeneko_s 7 лет назад ⁺⁶
SnappGamez by the way, the voice_sample is of a boy... the voice said *boku* ［僕］which is the i/me used for boys... watashi［私］are for girls... so basically it's a trap...
まぁー、俺も女の子の声だと思ったけどね
@SomethingUnreal 7 лет назад ⁺¹⁹
+白金圭 Have you never heard a tomboyish girl say "boku"? Please look up the game on VNDB (ぴゅあぴゅあ) if you don't believe me.
ひなたはあんまりお転婆なんかないけど。犬耳っ子だから…かな。
ところで今スペルを確かめるために「犬耳っ子」をググったと、ひなたは検索結果の6番目だった。びっくりしたｗ
@Skelpolu 8 лет назад ⁺²²
Please, do continue this and make more videos about it - it's incredibly intriguing, and I'd love to see what happens with different voice-actors. Male-japanese, and even some English ones would be awesome, despite not getting a single word out of them anyway - literally.
@SomethingUnreal 8 лет назад ⁺⁴
I'm glad you like it! I will eventually be uploading one trained on my voice (which happens to be male and English), which I trained with the specific goal of getting recognisable words out of it.
@Skelpolu 8 лет назад ⁺³
You think that'd be possible? That would be amazing! By the way, as far as I understood from the video, the learning eventually flattens out and only adjusts minimal features (which, however, seem to affect our perception of the voice the most). Would increasing the amount epochs taught make a difference at all?
@SomethingUnreal 8 лет назад ⁺⁵
Yes, the learning rate decreases over time to let things stabilise. I actually stopped it when I did because I wasn't noticing many changes (you can see that towards the end of the video, I'm skipping more results because there's nothing very different from previous results). Things would likely have continued to change a bit, but not much.
Also, although the training loss ("error" in what it has learned) decreases roughly logarithmically, it doesn't get better forever. It eventually stops decreasing and becomes closer and closer to a flat line if you look at it on a graph (please check the link at the end of the video description for some pretty graphs of the losses over time =P). In other words, there is a limit to how much the network can learn, even if you could give it hours' worth of really good data.
I think that the reason the results were sometimes still so different to each other at the end (even though the training loss had stopped decreasing) is because it was just tweaking a few detailed parameters in "random" ways (i.e. wasn't working towards a specific state) because it was not big enough to learn all details, compared to when it was learning the most important patterns. I could certainly be wrong, though. Another commenter did point out that I should sample from each checkpoint (iteration) more than once because they can produce wildly different results, but for technical reasons, I'm still not able to yet (I don't have access to the computer I trained it on, which trained using the GPU for speed; my computer can't train on the GPU, and the checkpoint files main by torch-rnn with GPU vs CPU training are different formats...).
@SomethingUnreal 8 лет назад ⁺³
Update: I actually _can_ use these checkpoints on my computer! Although it takes 85 minutes to make a single output file (~27 seconds of audio), assuming it's not training at the same time. So I must've been confusing torch-rnn with something else (maybe char-rnn).
@creator-link 2 года назад ⁺⁶
Weird to see this before the great transformer boom that basically accelerated ai to just everything
@ulilulable 7 лет назад ⁺¹²
3:05 "nani? kore? nani?" XD
@sophiapinzon2765 7 лет назад ⁺¹³
idk about you guys but the fact that the network would take a liking to random sounds in the beginning and use them all the time (example: eeeeeeeeeeeeeeeeeeee!!!!!) is super cute
@darianthe2nd42 7 лет назад ⁺²
idk man 5:10 was cuter
@danpope3812 7 лет назад ⁺⁵⁶
Someone please write some translation closed captions. Please.
@ChucksSEADnDEAD 7 лет назад ⁺⁷
Dan Pope no actual words were spoken, apart from random chance. It's speaking gibberish.
@danpope3812 7 лет назад ⁺³⁵
Filipe Amaral
I meant could someone with comedic talent have some fun with it.
@がに-k6n 7 лет назад ⁺²¹
"help, i am trapped inside the computer"
@renakunisaki 7 лет назад ⁺³⁹
"AAAAAAAAAAHHHHHHHHHHH PLEASE KILL ME"
@mihau5083 7 лет назад ⁺²²
iiiiiiiiiiiiiiiiiiiiiiiiiiiiIIIIIIIIIIIiiiiiiiiiiiiiiiiiIIIIIIIIIiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii
@delayed_control 7 лет назад ⁺⁸⁵
THE POWER OF CHRIST COMPELLS YOU
@nozumihishimatchi1880 7 лет назад ⁺¹
m4ti140 crist and scince is seperatef
@Dirtfire 7 лет назад ⁺³
I think he was just being funny, but yeah.
@mooshmash7678 7 лет назад ⁺¹³
1:48 : *WAHH WEH*
*wa **_WAAAAAAAAAAAAH_*
*aHh*
*WAH! **_WAAAH???_*
*aH*
@Noname_2014 8 лет назад ⁺¹²⁹
Its interest. But the voice sound a bit creepy
@crristox 8 лет назад ⁺³
Is cringy*
@MsHumanOfTheDecade 7 лет назад
Why? Because anime is cringy? That is an odd verdict.
@crristox 7 лет назад ⁺²
Dodeca Totally underrated, and original.
@ChrisD__ 7 лет назад ⁺¹⁵
It sounds horrifyingly uncanny.
@matthiasengh7935 8 лет назад ⁺⁸
1 have a cool project, 2 take the most annoying training data imaginable, 3 witness carnage
@blazelega2985 7 лет назад ⁺⁸
5:47 my longest "IIIII" ever
@htomerif 7 лет назад ⁺⁴³
Mkay, a lot of people in the comments who know nothing about AI.
So what was the "training" algorithm used here? That's the most important piece of information. I'm assuming the input and output were frequency domain samplings.
@SomethingUnreal 7 лет назад ⁺⁴
+htomerif: The input and output were raw 8-bit PCM audio samples, each of which was fed into or out of the network as the activation of one of 256 nodes. The fact that it's in the time domain is the part that amazed me the most (the way it's able to find the repeating patterns over time). I'm not entirely sure what you mean by "training algorithm", but torch-rnn (the software I used here) uses backpropagation with the "Adam" (Adaptive Moment Estimation) optimizer. You can get more details on exactly how it works here by checking its project page, and especially the text files "train.lua" and "doc/flags.md", here: github.com/jcjohnson/torch-rnn/
@htomerif 7 лет назад ⁺¹
SomethingUnreal Generally you have to have enhance/suppress condition for connections or live/die condition for individual nodes in a network. Like if you want a servomechanism and camera to follow a red ball, a training algorithm needs generally suppress connections more severely the further it is from the ball and enhance connections the closer it gets to a ball. So by "training algorithm", I mean "the thing that analyzes the input and output and decides whether the current network state is doing better or worse than the last network state."
It looks like maybe the "criterion" in is what I'm talking about. Reading other people's code is one of my least favorite activities (no offense), but my best (most likely incorrect) guess is that its based solely on the cumulative numeric deviation from the original audio file?
If that's the case then yeah, I would kind of expect the output to be some snips of time-synchronized copies of the input data repeated a lot.
I know this is getting TL;DR, but it might be interesting to use frequency domain data (obviously you already know that), I've used FFTW3 for that general kind of thing and if Lua is your language of choice, I'm sure theres a an FFTW library with Lua hooks. Possibly quite a bit slower though if you were actually using Cuda though.
@SomethingUnreal 7 лет назад ⁺¹
+htomerif I was using CUDA (it improved speed by about 4x). I don't know the exact way the loss is calculated, but by my understanding, it's not calculated by comparing the network's predicted output to that of the main training set.
The original file is split into a large training set and 2 smaller sets ("test" and "validation"). It appears to regularly compare the predicted output against the "test" set, and whether it's getting better or worse here influences the weights, which is why it doesn't generate perfect copies of the training set - it's never "seen" the test set before. If the original data is a short loop repeating many times, so that the same loop is repeated over and over in the training, test and validation sets, then all it does is perfectly memorise as long a sequence as it's possible to store in the network and blindly spit that out over and over.
EDIT: I may have confused "test" and "validation". The usage of these according to torch-rnn's code and according to other posts I've seen seem to contradict each other, unless I've misunderstood something a lot...
@htomerif 7 лет назад ⁺¹
SomethingUnreal I think I see what we're getting at here.
So the network is fed a small test sample of the input, and then its output is compared with *what should have come next*. That is the bit I was calling the "training algorithm".
I did notice that the code had 2 distinct states, a training state and a running state. So the training state is never fed the entire file, but the running state *is*, for purposes of the video.
But yeah, the terminology is basically an instant pitfall as there's huge variation in what means what across the field of AI programming.
Also, any or all of what I said up there could be wrong. I think I get the gist though.
@augustinushipponensis3021 7 лет назад
Have you played it back more slowly? I think your algorithm was being too efficient. :)
@SleepyAdam 7 лет назад ⁺²⁹
This stuff scares me. It's adorable and terrifying at the same time.
@heartache5742 7 лет назад ⁺⁶
Adam McKibben At 4:15 it gave up on its life
@__ten 3 года назад ⁺⁴
4:55 AWW that little "pop" noise was so cute
sounded like someone smacking their lips together or smth
@henryzhang3961 5 лет назад ⁺⁶
0:21 killed
0:28 being tickled
1:19 boiling kettle
1:25 some assembly required, etc
1:49 riding roller coaster
3:53 karate screaming
@OnEiNsAnEmOtHeRfUcKa 8 лет назад ⁺¹⁰
1:58 So that's what a computer screaming sounds like.
@BSFilms1997 7 лет назад ⁺⁷
@ 2:45 it says "Itai yo", which means "It hurts" in Japanese. This is scary...
@sqoops8613 7 лет назад ⁺¹¹
1:49 It has learned to express it's endless pain and suffering.
@Heliocentricity 4 года назад
baah baaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaiiiiiii
@andrewsauer2729 8 лет назад ⁺¹⁰⁶
I FEEEEL FANTASTIC! HEY HEY HEEY!!
@corazontuble293 8 лет назад ⁺¹⁴
Pls no not that
@helldronez 7 лет назад ⁺¹
i just shiver by trying to remember it
@xiphosura413 7 лет назад
Wut?
@andrewsauer2729 7 лет назад
+gremlinboii Thank you! I do aim to please.
@want-diversecontent3887 7 лет назад
andrew sauer
Don't go to my party next time.
@generico366 7 лет назад ⁺⁵
"Alright, let's give a voice to this neural network and see what happens."
*_continuous screams of agony_*
@Phagocytosis 8 лет назад ⁺⁶⁷
I am in your target audience! Absolutely love this.
@SomethingUnreal 8 лет назад ⁺¹
+Vincent Oostelbos I'm glad! And you even made it through my unreasonably long wall of text in the video description!
@Phagocytosis 8 лет назад
I sure did. I also noticed at the end in the video, you had written "I'd still like to try training a bigger network with longer training data". Is that something you have done or are still planning on doing, and if so, is it something that will find its way to this channel at some point?
@SomethingUnreal 8 лет назад ⁺¹
+Vincent Oostelbos I've not done it with this voice yet. I recently tried training a 760x3 network on 27 minutes of audio but with a very different voice (often becomes very quiet), but I haven't got it to turn out as well as this yet. I've trained several (smaller ones) on my own voice, with the goal of having it output recognisable words, to varying degrees of success. I think they could be better if I recorded more training data, but it's very hard to keep the same way of speaking similar things for over 15 minutes (it's like my brain becomes numb and I can't even form the words anymore). I should make videos of the results anyway, though.
@Phagocytosis 8 лет назад
SomethingUnreal Have you tried just reading out a lengthy piece of text, like a book, as if you were creating an audiobook?
Anyway, I'm looking forward to seeing the results of some of those projects you mentioned. Good luck!
@SomethingUnreal 8 лет назад ⁺⁴
+Vincent Oostelbos I did that a few days ago, yes. Thank you!
@MuzikBike 7 лет назад ⁺¹²
This sounds terrifyingly adorable.
@LivvyHackett 7 лет назад ⁺³⁵
What is it learning?
Is it trying to copy the phrase? Or make its own sentences?
@SomethingUnreal 7 лет назад ⁺⁵³
It's trying to learn how to make audio that sounds the same (without being able to simply store it). Or more technically, it's learning the probability of each of the 256 possible vertical waveform positions given all of the previous ones.
@LivvyHackett 7 лет назад ⁺⁶
oh? sounds interesting
@BigOlSmellyFlashlight 6 лет назад
SomethingUnreal oh it's 8 bit
@greenhound 7 лет назад ⁺²⁹
2:17 is hilarious
basically hentai audio
@obnoxendroblox8603 6 лет назад
wtf is this what hentai is?
@i_am_boredom 4 года назад
. . .
@junkeyz 4 года назад
What kinda hentai are you watching where the girl goes "sshshshhyyyaAAAAAAAAAAAAOAOAOIIIIIOOO9X
@SlyHikari03 3 года назад
@@obnoxendroblox8603 she sounds like she’s getting electrocuted.
@LUCABALUCA 7 лет назад ⁺²⁷
not my proudest fap
@mrmaniac3 7 лет назад ⁺¹⁹
5:00 Kawaii
@xtrashocking 7 лет назад ⁺⁴
k.....wAH-YEEEEE
@Heliocentricity 4 года назад ⁺¹
@@xtrashocking eHehhEHehehEHehehhEHHe
@curoamarsalus7822 8 лет назад ⁺¹⁴
Hmm, I'm extremely curious to see how this would sound with a normal voice spoken in a decent range.
@OrchidAlloy 8 лет назад
That is a good idea, excellent feedback, and a sick burn, all at the same time.
@curoamarsalus7822 8 лет назад ⁺⁶
Well, I don't mean it to be a burn. It's just that this voice at that quality is physically painful to listen to (for me at least).
@hecko-yes 8 лет назад
+Daniel T. Holtzclaw Look up WaveNet (if it doesn't find it, try "wavenet samples").
@reeee4336 5 лет назад ⁺³
3:40 *Computer-Chan is laughung her ass off.*
@ZiaSatazaki 8 лет назад ⁺¹⁷
8:54 MANGO PUPPY ASYYYYLUUUUUM~
@BaldiGamingOfficial69 29 дней назад
bruh
@bigbox8992 7 лет назад ⁺¹¹
To use a waifu voice is to play a dangerous game.
@WS-gw5ms 8 лет назад ⁺³⁷
horrifying
@ForgottenDawn 7 лет назад ⁺²¹
Yoko Ono's new album, ladies and gents.
*kill me*
@bobalinx8762 6 лет назад ⁺¹
THIS is what killed the Beatles?
@bingbongshamalama 8 лет назад ⁺⁵
This is a mind-blowingly awesome outcome for this network. I had an idea similar to this a few years ago but never implemented it. This makes me wonder about how you could develop a set of learned words and string them together somehow. Not sure how to overcome how unnatural that would probably sound, though. Great stuff!
@SomethingUnreal 8 лет назад ⁺⁶
Thank you! I was thinking something similar to that, but I have no idea how to program it. Something like manually transliterating the training data, then feeding it both the text and the audio so it associates a sound with each word - in other words how a particular voice pronounces text. Then, ultimately being able to give it text and have it read it in that voice. I believe some people are using the reverse of this for speech recognition.
This would be much easier in a phonetic language like Japanese. Although that would make it all the more impressive if it learned the many rules of English pronunciation without the need for me to put some intermediate stage in where it converts text to/from something like IPA. I may be getting ahead of myself. =P
@yazuak 8 лет назад ⁺⁵
some of the more clear japanese i heard:
7:24 "ota~ku no terai"
7:27 "mina-sama"
7:28 "sono fuinki"
7:29 "sugoi wakuwaku"
7:33 "zehi koto wo ---"
7:44 "kiette tari, tenki"
7:52 "jitei(??) koto yo~~"
8:00 "--kouki(?) ni ike"
8:11 it almost says ""hitotsu tano_sase(te)kureta(de)shou" which would be nearly actual japanese
@HamPuddle 8 лет назад ⁺³⁰⁸
"Cute voice" - you mean annoying af
@HamPuddle 8 лет назад ⁺⁵⁵
Still really cool, wish the sample audio hadn't been so squeaky
@wiertara1337 7 лет назад ⁺¹⁸
HamPuddle more like a screaming angry cat
@EpsilonDelta12 7 лет назад ⁺⁸
HamPuddle weaboos man
@naoromi9883 7 лет назад ⁺²
You have yet to understand Japanese school girls.
@pothole0761 6 лет назад
Weeb voice
@GroovingPict 7 лет назад ⁺²⁶
"cute voice"... yeah, if by cute you mean intensely annoying
@columbus8myhw 8 лет назад ⁺²⁸
Jesus fucking Christ this is horrifying
I half expected it to address me by name at the end
@MrHornfox 7 лет назад ⁺⁵
Let me introduce you to my girlfriend
"IIIIIIII HIHIIHII TOMEKAIKOTE OKOKAAAHAHHHHHHHIIIIIIIIII EEEEEEEEEEE HIHHIHIIHH HAAAAAAAAAAHAHHAHAHHIHIHIHIJIJIIIJIJI "
@Fuzzthefurr 7 лет назад ⁺¹⁰
0:40 is what I hear whenever I try to watch anime.
@user-cn4qb7nr2m 8 лет назад ⁺⁷
Wow! This is really creepy! Imagine one day you tell to your robo maid to bring some tea, but instead of silently obeying as always, she just stares at you and trying to mimic human voice, creaks:
"ihihi... waaaaaaaaa... na kiiil..."
"ihihi."
@SomethingUnreal 8 лет назад ⁺¹
I'd just have to put her through some more "training".
(Speaking of creepy)
@CorruptedMuse 8 лет назад ⁺⁶
It's like the audio equivalent of the uncanny valley
@justinslab2035 7 лет назад ⁺³²
so that's what it sounds like when you put a chibi into a blender. xD
@durdleduc8520 6 лет назад ⁺²
*aaaaaaaaaaaaaaaaa*
@bobalinx8762 6 лет назад
Oh my...
@maggintons 7 лет назад ⁺⁶
0:26 when you realise your dead and your memories have been uploaded into an AI program.
@inswedishmynameisdik 7 лет назад ⁺⁵
never thought skynet would be kawaii
@honse246 8 лет назад ⁺¹¹
ONE STEP CLOSER TO 3D ROBO WAIFUS
@BleakFufel 8 лет назад ⁺¹
3D DansGame
@satan2583 4 года назад ⁺⁶
2:49 sounds like “itai yo!” or “it hurts!”
They say a computer can’t feel emotions.
I’m pretty sure this is an exception.
@fzero8821 7 лет назад ⁺⁸
i just kept hearing shine shine
@fzero8821 7 лет назад ⁺¹
freaky neural networkn man
@ImDelphox 7 лет назад ⁺²
Fzero Fz Heard it trying to become a Pikachu several times later on ("pika" @ 7:54)
@_Killkor 5 лет назад ⁺²
6:15
*giggles in Skynet*
S O O N
@MuradBeybalaev 7 лет назад ⁺¹³
But why anime voice? Nobody ever talks like that in real life.
Nobody.
7:07 "[…]pokemon[…]".
@MrFram 4 года назад ⁺¹
And it sounds like its singing at that time like
"Saito!
Pokemon du paintsu,
Hey!
HmhmrO!"
@liammckewic3218 5 лет назад ⁺²
2:02 Waluigi goes "waa!"
2:07 HALT! HAAAALT!
3:53 HOOOLD UP!
5:54 The Deadly Screech of Four. Has. Returned.
@anonanon3066 8 лет назад ⁺³⁹
but why japanese. i cant watch this my roomates gonna think im watching hentais or something
@SomethingUnreal 8 лет назад ⁺¹¹
If you use headphones and your roommates look at your screen, they might think you're watching something highly scientific. But then they might think you're a nerd instead, so take your pick. =/
@HaydenTheEeeeeeeeevilEukaryote 7 лет назад
SomethingUnreal Headphones aren't enough because the sound leaks through
@SomethingUnreal 7 лет назад ⁺⁴
+Hayden the douchebag: Get better headphones? No offense.
@lumps17 6 лет назад ⁺³
First thing the AI does is scream, really tells you something, eh?
@FuchsiaMuffin 7 лет назад ⁺⁴
Is this Miku before she got her voice
@brianc3063 7 лет назад ⁺³
3:57 ORAAAAAAAaaAA orAAORAORAAAAAAAaaAA OOOOoooORRAaAaA
@notvelleda 8 лет назад ⁺⁴
It says "Oh Sh*t" at 7:50 xD
@Lemonidas75 7 лет назад ⁺¹²
Its like watching Skynet learning.
@Allupertti 6 лет назад ⁺³
1:02 Sugoi. Ok the tests over it can talk now.
@2sighkick2furious39 8 лет назад ⁺²
When all was said and done. The Neural Network told a story of sadness
@vladislavdracula1763 8 лет назад ⁺¹⁵
That was both amazing and disturbing...
@rommix0 Год назад ⁺¹
6 years later, and I almost forgot about this video. This was the start on my neural network journey. It's fun to look back on it.
@porygonlover322 7 лет назад ⁺²⁷
I'm almost certain it said "itai yo" (it hurts) at 12,000
@porygonlover322 7 лет назад ⁺¹²
59,000 ended with "masu" which is a way Japanese sentences actually end
@MilanKarakas 7 лет назад ⁺¹⁴
Of course machine said "it hurts". What you will say if you are plugged at 240V AC as this machine did? :D LOL
@thetoontastictoon1720 7 лет назад ⁺³
Kzinssie (porygonlover322)
very educational
this must mean that in Puyo Puyo the "level start" sentence spoken by Arle must be saying "masu" and not "natsu" like I always thought-
@SomethingUnreal 7 лет назад ⁺⁵
+The Toontastic Toon: batan kyuuu!
@thetoontastictoon1720 7 лет назад
SomethingUnreal xD
@shinjiikari1966 7 лет назад ⁺¹
This weeb machine literally projects all my moods: pained screaming, creepy giggling, and quiet and unintelligible, but vaguely sad chatter.
@23Scadu 7 лет назад ⁺³
Good grief, that original voice is about as cute as a steaming dump.
@zephyr733 7 лет назад ⁺¹
Love how it's first reaction was basically screaming.