Mel-Frequency Cepstral Coefficients Explained Easily

Valerio Velardo - The Sound of AI

Просмотров 124 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 30 июл 2024
MFCCs have traditionally been used in numerous speech and music processing problems. They are a somewhat elusive audio feature to grasp. In my new video, I introduce the concept of Cepstrum, illustrate its intuition, and discuss how we can extract MFCCs.
Slides:
github.com/musikalkemist/Audi...
Join The Sound Of AI Slack community:
valeriovelardo.com/the-sound-...
Interested in hiring me as a consultant/freelancer?
valeriovelardo.com/
Follow Valerio on Facebook:
/ thesoundofai
Connect with Valerio on Linkedin:
/ valeriove. .
Follow Valerio on Twitter:
/ musikalkemist
Наука

Комментарии • 218

@thebigVLOG 3 года назад ⁺¹¹
This is one of the best lectures I've ever effin watched, thank you so much for making this series!
@ayishanayyar1283 3 года назад ⁺¹
Description in a pleasant manner, untiring, relaxing effect on nerves. Thank you Valerio Velardo
@ashwinalinkil7328 10 месяцев назад
Straight up dude, you are an absolute beast! Every other sentence just blows my mind. You made it so easy to understand and gain an intuition on such abstruse concepts. Thank you so much!
@johnmin3821 3 года назад
This video is doing explanations that I couldn't find or understand from hundreds of websites. You're a legend
@jeevanreji7290 2 года назад ⁺³
I absolutely love the way you explain these concepts! Thank you !
@LauraSpinu 2 года назад ⁺⁸
This was so helpful, can't thank you enough for your time and effort. Simply amazing - and your enthusiasm makes it so easy to watch and enjoy through the end!
@emrecan9271 4 месяца назад
You are a perfect man. These videos are literally worth gold. I will watch them from the start. Thank you very much.
@BlackHermit 2 года назад ⁺⁴
This was really, exceptionally good. A rather lengthy video, but worth every second. Thank you so much!
@ValerioVelardoTheSoundofAI 2 года назад
Glad you liked it!
@MegaCadr 3 года назад
20 minutes in, my mind started melting. Amazing video!
@IngridKnoch 3 года назад ⁺¹
That was so clearly explained!! Thank you for this, Valerio
@Underscore_1234 2 месяца назад
that's AWESOME STUFF. Did expect good stuff, didn't expect that good stuff, you really did good about explaining cepstrums and the wave to separate glutal pulses from voice track. It really made sense.
@aberone_library 2 месяца назад
I cannot express how much I'm thankful to you for making this video! This is my favorite style of explanation that I myself have adopted over the years. You took an hour to explain a concept that could, in principle, have been explained in 15 mins or so, but you did it so clearly and thoroughly that by the end of the video I had a spotless, complete understanding not only of the process of extracting the MFCCs but also of the intuition and the meaning of it. Which is something that a lot of other explanatory videos lack these days. So thank you again for your effort!
@ValerioVelardoTheSoundofAI 2 месяца назад
Thanks a lot :)
@drillsargentadog 2 года назад ⁺²⁵
Nice explanation and great course! One comment: I'm pretty sure big X, E, and H at 27:19 should be functions of frequency, not time, and should be multiplied, not convolved.
@Jamazon 2 месяца назад
your channel is a gold mine, thank you so much for what you do!
@klausjurgenfolz4323 3 года назад ⁺¹
I've learned more watching this video than a whole semester in my university. Than You!!!!
@Goriuable 2 года назад ⁺¹
Thank you so much.
I searched alot about the Topic of MFCC and I did not found very good explanations.
Your Video is really a masterpiece and I have now a good knowledge about the concepts :)
For sure I will have a look at some other Videos from you.
Keep Up the amazing Work!
@ValerioVelardoTheSoundofAI 2 года назад
Thank you - glad I could help!
@subrahmanyamkunapuli1860 3 года назад ⁺¹
👏Excellent way to explain intricate details!! Thanks for the video series.
@ValerioVelardoTheSoundofAI 3 года назад
Thank you!
@naveenfrancis444 2 года назад ⁺¹⁹
at 14:38, doesn't the IDFT map a signal on to the time domain? If so, shouldn't the axis be pseudo time instead of pseudo frequency?
@st0a Год назад ⁺⁸
That's exactly what I was wondering...
@katsiarynaruksha9381 11 месяцев назад
Extremely useful series of lectures. Thanks a ton!
@zeyuyang2053 3 года назад ⁺²
Best MFCC explanation I‘ve seen ever！Thank you！
@ValerioVelardoTheSoundofAI 3 года назад ⁺¹
Thank you!
@quincydelp9586 2 года назад
This is an incredibly helpful video that taught me how to implement an MFCC algorithm and intuition for why it is useful information. I can't recommend it enough.
@ValerioVelardoTheSoundofAI 2 года назад
Thank you Quincy!
@beincheekym8 3 года назад ⁺⁴
awesome course, so complete, and very clear visualization. really amazing. thank you!
@ValerioVelardoTheSoundofAI 3 года назад
Thanks!
@DiogoSanti 2 года назад ⁺¹⁶
But is it ok to call the inverse Fourier a Spectrum? I tell that because the inverse Fourier brings back the Frequency Domain to Time Domain, and in my head, spectrum is represented by slices of frequency domain, or am i missing the point?
@jdavibedoya 10 дней назад
I'm not an expert, but I believe the conventional way to calculate the cepstrum uses the IDFT because of its scaling factor. Both the DFT and IDFT are quite similar and indeed produce results with the same shape.
@Bluephoton 3 года назад ⁺¹
Better than Speech Signal Processing Lecture in terms of explanation and ease of understanding !! Highly recommend to watch for speech related projects!
@ValerioVelardoTheSoundofAI 3 года назад
Thank you!
@thierrydesot1164 3 года назад ⁺¹
Thanks a lot for this brilliant explanation. I have read several papers to grasp the concept of mfcc, mel scaling, delta derivates etc. But after watching this youtube tutorial it is the first time I have the feeling I 'got' it. So I am on my way to watch your other tutorials.
@chacmool2581 2 года назад ⁺¹⁰
What I don't understand is why one takes an inverse FT instead of a FT to get to the quefrency domain. If it's indeed a spectrum of a spectrum shouldn't one take a FT of a FT?
@satyajeetprabhu 8 месяцев назад
Same thought
@booky6149 7 месяцев назад
We are taking inverse Fourier transform to represent the log spectrum in the same way as the human ear hear (i.e Frequency domain to Quefrency domain). FT only takes the Time domain signal as input. FT of FT violates the rules.
@4abdoulaye 3 года назад ⁺¹
VERYVERYVERY CLEAR, Best video I've ever seen.
@ValerioVelardoTheSoundofAI 3 года назад
Thanks!
@nataliakalashnikova1269 3 года назад ⁺¹⁷
I did my master's thesis in NLP on automatic emotion recognition comparing CNN and SVM performance using MFCC. I didn't really "get" the meaning of MFCC, how it works, why it is so popular, etc. Now I'm doing my PhD thesis also on emotion classification in speech and I was really struggling with the understanding of these basics concepts.
Thank you so much for your work, your clear and vivid explanations! You helped me a lot to move forward in my project.
P.S. Sorry for my English, if there are a lot of errors.
P.P.S. I am a linguist "I believe it's called" :)
@zahraamuhsen1310 2 года назад
please I am study and my thesis it also about speech emotion recognition using cnn and mfcc based on GA by using entropy >>>> can you send me your thesis or can you help me to understand
@abhi88mcet 3 года назад
I am more of a Reinforcement Learning guy with a bad squicky voice trying to start a youtube channel. I was researching the use RL to create a realistic vocoder to substitute my voice, and stumbled upon this gem...awesome work..keep up the good work..
@ValerioVelardoTheSoundofAI 3 года назад
Thanks a lot and good luck with the YT channel -- you're on the verge of starting an amazing journey :)
@zhenxinghu4889 2 года назад ⁺¹²
My question is why not apply DFT rather than IDFT again on Log(F(x(t))
@sasankkottapalli6822 4 месяца назад ⁺¹
Same question here
@tutatis96 3 месяца назад
@@sasankkottapalli6822 i think it works because we're not considering the phase after the log
@vaitom6078 2 года назад
you're a genius of vulgarization, thank you for the effort
@seathru1232 2 года назад ⁺²⁰
Dear Valerio, I don't get a point. Shouldn't you get time on the x-axis if you apply an IDFT to a signal represented in the frequency domain? If I take a signal x(t) and take the FFT, and then the IDFT, didn't I get back a reconstructed x(t)? Is the log of the FFT the reason behind what you explained?
@pedrobotsaris2036 Год назад ⁺²
That is right. I think he is misusing the term inverse Fourier transform here. If you apply a IDFT you get back to the time domain.
@Walsh2571 Год назад ⁺³
@@pedrobotsaris2036not if you change the scale before performing an ifft
@user-gb4oo2to4w 4 месяца назад ⁺¹
@@Walsh2571 Why? If you change the scale before performing IFFT, you just get back to the time domain with a different scale, right?
@tutatis96 3 месяца назад ⁺¹
@@user-gb4oo2to4wi think that the point is that we got rid of the phase with the log, but im not sure
@AlBeebe 3 года назад ⁺⁶
Excellent video. 42:08 ended up making me wonder what happened to the slack message i thought i got. :)
@pratyushsaha8482 3 года назад
Very well explained. You are awesome man !
@user-yo4kd7zy9j Год назад
So well explaniert! Thanks alot for your amazing work.
@user-ni2fo1uh2l 11 месяцев назад
I was watching the video and at some point I stopped and started talking to chatGPT to understand those concepts. I found myself learning about convolutions and cepstral coefficients and its intuition. Once, I got back to the lecture, the first thing Valerio started talking about was convolutions and the intuition behind cepstral coefficients. The moral of this story is he is an amazing teacher and just finish the lecture first and then search for stuff that you did not get in the lecture :)
@shaidhasan6895 3 года назад ⁺¹
Thanks a lot. Was waiting for this.
@ValerioVelardoTheSoundofAI 3 года назад ⁺¹
Glad you liked the video!
@RahulSharma2501 2 года назад
This is absolutely amazing.
@AsEnIxX-wtf Год назад
Excellent presentation & explanation
@rakhshandamujib2793 9 месяцев назад
Absolutely loved it!
@dataista7717 2 года назад
Thanks for the series, man. You accelerated my speed jumping into this field a lot. Like A LOT. Really, u rock 🙌
@parismitasarma1572 3 года назад
Amazing, you are explaining the underlying concept in much easier way. Thank you so much Sir.
@sagarparmar6715 3 года назад
greatly admire this video. it's quite detailed. thanks a lot
@anupambhattarai8765 3 года назад ⁺¹
Great explaination.👍
@shereenelmetwally522 2 года назад
Thank you very much. It was really wonderful!
@advaithpillai 2 года назад
Mate you are a life-saver!
@MrHowdai 3 года назад
I never understand it this clear until watching your videos!! Really appreciated it. :))
After watching this I got 2 little questions,
1. According to Nyquist theorem, when extracting the MFCCs, do we need more Mel filter banks when processing audio signals in higher sampling rates?
Cuz I found the MFCCs of an audio sampled at 44.1KHz are NOT the same as the down-sampled one, which is at 16Khz.
2. Is it right to say that MFCCs is volume-independent audio features?
Thanks for the great videos again! And I hope there's someone can help with my questions, thanks in advance!!
@brandonlincolnsnyder 2 года назад
this video is blowing my mind!
@surajpandey86 2 года назад
Great Content and explaination.
@pohjanakka1 2 года назад
Thank you so much. This was clearly explained.
@davidkooi4349 11 месяцев назад
Wonderful video, thank you!
@desikharkara9407 2 года назад
Very good and great explained thanks 👍
@akanshmaurya1568 3 года назад ⁺³⁴
I am confused about why it is a spectrum of a spectrum, when we take Fourier transform, we go from time to spectrum, so according to last step while calculating cepstrum, should we not call as inverse of spectrum?
@Erosis 3 года назад ⁺¹⁷
Yeah, the inverse is kinda confusing me. I thought we'd use another Fourier Transform to get quefrequency, not the inverse (which puts it back into time domain). I read a post about this ( dsp.stackexchange.com/questions/5940/mfcc-process-confusion ) where they say that both are going to produce relatively the same thing, so it doesn't matter in the end.
@bijan8705 2 года назад ⁺²
He clearly don't know that inverse FT is not the same as FT at 14:00
@RudraSingh-pb5ls Год назад
@@bijan8705 who doesn't know, Valerio or Akansh, the guy who asked this question here ?
@user-gb4oo2to4w 4 месяца назад
@@Erosis Thank you for this information. I read the post but I'm still confused... why are both going to produce the same thing? One is the inverse of the other
@stefanhopman9176 2 года назад
Great Video! Thank you.
@yatosaurio 2 года назад
fantastic explanation, very didactic, thank you very much
@ashokdhingra4 3 года назад ⁺³
Hi, Fourier transform of a time domain signal is a series of terms, and not a single number. What then is the meaning of Log of the Fourier transform? Or is it Log of each term in the Fourier transform? Further, when we take inverse Fourier transform, we should go back in time domain. So it is not really 'spectrum of a spectrum'.
@kxiong4021 3 года назад
Thank you for sharing this amazing content. Very informative and specific. Came for copper and found gold!
@harisbournas6600 3 года назад
Great explanation
@fabricejumel4630 2 года назад
Thanks a lot . Just perfect
@mohamadhanifomarsaifuddin4578 3 года назад
Good Explanation 5star
@yannickpezeu3419 3 года назад
Thanks, very interesting !
@mitsoskavelos 3 года назад
Awesome explanation and pleasant presentation. Well done and thank you !
@amoghshekharhiremath6627 2 года назад
Very Astounding!!!!!!!!!!!!!!!!
@bigpenguin8457 2 года назад
Thank you for the video, i wanted to ask if you have any documents or codes related to extracting "spectral detail" or the entire procedure that you described in the video (spectrum-->log amplitude spectrum-->spectral envelope-->spectral detail) i have applied amplitude envelope on log power spectrum which is a spectral envelope by theory but it gives me lesser values so i cannot do element wise subtraction with log power spectrum to get spectral detail, please suggest me if i am wrong somewhere. Thank you.
@sharonm1261 Год назад
this is really interesting, great explanation, thanks! now I just have to work out how to relate this to blossom bat squeaks 🤔 (their frequencies are a lot higher)
@selinm7775 2 года назад
Thank you so much
@tarekziedan9989 2 года назад
Very helpful
@6tyelement979 3 года назад ⁺⁵
4:32 When u cannot answer a question u got asked in front of whole class btw great vid
@ValerioVelardoTheSoundofAI 3 года назад ⁺¹
LOL
@VooDooSounding 3 года назад
brilliant!
@adrijachakraborty2316 2 года назад
Mind = Blown!
@JigarRajpopatOfficial 3 года назад ⁺²⁰
Very informative. Thank you!
@ValerioVelardoTheSoundofAI 3 года назад ⁺⁴
Thank you Jigar :)
@roninnash6782 2 года назад
Sorry to be so off topic but does someone know a tool to get back into an instagram account..?
I somehow lost the account password. I love any assistance you can offer me
@adambryant9487 2 года назад
@Ronin Nash instablaster :)
@roninnash6782 2 года назад
@Adam Bryant i really appreciate your reply. I got to the site through google and im in the hacking process now.
Looks like it's gonna take quite some time so I will reply here later with my results.
@roninnash6782 2 года назад
@Adam Bryant it worked and I finally got access to my account again. Im so happy!
Thanks so much, you saved my ass !
@srikantachaitanya6561 2 года назад
Thanks you soo much
@HorsesandCo657 2 года назад
thank you very much so much!
@ValerioVelardoTheSoundofAI 2 года назад
You're welcome!
@MrOpossumx3 3 года назад ⁺¹
Another great vid! I would have appreciated a bit more intuition over the meaning of the MFCC coeffs / time matrix presented around 48:37. If a spectrogram is intuitive, if found a MFCCs coefs over time matrix to be harder to interpret. Do you have some intuition of MFCCs coefs over time from a psycho-acoustical perspective? In a Spectrogam, the intensity of a given frequency at a frame nicely link to the perception we have of a sound high or low pitch. What would a perceptual equivalent for MFCCs coefs over time?
@user-tj4ut8ox9r 3 года назад
32:48 how do you choose the sine wave frequency?? I thought we use cepstrums to do that for us automatically?
@solomiyabranets8894 3 года назад
you are so good!
@ValerioVelardoTheSoundofAI 3 года назад
Thanks!
@goshick 3 года назад
wow! thanks
@TanupatBoon 3 года назад ⁺²
is DCT just another Fourier transform? Why is it the inverse one?
@keem.studios 3 года назад
this video just saved my engineering final project
@ValerioVelardoTheSoundofAI 3 года назад
Nice :)
@DOMINIK32110 2 года назад ⁺²
Great video as always!
Could you recommend books or other sources (it'd be great if it was possible to find them on the Internet) to read more about MFFCs? Especially in context of speech.
@harutyunyansaten 3 года назад ⁺²
I want to learn deeper can you please provide references where you tookthis info?
@Waffano Год назад
@37:44 You mention that we get a mel spectrum. However most of the ressources I found don't mention any mel spectrum at that step but instead they mention a 1D mel vector with length = M, where M is the number of mel bands and m is the band number. The m'th element of the mel vector then contains the sum of the products between the m'th mel filter bank and the power spectrum. Is this mel vector the same as a mel spectrum? And whats the pros and cons of using either, if they are different?
@ahmadkhadra992 3 года назад
Hi Valerio, I have a little question, when we apply DFT on the signal, why we got a power spectrum ? Why not just a spectrum ?
@sigitpriyohartanto2129 3 года назад ⁺¹
Wow..
keren.
how to make a comparison between one person's voice and another.
@lingarajmishra8981 Год назад
After application of Fourier transformation how did the vocal tract response and glottal pulse still was in the time domain....plz explain
@avidreader100 3 года назад ⁺³
I am still stalled at this video. I feel the founders of the concept have confused us by naming these unique parameters the way they did. Quefrency as a metric with a measure of seconds was quite a big factor confusing me. I am gradually coming to terms with it. Let me share my thoughts so that others can correct me if I am off.
In the Fourier transform that gave us the spectrum, we say we convert a signal from the time domain to frequency domain. We look at the time domain signal as an additive value of multiple uniform/ steadier frequency components (all taken within a short time frame). The amplitude in vertical axis is expressed in different units (dB etc), but is conceptually the same - magnitude. The Fourier transform inverted the x axis. From time it went to inverse of time, which is frequency.
The cepstrum is basically looking at the up and down shifts of the spectrum as we scan along with respect to frequency. These are the formants in speech. The amplitude is again not tampered with beyond expressing as log etc. The x axis is not flipped once again from cycles per unit time to time. In both spectrum and cepstrum we did flipping of x axis. First time around it analyzed the signal and have all the frequency components. In the second time it gave all the formats. The amplitude of the spike in the cepstrum gave us the significant components, and the quefrency or time value at which the spikes occurred, when inverted gives us the formant frequency corresponding to this spectrum. Does this sound right?
@Waffano Год назад ⁺²
The IDFT part is a typo if you ask me. For me it only makes sense that the cepstrum is a spectrum of a spectrum, meaning DFT applied to a spectrum. This is the only way we can collect the frequencies of the formants. If it was IDFT it would just result in a complex waveform with no information of frequencies. In the end Valerio also specifically uses discrete cosine transform and NOT inverse discrete cosine transform, to get the final MFCCs, which makes sense. So I strongly believe the IDFT in the beginning is just a mistake and should be DFT.
@parasharparikh9352 3 года назад
Can I use MFCCs for extracting features from the current signal?
@sharonm1261 Год назад
could anyone perhaps tell me which is the next video to watch for how to use MFCCs from different speakers to tell the speakers apart....no worries if there's not one, I will also search and google, thank you :)
@annazaitseva6213 3 года назад ⁺¹
If cepstrum is a spectrum of a spectrum why inverse Fourier transform is applied to a log spectrum of a signal not forward?
@amitrege502 3 года назад
This is a good video. However the question is, in the section on 'Formalizing Speech' why are you using the (t) variable in the transform domain also. The domain should be frequency.
@bubblefoil 3 года назад
The time on the x-axis at 49:50 is actually in seconds, right?
@zweiteid3340 Год назад
Hello,
We are currently doing a project on verification using the human voice (speaker recognition). Would mfcc be useful here at all, when it is actually about filtering out phonemes?
@SonGoku-rl9qf 5 месяцев назад
great
@user-ih4ml7he1x 6 месяцев назад
I am wondering that the 1st rhamonic is representing the envelope(formants) or the glottal pulse in the latter of this video? I am a little bit confusing here at 16:12
@AlexTuduran 2 года назад
I watched the suggested video for how to compute the envelope, but I find it unfit for this problem or I'm missing something. Basically, to compute the envelope, you take the max of a frame. This works well in general with audio, but in constructing the envelope of a spectrum, the data is rather short / scarce (ex. FFT 1024 => 512 points) and breaking it down in frames increases the chances of computing a rather "false" envelope. How do you manage to avoid the local minima and account only for the actual peaks? And since we're talking about speech, we'll have a lot of local minima. Applying a low-pass filter kind of does it, but it obviously has the disadvantage of potentially shave off important peaks. Sow how to do it properly?
@virendrawadher8006 3 года назад
any resources to know more about MFCC? and resources to know what are each coefficient belongs too like MFCC[1] -> energy, MFCC[2] -> spectral envelope etc
@ValerioVelardoTheSoundofAI 3 года назад
There isn't a direct mapping between each coefficient and a perceptual / acoustic attribute. Unfortunately, I haven't found many comprehensive resourcess on MFCCs.
@Jononor 3 года назад ⁺³
In "Computing Mel-Frequency Cepstral Coefficients" (approx time 38:00) you put Waveform->DFT->Log-Amp->Mel-filterbank->DCT. Is it not more conventional to apply the Mel filterbank to linear magnitude spectrogram, and then do the log transform? But maybe the order is not so important between those two steps?
@ValerioVelardoTheSoundofAI 3 года назад ⁺¹
It's really a matter of "preference". Both approaches work.
@mv7736 3 года назад
Hey, absolutely amazing,so informative.
I have a doubt,isn't cepstrum just log of the spectrum?
@ValerioVelardoTheSoundofAI 3 года назад
No, that is the log-spectrum. You can find details about cepstrum in the video.
@ritwickjha3954 2 года назад ⁺²
maybe you should be a bit clear, taking IFFT of frequency domain will give us time domain. Quefrency is in the time domain. I was a bit confused because you kept saying IFFT will give something like a frequency domain. Also i am not sure if taking log of signal in time domain is correct, since it is convolution of E and H, log should be in frequency domain where it is multiplication of E and H. please correct me if i am wrong.
great video
@Jononor 2 года назад
You should put (MFCC) in the title, I think. It should help people discover the video. Not everyone knows what the abbreviation stands for :)

Следующие

Автовоспроизведение

Extracting Mel-Frequency Cepstral Coefficients with Python