Mel-Frequency Cepstral Coefficients Explained Easily
HTML-код
- Опубликовано: 30 июл 2024
- MFCCs have traditionally been used in numerous speech and music processing problems. They are a somewhat elusive audio feature to grasp. In my new video, I introduce the concept of Cepstrum, illustrate its intuition, and discuss how we can extract MFCCs.
Slides:
github.com/musikalkemist/Audi...
Join The Sound Of AI Slack community:
valeriovelardo.com/the-sound-...
Interested in hiring me as a consultant/freelancer?
valeriovelardo.com/
Follow Valerio on Facebook:
/ thesoundofai
Connect with Valerio on Linkedin:
/ valeriove. .
Follow Valerio on Twitter:
/ musikalkemist Наука
This is one of the best lectures I've ever effin watched, thank you so much for making this series!
Description in a pleasant manner, untiring, relaxing effect on nerves. Thank you Valerio Velardo
Straight up dude, you are an absolute beast! Every other sentence just blows my mind. You made it so easy to understand and gain an intuition on such abstruse concepts. Thank you so much!
This video is doing explanations that I couldn't find or understand from hundreds of websites. You're a legend
I absolutely love the way you explain these concepts! Thank you !
This was so helpful, can't thank you enough for your time and effort. Simply amazing - and your enthusiasm makes it so easy to watch and enjoy through the end!
You are a perfect man. These videos are literally worth gold. I will watch them from the start. Thank you very much.
This was really, exceptionally good. A rather lengthy video, but worth every second. Thank you so much!
Glad you liked it!
20 minutes in, my mind started melting. Amazing video!
That was so clearly explained!! Thank you for this, Valerio
that's AWESOME STUFF. Did expect good stuff, didn't expect that good stuff, you really did good about explaining cepstrums and the wave to separate glutal pulses from voice track. It really made sense.
I cannot express how much I'm thankful to you for making this video! This is my favorite style of explanation that I myself have adopted over the years. You took an hour to explain a concept that could, in principle, have been explained in 15 mins or so, but you did it so clearly and thoroughly that by the end of the video I had a spotless, complete understanding not only of the process of extracting the MFCCs but also of the intuition and the meaning of it. Which is something that a lot of other explanatory videos lack these days. So thank you again for your effort!
Thanks a lot :)
Nice explanation and great course! One comment: I'm pretty sure big X, E, and H at 27:19 should be functions of frequency, not time, and should be multiplied, not convolved.
your channel is a gold mine, thank you so much for what you do!
I've learned more watching this video than a whole semester in my university. Than You!!!!
Thank you so much.
I searched alot about the Topic of MFCC and I did not found very good explanations.
Your Video is really a masterpiece and I have now a good knowledge about the concepts :)
For sure I will have a look at some other Videos from you.
Keep Up the amazing Work!
Thank you - glad I could help!
👏Excellent way to explain intricate details!! Thanks for the video series.
Thank you!
at 14:38, doesn't the IDFT map a signal on to the time domain? If so, shouldn't the axis be pseudo time instead of pseudo frequency?
That's exactly what I was wondering...
Extremely useful series of lectures. Thanks a ton!
Best MFCC explanation I‘ve seen ever!Thank you!
Thank you!
This is an incredibly helpful video that taught me how to implement an MFCC algorithm and intuition for why it is useful information. I can't recommend it enough.
Thank you Quincy!
awesome course, so complete, and very clear visualization. really amazing. thank you!
Thanks!
But is it ok to call the inverse Fourier a Spectrum? I tell that because the inverse Fourier brings back the Frequency Domain to Time Domain, and in my head, spectrum is represented by slices of frequency domain, or am i missing the point?
I'm not an expert, but I believe the conventional way to calculate the cepstrum uses the IDFT because of its scaling factor. Both the DFT and IDFT are quite similar and indeed produce results with the same shape.
Better than Speech Signal Processing Lecture in terms of explanation and ease of understanding !! Highly recommend to watch for speech related projects!
Thank you!
Thanks a lot for this brilliant explanation. I have read several papers to grasp the concept of mfcc, mel scaling, delta derivates etc. But after watching this youtube tutorial it is the first time I have the feeling I 'got' it. So I am on my way to watch your other tutorials.
What I don't understand is why one takes an inverse FT instead of a FT to get to the quefrency domain. If it's indeed a spectrum of a spectrum shouldn't one take a FT of a FT?
Same thought
We are taking inverse Fourier transform to represent the log spectrum in the same way as the human ear hear (i.e Frequency domain to Quefrency domain). FT only takes the Time domain signal as input. FT of FT violates the rules.
VERYVERYVERY CLEAR, Best video I've ever seen.
Thanks!
I did my master's thesis in NLP on automatic emotion recognition comparing CNN and SVM performance using MFCC. I didn't really "get" the meaning of MFCC, how it works, why it is so popular, etc. Now I'm doing my PhD thesis also on emotion classification in speech and I was really struggling with the understanding of these basics concepts.
Thank you so much for your work, your clear and vivid explanations! You helped me a lot to move forward in my project.
P.S. Sorry for my English, if there are a lot of errors.
P.P.S. I am a linguist "I believe it's called" :)
please I am study and my thesis it also about speech emotion recognition using cnn and mfcc based on GA by using entropy >>>> can you send me your thesis or can you help me to understand
I am more of a Reinforcement Learning guy with a bad squicky voice trying to start a youtube channel. I was researching the use RL to create a realistic vocoder to substitute my voice, and stumbled upon this gem...awesome work..keep up the good work..
Thanks a lot and good luck with the YT channel -- you're on the verge of starting an amazing journey :)
My question is why not apply DFT rather than IDFT again on Log(F(x(t))
Same question here
@@sasankkottapalli6822 i think it works because we're not considering the phase after the log
you're a genius of vulgarization, thank you for the effort
Dear Valerio, I don't get a point. Shouldn't you get time on the x-axis if you apply an IDFT to a signal represented in the frequency domain? If I take a signal x(t) and take the FFT, and then the IDFT, didn't I get back a reconstructed x(t)? Is the log of the FFT the reason behind what you explained?
That is right. I think he is misusing the term inverse Fourier transform here. If you apply a IDFT you get back to the time domain.
@@pedrobotsaris2036not if you change the scale before performing an ifft
@@Walsh2571 Why? If you change the scale before performing IFFT, you just get back to the time domain with a different scale, right?
@@user-gb4oo2to4wi think that the point is that we got rid of the phase with the log, but im not sure
Excellent video. 42:08 ended up making me wonder what happened to the slack message i thought i got. :)
Very well explained. You are awesome man !
So well explaniert! Thanks alot for your amazing work.
I was watching the video and at some point I stopped and started talking to chatGPT to understand those concepts. I found myself learning about convolutions and cepstral coefficients and its intuition. Once, I got back to the lecture, the first thing Valerio started talking about was convolutions and the intuition behind cepstral coefficients. The moral of this story is he is an amazing teacher and just finish the lecture first and then search for stuff that you did not get in the lecture :)
Thanks a lot. Was waiting for this.
Glad you liked the video!
This is absolutely amazing.
Excellent presentation & explanation
Absolutely loved it!
Thanks for the series, man. You accelerated my speed jumping into this field a lot. Like A LOT. Really, u rock 🙌
Amazing, you are explaining the underlying concept in much easier way. Thank you so much Sir.
greatly admire this video. it's quite detailed. thanks a lot
Great explaination.👍
Thank you very much. It was really wonderful!
Mate you are a life-saver!
I never understand it this clear until watching your videos!! Really appreciated it. :))
After watching this I got 2 little questions,
1. According to Nyquist theorem, when extracting the MFCCs, do we need more Mel filter banks when processing audio signals in higher sampling rates?
Cuz I found the MFCCs of an audio sampled at 44.1KHz are NOT the same as the down-sampled one, which is at 16Khz.
2. Is it right to say that MFCCs is volume-independent audio features?
Thanks for the great videos again! And I hope there's someone can help with my questions, thanks in advance!!
this video is blowing my mind!
Great Content and explaination.
Thank you so much. This was clearly explained.
Wonderful video, thank you!
Very good and great explained thanks 👍
I am confused about why it is a spectrum of a spectrum, when we take Fourier transform, we go from time to spectrum, so according to last step while calculating cepstrum, should we not call as inverse of spectrum?
Yeah, the inverse is kinda confusing me. I thought we'd use another Fourier Transform to get quefrequency, not the inverse (which puts it back into time domain). I read a post about this ( dsp.stackexchange.com/questions/5940/mfcc-process-confusion ) where they say that both are going to produce relatively the same thing, so it doesn't matter in the end.
He clearly don't know that inverse FT is not the same as FT at 14:00
@@bijan8705 who doesn't know, Valerio or Akansh, the guy who asked this question here ?
@@Erosis Thank you for this information. I read the post but I'm still confused... why are both going to produce the same thing? One is the inverse of the other
Great Video! Thank you.
fantastic explanation, very didactic, thank you very much
Hi, Fourier transform of a time domain signal is a series of terms, and not a single number. What then is the meaning of Log of the Fourier transform? Or is it Log of each term in the Fourier transform? Further, when we take inverse Fourier transform, we should go back in time domain. So it is not really 'spectrum of a spectrum'.
Thank you for sharing this amazing content. Very informative and specific. Came for copper and found gold!
Great explanation
Thanks a lot . Just perfect
Good Explanation 5star
Thanks, very interesting !
Awesome explanation and pleasant presentation. Well done and thank you !
Very Astounding!!!!!!!!!!!!!!!!
Thank you for the video, i wanted to ask if you have any documents or codes related to extracting "spectral detail" or the entire procedure that you described in the video (spectrum-->log amplitude spectrum-->spectral envelope-->spectral detail) i have applied amplitude envelope on log power spectrum which is a spectral envelope by theory but it gives me lesser values so i cannot do element wise subtraction with log power spectrum to get spectral detail, please suggest me if i am wrong somewhere. Thank you.
this is really interesting, great explanation, thanks! now I just have to work out how to relate this to blossom bat squeaks 🤔 (their frequencies are a lot higher)
Thank you so much
Very helpful
4:32 When u cannot answer a question u got asked in front of whole class btw great vid
LOL
brilliant!
Mind = Blown!
Very informative. Thank you!
Thank you Jigar :)
Sorry to be so off topic but does someone know a tool to get back into an instagram account..?
I somehow lost the account password. I love any assistance you can offer me
@Ronin Nash instablaster :)
@Adam Bryant i really appreciate your reply. I got to the site through google and im in the hacking process now.
Looks like it's gonna take quite some time so I will reply here later with my results.
@Adam Bryant it worked and I finally got access to my account again. Im so happy!
Thanks so much, you saved my ass !
Thanks you soo much
thank you very much so much!
You're welcome!
Another great vid! I would have appreciated a bit more intuition over the meaning of the MFCC coeffs / time matrix presented around 48:37. If a spectrogram is intuitive, if found a MFCCs coefs over time matrix to be harder to interpret. Do you have some intuition of MFCCs coefs over time from a psycho-acoustical perspective? In a Spectrogam, the intensity of a given frequency at a frame nicely link to the perception we have of a sound high or low pitch. What would a perceptual equivalent for MFCCs coefs over time?
32:48 how do you choose the sine wave frequency?? I thought we use cepstrums to do that for us automatically?
you are so good!
Thanks!
wow! thanks
is DCT just another Fourier transform? Why is it the inverse one?
this video just saved my engineering final project
Nice :)
Great video as always!
Could you recommend books or other sources (it'd be great if it was possible to find them on the Internet) to read more about MFFCs? Especially in context of speech.
I want to learn deeper can you please provide references where you tookthis info?
@37:44 You mention that we get a mel spectrum. However most of the ressources I found don't mention any mel spectrum at that step but instead they mention a 1D mel vector with length = M, where M is the number of mel bands and m is the band number. The m'th element of the mel vector then contains the sum of the products between the m'th mel filter bank and the power spectrum. Is this mel vector the same as a mel spectrum? And whats the pros and cons of using either, if they are different?
Hi Valerio, I have a little question, when we apply DFT on the signal, why we got a power spectrum ? Why not just a spectrum ?
Wow..
keren.
how to make a comparison between one person's voice and another.
After application of Fourier transformation how did the vocal tract response and glottal pulse still was in the time domain....plz explain
I am still stalled at this video. I feel the founders of the concept have confused us by naming these unique parameters the way they did. Quefrency as a metric with a measure of seconds was quite a big factor confusing me. I am gradually coming to terms with it. Let me share my thoughts so that others can correct me if I am off.
In the Fourier transform that gave us the spectrum, we say we convert a signal from the time domain to frequency domain. We look at the time domain signal as an additive value of multiple uniform/ steadier frequency components (all taken within a short time frame). The amplitude in vertical axis is expressed in different units (dB etc), but is conceptually the same - magnitude. The Fourier transform inverted the x axis. From time it went to inverse of time, which is frequency.
The cepstrum is basically looking at the up and down shifts of the spectrum as we scan along with respect to frequency. These are the formants in speech. The amplitude is again not tampered with beyond expressing as log etc. The x axis is not flipped once again from cycles per unit time to time. In both spectrum and cepstrum we did flipping of x axis. First time around it analyzed the signal and have all the frequency components. In the second time it gave all the formats. The amplitude of the spike in the cepstrum gave us the significant components, and the quefrency or time value at which the spikes occurred, when inverted gives us the formant frequency corresponding to this spectrum. Does this sound right?
The IDFT part is a typo if you ask me. For me it only makes sense that the cepstrum is a spectrum of a spectrum, meaning DFT applied to a spectrum. This is the only way we can collect the frequencies of the formants. If it was IDFT it would just result in a complex waveform with no information of frequencies. In the end Valerio also specifically uses discrete cosine transform and NOT inverse discrete cosine transform, to get the final MFCCs, which makes sense. So I strongly believe the IDFT in the beginning is just a mistake and should be DFT.
Can I use MFCCs for extracting features from the current signal?
could anyone perhaps tell me which is the next video to watch for how to use MFCCs from different speakers to tell the speakers apart....no worries if there's not one, I will also search and google, thank you :)
If cepstrum is a spectrum of a spectrum why inverse Fourier transform is applied to a log spectrum of a signal not forward?
This is a good video. However the question is, in the section on 'Formalizing Speech' why are you using the (t) variable in the transform domain also. The domain should be frequency.
The time on the x-axis at 49:50 is actually in seconds, right?
Hello,
We are currently doing a project on verification using the human voice (speaker recognition). Would mfcc be useful here at all, when it is actually about filtering out phonemes?
great
I am wondering that the 1st rhamonic is representing the envelope(formants) or the glottal pulse in the latter of this video? I am a little bit confusing here at 16:12
I watched the suggested video for how to compute the envelope, but I find it unfit for this problem or I'm missing something. Basically, to compute the envelope, you take the max of a frame. This works well in general with audio, but in constructing the envelope of a spectrum, the data is rather short / scarce (ex. FFT 1024 => 512 points) and breaking it down in frames increases the chances of computing a rather "false" envelope. How do you manage to avoid the local minima and account only for the actual peaks? And since we're talking about speech, we'll have a lot of local minima. Applying a low-pass filter kind of does it, but it obviously has the disadvantage of potentially shave off important peaks. Sow how to do it properly?
any resources to know more about MFCC? and resources to know what are each coefficient belongs too like MFCC[1] -> energy, MFCC[2] -> spectral envelope etc
There isn't a direct mapping between each coefficient and a perceptual / acoustic attribute. Unfortunately, I haven't found many comprehensive resourcess on MFCCs.
In "Computing Mel-Frequency Cepstral Coefficients" (approx time 38:00) you put Waveform->DFT->Log-Amp->Mel-filterbank->DCT. Is it not more conventional to apply the Mel filterbank to linear magnitude spectrogram, and then do the log transform? But maybe the order is not so important between those two steps?
It's really a matter of "preference". Both approaches work.
Hey, absolutely amazing,so informative.
I have a doubt,isn't cepstrum just log of the spectrum?
No, that is the log-spectrum. You can find details about cepstrum in the video.
maybe you should be a bit clear, taking IFFT of frequency domain will give us time domain. Quefrency is in the time domain. I was a bit confused because you kept saying IFFT will give something like a frequency domain. Also i am not sure if taking log of signal in time domain is correct, since it is convolution of E and H, log should be in frequency domain where it is multiplication of E and H. please correct me if i am wrong.
great video
You should put (MFCC) in the title, I think. It should help people discover the video. Not everyone knows what the abbreviation stands for :)