Extracting Mel Spectrograms with Python

Поделиться
HTML-код
  • Опубликовано: 19 окт 2024
  • Learn how to extract and visualise Mel spectrograms from an audio file with Python and Librosa. Learn to visualise Mel filter banks.
    Code:
    github.com/mus...
    Join The Sound Of AI Slack community:
    valeriovelardo...
    Interested in hiring me as a consultant/freelancer?
    valeriovelardo...
    Follow Valerio on Facebook:
    / thesoundofai
    Connect with Valerio on Linkedin:
    / valeriovelardo
    Follow Valerio on Twitter:
    / musikalkemist

Комментарии • 68

  • @jatinsharma7480
    @jatinsharma7480 9 месяцев назад +3

    ANYONE WHO WANTS TO LEARN AUDIO PROCESSING HE IS THE GO TO MAN!!!!, THANKS SO MUCH MAN! RESPECT+++

  • @KevinBacheM
    @KevinBacheM 4 года назад +26

    Your videos are the best audio processing explanations I've found on the internet. Thanks and keep up the good work!

  • @michelebernasconi375
    @michelebernasconi375 4 года назад +3

    Hi Valerio, great video! Funny enough I just got recently interested in audio post-processing with librosa, and repeating the basic concepts with your video series helps me to consolidate my know-how on the topic! Pls keep going!

  • @vladdarii6895
    @vladdarii6895 Год назад

    Valerio you are the absolute best, thanks for the great work, time and effort you are putting in your videos!

  • @nmirza2013
    @nmirza2013 Год назад

    great video on Mel Spectrograms

  • @louisleeboy
    @louisleeboy 2 года назад

    Good video, help me quickly understand the usage of library and the mel band concept. Very thank you.

  • @nedzadhadziosmanovic3785
    @nedzadhadziosmanovic3785 3 года назад +2

    In this video, and the previous video called "Mel spectrograms explained easily" you are explaining to us what does a mel band mean, mel scale, mel filter bank etc, but in my opinion there is a single step missing for understanding what is really done when using mel filter banks to construct a mel spectrogram.
    The process you are referring to:
    1. Find the smallest and biggest frequency expressed in Hz, which we got from the output of STFT
    2. Convert these two values from Hz to mel scale
    3. Choose the number of mel bands we want to use
    4. According to the chosen number of mel band, we construct a mel filter bank
    And now comes the part which is not clear to me: The use of mel filter banks on outputs of STFT to get output of some other kind, which will be used to construct a mel spectrogram.
    At this point let's just go back and look at a single output of the STFT (which is equivalent to performing DFT on one frame of an audio wave). As a result we get a set of complex numbers, and by finding their magnitudes we are able to construct a amplitudeVSfrequency graph (also called "frequency domain graph"), by simply plotting the magnitudes as the amplitude for a certain frequency. In other words, each of the magnitudes of the complex numbers (to be clear, one magnitude per one complex number) is responsible for the high of one bin inside the amplitudeVSfrequency graph.
    Now we have this single amplitudeVSfrequency graph, and we want to use it in combination with mel filter banks to construct output of some kind. First question is how to apply a mel filter bank to a single single output of STFT (i.e. to one amplitudeVSfrequency graph)? In other words, how to combine these two to get an output of some kind? (I know that is a multiplication of two vectors basically, but how would you represent this visually, using a mel filter bank and a single amplitudeVSfrequency graph). Secondly, what is the this output representing, the amplitude for a single mel band? Lastly, I think it would be much more clear if we used mel bands on the y-axis and mel measuring unit (but I don't know would this be correct), but in my opinion, putting frequency in Hz on y-axis of a mel spcetrogram is completely misleading (and is making me think I did understand anything).
    I wanted to ask you would you be so kind to make a single graph which is the output of a single amplitudeVSfrequency graph (which we got from STFT) and mel filter bank, also expressed visually as graph (I suppose, but I am not sure that it would then be a amplitudeVSfrequency graph, but this time with mel frquencies on the x-axis), as I think that it could help both me, and a lot of your viewers?

    • @charmz973
      @charmz973 3 года назад +1

      Thank you for the question, was also asking myself the sane question, why do we go through the hustle of calculating mel filter banks and yet they are not used in constructing the mel spectograms

    • @laptopml
      @laptopml Год назад

      What I understand is that melspectrogram() creates both the vanilla spectrogram and the filter banks and applies the filters to obtain the resultant time/mel freq plot, all "under the hood", going by the video around 7:38. I think he used librosa.filters.mel() to illustrate what filter banks look like, but calling mel() is not necessary as melspectrogram() will do the same for you AND apply the generated filters to the vanilla spectrogram to give you the final output. The fact that both mel() and melspectrogram() accept the same set of parameters i.e. n_fft, sampling rate, n_mels seems to also confirm this. I removed the code for generating filter_banks i.e. the mel() call and called melspectrogram(). The mel spectrogram it generated looks the same regardless of whether mel() was called or not. Disclaimer: I used my own WAV file for this test. Hope this helps.

  • @RildoDemarquiPereira
    @RildoDemarquiPereira Год назад

    Very well explained. Congrats!

  • @sharonm1261
    @sharonm1261 3 года назад

    thanks very useful video, will go and try to make some spectrograms from my bat calls!
    I had subtitles turned on which is pretty entertaining, youtube couldn't decide on male, mail, mouth or mouse spectrograms. and apparently you were using 10 Melbournes.

  • @antikoo1
    @antikoo1 3 года назад

    you deserve more likes. Thanks!

  • @DíazRamírezManuel
    @DíazRamírezManuel 6 месяцев назад

    It's great your work in AI for audio. I'm working in a project of music transcription and this is very useful for me. Really, thanks so much. A question: ¿Do you have any video about Constant Q Transform or can you give a recomendation for study that topic? Thanks.

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  6 месяцев назад

      I haven't covered Constant Q Transform in the channel yet.
      If I remember correctly, I think you can find more info about it in Fundamentals of Music Processing by Meinard Muller.

  • @abdouazizdiop8279
    @abdouazizdiop8279 4 года назад +1

    Another great video , Thanks for the sharing

  • @Moonwalkerrabhi
    @Moonwalkerrabhi 3 года назад

    Your Videos are best ! One thing i wanna suggest is to zoom in the jupyter notebook a lil bit while coding, 😅

  • @harperlewis_RC
    @harperlewis_RC 2 года назад +1

    Congratulations for the great content!
    Loading a large number of audio files with librosa can be quite demanding computationally as I've experienced. Is it necessary to keep the resulting vector on float32? The values represent the amplitudes, so I doubt that such a precision is necessary in order to ssave the vector and extract a Mel Spectrogram from it. Any suggestions on the smallest variable type to stick to in this case?

    • @muhammaddanish9843
      @muhammaddanish9843 2 года назад +1

      yes, the precision is important. because the resulting vector is a quantized vector. re-constructing in with dropping the floats will degrade the quality.
      but I would say try and see what effect it does have on your task. than let me know too.

  • @НиколайНовичков-е1э

    Thank you for great video!

  • @abeeramir6006
    @abeeramir6006 9 месяцев назад

    Correction: You have been saying framesize / 2 as nyquist frequency in this video (and the previous one as well) which I think is sampling rate / 2 and not framesize / 2.

  • @naufalrifqihabibie4492
    @naufalrifqihabibie4492 Год назад +1

    Hi Valerio, I want to ask something stupid, but is the n_fft and hop_length the same as the window length and hop length? I'm trying to do all of the process all over again myself, from the fourier transforming each of the frame to applying the mel filterbank to each frame, but when I inputted the window length and hop length to the melspectrogram's parameter, it seems like it's stretching the signal, as the original audio is only 30 s long and the mel spectrogram became 45 s long. Thanks to everyone who replies this comment!

    • @cavega2042
      @cavega2042 Год назад

      have you solved your problem? I have the same problem by extracting mel spectrogram, the time is stretching, i have an audio of 2:20 min and the x_axis is showing about 4:00 mins.

  • @ΔΗΜΗΤΡΙΟΣΚΟΥΜΑΝΔΡΑΚΗΣ

    Why is the frequency in Hertz? Shouldn't it be in mels in the mel spectograms?

  • @MichelHabib
    @MichelHabib 2 года назад

    Thank you 🥰

  • @debabratagogoi9038
    @debabratagogoi9038 Год назад

    As you have not shown the reconstruction of the audio waveform from a mel-spectrogram, also not talked about the phase information which is not captured while performing mel-spectrogram. will it be proper reconstruction, as the mel-spectrogram alone does not contain enough information for the complete reconstruction of the raw audio?

  • @MrDari88
    @MrDari88 4 года назад

    Thank you for another great video! I have noticed you have used the STFT but not the Hann window. Is this the standard procedure? By exploring the Librosa function feature.melspectrogram, I have seen there is the option of providing the window type and window length as arguments to the method. Could you please clarify this?

  • @SHADABALAM2002
    @SHADABALAM2002 3 года назад

    Hi Valerio, thanks for all videos and knowledge you shared, its simply have no match. my question is that when you plot mel filter banks at (5.05) why color bar on right side has all values +0 from bottom to top... I cant understand this...

    • @mitramir5182
      @mitramir5182 3 года назад +1

      These are the values related to db (decible) and can have negative values and positive ones depending on how much they are smaller or larger than our threshold of hearing.

  • @user-lp3fv2nv7t
    @user-lp3fv2nv7t 2 года назад

    Thanks for the video! I learned a lot from these and took so many notes. In this video 9:00, mel_spectrogram.shape is (90, 342). Why the column not equal to the frame size (2048) divided by 2 pulses 1? 2048/2+1 = 1025. Why the answer is 342? Another question, the hop_length always equals half of the frame size. Why here it is not 1024? Thanks for your response!

    • @idontevenknow3707
      @idontevenknow3707 2 года назад

      Hop length can be whatever you want it to be, half, quarter etc.
      His answer is 342 because of the length of his audio file. My answer was (90, 21360) because I loaded a full 8min song to librosa. Valerio's answer was 342 because his audio file is short. You can figure out the length of Valerio's audio file by doing: 342 * 512 (hoplength) / 22050 (samplerate) = 7.94 seconds.

    • @desrucca
      @desrucca Год назад +2

      U mistake the formula. Check the previous video

  • @estherdzitiro2039
    @estherdzitiro2039 4 года назад

    Your videos are extremely helpful. May you please advise what's the best way to save the Mel-spectrograms for deep learning, I have a dataset with about 99539 30sec audios clips.

    • @kosprov69
      @kosprov69 3 года назад

      Mel spectrograms are numpy arrays at the end of the day. So you can save them as you save any numpy arrays via .npy files.

  • @nadiamaarfavi4930
    @nadiamaarfavi4930 3 года назад

    Thanks for a great explanation. If we have long audio with 3 min duration, and we want to generate a spectrogram for CNN, do we cut the audio or generate the spectrogram for the full audio? does windows size or hop size differ in long audio vs short audio?

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  3 года назад

      You can cut the audio. Typical lengths are from 10 to 30 sec segments.

    • @nadiamaarfavi4930
      @nadiamaarfavi4930 3 года назад

      @@ValerioVelardoTheSoundofAI Thanks for such a quick response, the thing is I don't want to cut the audio, somehow I need all information on the audio to be included. The analysis I'm doing depends on all the information of the full audio. Doesn't cut it will lose some information on the audio? what will happen if I generate one spectrogram for a full duration?

  • @adam051838
    @adam051838 2 года назад

    First of all let me say thank you for the videos - they're helping me tremendously. I have run into a problem when I try to run the code myself though. When I use y_axis="mel" in the librosa.display.specshow function, matplotlib complains that "__init__() got an unexpected keyword argument 'linthreshy'". My code is practically identical to your own, do you have any ideas on what a solution could be?

    • @adam051838
      @adam051838 2 года назад

      Turns out I solved my own problem. I was using matplotlib version 3.5.1 and I downgraded to 3.3.4 and it seems to work.
      I do also spot that it warns me "The 'linthreshy' parameter of __init__() has been renamed 'linthresh' since Matplotlib 3.3; support for the old name will be dropped two minor releases later." and since 3.5 is 2 minor version over 3.3 I'm thinking that this is bug with either matplotlib or librosa not being updated properly

  • @saishriram2064
    @saishriram2064 4 года назад

    Hey bro!
    Thanks for the video. I have a question. How can I use melspectrogram as an input to a CNN by saving as an image? Currently, I can only save black and white. Is color better?

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 года назад +2

      You can save the mel-spectrogram as a grey-scale image, i.e., width = frames, height = mel bands, channel = 1

  • @kingeng2718
    @kingeng2718 3 года назад

    if i want to use the results of mel spectogram with convolution should I use a .T on result so convolution happens on the time or not ?

  • @abhijitjaiswal
    @abhijitjaiswal 4 года назад

    Hey Valerio, I am using GTZAN dataset for extracting mel spectrogram, but I get different sizes for the audio signals, is this expected, if yes, how do I use them for training a CNN model ? If no, could you help me understand where I may be going wrong ?

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 года назад

      I'm assuming the mel spectrograms differ in # of frames. The files may have a slightly different duration from one another. If that's the case, I suggest you to either ensure that all the files have the same # samples, or to zero pad the shorter spectrograms.

  • @aomo5293
    @aomo5293 Год назад

    what is the version of librosa you are using ?

  • @soundaryashanthi2548
    @soundaryashanthi2548 2 года назад

    I am getting error. I am doing this in google colab and getting attribute error: module 'librosa.feature' has no attribute 'melspectogram'
    Pl tell the solution.

  • @canpasa6695
    @canpasa6695 Год назад

    Hello! Thank you very much for give this infos. When i apply my jupyternotebook i have a issue like;
    melspectrogram() takes 0 positional arguments but 1 positional argument (and 3 keyword-only arguments) were given
    i couldn't find anything. Can anyone help me?

    • @ankitanand2448
      @ankitanand2448 Год назад +1

      Hi Can,
      While passing the signal argument in the librosa.feature.melspectrogram function, use y= rather than directly writing signal.
      In context to this tutorial,
      try mel_spectrogram = librosa.feature.melspectrogram(y=scale, sr=sr, n_fft=2048, hop_length=512, n_mels=10)
      instead of
      mel_spectrogram = librosa.feature.melspectrogram(scale, sr=sr, n_fft=2048, hop_length=512, n_mels=10)
      I hope this resolves your issue

    • @Rjokich
      @Rjokich Год назад

      @@ankitanand2448 Thank you for help. I faced the same issue and this solution helped.

  • @tyhuffman5447
    @tyhuffman5447 4 года назад

    I have a question about lines 4 & 5 in the example code, on line 4 you say sr = librosa.load(scale_file) then on line 5 you say sr = 22050, why not say sr = sr??

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 года назад

      You're right. I could have written sr = sr.

    • @tyhuffman5447
      @tyhuffman5447 4 года назад

      @@ValerioVelardoTheSoundofAI Not wanting to pick at nits I just wondered if there was a specific reason for using an explicit rather than a reference. BTW I have been cleaning up some of your code which I will either post to my github or submit as a pull request to your git when I figure out how to do that. Me and git don't get along, yet.

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 года назад

      @@tyhuffman5447 sounds cool!

  • @علیرضااکبری-ه9ص
    @علیرضااکبری-ه9ص 2 года назад

    hi i want to extract the center frequencies of filterbanks how can i do that?

  • @ethiotechnotube7982
    @ethiotechnotube7982 Год назад

    how can I use windowing mechanisms

  • @Sam-jk5dw
    @Sam-jk5dw 3 года назад

    3:45 - “Which is equal to the size” The size of what?
    6:32 - Why does the weight never reach 1 for bands 5-10

    • @romainpattyn4528
      @romainpattyn4528 3 года назад +2

      1) the size of the second dimension of the matrix
      2) In practice, there is a normalisation factor, you can check it out here : www.researchgate.net/figure/Mel-Scale-Filter-Bank-for-Word-Tulas_fig4_270898832

    • @louisleeboy
      @louisleeboy 2 года назад

      I have the same question, thank for Romain Pattyn 's answer.

  • @muhammaddanish9843
    @muhammaddanish9843 2 года назад

    How can i use mel spectograms for audio denoising task ?

  • @bArda26
    @bArda26 5 дней назад

    o7