Understanding Time Domain Audio Features

Поделиться
HTML-код
  • Опубликовано: 22 июл 2020
  • I introduce fundamental time-domain audio features, such as Amplitude Envelope, Root-Mean-Squared Energy, and Zero Crossing Rate. I explain the intuition and the math behind these temporal acoustic features, and mention a few sample applications.
    Slides:
    github.com/musikalkemist/Audi...
    Join The Sound Of AI Slack community:
    valeriovelardo.com/the-sound-...
    Interested in hiring me as a consultant/freelancer?
    valeriovelardo.com/
    Follow Valerio on Facebook:
    / thesoundofai
    Connect with Valerio on Linkedin:
    / valeriovelardo
    Follow Valerio on Twitter:
    / musikalkemist
  • НаукаНаука

Комментарии • 68

  • @wthrmn2490
    @wthrmn2490 3 года назад +24

    Valerio, rn you're single-handedly saving my bachelor's thesis. These lessons are beyond amazing and I'll be forever grateful for getting these for free, keep up your great work!

    • @rhwood1154
      @rhwood1154 2 года назад +2

      I am in the exact same situation!!! These videos are fantastic... What are you doing for your thesis

    • @wthrmn2490
      @wthrmn2490 2 года назад +2

      @@rhwood1154 they are indeed! Im going to design a chord detection algorithm for audio signals. How about you?

  • @mohamadsuleyman3562
    @mohamadsuleyman3562 Год назад +1

    I start "Liking " videos even before I see them, more fun than film series.

  • @tyhuffman5447
    @tyhuffman5447 3 года назад +6

    I really like how you are taking your time going through the data preparation part which is the biggest part since bad stuff in = bad stuff out.

  • @adrijachakraborty2316
    @adrijachakraborty2316 2 года назад

    Amazing explanation! Taking a class on music cognition and technology and didn't have much background knowledge on audio signals. Your videos are a saviour.

  • @TheSkeef79
    @TheSkeef79 3 года назад +13

    Man, you are amazing! Your lessons are easy to understand and they are so informative.

  • @Hiyori___
    @Hiyori___ 3 года назад +3

    Amazing lecture. I’m studying this subject for uni but it was quite hard to really make some sense out of the formulas, especially from the summations indexes. Now it’s much clearer, THANK YOU

  • @yovisstar
    @yovisstar 3 года назад

    i'm a new-media artist and self-learned coder . your videos are really, really helpful for person like me. hope many goodness are returning to you. thanks!

  • @juhinpavithran9339
    @juhinpavithran9339 Год назад +1

    I am a music producer and new to audio ML, as you discussed about Zero Crossing, understand it could be easily used for differentiating between spoken words and musical beat or Rythm ( High Transient sound with fast attack) . Great insights, thank you.

  • @BetinhoSM
    @BetinhoSM 4 года назад

    Very nice and simple explanation. Thanks!

  • @fardalakter4395
    @fardalakter4395 10 месяцев назад

    Bro thank you so much you for your effort to teach. It's amazing explanation

  • @KingQuetzal
    @KingQuetzal Год назад

    Awesome video. Thank you so much

  • @user-yz7gh5dv8n
    @user-yz7gh5dv8n Год назад

    A wonderfaul vedio that I seek many years.

  • @SuperLucasGuns
    @SuperLucasGuns 3 года назад +3

    I love watching this man can't wait to get to start coding this

  • @WahranRai
    @WahranRai 3 года назад

    The zero cross rate formula is derivated from famous continuous function theorem : if the function f(x) is like f(a)*f(b) < 0 then it exists c : f(c) = 0 with a < c < b

  • @Suman-zm7wx
    @Suman-zm7wx 3 года назад

    wow that was awesome !!!!!!

  • @ibrezmohd9448
    @ibrezmohd9448 2 года назад

    Amazing content

  • @sathyanarayananvittal7832
    @sathyanarayananvittal7832 7 месяцев назад

    Excellent. You are making the math piece sound so easy and fun. Wondering in RMSE is done over mean of the sample amplitude -> ( s - s(m)^2 ) where s(m) is the mean for that sample

  • @yuvaramsingh3773
    @yuvaramsingh3773 4 года назад

    Liked your video even before finishing it

  • @Underscore_1234
    @Underscore_1234 2 месяца назад

    Hi, since you display knowledge in this field and give crystal clear explanation, I decided to watch all the videos you made on this playlist. I'm relearning stuff and learning new infos (I'm entirely new to sound applied data science). I think it's also great that you give usecases for each features you discussed so far. I'm eager to test a few. Do you have any dataset that you'd recommand playing with? Also, would you know a library or anyway to create sheet musics via python and make it be played?
    Anyway, really nice videos!

  • @Heidi-fz2lg
    @Heidi-fz2lg 2 года назад

    amazing thanks alot sir

  • @notallama1868
    @notallama1868 Год назад

    I notice that the zero crossing rate equation kind of breaks down if one of the samples has an amplitude of 0. In theory, that will probably never happen since 0 is just one out of infinite possible amplitudes for any given signal, but in practice we have limited values that we can actually represent and rely on rounding to the nearest one of those for each sample. If that nearest value happens to be 0 for one of the samples, then |sgn(s(k)) - sgn(s(k + 1))| evaluates to 1 rather than 2 or 0. If this only happens once, we end up with a ZCR of, for example, 4.5, but if it happens more than once, we won't be able to tell, as every two occurrences will just look like a single "normal" crossing. A ZCR of 4 could mean we have 4 crossing pairs that don't contain a 0 sample, or it could mean we have 8 crossing pairs where one sample has a value of 0.
    So do we just ignore this under the assumption that it won't happen frequently enough to be an issue? Do we remove 0 as a possible amplitude value that a sample could evaluate to during quantization? Or is there some other way of dealing with it?

  • @adityaprakash256
    @adityaprakash256 3 года назад +1

    Could you please recommend any current state of art technique or paper to separate voiced part of signal from unvoiced ones?

  • @tetlleyplus
    @tetlleyplus 7 месяцев назад

    I think this analysis is usually carried out offline, but what about if we need to perform this analysis in real time? I guess, computation speed can be speed-up by taking advantage of the internal register's flags, such as the sign and zero flags.

  • @lucasa.w.romeiro2136
    @lucasa.w.romeiro2136 2 года назад +1

    Hello my Name Is Lucas. I'm Brazilian and I'm trying to make an algorithm that differentiates one noise from another.
    For example: Depending on the sound the rain makes as it hits the ground if I can determine how much water is falling. Things like that, always involving noise.
    Anyone familiar with this to help me with some directions?
    Thanks

  • @ImDino
    @ImDino Год назад

    The symbol Σ you referred to is the Greek capital letter S (sigma), maybe that clarifies something for you.

  • @MegaNargess
    @MegaNargess 3 года назад

    can you please also teach feature extraction using opensmile?

  • @larsneyerlin8200
    @larsneyerlin8200 2 года назад +1

    I love you

  • @StefaanHimpe
    @StefaanHimpe 4 года назад

    Explanation is very lucid. I was wondering though what are typical frame sizes for envelope detection? The ones used in the visualization seemed a little coarse (presumably for "educational" purposes).

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 года назад

      Thank you! Yeah, you're absolutely correct. You should use typical frame sizes (e.g., 1024, 2048), but the decision obviously depends on sampling rate.
      If you haven't, you can check out my previous video on audio features extraction pipelines to get more details on frame sizes.

  • @rakshithv5073
    @rakshithv5073 3 года назад

    Why do we take RMS instead of something simple like mean which can also handle outliers as well ?
    Is there any particular reason ?

  • @evrenbingol7785
    @evrenbingol7785 3 года назад

    Could you use time series algorithms used in trading as feature after all they are all mathematical signatures. I can sort of see use cases of MACD or other simpleMoving weighted average calculation to see burst of energy calculations.
    Like you can extract frequencies and do reverse fft like you would to denoise and compare each time domains at different frequencies.
    You can even do covariance and/or correlation between these? Would it make sense ?
    That beats me..

  • @danielcapel470
    @danielcapel470 2 года назад +1

    Valerio, This amplitude envelope method is not complete as it gives you only the maximum value of the frame. In order to get the real amplitude envelope, you need to connect the maximum of the previous frame with the next one through linear function. (y-y0)= m(x- x0) where the m =(y1 - y0)/(x1-x0) . am I correct or did I miss something?

  • @imamuddin8042
    @imamuddin8042 4 года назад +2

    @Valerio, I have a question about zero crossing rate ... if one of the consecutive samples have value 0, then its sign is 0.... so if the other sample is either + or - , the number of zero crossing sample in this case is 1/2 , which is actually true ? .... I would recommend , if the k+1 sample becomes 0, then we need to look for k+2 sample meaning 3 samples in same time.... then we can compare between k and k+2 samples... if k and k+2 samples are same sign, then intermediate 0 is not a indication of zero crossing , if k and k+2 are opposite sign, then all these 3 samples will mean a single zero crossing case

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 года назад

      Good catch! I think it's mainly a performance issue. In a 16 bit record, for example, the chance that you actually get a 0 value for the amplitude is pretty low, given you have ca. 65K values available. For this reason, it's probably preferable to get an "imperfect" 0-crossing rate, but compute less operations. This may not be an issue for off-line applications, but for real-time ones it can make a difference.

  • @prabhavsingh4016
    @prabhavsingh4016 3 года назад

    @Valerio Velardo - The Sound of AI
    This was really helpful! I had a question. Are Time Domain Features of any use in Emotion Recognition? As in, are Spectogram based features enough for emotion recognition or can time-domain features like RMS Energy help in that?

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  3 года назад +2

      If you're using a Deep-Learning approach, you don't need to resort to time domain features. Just go with spectrograms / mel spectrograms. If you're using traditional ML, then I would consider including time-domain features.

    • @prabhavsingh4016
      @prabhavsingh4016 3 года назад

      @@ValerioVelardoTheSoundofAI Thanks for the helpful and prompt reply! Can I have your mail id? I have a few questions regarding emotion recognition in general and it would be great if you could help out! Your videos have really helped me in this domain.

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  3 года назад +1

      @@prabhavsingh4016 I suggest you to join the Sound of AI Slack community to get feedback. If you need more help, I offer consulting services. My email address is velardovalerio@gmail.com.

    • @prabhavsingh4016
      @prabhavsingh4016 3 года назад

      @@ValerioVelardoTheSoundofAI Thanks a lot!

  • @pk-bb8cq
    @pk-bb8cq 3 года назад

    what will be the dataset that you will be using further ? Is it GTZAN dataset or something else ? Thank you so much for your amazing teachings man...

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  3 года назад

      I won't be using any dataset in this series. I'll mainly focus on audio features and techniques to extract them.

  • @venkatesanr9455
    @venkatesanr9455 4 года назад +1

    Interesting topics with good explanations and weekly waiting for your videos. How the frame size is chosen?. Whether these time domain features can be used for tracing speaking rate (i.e.,speakers speaking at fast or slow).

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 года назад +1

      With these features, we usually try typical values for the frame size / hop length (e.g., 512, 1024, 2048)

    • @venkatesanr9455
      @venkatesanr9455 4 года назад

      @@ValerioVelardoTheSoundofAI Thanks for your response, Valerio. Any suggestions or links to find speaking rate (fast, slow or medium) whether ML-based approach or time-domain features enough for the findings

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 года назад +1

      @@venkatesanr9455 unfortunately I don't know the literature on this problem. I'm sure a quick search on Google Scholar will help!

    • @venkatesanr9455
      @venkatesanr9455 4 года назад

      @@ValerioVelardoTheSoundofAI Thanks for your kindly reply.

  • @097sleeper
    @097sleeper 2 года назад

    is this independent of the MFCC?

  • @lucasa.w.romeiro2136
    @lucasa.w.romeiro2136 2 года назад

    This video is in Vietnamese language. Could you fix to generate correct subtitle? Thanks!

  • @gregorysech7981
    @gregorysech7981 4 года назад

    Is there an error for the upper bound of the Zero Crossing Rate for a frame t? Shouldn't the sum's maximum iteration reach (t+1)K - 2 instead of (t+1)K - 1? We are comparing k with k+1 so if we do only -1 in the last iteration we compare with the first sample of the t+1-th frame. No? Anyway great video, this series makes getting into audio signal processing quite easier I've binged all this playlist today and I'll probably do the same with the Deep Learning with Python one :D

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  4 года назад +1

      Glad you like the series! (t+1)K-1 is correct because you want to check the crossing rate for all the samples in the current frame. For doing that, as you correctly pointed out, you'll have to compare the last sample of frame t with the first of frame t+1.

    • @gregorysech7981
      @gregorysech7981 4 года назад +1

      @@ValerioVelardoTheSoundofAI ah that makes sense, didn't think about it that way

    • @abhishek-shrm
      @abhishek-shrm 3 года назад

      @@ValerioVelardoTheSoundofAI Great Video! Just have one doubt. What if frame t is the last frame and the (t+1)th frame doesn't exist, then what will be the value for s(k+1)?

  • @Bihari_Chaman
    @Bihari_Chaman 2 года назад

    Unable to Sign Up to COMMUNITY. Some error occured

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  2 года назад

      Do you get a "There’s been a glitch…" page? If so, this seems an already-occurred error status.slack.com/2021-03/164d083f66725086

    • @Bihari_Chaman
      @Bihari_Chaman 2 года назад +1

      @@ValerioVelardoTheSoundofAI Done

  • @1412-kaito
    @1412-kaito Год назад

    Code?

  • @ImDino
    @ImDino Год назад

    whenever I see AE I think of (Adobe) After Effects

  • @fridaynightfunkinmoni7600
    @fridaynightfunkinmoni7600 3 года назад +1

    Thank you very much my sir
    Please i need heelp
    I need your emial please

  • @ericchuhaochan2066
    @ericchuhaochan2066 3 года назад

    I don't understand why these features belong to time-domain. I was expecting something like tempo to appear in the video. AE and RMS look more like loudness-domain to me.

    • @ValerioVelardoTheSoundofAI
      @ValerioVelardoTheSoundofAI  3 года назад

      They are called time-domain features because they depend on time (independent variable). By contrast, frequency-domain features are a function of frequency.
      "Tempo" in musical terms doesn't translate to time. It's a measure of the "speed" of a piece.

  • @tyxiao8046
    @tyxiao8046 Год назад

    Dear Valerio,could you change this video language into English ,the default language is Vietnamese.thanks a lot!