Audio-visual self-supervised baby learning

Поделиться
HTML-код
  • Опубликовано: 19 июн 2024
  • Andrew Zisserman (Oxford University)
    simons.berkeley.edu/talks/and...
    Understanding Lower-Level Intelligence from AI, Psychology, and Neuroscience Perspectives
    Lesson 1 from the classic paper "The Development of Embodied Cognition: Six Lessons from Babies" is `Be Multimodal'. This talks explores how recent work in the computer vision literature on audio-visual self-supervised learning addresses this challenge. The aim is to learn audio and visual representations and capabilities directly from the audio-visual data stream of a video (without providing any manual supervision of the data) - much as an infant could learn from the correspondence and synchronization between what they see and hear. It is shown that a neural network that simply learns to synchronize audio and visual streams is able to localize the faces that are speaking (active speaker detection) and objects that sound.

Комментарии • 2

  • @aaqib.s
    @aaqib.s 22 дня назад +1

    It is so wonderful to listen to Prof. Andrew!

  • @ATH42069
    @ATH42069 21 день назад

    @11:56 when the mouth detection algorithm isn't sure if it is the mouth