Thanks for covering our work Yannic! You nailed the description of generative models despite your humility. In terms of the experiment in Figure 3 (28:59), we included this to demonstrate the ability of the Topographic VAE to learn topographically structured latent spaces even without sequences. The figure shows the images from MNIST which maximally activate each neuron laid out in the 2D torus topography you described, and you can see that neurons which are closer to each other in this topography have similar 'maximum activating' images (i.e. they are organized by stroke thickness, slant, digit class). You can see similar figures in the original Topographic ICA papers, so in some sense this experiment was in homage to that prior work 😊. Excited to share the new stuff we are working on soon!
Very interesting work! Do you think that it is possible to apply these concepts to the music/audio domain in order to extract useful features/representations?
Thanks @@WatchAndGame. It would certainly be interesting to see what types of transformations could be learned for music/audio, or really any natural time series data. As Yannic mentioned, this paper was mainly focused on artifical sequences we constructed from the MNIST and dSprites datasets, so we knew what transformations we expected to learn (i.e. rotation, scaling, translation). It would be interesting to see what transformations could be learned on natural data, although I expect you would need to increase the size of the network (# of capsules) significantly to account for the increased diversity of transformations present in such data.
There is another interesting one: "Random sketch learning for deep neural networks in edge computing" (180x speed up) As well as MONGOOSE and SLIDE (a bit older). They use variations of Locality Sensitive Hashing (LSH) like in the Reformer
I really want to see how those latent spaces looks like for hi-resolution images. It seems that such system might capture an essence of an image and store it in somewhat meaningful way.
I think the rolling operation you drew at 22:25 is faulty (i.e. clockwise should've been counterclockwise and vice versa). I think that way it'll be consistent with figure 1 at 6:00. That way the U's would be rolled "towards" the center not rolled "further away" from the center. But of course I haven't read the paper and just basing this based on figure 1.
This seems really cool to me! I wonder how plausible it would be to apply this to GANs or Diffusion Nets. Seems like a lot here could possibly be translated over. Although in diffusion nets it might be kinda weird. Like, would you just transform the noise you apply each step according to the target transformation?
You could definitely apply the Gaussian -> TPoT transformation straight forwardly in a diffusion model, although I think it would make most sense to apply it after the final diffusion step once you have gaussian random variables.
It looks oddly similar to direction cosines. I wonder if it can recognize perspectives, because if so it can report at what angle a specific pattern is standing right now wrt the past. With deepmind's grid cells, they might even act like vector cells.
I wish I took Topopography classes in uni to understand the math behind this. I wonder what would happen if we just rolled the latent space tensor n times and calculated L2 loss with nth image forward in the sequence.
uuh… this might be able to give anomaly detection systems more resilience to planned anomalies (like maintenance vs breaking) compared to traditional VAEs 😮
if this is a VAE with a forced latent interpretation that is human understandable does that then make this a neuro symbolic VAE? Equally if the latent space can be forced, would that then "negate" the purpose of the latent space in the first place? (it at least seems counter productive for conservation of space )
Latent spaces aren't necessarily just for conservation of space. As a first order description they reflect our prior beliefs that the very high-dimensional world (think of the space of all possible 100x100 images) can be represented by lower-dimensional data. Imaging that those lower-dimensional data have structure that we can model would represent a more complex and potentially useful way to introduce prior beliefs about the data.
@@insidedctm by your logic, is fruitful to start thinking about expansive latent spaces? Would changing a 3d model to 4d arbitrarily introduce quality not previously seen? I would argue possibly, but it is no longer Neuro symbolic.
Sorry for spamming, I hope you can answer if you remember: a work/paper where humans labeled speech utterances with how the talking person's face expression they think looks like. Do you know how I can find it ?
OUTLINE:
0:00 - Intro
1:40 - Architecture Overview
6:30 - Comparison to regular VAEs
8:35 - Generative Mechanism Formulation
11:45 - Non-Gaussian Latent Space
17:30 - Topographic Product of Student-t
21:15 - Introducing Temporal Coherence
24:50 - Topographic VAE
27:50 - Experimental Results
31:15 - Conclusion & Comments
Thanks for covering our work Yannic! You nailed the description of generative models despite your humility. In terms of the experiment in Figure 3 (28:59), we included this to demonstrate the ability of the Topographic VAE to learn topographically structured latent spaces even without sequences. The figure shows the images from MNIST which maximally activate each neuron laid out in the 2D torus topography you described, and you can see that neurons which are closer to each other in this topography have similar 'maximum activating' images (i.e. they are organized by stroke thickness, slant, digit class). You can see similar figures in the original Topographic ICA papers, so in some sense this experiment was in homage to that prior work 😊. Excited to share the new stuff we are working on soon!
Wow, nice work Andy! Looking forward to digging in more, and congrats on all the good press as well!!
Thanks Tristan! Shoot me a message, lets catch up soon! Interested in hearing about your recent work too 🙂
Very interesting work! Do you think that it is possible to apply these concepts to the music/audio domain in order to extract useful features/representations?
Thanks @@WatchAndGame. It would certainly be interesting to see what types of transformations could be learned for music/audio, or really any natural time series data. As Yannic mentioned, this paper was mainly focused on artifical sequences we constructed from the MNIST and dSprites datasets, so we knew what transformations we expected to learn (i.e. rotation, scaling, translation). It would be interesting to see what transformations could be learned on natural data, although I expect you would need to increase the size of the network (# of capsules) significantly to account for the increased diversity of transformations present in such data.
Interesting work!!! I can see its potential to explain my problem.
Suggestion: "Multiplying Matrices Without Multiplying", speed up NNs nearly 100x on CPUs
I honestly thought it was a joke.
Yannic please do a video on this!
There is another interesting one:
"Random sketch learning for deep neural networks in edge computing" (180x speed up)
As well as MONGOOSE and SLIDE (a bit older).
They use variations of Locality Sensitive Hashing (LSH) like in the Reformer
I really want to see how those latent spaces looks like for hi-resolution images. It seems that such system might capture an essence of an image and store it in somewhat meaningful way.
I think the rolling operation you drew at 22:25 is faulty (i.e. clockwise should've been counterclockwise and vice versa). I think that way it'll be consistent with figure 1 at 6:00. That way the U's would be rolled "towards" the center not rolled "further away" from the center. But of course I haven't read the paper and just basing this based on figure 1.
Maps a sequence of images onto a curve in latent space instead of a normal distribution.
Yay, a new research paper overview!
it must be liberating to understand things so fluently at this level. I hope i get there one day too 😅😮💨.
This channel is 🔥
Thanks for this work
This seems really cool to me!
I wonder how plausible it would be to apply this to GANs or Diffusion Nets. Seems like a lot here could possibly be translated over. Although in diffusion nets it might be kinda weird. Like, would you just transform the noise you apply each step according to the target transformation?
You could definitely apply the Gaussian -> TPoT transformation straight forwardly in a diffusion model, although I think it would make most sense to apply it after the final diffusion step once you have gaussian random variables.
@@t_andy_keller yeah that makes sense
very informative i love your explanation you literally saved alot of my time
It looks oddly similar to direction cosines. I wonder if it can recognize perspectives, because if so it can report at what angle a specific pattern is standing right now wrt the past. With deepmind's grid cells, they might even act like vector cells.
I wish I took Topopography classes in uni to understand the math behind this. I wonder what would happen if we just rolled the latent space tensor n times and calculated L2 loss with nth image forward in the sequence.
uuh… this might be able to give anomaly detection systems more resilience to planned anomalies (like maintenance vs breaking) compared to traditional VAEs 😮
Didn't watch the vid yet but I'm hyped!! :)
Thank you!
missed this video somehow because of the thumbnail; noticed the ml news, though!
Pretty neat
When you see "capsules", it is always MNIST
At first I was wondering if it would have applications in video compression, but it was just a big meh.
24:22 the crux of it
if this is a VAE with a forced latent interpretation that is human understandable does that then make this a neuro symbolic VAE?
Equally if the latent space can be forced, would that then "negate" the purpose of the latent space in the first place? (it at least seems counter productive for conservation of space )
Latent spaces aren't necessarily just for conservation of space. As a first order description they reflect our prior beliefs that the very high-dimensional world (think of the space of all possible 100x100 images) can be represented by lower-dimensional data. Imaging that those lower-dimensional data have structure that we can model would represent a more complex and potentially useful way to introduce prior beliefs about the data.
@@insidedctm by your logic, is fruitful to start thinking about expansive latent spaces? Would changing a 3d model to 4d arbitrarily introduce quality not previously seen? I would argue possibly, but it is no longer Neuro symbolic.
TPoT = teapot... scientists these days, always trying to word play like rappers in their ML papers lol
Video dekhle vardhan, comments nai...
ok
Sorry for spamming, I hope you can answer if you remember:
a work/paper where humans labeled speech utterances with how the talking person's face expression they think looks like.
Do you know how I can find it ?