Thanks for the interesting video you made, really appreciate it. But what is happening in that image after denoiser unit, the one after decontextulizer? What are those 0 and 1s?
If you are referring to the two-tower diffusion LCM image, then those 0s and 1s are the depiction of the attention mask in the denoiser unit (in the cross-attention layer), which basically masks out unwanted elements (tokens/representations, etc) so that the attention mechanism only attends to the right elements/tokens/etc. The pink row shows that they are dropping one sample for unconditional training (i.e. so that the training is not conditioned on all samples). They trained that two-tower model both conditionally and unconditionally. Thanks for watching :)
Thanks for the summary!
Thanks for watching :)
@analyticsCampYou’re very welcome. By the way, I just wrote a comment on your last video.
Thanks for the interesting video you made, really appreciate it. But what is happening in that image after denoiser unit, the one after decontextulizer? What are those 0 and 1s?
If you are referring to the two-tower diffusion LCM image, then those 0s and 1s are the depiction of the attention mask in the denoiser unit (in the cross-attention layer), which basically masks out unwanted elements (tokens/representations, etc) so that the attention mechanism only attends to the right elements/tokens/etc. The pink row shows that they are dropping one sample for unconditional training (i.e. so that the training is not conditioned on all samples). They trained that two-tower model both conditionally and unconditionally.
Thanks for watching :)
@analyticsCamp Thanks for the explanations about attention mask. Your explanations in the video was also very clear. 😊