I think this paper would lacks a detailed ablation analysis regarding the major source of improvement compared to wav2vec 2.0. Is it the multimodality or the objective function or the approach for updating teacher parameters (utilizing an exponentially decaying average of student parameters), or a combination of these factors?
Thanks HF and VB for an amazing lecture! Keep em' comin'! 🤗
I think this paper would lacks a detailed ablation analysis regarding the major source of improvement compared to wav2vec 2.0. Is it the multimodality or the objective function or the approach for updating teacher parameters (utilizing an exponentially decaying average of student parameters), or a combination of these factors?