OpenAI CLIP Embeddings: Walkthrough + Insights

Поделиться
HTML-код
  • Опубликовано: 8 сен 2024

Комментарии • 6

  • @johntanchongmin
    @johntanchongmin  5 месяцев назад

    At 58:22, the weights W_i and W_t are the projections of the embedding space form the image model output and text model output respectively (allows for change in embedding dimension). This allows for more generic text and image models with different output dimensions, and they can all map to the same embedding dimension.

  • @johntanchongmin
    @johntanchongmin  5 месяцев назад

    For the loss function at 1:00:15, they use Cross Entropy Loss with the input as the unnormalised logits (multiply by exponent term with temperature t). That is why there is a need to multiply the resultant cosine similarity matrix with the logits. In the Cross Entropy Loss function, this will be divided further by the summation of all other input terms multiplied by the exponent term (otherwise known as normalised). See pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html for details.

    • @johntanchongmin
      @johntanchongmin  5 месяцев назад

      CLIP's loss function has also been described as InfoNCE loss, a common loss term for contrastive learning.
      See builtin.com/machine-learning/contrastive-learning for details.
      It is essentially Cross Entropy over cosine similarity terms, which is what is done in CLIP.

  • @johntanchongmin
    @johntanchongmin  5 месяцев назад

    1:07:31 This is a mistake on my end - this is not the ImageNet Supervised Learning model. Li. et. al. is actually the Visual N-gram model where they predict n-grams (n words) for each picture. arxiv.org/pdf/1612.09161.pdf
    Here, I believe they did not even implement out their model (it is quite low performance of 10+% accuracy on ImageNet), but rather, just use the method of how they use the class name text directly. They applied this on CLIP.
    Basically, the paper was misleading - they did not even need to refer to Li. et. al. for that chart as the methodology is totally different. It is just CLIP with ImageNet class names without any added prompt engineering.

  • @Qzariuss
    @Qzariuss 5 месяцев назад +1

    going to try this tomorrow

  • @johntanchongmin
    @johntanchongmin  5 месяцев назад

    Jupyter Notebook Code can be found here if you want to do your own experiments too:
    github.com/tanchongmin/TensorFlow-Implementations/tree/main/Paper_Reviews/CLIP/CLIP%20Code