Adding Self-Attention to a Convolutional Neural Network! PyTorch Deep Learning Tutorial

Поделиться
HTML-код
  • Опубликовано: 11 сен 2024

Комментарии • 12

  • @profmoek7813
    @profmoek7813 3 месяца назад +1

    Master piece. Thank you so much 💗

  • @aldonin21
    @aldonin21 День назад +1

    hello. I was trying to introduce a self_attention layer between a fully connected layer (with 32 neurons) and an output layer to recreate "Patt-lite" CNN model. I used Attention function from maximal library. The thing is, I get mixed results for the same parameters, even seed. Sometimes I get quickly to 95% accuracy and other times it doesn't learn at all and stays at 15-30%. Without the attention added, i get constant ~75%. Do you know why this could be happening?

    • @LukeDitria
      @LukeDitria  5 часов назад +1

      Do you use any type of regularisation? Could be that

    • @aldonin21
      @aldonin21 4 часа назад +1

      @@LukeDitria I eventually figured out the issue was that the self_attention layer is very sensitive to weights initialization, so I used constant seed=42 for kernel_initialization inside the 3 dense layers for q, k and v weights (i modified by hand the Attention layer from maximal library which is posted on github). After this modification I run the 5fold CV and I got stable results around 95% for each fold :) I repeated it few time and it was always learning perfectly. And I am very happy about it

  • @thouys9069
    @thouys9069 3 месяца назад

    very cool stuff. Any idea how this compares to Feature Pyramid Networks, which are typically used to enrich the high-res early convolutional layers?
    I would imagine that the FPN works well if the thing of interest is "compact". I.e. can be captured well by a quadratic crop, whereas the attention would even work for non-compact things. Examples would be donuts with large holes and little dough, or long sticks, etc.

    • @LukeDitria
      @LukeDitria  3 месяца назад

      I believe Feature Pyramid Networks were primarily for object detection, and are a way of bringing fine grain information from earlier layers deeper into the network with big residual connections, they sill rely on multiple conv layers to combine spatial information. What we're trying to do here is mix spatial information early in the network. With attention the model can also choose how exactly to do that.

  • @yadavadvait
    @yadavadvait 3 месяца назад

    Good video! Do you think this experiment of adding the attention head so early on can extrapolate well to graph neural networks?

    • @LukeDitria
      @LukeDitria  3 месяца назад

      Hi thanks for your comment! Yes, Graph Attention Networks do what you are describing!

  • @esramuab1021
    @esramuab1021 3 месяца назад

    thank U

  • @unknown-otter
    @unknown-otter 3 месяца назад

    I'm guessing that adding self-attention in deeper layers would have lesser of an impact due to each value having greater receprive field?
    If not, then why not to add at the end, where it would be less expensive? Without the fact that we could incorporate it in every conv block if we had infinite compute

    • @LukeDitria
      @LukeDitria  3 месяца назад

      Thanks for your comment! Yes you are correct, in terms of combining features spatially it won't have as much of an impact if the features already have a large receptive field. The idea is to try to add it as early as possible, and yes you could add it multiple times throughout your network, though you would probably stop once your feature map is around 4x4 etc...

    • @unknown-otter
      @unknown-otter 3 месяца назад

      Thanks for the clarification! Great video