I can admit that this is the best explanation for GAT and GNN one can find. Fantastic explanation with very simple English. The quality of sound and video is great as well. Many thanks.
This was simply a fantastic explanation video, I really do hope this video gets more coverage than it already has. It would be fantastic if you were to explain the concept of multi-head attention in another video. You've earned yourself a subscriber +1.
Thank you very much! This was my introduction into GAT and helped me to immediately get a good grasp of the basic concept :) I like the graphical support you provide to the explanation, it's gerat!
Muchas gracias por el video. Despues de haber visto muchos otros, puedo decir que el suyo es el mejor, el mas sencillo de entender. Estoy muy agradecido con usted. Saludos
A wonderful and succinct explanation with crisp visualisations about both the attention mechanism and the graph neural network. The way the learnable parameters are highlighted along with the intuition (such as a weighted adjacency matrix) and the corresponding matrix operations is very well done.
This is pretty amazing content. The way you explain the concept is pretty great and I especially like the visual style and very neat looking visuals and animations you make. Thank you!
Your work has been an absolute game-changer for me! The way you break down complex concepts into understandable and actionable insights is truly commendable. Your dedication to providing in-depth tutorials and explanations has tremendously helped me grasp the intricacies of GNNs. Keep up the phenomenal work!
Your visual explanation is super great, help many people to learn some-hour stuff in minutes! Please make more videos on specialized topics of GNNs! Thanks in advance!
Just for anyone confused, in accordance to the illustration in the summary the weight matrix should have 5 rows instead of 4 that are shown in the video. Great video and I admire the fact that your topics of choice are really into the latest hot staff of ML!
Thank you so much for this beautiful video. Have been trying out too many videos on GNN and GAN but this video definitely tops. I finally understood the concept behind it. Keep up the good work :)
Hi! Thanks! Multi-head attention simply means that several attention mechanisms are applied at the same time. It's like cloning the regular attention. What exactly is unclear here? :)
@@DeepFindr The math and code are hard to fully grasp. If you could break down the linear algebra with the matrix diagrams as you have done for single head attention, I think people would find that very helpful.
Good explanation to the key idea. One question, what is the difference between GAT and self attention constrained by a adjacency matrix(eg. Softmax(Attn*Adj) )? The memory used for GAT is D*N^2, which is D times of the intermediate ouput of SA. The node number of graph used in GAT thus cannot be too large because of memory size. But it seems that they both implement dynamic weighting of neighborhood information constrained by a adjacency matrix.
Hi, Did you have a look at the implementation iny PyG? pytorch-geometric.readthedocs.io/en/latest/_modules/torch_geometric/nn/conv/gat_conv.html#GATConv One of the key tricks in GNNs is usually to represent the adjacency matrix in COO format. Therefore you have adjacency lists and not a nxn matrix. Using functions like gather or index_select you can then do a masked selection of the local nodes. Hope this helps :)
This simply comes from dense (fully connected layers). There are lots of resources, for example here: analyticsindiamag.com/a-complete-understanding-of-dense-layers-in-neural-networks/#:~:text=The%20dense%20layer's%20neuron%20in,vector%20of%20the%20dense%20layer.
At 11:30, should the denominator have k instead of j? Also, this vector w_a, is it the same vector used for all edges, there isn't a different vector to learn for each node i, right? Thank you!
why would the attention adjacency matrix be symmetrical? If the weight vector is learnable, then it does matter which order the two input vectors are concatenated. It doesn't seem like there would be any reason to enforce symmetry.
Thank you for the great video. I have one question, what happens if weighted graphs are used with attention GNN? Do you think adding the attention-learned edge "weights" will improve the model compared to just having the input edge weights (e.g. training a GCNN with weighted graphs)?
Hi! Yes I think so. The fact that the attention weights are learnable makes them more powerful than just static weights. The model might still want to put more attention on a node, because there is valuable information in the node features, independent of the weight. A real world example of this might be the data traffic between two network nodes. If less data is sent between two nodes, you probably assign a smaller weight to the edge. Still it could be that the information coming from one nodes is very important and therefore the model pays more attention to it.
Perfect video to understand GATs. However, I guess, you forgot to add sigmoid function when you demonstrate h1' as a sum of multiplications of hi* and attention values, in the last seconds of the video: 13:51
Thanks for the great explanation! Just one thing that I do not really understand, may I ask how do you get the size of the learnable weight matrix [4,8]? I understood that there are 4 rows due to the number of features for each node. However, not sure where the 8 columns come from.
I have come to understand attention as key, query, value multiplication/addition. Do you know why this wasn't used and if it's appropriate to call it attention?
Hi, Query / Key / Value are just a design choice of the transformer model. Attention is another technique of the architecture. There is also a GNN Transformer (look for Graphormer) that follows the query/key/value pattern. The attention mechanism is detached from this concept and is simply a way to learn importance between embeddings.
Thanks a lot! Haha I use active presenter (it's free for the basic version) but I guess there are better alternatives out there. Still experimenting :)
Thanks for the video! There's a question: at 13:03, I think the 'adjacency matrix' consists of {e_ij} could be symmetric, but after the softmax operation, the 'adjacency matrix' consists of {α_ij} should not be symmetric any more. Is that right?
That's a good point. I think the TransformerConv is the layer that uses dot product attention. I'm also not aware of any reason why it was implemented like that. Maybe it's because this considers the direction of information (so source and target nodes) better. Dot product is cummutative, so i*j is the same as j*i, so it can't distinguish between the direction of information flow. Just an idea :)
No the weight vector has a fixed size. It is applied to each node feature vector. For example if you have 5 nodes and a feature size of 10, then the weight matrix with 128 neurons could be (10, 128). If you have more nodes, just the batch dimension is bigger. Hope this answers the question :)
@@DeepFindris the generic gnn weighting matrix the same matrix for the entire graph or is it a different matrix for each node but applied to all the neighbours? Also, how does it deal with heterogeneous data where the input feature vectors dimensions are different?
It's because the output dimension (neurons) of the neural network is different then the input dimension. You could also have less or the same number of features.
Hi hope you're doing well Is there any graph neural network architecture that receives multivariate dataset instead of graph-structured data as an input? I'll be very thankful if you answer me i really nead it Thanks in advanced
Hi! As the name implies, graph neural networks expect graph structured input. Please see my latest videos on how to convert a dataset to a graph. It's not that difficult :)
Well it will simply calculate attention weights with all neighbor nodes. So every node attends to all other nodes. Its a bit like the transformer that attends to all words. This paper might also be interesting: arxiv.org/abs/2105.14491
Thank you for the great video! I wanted to ask - how is training of this network performed when the instances (input graphs) have varying number of nodes and/or adjacency matrix? It seems that W would not depend on the number of nodes (as its shape is 4 node features x 8 node embeddings) but shape of attention weight matrix Wa would (as its shape is proportional to the number of edges connecting node 1 with its neighbors.)
Hi! The attention weight matrix has always the same shape. The input shape is twice the node embedding size because it always takes two neighbor - combinations and predicts the attention coefficient for them. Of course if you have more connected nodes, you will have more of these combinations, but you can think of it like the batch dimension increases, but not the input dimension. For instance you have node embeddings of size 3. Then the input for the fully connected network is for instance [0.5, 1, 1, 0.6, 2, 1], so the concatenated node embeddings of two neighbors (size=3+3). It doesn't matter how many of these you input into the attention weight matrix. If you have 3 neighbors for a node it would look like this: [0.5, 1, 1, 0.6, 2, 1] [0.5, 1, 1, 0.7, 3, 2] [0.5, 1, 1, 0.8, 4, 3] The output are then 3 attention coefficients for each of the neighbors. Hope this makes sense :)
3 года назад
@@DeepFindr If graph sizes are already different, I mean if one have graph_1 that has 2200 nodes(that results in 2200,2200 adj. matrix, and graph_2 has 3000 nodes (3000,3000 adj matrix), you can zero pad graph_1 to 3000. This way you'll have fixed size of input for graph_1 and graph_2. Zero padding will create dummy nodes with no connection. So the sum with the neighboring nodes will be 0. And having dummy features for dummy nodes, you'll end up with fixed size graphs.
Hi, yes that's true! But for the attention mechanism used here no fixed graph size is required. It also works for a different number of nodes. But yes padding is a good idea to get the same shapes :)
Awesome video! Quick question: do you have a video explaining Cluster-GCN? And if yes, do you know if similar clustering idea can be applied to other networks (like GAT) to be able to train the model on large graphs? Thanks!
Hi! The video in the description from this other channel explains the general attention mechanism used in transformers quite well :) or do you look for other attention mechanisms in GNNs?
OK :) In my next video (of the current GNN series) I will also Quickly talk about Graph Transformers. There the attention coefficients are calculated with a dot product of keys and queries. I hope to upload this video this or next week :)
Yes, they are the same thing :) passing messages is in the end nothing else but multiplying with the adjacency matrix. It's just a common term to better illustrate how the information is shared :)
Thanks a lot. Your videos are really helpful. I have a few questions regarding the case of weighted graphs. Would attention still be useful if the edges are weighted? If so, how to pass edge wights to the attention network? Can you suggest a paper doing that?
The GAT layer of PyG supports edge features but no edge weights. Therefore I would simply treat the weights as one dimensional edge features. The attention then additionally considered these weights. Probably the learned attention weights and the edge weights are sort of correlated, but I think it won't harm to include them for the attention calculation. Maybe the attention mechanism can learn even better scores for the aggregation :) I would just give it a try and see what happens. For example compare RGCN + edge weights with GAT + edge features.
I am following your playlist on GNN and this is the best content I get as of now. I have a CSV file and want to apply GNN on it but I don't understand how to find the edge features from the CSV file
This depends on the input graph. For the molecule it's simple the atoms that are not connected with a specific atoms. All nodes that are not connected to a specific node have a 0 in the adjacency matrix entries.
Thanks a lot for the excellent tutorial. Just a quick question, when training the single layer attention network, what are the lables of input? How this single layer network is trained?
Thanks! Typically you train it with your custom problem. So the embeddings will be specific to your use-case. For example if you want to classify molecules, then the loss of this classification problem is used to optimize the layer. The labels are then the classes. It is however also possible to train universal embeddings. This can be done by using a distance metric such as cosine distance. The idea is that similar inputs should lead to similar embeddings and the labels would then be the distance between graphs. With both options the weights in the attention layer can be optimized. It is also possible to train GNNs in an unsupervised fashion, there exist different approaches in the literature. Hope this answers the question :)
@@DeepFindr Thanks! Sorry, my question might be confusing. For the node classification task, if we use the distance metrics between nodes as labels to train the weights of attention layer, then I think the attention layer that computes attention coefficient is not needed. Because we can get the importance by computing the distance metrics. I wonder how we can train weights of the shared attentional mechanism. Thanks again!
Yes, you are right. The attention mechanism using the dot product will also lead to similar embeddings for nodes that share the same neighborhood. However the difference is that the attention mechanism is local - it only calculates the attention coefficient for the neighboring nodes. Using the distance as targets can however be applied to all nodes in the input graph. But I agree, the various GNN layers might be differently useful depending on the application.
Hi! There is soft vs hard attention, you can search for it on Google. For self attention there are great tutorials, such as this one peltarion.com/blog/data-science/self-attention-video
Hello ,thanks for sharing, could you plz explain how you get learnable method,is it matrix randomly chosen or there is method behind,and is this equal to lablacian method. One more question ,your embedding only on node level ,right
Hi, the learnable weight matrix is randomly initialized and then updated through back propagation. It's just a classical fully-connected neural network layer. Yes the embedding is on the node level :)
I can admit that this is the best explanation for GAT and GNN one can find. Fantastic explanation with very simple English. The quality of sound and video is great as well. Many thanks.
Thank you for your kind words
This is the best and most in detail explanation on Graph CNN attention I've found. Great job!
This was simply a fantastic explanation video, I really do hope this video gets more coverage than it already has. It would be fantastic if you were to explain the concept of multi-head attention in another video. You've earned yourself a subscriber +1.
Thank you, I appreciate the feedback!
Sure, I note it down :)
Simple, clear. It makes a lot of sense to go thru this video with a slate and chalk in hand. THe mathematics is very well explained. Thank you
amazing!!! author well done!!!
Thank you very much! This was my introduction into GAT and helped me to immediately get a good grasp of the basic concept :) I like the graphical support you provide to the explanation, it's gerat!
This might be the best and simple explanation of GAT one can ever find! Thanks man
Muchas gracias por el video. Despues de haber visto muchos otros, puedo decir que el suyo es el mejor, el mas sencillo de entender. Estoy muy agradecido con usted. Saludos
Thank you! :)
A wonderful and succinct explanation with crisp visualisations about both the attention mechanism and the graph neural network. The way the learnable parameters are highlighted along with the intuition (such as a weighted adjacency matrix) and the corresponding matrix operations is very well done.
I especially love your background pics.
Explained in terms of basic Neural Network terminologies!! Great work 👍
This is pretty amazing content. The way you explain the concept is pretty great and I especially like the visual style and very neat looking visuals and animations you make. Thank you!
Thank you for your kind words :)
Your work has been an absolute game-changer for me! The way you break down complex concepts into understandable and actionable insights is truly commendable. Your dedication to providing in-depth tutorials and explanations has tremendously helped me grasp the intricacies of GNNs. Keep up the phenomenal work!
This is the MOST BEST video of GCN and GAT, very great, thank you!
very well explained, provides a very intuitive picture of the concept. Thanks a ton for this awesome lecture series!
I found it hard to follow initially but after understanding GCNN thoroughly, this video is a gem.
very good explanation! clear and crisp, even I, a beginner, feeling satisfied after watching this. Should get more recognition!
Thanks
it was the best explanation that gave me hope for the understanding these mechanisms. Everything was so good explained and depicted, thank you!
Clear explanation and visualization on attention mechanism. Really helpful in studying GNN.
Your visual explanation is super great, help many people to learn some-hour stuff in minutes!
Please make more videos on specialized topics of GNNs!
Thanks in advance!
I will soon upload more GNN content :)
Extremely helpful. Very well explained in concrete and abstract terms.
Thanks
Amazingly easy to understand. Thank you.
Just for anyone confused, in accordance to the illustration in the summary the weight matrix should have 5 rows instead of 4 that are shown in the video.
Great video and I admire the fact that your topics of choice are really into the latest hot staff of ML!
such an easy-to-grasp explanation! such a visually nice video! amazing job!
Thanks, I appreciate it :)
Very well explained. Thank you very much!
Thank you so much for this beautiful video. Have been trying out too many videos on GNN and GAN but this video definitely tops. I finally understood the concept behind it. Keep up the good work :)
This is a very great explanation covering basic GNN and the GAT. Thank you so much
clearly clear explanation, super best video lecture about GNN ever seen.
I'd love it if you could explain multi-head attention as well. You really have such a good grasp of this very complex subject.
Hi! Thanks!
Multi-head attention simply means that several attention mechanisms are applied at the same time. It's like cloning the regular attention.
What exactly is unclear here? :)
@@DeepFindr The math and code are hard to fully grasp. If you could break down the linear algebra with the matrix diagrams as you have done for single head attention, I think people would find that very helpful.
Great! Thank you for explaining the math and the linear algebra with the simple tables.
I really salute you for this detailed video! that's very intriguing and clear! thank you again!
very helpful tutorial, clearly explained!
4:00 do you multiply "feature node matrix" with "adjacency matrix" before multiplying it with "learnable weight matrix" ?
Great video! your explanation was amazing. Thank you!!
Thanks :)
Great job mate, keep it up
Thank you for sharing this clear and well-designed explanation.
Very clear explanation. Thank you!
simple and informative! Thank you!
Very clear and helpful. Thank you so much!
most understandable explanation so far!
Very nice video. Thanks for your work~
Very Helpful Explanation! Thank you!
best video for learning GNN thank you so much!
Good explanation to the key idea. One question, what is the difference between GAT and self attention constrained by a adjacency matrix(eg. Softmax(Attn*Adj) )? The memory used for GAT is D*N^2, which is D times of the intermediate ouput of SA. The node number of graph used in GAT thus cannot be too large because of memory size. But it seems that they both implement dynamic weighting of neighborhood information constrained by a adjacency matrix.
Hi,
Did you have a look at the implementation iny PyG? pytorch-geometric.readthedocs.io/en/latest/_modules/torch_geometric/nn/conv/gat_conv.html#GATConv
One of the key tricks in GNNs is usually to represent the adjacency matrix in COO format. Therefore you have adjacency lists and not a nxn matrix.
Using functions like gather or index_select you can then do a masked selection of the local nodes.
Hope this helps :)
easy and best explanation
nice work
Thanks for the best explanation.
I need more Graph Neural Network related video!!
There will be some more in the future. Anything in particular you are interested in? :)
Thank you bro. Confused head now gets the idea about GNN.
Hehe
Simply exceptional!
Great explination, really appretiated.
If you Please could u make a videa explain the loss calculation and backpropagation in gnn?
Thanks for sharing the knowledge!
You're welcome :)
Thx for the awesome explanation!
A video with attention in CNN e.g. UNet would be great :)
I slightly capture that in my video on diffusion models. I've noted it down for the future though.
I learned so much from this video! Thanks a lot
That's great :)
Excellent job, mate 👍👍
Thx :)
Wonderful explination! thanks
thank you. what if you also wanted to have edge features?
Hi, I have a video on how to use edge features in GNNs :)
how is learnable weight matrix is formed ? have some material to understand it better?
This simply comes from dense (fully connected layers). There are lots of resources, for example here: analyticsindiamag.com/a-complete-understanding-of-dense-layers-in-neural-networks/#:~:text=The%20dense%20layer's%20neuron%20in,vector%20of%20the%20dense%20layer.
Very nice, thanks for effort!
Fantastic explaination.
A great explanation, many thanks
Great walkthrough.
At 11:30, should the denominator have k instead of j?
Also, this vector w_a, is it the same vector used for all edges, there isn't a different vector to learn for each node i, right? Thank you!
Ohh yeah you are right. Should be k...
Yes its a shared vector, used for all edges. Thank you for the finding!
Excellent explanation 👌 👏🏾
Thank you so much for this great video.
why would the attention adjacency matrix be symmetrical? If the weight vector is learnable, then it does matter which order the two input vectors are concatenated. It doesn't seem like there would be any reason to enforce symmetry.
Thank you for the great video. I have one question, what happens if weighted graphs are used with attention GNN? Do you think adding the attention-learned edge "weights" will improve the model compared to just having the input edge weights (e.g. training a GCNN with weighted graphs)?
Hi! Yes I think so. The fact that the attention weights are learnable makes them more powerful than just static weights.
The model might still want to put more attention on a node, because there is valuable information in the node features, independent of the weight.
A real world example of this might be the data traffic between two network nodes. If less data is sent between two nodes, you probably assign a smaller weight to the edge. Still it could be that the information coming from one nodes is very important and therefore the model pays more attention to it.
Perfect video to understand GATs. However, I guess, you forgot to add sigmoid function when you demonstrate h1' as a sum of multiplications of hi* and attention values, in the last seconds of the video: 13:51
Thanks for the great explanation! Just one thing that I do not really understand, may I ask how do you get the size of the learnable weight matrix [4,8]? I understood that there are 4 rows due to the number of features for each node. However, not sure where the 8 columns come from.
I think 8 is the arbitrarily chosen dimensionality of the embedding space.
Outstanding explanation
Greta Video, thank you!
I have come to understand attention as key, query, value multiplication/addition. Do you know why this wasn't used and if it's appropriate to call it attention?
Hi,
Query / Key / Value are just a design choice of the transformer model. Attention is another technique of the architecture.
There is also a GNN Transformer (look for Graphormer) that follows the query/key/value pattern. The attention mechanism is detached from this concept and is simply a way to learn importance between embeddings.
Great quality thank you !
Hi, Can you tell which tool you're using to make those amazing visualizations? All of your videos on GNNs are great btw :)
Thanks a lot! Haha I use active presenter (it's free for the basic version) but I guess there are better alternatives out there. Still experimenting :)
Thanks for the video! There's a question: at 13:03, I think the 'adjacency matrix' consists of {e_ij} could be symmetric, but after the softmax operation, the 'adjacency matrix' consists of {α_ij} should not be symmetric any more. Is that right?
Yes usually the attention weights do not have to be symmetric. Is that what you mean? :)
@@DeepFindr Yes. Thanks for your reply!
why replacing dot product attn with concat proj + leaky relu?
That's a good point. I think the TransformerConv is the layer that uses dot product attention. I'm also not aware of any reason why it was implemented like that. Maybe it's because this considers the direction of information (so source and target nodes) better. Dot product is cummutative, so i*j is the same as j*i, so it can't distinguish between the direction of information flow. Just an idea :)
Thank you for wonderful content
This is very helpful!
weight vector are dependent on the nunber of node in graph? if i have a large of graph, i will got a bigger dimension weight vector?
No the weight vector has a fixed size. It is applied to each node feature vector. For example if you have 5 nodes and a feature size of 10, then the weight matrix with 128 neurons could be (10, 128). If you have more nodes, just the batch dimension is bigger.
Hope this answers the question :)
@@DeepFindr thank you so much
@@DeepFindris the generic gnn weighting matrix the same matrix for the entire graph or is it a different matrix for each node but applied to all the neighbours? Also, how does it deal with heterogeneous data where the input feature vectors dimensions are different?
Great video! Thank you
Why does the new state calculated have more features than the original state? I dont understand
It's because the output dimension (neurons) of the neural network is different then the input dimension.
You could also have less or the same number of features.
A Great explanation
Hi hope you're doing well
Is there any graph neural network architecture that receives multivariate dataset instead of graph-structured data as an input?
I'll be very thankful if you answer me i really nead it
Thanks in advanced
Hi! As the name implies, graph neural networks expect graph structured input. Please see my latest videos on how to convert a dataset to a graph. It's not that difficult :)
@@DeepFindr thanks for prompt response
Sure; I'll see it right now..
Would you please sent its link?
ruclips.net/video/AQU3akndun4/видео.html
how do you think it will behave with complete graphs only ?
Well it will simply calculate attention weights with all neighbor nodes. So every node attends to all other nodes. Its a bit like the transformer that attends to all words.
This paper might also be interesting:
arxiv.org/abs/2105.14491
Brilliant video 👍👍👍
please use brackets and multiplication signs between matrices so i can map the mathematical formula to the visualization
Thank you for the great video! I wanted to ask - how is training of this network performed when the instances (input graphs) have varying number of nodes and/or adjacency matrix? It seems that W would not depend on the number of nodes (as its shape is 4 node features x 8 node embeddings) but shape of attention weight matrix Wa would (as its shape is proportional to the number of edges connecting node 1 with its neighbors.)
Hi! The attention weight matrix has always the same shape. The input shape is twice the node embedding size because it always takes two neighbor - combinations and predicts the attention coefficient for them. Of course if you have more connected nodes, you will have more of these combinations, but you can think of it like the batch dimension increases, but not the input dimension.
For instance you have node embeddings of size 3. Then the input for the fully connected network is for instance [0.5, 1, 1, 0.6, 2, 1], so the concatenated node embeddings of two neighbors (size=3+3). It doesn't matter how many of these you input into the attention weight matrix.
If you have 3 neighbors for a node it would look like this:
[0.5, 1, 1, 0.6, 2, 1]
[0.5, 1, 1, 0.7, 3, 2]
[0.5, 1, 1, 0.8, 4, 3]
The output are then 3 attention coefficients for each of the neighbors.
Hope this makes sense :)
@@DeepFindr If graph sizes are already different, I mean if one have graph_1 that has 2200 nodes(that results in 2200,2200 adj. matrix, and graph_2 has 3000 nodes (3000,3000 adj matrix), you can zero pad graph_1 to 3000. This way you'll have fixed size of input for graph_1 and graph_2. Zero padding will create dummy nodes with no connection. So the sum with the neighboring nodes will be 0. And having dummy features for dummy nodes, you'll end up with fixed size graphs.
Hi, yes that's true! But for the attention mechanism used here no fixed graph size is required. It also works for a different number of nodes.
But yes padding is a good idea to get the same shapes :)
Awesome video! Quick question: do you have a video explaining Cluster-GCN? And if yes, do you know if similar clustering idea can be applied to other networks (like GAT) to be able to train the model on large graphs? Thanks!
Great Explanation! As you pointed out this is one way of attention mechanism. Can you also provide references to other attention mechanisms.
Hi! The video in the description from this other channel explains the general attention mechanism used in transformers quite well :) or do you look for other attention mechanisms in GNNs?
@@DeepFindr yes thanks for sharing that too in the video. I was curious about the attention mechanisms on gnn
OK :)
In my next video (of the current GNN series) I will also Quickly talk about Graph Transformers. There the attention coefficients are calculated with a dot product of keys and queries.
I hope to upload this video this or next week :)
Hi! Are what you explain in the "Basics" and the message-passing concept the same things?
Yes, they are the same thing :) passing messages is in the end nothing else but multiplying with the adjacency matrix. It's just a common term to better illustrate how the information is shared :)
Thanks a lot. Your videos are really helpful. I have a few questions regarding the case of weighted graphs. Would attention still be useful if the edges are weighted? If so, how to pass edge wights to the attention network? Can you suggest a paper doing that?
The GAT layer of PyG supports edge features but no edge weights. Therefore I would simply treat the weights as one dimensional edge features.
The attention then additionally considered these weights.
Probably the learned attention weights and the edge weights are sort of correlated, but I think it won't harm to include them for the attention calculation. Maybe the attention mechanism can learn even better scores for the aggregation :) I would just give it a try and see what happens. For example compare RGCN + edge weights with GAT + edge features.
@@DeepFindr thanks a lot for the reply.
I am following your playlist on GNN and this is the best content I get as of now.
I have a CSV file and want to apply GNN on it but I don't understand how to find the edge features from the CSV file
Thanks! Did you see my latest 2 videos? They show how to convert a CSV file to a graph dataset. Maybe it helps you to get started :)
@@DeepFindr thanks, hope i will get my answer :-)
Great Video!
How is the adjacency matrix derived?
Hi, what exactly do you mean by derived? :)
@@DeepFindr What criteria decides what feature vector is zero'd out?
This depends on the input graph. For the molecule it's simple the atoms that are not connected with a specific atoms.
All nodes that are not connected to a specific node have a 0 in the adjacency matrix entries.
Amazing thank you 🤩
great video, thanks
Thanks a lot for the excellent tutorial. Just a quick question, when training the single layer attention network, what are the lables of input? How this single layer network is trained?
Thanks!
Typically you train it with your custom problem. So the embeddings will be specific to your use-case. For example if you want to classify molecules, then the loss of this classification problem is used to optimize the layer. The labels are then the classes.
It is however also possible to train universal embeddings. This can be done by using a distance metric such as cosine distance. The idea is that similar inputs should lead to similar embeddings and the labels would then be the distance between graphs.
With both options the weights in the attention layer can be optimized.
It is also possible to train GNNs in an unsupervised fashion, there exist different approaches in the literature.
Hope this answers the question :)
@@DeepFindr Thanks! Sorry, my question might be confusing. For the node classification task, if we use the distance metrics between nodes as labels to train the weights of attention layer, then I think the attention layer that computes attention coefficient is not needed. Because we can get the importance by computing the distance metrics. I wonder how we can train weights of the shared attentional mechanism. Thanks again!
Yes, you are right. The attention mechanism using the dot product will also lead to similar embeddings for nodes that share the same neighborhood.
However the difference is that the attention mechanism is local - it only calculates the attention coefficient for the neighboring nodes.
Using the distance as targets can however be applied to all nodes in the input graph.
But I agree, the various GNN layers might be differently useful depending on the application.
Got it! Thanks again!
Very understandable! Thank you.
Can you share your presentation?
Sure! Can you send me an email to deepfindr@gmail.com and I'll attach it :) thx
@@DeepFindr Hey I have also sent you an email, could you please attach the presentation?
Hi, sorry to bother you
I have a question
What's the difference between soft-attention and self-attention?
Hi! There is soft vs hard attention, you can search for it on Google.
For self attention there are great tutorials, such as this one peltarion.com/blog/data-science/self-attention-video
Hello ,thanks for sharing, could you plz explain how you get learnable method,is it matrix randomly chosen or there is method behind,and is this equal to lablacian method.
One more question ,your embedding only on node level ,right
Hi, the learnable weight matrix is randomly initialized and then updated through back propagation. It's just a classical fully-connected neural network layer.
Yes the embedding is on the node level :)