2:33 Contrastive Learning - Dollar Drawings 3:09 Motivation of Self-Supervised Learning 4:48 Success with DeepMind Control Suite 6:00 MoCo Framework Overview 8:08 Dynamic Dictionary Look-up Problem 8:46 Data Augmentation 9:21 Key Dictionary should be Large and Consistent 10:42 Large Dictionaries 11:32 Dictionary Solutions in MoCo 13:37 Experiments 14:26 Ablations 16:10 MoCov2 with SimCLR extensions 18:24 Training with Dynamic Targets
6:15 in the denominator are not all _other_ keys, but _ALL_ keys, including the positive one. From the paper, right under the equation: " The sum is over one positive and K negative samples"
thanks for taking the time Connor ..still couldn't figure out 2 mysteries from the paper a) why maintain a dictionary when we are NOT sampling from it ? from the psuedo code in the paper, the only time the queue is used is while calculating -ve logits (which has an additional issue .. if im taking all KEYS from current batch, there will definitely be +ve keys in the queue when i multiply the query into the queue right ? most will be -ve but atleast the +ve pairs in the batch WILL result in +ve keys) b) while calculating the loss , the paper uses an N dim array of 0's .. i understand it specifies the 0th index of the target label so i can assume the 0th index to 1 and the rest as 0's BUT one would assume that only the positive logits would need to be closer to the 0th index ..why are they making even the -ve logits come closer to the 0th index ) .. im quite confused
@@connor-shorten thanks much .. i just re read the paper and realized that the dictionary is nothing but a big sampler for ALL -ve keys ..so my understanding is that since the Query encoder is being trained to learn the best possible representation of the images, it can only do so if it can come as close as possible to the +ve key and go as far AWAY as possible from all the -ve keys in the dictionary ..so more the -ve keys it can "escape" from the better and crisper the image representation gets hence enabling the encoder to allow for richer image embeddings that can be used in low volume datasets via supervised learning ( instead of using the small dataset to create an overfit model OR , theoretically, use imagenet's supervised pre trainers )
If I have understood correctly from the paper, using the same encoder for keys and querys yields in an oscillating loss, because the encoder changes to fast for the "older" keys. (See section 3.2 in momentum update and 4.1 in ablation: momentum in the paper)
We are aiming for one representation space as the product of this task. The query and key encoders can't be too disentangled from each other because than the query encoder could learn a trivial solution to map queries to their positive keys. Good question, it's challenging to answer well, please ask any follow up questions or comments on this.
@Patrik Vacek Because then you'll either have a small dictionary due to memory constraints, or if you store past mini-batches then your dictionary will be inconsistently out-dated.
I love someone is breaking actually complex topics in AI with this much care and cosistency, but it goes completely unnoticed while siraj + medium collect views with clickbaity content xd
2:33 Contrastive Learning - Dollar Drawings
3:09 Motivation of Self-Supervised Learning
4:48 Success with DeepMind Control Suite
6:00 MoCo Framework Overview
8:08 Dynamic Dictionary Look-up Problem
8:46 Data Augmentation
9:21 Key Dictionary should be Large and Consistent
10:42 Large Dictionaries
11:32 Dictionary Solutions in MoCo
13:37 Experiments
14:26 Ablations
16:10 MoCov2 with SimCLR extensions
18:24 Training with Dynamic Targets
The queue encoding is FIFO not LIFO, correct me if I'm wrong
You are not.
i was confused by it too...
6:15
in the denominator are not all _other_ keys, but _ALL_ keys, including the positive one.
From the paper, right under the equation: " The sum
is over one positive and K negative samples"
thanks for taking the time Connor ..still couldn't figure out 2 mysteries from the paper
a) why maintain a dictionary when we are NOT sampling from it ? from the psuedo code in the paper, the only time the queue is used is while calculating -ve logits (which has an additional issue .. if im taking all KEYS from current batch, there will definitely be +ve keys in the queue when i multiply the query into the queue right ? most will be -ve but atleast the +ve pairs in the batch WILL result in +ve keys)
b) while calculating the loss , the paper uses an N dim array of 0's .. i understand it specifies the 0th index of the target label so i can assume the 0th index to 1 and the rest as 0's BUT one would assume that only the positive logits would need to be closer to the 0th index ..why are they making even the -ve logits come closer to the 0th index ) .. im quite confused
Hey Vikram, I will try to get around to this. Please feel free to join the Weaviate slack chat to ping me again about this in case I forget.
@@connor-shorten thanks much .. i just re read the paper and realized that the dictionary is nothing but a big sampler for ALL -ve keys ..so my understanding is that since the Query encoder is being trained to learn the best possible representation of the images, it can only do so if it can come as close as possible to the +ve key and go as far AWAY as possible from all the -ve keys in the dictionary ..so more the -ve keys it can "escape" from the better and crisper the image representation gets hence enabling the encoder to allow for richer image embeddings that can be used in low volume datasets via supervised learning ( instead of using the small dataset to create an overfit model OR , theoretically, use imagenet's supervised pre trainers )
What is the problem in using the same encoder for both key and query, why should they be different?
If I have understood correctly from the paper, using the same encoder for keys and querys yields in an oscillating loss, because the encoder changes to fast for the "older" keys. (See section 3.2 in momentum update and 4.1 in ablation: momentum in the paper)
Speed.
Thanks for the video.
Why are the weights computed for the query encoder useful at all for learning the key encoder?
We are aiming for one representation space as the product of this task. The query and key encoders can't be too disentangled from each other because than the query encoder could learn a trivial solution to map queries to their positive keys.
Good question, it's challenging to answer well, please ask any follow up questions or comments on this.
@Patrik Vacek Because then you'll either have a small dictionary due to memory constraints, or if you store past mini-batches then your dictionary will be inconsistently out-dated.
Thank you !!!!!
Thank you for watching!
Great summary Connor!
I love someone is breaking actually complex topics in AI with this much care and cosistency, but it goes completely unnoticed while siraj + medium collect views with clickbaity content xd
Thank you, It was really helpful.
Thank you!
Thank you
I did not look at the paper but that looks similar to Siamese neural networks.
Nice
Thank you!
@@connor-shorten How can I contribute?
Thank you!