Quick question… the output of the Bert is several word embeddings right? So does that mean they need to be concatenated/added/averaged to create the feature to be passed to the MLP classifier? What is the most standard method these days?
I have been trying to understand how sentence embedding (one embedding vector for the whole sentence) is generated by BERT. What I've understood so far is as follows: - The [CLS] token that is added to the beginning of the sentence (assuming you give the model a sentence and want the embedding vector as the output) evolves in the same way as any other token evolves as it passes through the network (BERT) and what results from that process for the [CLS] token is used as a representative embedding for the whole sentence. - Or, the max/mean along each dimension of the embedding vectors for each token is used as the sentence embeddings.
thanks for all of your transformers viedos :)
you are welcome! glad to hear you like them!
Quick question… the output of the Bert is several word embeddings right? So does that mean they need to be concatenated/added/averaged to create the feature to be passed to the MLP classifier?
What is the most standard method these days?
I have been trying to understand how sentence embedding (one embedding vector for the whole sentence) is generated by BERT. What I've understood so far is as follows:
- The [CLS] token that is added to the beginning of the sentence (assuming you give the model a sentence and want the embedding vector as the output) evolves in the same way as any other token evolves as it passes through the network (BERT) and what results from that process for the [CLS] token is used as a representative embedding for the whole sentence.
- Or, the max/mean along each dimension of the embedding vectors for each token is used as the sentence embeddings.
17:55, 2 years ago, 300kk parameters were "quite large". Look at us now - 405kkk llama 3. 1000+ grow. What about next 2 years...