@clay1958 BERT and ModernBERT are general-purpose NLP models. They were trained using masked language modeling (MLM), which involves predicting masked tokens in a sequence. Using MLM, the models can learn to understand the context and can capture lots of knowledge due to large pre-training corpora, but you usually want to fine-tune them for your downstream tasks to get the best results. For example, if you want to use ModernBERT for semantic search, you might want a version fine-tuned on MS-MARCO, a popular retrieval dataset with contrastive examples. (This is what I did in my video.) OpenAI's embedding models are already fine-tuned on some internal datasets. They're designed to work well for tasks like semantic search, clustering, and recommendations right out of the box, so you don't need additional training to get started. ModernBERT Embed (huggingface.co/nomic-ai/modernbert-embed-base) is probably a good alternative to OpenAI's embed models at the moment. It's trained on the Nomic Embed datasets and performs on par with OpenAI's text-embedding-3 models on benchmarks like MTEB. Plus, you will also get all the benefits of using open-source models.
Thank you for watching! What do you think about ModernBERT?
Amazing video!!
Thank you!!
How do BERT and ModernBERT differ from embedding models such as OpenAI's text-embedding-3-small and text-embedding-3-large?
@clay1958 BERT and ModernBERT are general-purpose NLP models. They were trained using masked language modeling (MLM), which involves predicting masked tokens in a sequence. Using MLM, the models can learn to understand the context and can capture lots of knowledge due to large pre-training corpora, but you usually want to fine-tune them for your downstream tasks to get the best results. For example, if you want to use ModernBERT for semantic search, you might want a version fine-tuned on MS-MARCO, a popular retrieval dataset with contrastive examples. (This is what I did in my video.)
OpenAI's embedding models are already fine-tuned on some internal datasets. They're designed to work well for tasks like semantic search, clustering, and recommendations right out of the box, so you don't need additional training to get started.
ModernBERT Embed (huggingface.co/nomic-ai/modernbert-embed-base) is probably a good alternative to OpenAI's embed models at the moment. It's trained on the Nomic Embed datasets and performs on par with OpenAI's text-embedding-3 models on benchmarks like MTEB. Plus, you will also get all the benefits of using open-source models.
@@botsknowbest Thank you for this reply! And fantastic video by the way
Thank you!!