Text Embeddings, Classification, and Semantic Search (w/ Python Code)

Поделиться
HTML-код
  • Опубликовано: 30 ноя 2024

Комментарии • 68

  • @ShawhinTalebi
    @ShawhinTalebi  8 месяцев назад +2

    👉More on LLMs: ruclips.net/p/PLz-ep5RbHosU2hnz5ejezwaYpdMutMVB0
    --
    References
    [1] ruclips.net/video/A8HEPBdKVMA/видео.htmlsi=PA4kCnfgd3nx24LR
    [2] R. Patil, S. Boit, V. Gudivada and J. Nandigam, “A Survey of Text Representation and Embedding Techniques in NLP,” in IEEE Access, vol. 11, pp. 36120-36146, 2023, doi: 10.1109/ACCESS.2023.3266377.
    [3] owasp.org/www-project-top-10-for-large-language-model-applications/

  • @ccapp3389
    @ccapp3389 8 месяцев назад +30

    Love that you’re bringing real knowledge, insights and code here! So many AI RUclipsrs are just clickbaiting their way through the hype cycle by reading the same SHOCKING news as everyone else.

    • @tylerpoore97
      @tylerpoore97 8 месяцев назад

      I mean, the guy clickbaited the thumbnail. Also, this is insanely old news at this point(if considered news at all).
      Video content was on point, but we shouldn't be promoting clickbait methods.

    • @ccapp3389
      @ccapp3389 8 месяцев назад +3

      I clicked this video for technical explanations and code, not news. There are plenty of dudes reading off the same SHOCKING news across AI RUclips. I got exactly what I wanted from this video and feel like the title was clear.

  • @krishnavamsiyerrapatruni5385
    @krishnavamsiyerrapatruni5385 6 месяцев назад +6

    I have learnt so much by watching the entire series. Thank you so much Shaw! I think this is one of the best playlists out there for anyone looking to get into the field of LLMs and GenAI.

    • @ShawhinTalebi
      @ShawhinTalebi  6 месяцев назад

      Great to hear! Feel free to share any suggestions for future content :)

  • @youngzproduction7498
    @youngzproduction7498 3 месяца назад +1

    I love how you give a low level lesson. It helps me understand more in the topic and also see more potential on applying in another area. Long story short, you got a new subscriber. I will consume all your knowledge and make the best out of it.

    • @ShawhinTalebi
      @ShawhinTalebi  3 месяца назад

      Thanks for subscribing :) Glad it was helpful!

  • @BrandonFoltz
    @BrandonFoltz 7 месяцев назад +4

    Great video. The practical use cases for embeddings themselves are undervalued IMHO and this video is fantastic for showing ways to use embeddings. Even if you use OpenAI embeddings, they are dirt cheap, and can provide fantastic vectors for further analysis, manipulation, and comparison.

    • @ShawhinTalebi
      @ShawhinTalebi  7 месяцев назад

      Thanks Brandon! I completely agree. Agents are great, but they seem to overshadow all the relatively simple text embedding-based applications.

  • @aldotanca9430
    @aldotanca9430 7 месяцев назад +1

    Exceptionally clear as always!

  • @pramodkumarsola
    @pramodkumarsola 8 месяцев назад +2

    You are the real guy to subscribe and learn

  • @ethanlazuk
    @ethanlazuk 7 месяцев назад

    SEO here, enjoyed your examples of semantic search and explanation of hybrid search. Great vid and easy to follow. Will explore your channel. Cheers!

  • @obaydmir8353
    @obaydmir8353 7 месяцев назад +1

    Clear and understandable explanation of these concepts. Thanks and really enjoyed!

  • @PRColacino
    @PRColacino 7 месяцев назад +1

    Congrats man! Keep going with more real examples with code sharing

  • @superfreiheit1
    @superfreiheit1 3 месяца назад

    Awesome teaching quality. Simple start into text embeddings for begineers. But it would be better to use a Open Source LLM to create embeddings. OpenAI api is for paid.

    • @ShawhinTalebi
      @ShawhinTalebi  3 месяца назад

      Thanks :)
      I use open source embeddings in my latest video: ruclips.net/video/3JsgtpX_rpU/видео.htmlsi=ricuwaoSJSYnSAQM&t=843

  • @banoffanimations5704
    @banoffanimations5704 3 месяца назад

    Hi Shaw!!! Really great stuff... I am loving this series!!! I echo with everyone else agreeing that your videos are super informative and hands on!!! Very Very Useful!!! Many thanks man!

  • @enmutlu-c4j
    @enmutlu-c4j 3 месяца назад

    great video! super clear and on point. Thanks Shaw!

  • @blackswann9555
    @blackswann9555 7 месяцев назад

    Excellent work sir! ❤

  • @greatwall2003
    @greatwall2003 5 месяцев назад

    Thanks, useful material 👍

  • @eliskucevic340
    @eliskucevic340 7 месяцев назад

    Iv been using embeddings for awhile but i find that agents can call specialized tools that can be very useful depending on the applications.

    • @ShawhinTalebi
      @ShawhinTalebi  7 месяцев назад

      Thanks for sharing your insight! Indeed agents and embeddings solve different problems. However, some agent use cases could be reconfigured to be solved with text embeddings + human in the loop.

  • @ifycadeau
    @ifycadeau 8 месяцев назад +1

    Wow! Thank you for breaking this down, been trying to figure it out!

  • @cinematicsounds
    @cinematicsounds 8 месяцев назад

    Thank you very good information, will try to make a database for audio sound effects using vector databases text to audio

  • @KrisTC
    @KrisTC 8 месяцев назад +3

    I have watched most of the videos in this series and found them really helpful. Something I am looking for that I haven't seen you cover yet. Is some more guidance on preparing data for either RAG or fine tuning. I am sure you have practical tips you can give. I have a large old codebase, we have loads of documentation and tutorials etc, but it is a lot of someone to pickup. This new world of GPTs seams perfect for building an assistant. I will be able to work through it ok, but I suspect there will be a load of learnt best practices or pitfalls to avoid that are a bit more subtle. For example I am looking through our support emails / tickets, lots of them all start with please send logs :) and after a load of back and forth we have info. This is much like a conversation with ChatGPT. For fine tuning is it best to fine tune on a whole thread? Or each chunk of the conversation?

    • @ShawhinTalebi
      @ShawhinTalebi  7 месяцев назад +3

      Great suggestion! I plan to do a series on data engineering and this would be a great thing to incorporate into it.
      For you use case, the best choice would depend on what you want the assistant to do. For instance, if you want the assistant to mimic the support rep, then you'd likely want to use each message in the thread with its appropriate context (i.e. preceding messages).

    • @KrisTC
      @KrisTC 7 месяцев назад

      @@ShawhinTalebi thanks for the tip. That’s what I ended up doing it. Not yet tried actually fine tuning yet. Just finished my data prep. Looking forward to you next series 😊

  • @uzairmalik7084
    @uzairmalik7084 4 месяца назад

    Hey Shaw thanks for this wonderful series. I have completed it and learned so many new things but one thing I felt is that the code is very high level and it feels like to me that I have to remember most of the things during coding while practicing with those hugging face models. Do you have any suggestions for that?

    • @ShawhinTalebi
      @ShawhinTalebi  4 месяца назад

      I think the best way to solidify your understanding is to apply it to real-world use cases.

  • @jamespeters1617
    @jamespeters1617 2 месяца назад

    Great info

  • @hoseinsalahshoor635
    @hoseinsalahshoor635 Месяц назад

    Thank you for your useful video. I have a question regarding openai embedding model. Does openai fine-tune its model if we use these (embedding) models? ... My data is private and I don't want to expose it. Thanks

    • @ShawhinTalebi
      @ShawhinTalebi  Месяц назад

      OpenAI's privacy policy says they do not train on API data: openai.com/enterprise-privacy/

  • @databasemadness
    @databasemadness 7 месяцев назад

    Love you shaw!

  • @avi7278
    @avi7278 8 месяцев назад

    Great format subd

  • @tamilinfomite
    @tamilinfomite 7 месяцев назад

    Hi Shawhin, Thanks. I ran into a problem. I tried to use Sentence_transformers model by installing it. It always givens an error no file found config_sentence_transformers.json' in the .cache/huggingface/... folder. Your help is appreciated

    • @ShawhinTalebi
      @ShawhinTalebi  7 месяцев назад

      Not sure what the issue could be. Did you install all the requirements on the GitHub?
      github.com/ShawhinT/RUclips-Blog/tree/main/LLMs/text-embeddings

  • @dr.aravindacvnmamit3770
    @dr.aravindacvnmamit3770 8 месяцев назад

    Excellent!

  • @pepeballesteros9488
    @pepeballesteros9488 7 месяцев назад

    Many thanks for the video Shaw, great content!
    One simple question: when using OpenAI's embedding model, each resume is represented by an embedding vector. Is this embedding computed as the average of all word vectors?

    • @ShawhinTalebi
      @ShawhinTalebi  7 месяцев назад

      Great question! Embedding models do not operate on specific words, but rather on the text as a whole. This is valuable because the meaning of specific words is driven by the context it appears in.

  • @avi7278
    @avi7278 8 месяцев назад

    finally someone who speaks with their hands more than I do, lol...

    • @ShawhinTalebi
      @ShawhinTalebi  7 месяцев назад

      😂😂.. 👋 👍

    • @toddai2721
      @toddai2721 6 месяцев назад

      I call him the hand whisperer.... but really loud.

  • @AlexandreMarr-uq8pw
    @AlexandreMarr-uq8pw 6 месяцев назад

    Can only two kinds of classification be made? If I have lots of types, for example, product classification, can it be applied?

    • @ShawhinTalebi
      @ShawhinTalebi  6 месяцев назад

      You can have several target classes. Here's a nice write-up about doing that with sklearn: scikit-learn.org/stable/modules/multiclass.html

  • @sherpya
    @sherpya 8 месяцев назад

    it's possible to extract software names from the query with a text classifier and apply only e. g. apache airflow to kw search? also what db do you suggest? is postgres with vector db good?

    • @ShawhinTalebi
      @ShawhinTalebi  8 месяцев назад

      Good question. While I haven't seen a text classifier used for KW search, that could be a clever way to implement it.
      There are several DBs to choose from these days. I'd say go with what makes sense with the existing data infrastructure. If starting from scratch, Elastic search or Pinecone might be good jumping off points.

    • @aldotanca9430
      @aldotanca9430 7 месяцев назад

      lanceDB is also quite good.

  • @Whysicist
    @Whysicist 8 месяцев назад

    LDA - Latent Dirichlet Allocation is kinda trivial these days… Matlab text analytics toolbox works great on pdf’s with bi-grams… a la bag-of-N-Grams. Cool… thanks…

  • @alroygama6166
    @alroygama6166 6 месяцев назад

    Can i use these embeddings with bert based models instead?

    • @ShawhinTalebi
      @ShawhinTalebi  6 месяцев назад

      Yes! In fact, sentence transformers has a few bert-based embedding models: sbert.net/docs/pretrained_models.html

  • @enestemel9490
    @enestemel9490 10 дней назад

    It's not very good practice to compare ai assistants and text embeddings since ai assistants are also working with tokens ( which are the embeddings for each text chunk) .

  • @skarloti
    @skarloti 7 месяцев назад

    This is not always a good solution if we have multilingual text. I see that LLM context 1M token/character They offer other solutions with functions and external API calls.

    • @ShawhinTalebi
      @ShawhinTalebi  7 месяцев назад

      I'm curious about this. I've seen embedding models that can handle multiple languages, so I'd expect them to work pretty well. Can you shed any more light on this?

  • @chamaljayasinghe4210
    @chamaljayasinghe4210 6 месяцев назад

    ✌✌🧑‍💻🧑‍💻

  • @tylerpoore97
    @tylerpoore97 8 месяцев назад

    Soo, unlike your thumbnail, this has nothing to do with agents...
    Why mention them?

    • @ShawhinTalebi
      @ShawhinTalebi  7 месяцев назад

      Thumbnail is "Forget AI agents... use this instead". I explain this a bit @3:15.

  • @cirtey29
    @cirtey29 7 месяцев назад

    By end of next year all the drawbacks of LLMs will be erased.

  • @onjajaboy
    @onjajaboy 7 месяцев назад

    are you persian

  • @AndresSolar-y3g
    @AndresSolar-y3g 6 месяцев назад

    cool...

  • @bentobin9606
    @bentobin9606 7 месяцев назад

    is text embedding same as text tokenization done in training ?

    • @ShawhinTalebi
      @ShawhinTalebi  7 месяцев назад +2

      Good question! These are different things.
      Tokenization is the process of taking a some text and deriving a vocabulary from which the original text can be generated, where each element in the vocabulary is assigned a unique integer value.
      Text embeddings on the other hand, take tokens and translate them into meaningful (numerical) representations.
      I talk a little more about tokenization here: ruclips.net/video/czvVibB2lRA/видео.htmlsi=FwqmkB9Ltyq45n0w&t=348