Introducing KeyLLM - Keyword Extraction with Mistral 7B and KeyBERT

Поделиться
HTML-код
  • Опубликовано: 20 сен 2024
  • НаукаНаука

Комментарии • 36

  • @jeroenadamdevenijn4067
    @jeroenadamdevenijn4067 10 месяцев назад +8

    Things are moving fast. 6 months ago I was contemplating BERTopic or Neural Search with Elasticsearch using all-MiniLM-L6-v2 and ANN algorithm.
    Now we have Mistral, KeyBERT and RAG. Just playing around a bit, my use case is topic extraction from relatively small documents.

  • @dixon1e
    @dixon1e 6 месяцев назад

    This is an amazing display of how how to make it easy to understand and execute these important new technologies. Well done, thank you.

  • @jevy_yang
    @jevy_yang Месяц назад

    Thank you very much for sharing

  • @LaHoraMaker
    @LaHoraMaker 11 месяцев назад +2

    Hi Marteen, the diagrams makes the process stages quite clear. Thanks for sharing!
    BTW the link to the O'Reilly book is broken.

  • @roniepaolo6218
    @roniepaolo6218 7 месяцев назад

    Thank you, Maarten! It is an amazing content. I'm having some problems while including RAG in KeyBert, so it would be nice a video about it

  • @PerFeldvoss
    @PerFeldvoss 11 месяцев назад +2

    Nice video, but I guess I miss what is the problem you want to solve... When do you need to extract keywords? Say if you want to cluster documents, your approach would work, but what you suggest is the exact opposite?

    • @spicytuna08
      @spicytuna08 9 месяцев назад +2

      i am not clear what is the business use case for this.

  • @fanchuankang1228
    @fanchuankang1228 10 месяцев назад +2

    what are the appropriate application scenarios for keybert/keyllm?

  • @ramp2011
    @ramp2011 11 месяцев назад +2

    Thank you for the video. Here what is the definition of a keyword? Is it picking all the words present (except the stop words)? Say I want to return words of one type (like location or food ingredients) how does one modify this example to get those words? Thank you

    • @MaartenGrootendorst
      @MaartenGrootendorst  11 месяцев назад +1

      It depends on your implementation but the LLM chooses the keyword (the words that best represent the document) itself based on the instructions of the prompt. If you want to modify those words, like extracting certain types, then you would need to update the prompt itself to account for that. You can find an example of that in the documentation here: maartengr.github.io/KeyBERT/guides/keyllm.html#2-extract-keywords-with-keyllm

  • @araldjean-charles3924
    @araldjean-charles3924 7 месяцев назад

    Very nice! Next step key phrase?

  • @KeThienNguyen
    @KeThienNguyen 5 месяцев назад

    Thanks for the vids ! What if I have thousand of documents that are preprocessed web pages. Will it be super slow ? What can I do to make it more performant ?

    • @MaartenGrootendorst
      @MaartenGrootendorst  5 месяцев назад +1

      Thousands is not that much and is unlikely to be slow. If you have millions of documents, you can follow this guide for running BERTopic on large data: github.com/MaartenGr/BERTopic?tab=readme-ov-file#getting-started

  • @LeoLogics
    @LeoLogics 11 месяцев назад

    thank you for the video.

  • @MyKlent
    @MyKlent 21 день назад

    Thanks Maarten. Can this approach extract keywords that are not mentioned in the document but are relevant to the document? Keywords not mentioned but very important keywords to help identify the document?

    • @MaartenGrootendorst
      @MaartenGrootendorst  21 день назад +1

      @@MyKlent I believe it should if you instruct the LLM to do so. It might require a bit of prompt engineering but I think this is already default behaviour. If not, feel free to post an issue on the repository as it should be straightforward to implement if it's missing.

    • @MyKlent
      @MyKlent 18 дней назад

      @@MaartenGrootendorst Great thanks! I'll try this one out.

    • @MyKlent
      @MyKlent 7 дней назад

      Hi@@MaartenGrootendorst just wanted to share and ask for your input. This is an issue I encountered using Ubuntu. Just following every detail from your steps, I get an error:
      "Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU." with keyword results of [['']].
      I started on Windows but it's running fine. But when I made similar setup in Ubuntu, I got that error. Any ideas?

    • @MaartenGrootendorst
      @MaartenGrootendorst  7 дней назад

      @@MyKlent It might be worthwhile to use llama cop python for now instead as that tends to work a bit easier. Other than that, it's difficult to say. The repo would be the best place for these kinds of issues.

  • @xspydazx
    @xspydazx 4 месяца назад

    thanks .... I leanred !
    It seems like we can use keyllm like a chain ?
    So we can also do entity recognition ?
    can KeyLLM be compiled to gGUF? and can it be trained ? Like a llm ?
    can the llm be detachd or is it fixed ? can we connect the model to keyllm and train it and detach teh wrapper of lkey llm ? then attact it only whne i need it like a lora ! ??
    can we combine mistral with nomic AI embeddings model ?

  • @vesaalexandru6853
    @vesaalexandru6853 8 месяцев назад

    Great content. How values do you have by default for temperature and top_p ?

  • @dibu28
    @dibu28 5 месяцев назад

    Is it possible to extract document metadata the same way with Mistral?

  • @PhucLe93
    @PhucLe93 11 месяцев назад

    Hi Maarten, I don't see where you put the candidate keywords in example 3, you only put the document embedding.

    • @MaartenGrootendorst
      @MaartenGrootendorst  11 месяцев назад +1

      You can use the [CANDIDATE] tag for that in the prompt. You can find more about it here: maartengr.github.io/KeyBERT/guides/keyllm.html#3-fine-tune-candidate-keywords

  • @GeandersonLenz
    @GeandersonLenz 10 месяцев назад

    Hey Maarten, congrats by this amazin work! if i wanna use input in portuguese and output in portuguese, how I need to change? just prompt? What the format of input if i have .docx files? Thanks.

  • @streamocu2929
    @streamocu2929 10 месяцев назад

    really good

  • @aranyapatra65
    @aranyapatra65 11 месяцев назад

    Hi Maarten, thanks for the informative video. I was trying it out, can you suggest what i have to do as I am getting the "WARNING:ctransformers:Number of tokens (513) exceeded maximum context length (512)."

    • @aranyapatra65
      @aranyapatra65 11 месяцев назад

      I have already tried splitting the document into

  • @hemanth8195
    @hemanth8195 11 месяцев назад

    I think mistral llm is a gud choice 👍🏾

  • @VivekGohel-d9k
    @VivekGohel-d9k 11 месяцев назад

    What if my document is very large like I have millions of data, then it will through out of token error, it will exceed the model token limit?
    can you please suggest any solution of that?

    • @MaartenGrootendorst
      @MaartenGrootendorst  11 месяцев назад +1

      You can split up a document in paragraphs and feed it into KeyLLM but there are many other solutions. Building a RAG pipeline, summarize the documents, using the tokenizer to split up the text, using an LLM with a larger context, etc.

  • @AshwinChandarR
    @AshwinChandarR 11 месяцев назад +1

    Very informative @Maarten. Will you mentor me?

    • @ksrajavel
      @ksrajavel 11 месяцев назад +1

      What a wild ask? 😅
      Hit or miss

    • @MaartenGrootendorst
      @MaartenGrootendorst  11 месяцев назад +2

      Although I am currently too busy for extensive mentoring, feel free to reach out on LinkedIn! I cannot make any promises but if you have anything you need my perspective on or are struggling with perhaps I can help.

  • @programming1784
    @programming1784 11 месяцев назад +1

    God what a Dutch accent, coming from a Dutchie 😂