Things are moving fast. 6 months ago I was contemplating BERTopic or Neural Search with Elasticsearch using all-MiniLM-L6-v2 and ANN algorithm. Now we have Mistral, KeyBERT and RAG. Just playing around a bit, my use case is topic extraction from relatively small documents.
Nice video, but I guess I miss what is the problem you want to solve... When do you need to extract keywords? Say if you want to cluster documents, your approach would work, but what you suggest is the exact opposite?
Thank you for the video. Here what is the definition of a keyword? Is it picking all the words present (except the stop words)? Say I want to return words of one type (like location or food ingredients) how does one modify this example to get those words? Thank you
It depends on your implementation but the LLM chooses the keyword (the words that best represent the document) itself based on the instructions of the prompt. If you want to modify those words, like extracting certain types, then you would need to update the prompt itself to account for that. You can find an example of that in the documentation here: maartengr.github.io/KeyBERT/guides/keyllm.html#2-extract-keywords-with-keyllm
Thanks for the vids ! What if I have thousand of documents that are preprocessed web pages. Will it be super slow ? What can I do to make it more performant ?
Thousands is not that much and is unlikely to be slow. If you have millions of documents, you can follow this guide for running BERTopic on large data: github.com/MaartenGr/BERTopic?tab=readme-ov-file#getting-started
Thanks Maarten. Can this approach extract keywords that are not mentioned in the document but are relevant to the document? Keywords not mentioned but very important keywords to help identify the document?
@@MyKlent I believe it should if you instruct the LLM to do so. It might require a bit of prompt engineering but I think this is already default behaviour. If not, feel free to post an issue on the repository as it should be straightforward to implement if it's missing.
Hi@@MaartenGrootendorst just wanted to share and ask for your input. This is an issue I encountered using Ubuntu. Just following every detail from your steps, I get an error: "Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU." with keyword results of [['']]. I started on Windows but it's running fine. But when I made similar setup in Ubuntu, I got that error. Any ideas?
@@MyKlent It might be worthwhile to use llama cop python for now instead as that tends to work a bit easier. Other than that, it's difficult to say. The repo would be the best place for these kinds of issues.
thanks .... I leanred ! It seems like we can use keyllm like a chain ? So we can also do entity recognition ? can KeyLLM be compiled to gGUF? and can it be trained ? Like a llm ? can the llm be detachd or is it fixed ? can we connect the model to keyllm and train it and detach teh wrapper of lkey llm ? then attact it only whne i need it like a lora ! ?? can we combine mistral with nomic AI embeddings model ?
You can use the [CANDIDATE] tag for that in the prompt. You can find more about it here: maartengr.github.io/KeyBERT/guides/keyllm.html#3-fine-tune-candidate-keywords
Hey Maarten, congrats by this amazin work! if i wanna use input in portuguese and output in portuguese, how I need to change? just prompt? What the format of input if i have .docx files? Thanks.
Hi Maarten, thanks for the informative video. I was trying it out, can you suggest what i have to do as I am getting the "WARNING:ctransformers:Number of tokens (513) exceeded maximum context length (512)."
What if my document is very large like I have millions of data, then it will through out of token error, it will exceed the model token limit? can you please suggest any solution of that?
You can split up a document in paragraphs and feed it into KeyLLM but there are many other solutions. Building a RAG pipeline, summarize the documents, using the tokenizer to split up the text, using an LLM with a larger context, etc.
Although I am currently too busy for extensive mentoring, feel free to reach out on LinkedIn! I cannot make any promises but if you have anything you need my perspective on or are struggling with perhaps I can help.
Things are moving fast. 6 months ago I was contemplating BERTopic or Neural Search with Elasticsearch using all-MiniLM-L6-v2 and ANN algorithm.
Now we have Mistral, KeyBERT and RAG. Just playing around a bit, my use case is topic extraction from relatively small documents.
This is an amazing display of how how to make it easy to understand and execute these important new technologies. Well done, thank you.
Thank you very much for sharing
Hi Marteen, the diagrams makes the process stages quite clear. Thanks for sharing!
BTW the link to the O'Reilly book is broken.
Thank you, Maarten! It is an amazing content. I'm having some problems while including RAG in KeyBert, so it would be nice a video about it
Nice video, but I guess I miss what is the problem you want to solve... When do you need to extract keywords? Say if you want to cluster documents, your approach would work, but what you suggest is the exact opposite?
i am not clear what is the business use case for this.
what are the appropriate application scenarios for keybert/keyllm?
Thank you for the video. Here what is the definition of a keyword? Is it picking all the words present (except the stop words)? Say I want to return words of one type (like location or food ingredients) how does one modify this example to get those words? Thank you
It depends on your implementation but the LLM chooses the keyword (the words that best represent the document) itself based on the instructions of the prompt. If you want to modify those words, like extracting certain types, then you would need to update the prompt itself to account for that. You can find an example of that in the documentation here: maartengr.github.io/KeyBERT/guides/keyllm.html#2-extract-keywords-with-keyllm
Very nice! Next step key phrase?
Thanks for the vids ! What if I have thousand of documents that are preprocessed web pages. Will it be super slow ? What can I do to make it more performant ?
Thousands is not that much and is unlikely to be slow. If you have millions of documents, you can follow this guide for running BERTopic on large data: github.com/MaartenGr/BERTopic?tab=readme-ov-file#getting-started
thank you for the video.
Thanks Maarten. Can this approach extract keywords that are not mentioned in the document but are relevant to the document? Keywords not mentioned but very important keywords to help identify the document?
@@MyKlent I believe it should if you instruct the LLM to do so. It might require a bit of prompt engineering but I think this is already default behaviour. If not, feel free to post an issue on the repository as it should be straightforward to implement if it's missing.
@@MaartenGrootendorst Great thanks! I'll try this one out.
Hi@@MaartenGrootendorst just wanted to share and ask for your input. This is an issue I encountered using Ubuntu. Just following every detail from your steps, I get an error:
"Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU." with keyword results of [['']].
I started on Windows but it's running fine. But when I made similar setup in Ubuntu, I got that error. Any ideas?
@@MyKlent It might be worthwhile to use llama cop python for now instead as that tends to work a bit easier. Other than that, it's difficult to say. The repo would be the best place for these kinds of issues.
thanks .... I leanred !
It seems like we can use keyllm like a chain ?
So we can also do entity recognition ?
can KeyLLM be compiled to gGUF? and can it be trained ? Like a llm ?
can the llm be detachd or is it fixed ? can we connect the model to keyllm and train it and detach teh wrapper of lkey llm ? then attact it only whne i need it like a lora ! ??
can we combine mistral with nomic AI embeddings model ?
Great content. How values do you have by default for temperature and top_p ?
Is it possible to extract document metadata the same way with Mistral?
Hi Maarten, I don't see where you put the candidate keywords in example 3, you only put the document embedding.
You can use the [CANDIDATE] tag for that in the prompt. You can find more about it here: maartengr.github.io/KeyBERT/guides/keyllm.html#3-fine-tune-candidate-keywords
Hey Maarten, congrats by this amazin work! if i wanna use input in portuguese and output in portuguese, how I need to change? just prompt? What the format of input if i have .docx files? Thanks.
really good
Hi Maarten, thanks for the informative video. I was trying it out, can you suggest what i have to do as I am getting the "WARNING:ctransformers:Number of tokens (513) exceeded maximum context length (512)."
I have already tried splitting the document into
I think mistral llm is a gud choice 👍🏾
What if my document is very large like I have millions of data, then it will through out of token error, it will exceed the model token limit?
can you please suggest any solution of that?
You can split up a document in paragraphs and feed it into KeyLLM but there are many other solutions. Building a RAG pipeline, summarize the documents, using the tokenizer to split up the text, using an LLM with a larger context, etc.
Very informative @Maarten. Will you mentor me?
What a wild ask? 😅
Hit or miss
Although I am currently too busy for extensive mentoring, feel free to reach out on LinkedIn! I cannot make any promises but if you have anything you need my perspective on or are struggling with perhaps I can help.
God what a Dutch accent, coming from a Dutchie 😂