Q: How put 1000 PDFs into my LLM?

code_your_own_AI

Просмотров 22 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 22 авг 2024
A subscriber asks: HOW to put 1000s of PDFs into a Large Language Model (LLM) and perform Q&A? How to integrate huge amount of corporate and private data into an open source LLM?
#aieducation
#chatgpt

Комментарии • 39

@matthewmcc2237 Год назад ⁺¹⁰
Wow ! Ask a question and get a whole great video explanation ! Could not ask for more. Was not even sure my question was going to get answered because it was asked sometime after the comments in that video seemed to have stopped. I was thinking of asking the creators on their site, but I am sure the best explanation I could get is the above video. I now finally have a much, much better understanding of what it going on with this type of set-up. I was really in the dark and a little frustrated trying to understand just how this was working and how it related to fine-tuning and other such things I am learning about, but your explanation and graphics made it very clear and easy to understand for someone like myself who is not expert in this area. I can now use this type of set-up much more intelligently now that I know how it is working. Thank you very much ! It probably would have taken a lot of research and reading before I ever came close to figuring this out on my own. I have been learning a lot from your videos. They are very informative and are all both presented very well and explained very well. Look forward to future videos.
@code4AI 11 месяцев назад ⁺¹
You had a great question. But I took me some days to prepare the video ... Glad it was informative and answered your questions.
@blacksage81 11 месяцев назад
Thanks for your question.
@skoppisetti 11 месяцев назад
Thanks for asking a good question.
@zakariaabderrahmanesadelao3048 5 месяцев назад
I always worry about prior knowledge before getting into a new topic as big as creating custom LLMs, however, this channel makes it feel like the best place to start. Thank you and I hope this channel grows exponentially.
@kurtiswithakayy 6 месяцев назад ⁺¹
I've been trying to learn about this, but high-quality LLM content like this video is hard to come by - ty!
@Dattatray.S.Shinde Год назад
Thanks for exaplaining RAG approach in very simple manner!
@code4AI Год назад
Retrieval Augmented Generation in a prompt flow .... smile.
@42svb58 Год назад
It's like you're reading my mind on what do I need to learn next
@lance3301 Год назад
Great info. Thanks for breaking it down in a way that was easy to understand.
@code4AI Год назад
Glad it was helpful!
@tsilikitrikis 6 месяцев назад
Oh man this is treasure thank you!
@mowsie2k492 5 месяцев назад
thank you for this video - enlightening!
@brettmiddleton5013 Месяц назад
Boggles me how everyone hasn’t done the ram bypass yet. Maybe I should sell it as a function/button. call it the creativity tax
@mohsenghafari7652 4 месяца назад
Hi dear friend .
Thank you for your efforts .
How to use this tutorial in PDFs at other language (for example Persian )
What will the subject ?
I made many efforts and tested different models, but the results in asking questions about pdfs are not good and accurate!
Thank you for the explanation
@Nick_With_A_Stick Год назад ⁺²
A few months ago I had tried some open source vector databases with llama 1 and I couldn’t for the life of me get the model to predictably gather data from the vector database. What could’ve caused this? And do the newer models tend to gather information better? Or the newer vector databases? Or could I just not have been prompting the models right?
@andreyseas 11 месяцев назад ⁺¹
Nice video! What have you discovered to be the best type of file for encoding? Since you don't prefer PDFS? Is it JSON?
@ostravaofboletaria1027 10 месяцев назад ⁺¹
2:50 he said Latex
@teatea5528 11 месяцев назад
I need this video. Thank u❤
@code4AI 11 месяцев назад
You're welcome 😊
@darkmatter9583 2 месяца назад
so for the math formulas with latex in the pdfs or if the pdf is scanned and have to be scanned by ocr what?
@Elektron1c97 8 месяцев назад
Is there a point where you think LoRA should be done and not only Embedding/RAG as explained in this video?
The background for this is let's say you have 1 million documents, 10k of them are relevant for your search query but you won't be able to fit all of them into the prompt. Thus fine tuning your model would make sense?
And also: Can you recommend a channel that will implement this with an example?
@code4AI 8 месяцев назад
ruclips.net/video/oPS-8nKGu8U/видео.html
@wdonno Год назад
Would it ever be worthwhile to label the sentences first to identify multi-word ‘spans’ with special meaning or to capture entities that have specific numerical values attached to them?
@code4AI Год назад
Sure, you could add bigrams and trigrams to your vocabulary (see optimal tokenization) if they show up at a significant amount.
@Azariven Год назад
Hi code_your_own_AI, what if my task is on trying to summarize the 1000 pdf vs trying to search? It seems like there are many search capabilities out there, but I just want to know what all of them is about, which is also a common task, but I haven't found many summarization methods that do it easily. There is no good benchmark for evaluating summarization, it almost all boils down to human evaluation, which is not scalable. One simple way that gets 80% of the job is with sentence embedding and then hierarchical clustering, but as you can see the pdfs may contain multiple topics and need to be in different clusters. We can use LLM to summarize each pdf, or extract main topics, but then we need a way to benchmark and make sure the summary is good. Love to hear your comment or if you can point me to right sources! Thanks!
@code4AI Год назад ⁺²
First idea within the first second: 1000 documents w/ 1K sentences equals 1M sentences. LLM summarization of 1M sentences is not performant enough. I would use a topological dim reduction like UMAP on 1M embedded vectors and then cluster with HDBSCAN. Code examples (including parametric UMAP) to explore hundreds of topological clusters (one sentence will only be represented in one cluster) and condense their main semantic content based on the sentences in each cluster (here you might apply LLM for a semantic summarization of all sentences in each single cluster and cascade the cluster semantic content upwards), you will find in my 2 years old videos like:
ruclips.net/video/6nMceLIVwBo/видео.htmlfeature=shared
ruclips.net/video/rujdyFHOIG0/видео.htmlfeature=shared (for a KERAS implementation)
ruclips.net/video/yYzN0vSRlaQ/видео.htmlfeature=shared
@Azariven Год назад ⁺¹
@@code4AI Yooooo! Didn't expect an answer so quickly, thank you so much Code4AI! I tried using HDBSCAN directly on all the sentence embeddings but didn't get great results. For sentences with a lot of noise or with a lot of similarities, they tend to either be sparse or all clump up. I will go though the videos on UMAPs first, may have some more follow up questions after testing. TYVM again!
@ashoksamrat8486 6 месяцев назад
I want to design a ATS system so that I can filter the resumes based on job description.
Suppose there are 10000 resume of candidates and I want to get Top 50 or 100 resume that can be best suited for job description.
input:- 10000 resumes in pdf format.
Output: Top 50/100 resume that best suited for job description.
How can I achieve this using LLM and Streamlit for UI?
@user-fp3tm1jp2f 5 месяцев назад
Take a look at the video again from start, and repeat this until you understand what he clearly said how do we achieve this 😊
@ahsaniftikhar1037 Год назад
Hey
Can i somehow improve gpt 3.5 ability to write code. If yes, what will be the cheapest method
@code4AI Год назад
Dedicated Code LLMs are fine-tuned on a particular code domain and some are open source. To see a complete benchmark assessment of about 20 to 30 LLMs for their coding ability, watch my video here: ruclips.net/video/uXBJRBqHnNo/видео.htmlsi=ekh7DtY-iTaZa7w-&t=256
And yes, you can fine-tune ChatGPT on a dedicated instruction fine-tuned code (python) dataset, but the costs can be significant, since you pay OpenAI for each token.
@gileneusz Год назад
0:21 I don't need a computer from them, I'm a humble person 😇, just let me download the gpt-4 model, I'll run it myself 🤣I'll not let anyone copy it from me, I promise 😇
@code4AI Год назад
.... smile ....
@romeo72899 11 месяцев назад
Do you have code for all this?
@code4AI 11 месяцев назад ⁺¹
Yes, and a complete RUclips playlist on this topic.
@darkmatter9583 2 месяца назад
send the playlist and you will gain a subscriber forever please @@code4AI

Следующие

Автовоспроизведение

Python RAG Tutorial (with Local LLMs): AI For Your PDFs