GPT-4 Tutorial: How to Chat With Multiple PDF Files (~1000 pages of Tesla's 10-K Annual Reports)

Поделиться
HTML-код
  • Опубликовано: 5 окт 2024
  • In this video we'll learn how to use OpenAI's new GPT-4 api to 'chat' with and analyze multiple PDF files. In this case, I use three 10-k annual reports for Tesla (~1000 PDF pages)
    OpenAI recently announced GPT-4 (it's most powerful AI) that can process up to 25,000 words - about eight times as many as GPT-3 - process images and handle much more nuanced instructions than GPT-3.5.
    You'll learn how to use LangChain (a framework that makes it easier to assemble the components to build a chatbot) and Pinecone - a 'vectorstore' to store your documents in number 'vectors'. You'll also learn how to create a frontend chat interface to display the results alongside source documents.
    A similar process can be applied to other usecases you want to build a chatbot for: PDF's, websites, excel, or other file formats.
    Visuals & Code:
    🖼 Visual guide download + github repo (this is the base template used for this demo):
    github.com/may...
    Twitter: / mayowaoshin
    Website: www.mayooshin....
    Send a tip to support the channel: ko-fi.com/mayo...
    Timestamps:
    01:02 PDF demo (analysis of 1000-pages of annual reports)
    06:01 Visual overview of the multiple pdf chatbot architecture
    17:40 Code walkthrough pt.1
    25:15 Pinecone dashboard
    28:30 Code walkthrough pt.2
    #gpt4 #investing #finance #stockmarket #stocks #trading #openai #langchain #chatgpt #langchainjavascript #langchaintypescript #langchaintutorial
  • НаукаНаука

Комментарии • 369

  • @chatwithdata
    @chatwithdata  Год назад +38

    Timestamps:
    01:02 PDF demo (analysis of 1000-pages of annual reports)
    06:01 Visual overview of the multiple pdf chatbot architecture
    17:40 Code walkthrough pt.1
    25:15 Pinecone dashboard
    28:30 Code walkthrough pt.2

    • @johnpaulpayopay1074
      @johnpaulpayopay1074 Год назад +1

      Instead of using an OpenAI API, can use local LLMs such as llama, alpaca, vicuna, koala

    • @MidoBroski
      @MidoBroski Год назад

      The GitHub repository in the description does not include this multi pdf bot? Could we get the source code?

    • @chineseoutlet
      @chineseoutlet Год назад

      @@johnpaulpayopay1074 Agreed. I tried to use Vicuna7B to replace the OpenAI(my machine don't run 13B). It works OK, except it can't do query on the entire PDF. Like when I ask it to summarise the whole book, it just say "I don't know" trying to find ways to fix it. Any one here know how to fix it, please let me know.

    • @tangdavid3317
      @tangdavid3317 Год назад

      Do you know how to fix this?
      - error Error: Cannot find module 'next/dist/server/future/route-modules/route-module.js'

  • @TheRonellCross
    @TheRonellCross Год назад +97

    This tutorial is legit the best. Custom Chatbots are going to be a huge biz op.

  • @byte_easel
    @byte_easel Год назад +21

    The excalidraw documents are invaluable to the presentation and really help visualize the flow of data between the pinecone database and langchain is used to make it all work. Thanks a lot for those diagrams, your efforts are not unrecognized. The possibilities of this are literally endless

  • @clockworkOMNI
    @clockworkOMNI Год назад +7

    Thank you for this contribution to the common good. You are the man.

  • @tslg1988
    @tslg1988 Год назад +4

    Thank you for your great videos, I really like them.
    One UX improvement comment: Try serving the PDFs as static content with your web application, then when you append #page=42 to the PDF's link, a user can go directly to that page in their browser. Very easy to implement and your users will save a copy and paste and 3 clicks.

  • @harischsood5479
    @harischsood5479 Год назад +2

    so interesting. as a non -techie trying to implement this is wild but fascinating

  • @JasonMelanconEsq
    @JasonMelanconEsq Год назад +4

    Can you please show how the front end connects and how to build it? This is definitely the missing piece of the puzzle for me. Thanks for the videos. They're great!

  • @Grahfx
    @Grahfx Год назад +150

    I don't think this is the right approach. It doesn't know enough context on the whole document. When you hit a prompt, he will match a chunk to your prompt, but what if the answer is contained in the context of let say 200 pages ? You could make something clever. You could aggregate chunk togethers asking the LLM to summarize and group them in "meta chunks", you could repeat the process until all years are contained into a single max limit tokens batch. Then, with the meta Data, you'll be able to perform a much more powerful search over the corpus, providing much more context to your LLM with different level of aggregation.

    • @user-jc3ys1yc2n
      @user-jc3ys1yc2n Год назад +46

      Something like that already exists and is called a “recursive summarizer”.
      The hard part is figuring out how to actually perform search over the recursively summarized documents without invoking the GPT api an unreasonable amount of times except during preprocessing.
      I am currently doing research in the area of passage retrieval and would love to know your thoughts on this and possibly experiment with some of your ideas.

    • @Grahfx
      @Grahfx Год назад +10

      ​@@user-jc3ys1yc2n Well, I had this idea while having my morning coffee, but I haven't conducted any deep research to support it. However, as you mentioned, there may be potential issues that arise. Have you considered using GPT4 to help solve this problem? Personally, I anticipate having to address this issue myself because I plan to summarize big HR data.

    • @JanBadertscher
      @JanBadertscher Год назад +22

      you lose more and more information by recursive summarization. better to use embeddings, so the LLM knows the whole context of your documents, no matter how many...

    • @Grahfx
      @Grahfx Год назад +5

      @@JanBadertscher No, of course, it doesn't have complete knowledge, that's the main issue. It only has partial knowledge based on a vector match. This is why it provides inadequate responses when the prompt is not very specific. It is incapable of extracting critical information that comes from reading the entire context. A human expert is currently better to do this.

    • @chatwithdata
      @chatwithdata  Год назад +24

      You can control the `k` returned source documents up to 1000 if you want to via Pinecone + gpt-4 context is 8k, so you can cover a lot of context across the doc.

  • @LimabeanStudios
    @LimabeanStudios Год назад +8

    I work in an academic research setting researching an extremely niche subject in material science and these kind of tools are what I have been waiting for. I can't wait for us to be able to upload the papers and books that contain all current available knowledge on the topic and be able to interact with that knowledge pool.

    • @PizzaLord
      @PizzaLord Год назад +1

      I already built this tool. If you want to be a tester, let me know.

    • @bendrybrough4362
      @bendrybrough4362 Год назад

      @@PizzaLord I would like to.

    • @LimabeanStudios
      @LimabeanStudios Год назад

      @@PizzaLord I would certainly like to hear more

    • @ahmeda8042
      @ahmeda8042 Год назад

      @@PizzaLord I would like to.

    • @tylerd6962
      @tylerd6962 Год назад

      ​@@PizzaLord how do I become a tester

  • @astroid-ws4py
    @astroid-ws4py Год назад +8

    Cool stuff, It is just unbelievable how much depth and breadth all the computing field has become, Too stretched out and too spread, Too much subjects for to explore, Chating with books is a cool idea that could really help us access and organize scientific human knowledge in all fields.

  • @JohnnyLonghorn
    @JohnnyLonghorn Год назад +11

    Thanks for this, your vids are some of the best on YT now for this type of thing as right now in this domain there is only two words for all the content out there - information overload!!! I am still going through your prev video on PDF ingestion and processing, thank you for sharing your invaluable information in a style that makes it understandable.

  • @rhythmgaidhani2149
    @rhythmgaidhani2149 Год назад

    best video i found for my personal project on the internet

  • @MehdiAllahyari
    @MehdiAllahyari Год назад +6

    Great video. But the problem with your solution is you have hit the OpenAI api at least 5 times and it would make it costly and not scalable. Other than that, a good project.

    • @chatwithdata
      @chatwithdata  Год назад +3

      Hmm where are you getting 5 from? The retrieval is from Pinecone, which is very cheap.

  • @duhai1836
    @duhai1836 Год назад +4

    You are great at teaching the entire process! Please continue this series :)
    Thank you!

  • @satyaschannel4391
    @satyaschannel4391 8 месяцев назад +1

    Hey, the tutorial is one of teh best of its king in this segment. Can please share the repo link to this multi pdf one. Its very useful for others to work on.

  • @CowCoder
    @CowCoder Год назад +3

    Just found this channel, 16k subs now, not for long. You are making great tutorials that I will say I should have made months ago. Keep making videos you can hit 2m subs in less than a year if you make content more relevant and useful to a wider audience and more entertaining

  • @ADHDOCD
    @ADHDOCD Год назад +1

    Woah! 😱These diagrams make it so much easier to understand concepts better than other RUclips videos! Thanks for spending so much time and effort!

    • @Jingizz
      @Jingizz Год назад

      yes its so helpful to understand these concepts

  • @yajatgulati
    @yajatgulati Год назад +2

    This is the best tutorial I've seen on embeddings search, the applications of this are endless, really excited to start building. Thank you so much for the work that you're doing :)

  • @labsanta
    @labsanta Год назад +4

    1. What is the tutorial about?
    - The tutorial is about how to chat with multiple massive PDF documents across multiple files, specifically Tesla's 10-K Annual Reports.
    2. How many pages of PDFs are involved in this tutorial?
    - The tutorial involves around a thousand pages of PDFs from the 2020, 2021, and 2022 annual reports of Tesla.
    3. How does the chatbot work when searching for specific information in the PDFs?
    - The chatbot is able to cross-check and provide a reference page when answering a specific question about the risk factors or financial performance of Tesla over the past three years.
    4. Can the chatbot analyze multiple PDF files simultaneously?
    - Yes, the chatbot is able to analyze and answer questions about multiple PDF files from different years of Tesla's annual reports.
    5. Is the tutorial code available for public release?
    - The code for the tutorial is not currently available for public release as it is still being tested and may be buggy.
    1. What is the process of converting PDF documents into number representations?
    - The PDF docs are converted into text, as PDF is binary and needs to be in text format.
    - The text is split into chunks to fit into Open AI's context window.
    - Open AI creates embeddings, which are number representations of the text.
    2. What is a vector store?
    - A vector store is a database that contains number representations of documents in different categories or spaces.
    - The vector store can also store the text of the documents and relevant metadata.
    3. How does GPT-4 retrieve information from the vector store?
    - The question is converted into numbers and a specific namespace or box is specified to retrieve relevant documents.
    - The relevant documents are combined with the question and GPT-4 looks at the context to provide a prompt response.
    4. What is the challenge of analyzing information across multiple namespaces or years?
    - Extracting the namespace from the question is required to search for information in the relevant namespaces.
    - The model needs to dynamically determine which year or namespace the user is referring to.
    5. How can GPT-4 assist in extracting the namespace from a question?
    - GPT-4 can be used to extract the namespace from the question and dynamically determine which year or namespace the user is referring to.
    - This allows for analyzing information across multiple years and namespaces.
    1. What is the purpose of the script called "ingest data"?
    - The script called "ingest data" is used to load each PDF report in the reports folder into text.
    2. How does the dynamic context work in the system being described?
    - The dynamic context specifies the namespaces to look at for relevant documents.
    - When there is a question, the system reverts the question to embeddings and checks the specified namespaces for relevant documents.
    - It retrieves the relevant documents for each namespace and then proceeds with the usual procedure.
    3. What is the website called "secant Alpha" used for?
    - The website called "secant Alpha" is used by many investors.
    - It provides information on revenue growth.
    4. What happened when the speaker asked the system about Tesla's estimated revenue growth for 2022?
    - The system searched for and found the relevant documents from the 2022 namespace.
    - The system calculated the estimated revenue growth for Tesla based on the consolidated statement of operations.
    - The result was surprising and unexpected.
    5. What is the output of the "ingest data" script?
    - The output of the "ingest data" script is a JSON file containing information on each PDF report in the reports folder.
    - The file includes information on the year of the report and the name of the company.
    1. What is the purpose of page numbers and references in the UI of the project?
    - They allow users to easily locate and reference specific pages and original sources.
    2. How are the PDFs translated into text in the project?
    - The PDFs are ingested and split into different categories and chunks of 1000-1200 characters, with each chunk assigned a namespace.
    3. What is the purpose of Pinecone in the project?
    - Pinecone is used as a database to store the chunks of text and metadata, converted into embeddings (vectors) for similarity calculations.
    4. What are some limitations of Pinecone, and how are they overcome in the project?
    - Pinecone has limits on the number of vectors that can be inserted at once, so chunks are split into smaller groups (e.g. 50). The API keys, environment variable, index name, and number of dimensions must also be specified correctly.
    5. What is the purpose of the dimensions in the embeddings?
    - The dimensions represent different spots in an array of vectors, with each dimension containing a number representing a specific aspect of the text or metadata being analyzed. In this project, OpenAI creates 1536 dimensions for each embedding.
    1. What has been trending on GitHub for the past couple of days?
    - Answer: The transcript does not provide a clear answer to this question.
    2. What are the indexes and namespaces in the code?
    - Answer: The indexes and namespaces are components used in the code to retrieve information from different sources.
    3. How can one explore what the vectors look like in the code?
    - Answer: One can click "fetch" to see what the vectors look like, which includes the namespace, ID, and decimal number representations of the text.
    4. What should one do to ensure successful ingestion in the code?
    - Answer: One should ensure that their config has matching namespaces as in the dashboard, set environment variables correctly, and avoid tampering with versions to avoid breaking changes.
    5. What is the purpose of the second phase in the code?
    - Answer: The purpose of the second phase is to chat and retrieve information dynamically by specifying the namespace to retrieve information from.
    1. What is the custom QA chain and what does it do?
    - The custom QA chain is a tool created by the speaker.
    - It takes the model, index, and namespace to effectively set the stage for the Standalone question and retrieve specific name spaces and relevant documents to provide a response.
    - The custom QA chain is just one of three chains available in Lang chain, which includes chat Vector DBQ a chain and Vector DBQ a chain.
    2. What is the purpose of the chat history in the code?
    - The chat history is set to nothing and is not directly relevant to the custom QA chain.
    - It is used in the API implementation of the same logic seen in main.ts.
    3. How many files were tested with the custom QA chain, and what were the results?
    - The speaker was able to read three different files with over close to a thousand pages of in-depth financial analysis.
    - Gpt4 was able to analyze all three years and provide decent analysis.
    4. What is the purpose of the front end in the repo?
    - The front end in the repo is already available for use.
    - It is an adaptation of the custom QA chain tool.
    - The speaker is experimenting with talking across different PDFs.
    5. What is the compound annual growth rate for Revenue over the past three years?
    - The transcript does not provide an answer to this question.
    2. What kind of questions were asked about Tesla's revenue?
    - The questions asked were about growth potential, profitability, and risk factors management.
    3. What is the K1 ?
    - K1 is the number of reference documents that are returned per PDF when using GPT to analyze data.
    4. What kind of future changes will the repository make?
    - The repository will add features for analyzing multiple PDF files in the future.
    5. Where can you find more information about GPT and its applications?
    - The speaker suggests checking out their workshops and signing up for the waitlist in the description section for more in-depth step-by-step details.

  • @yudhaesap
    @yudhaesap Год назад

    Thank you so much. This is exactly what I need. To talk with a document, is kinda insane.

  • @hidroman1993
    @hidroman1993 Год назад +13

    If you use GPT-4 to send all those tokens, the cost must be insane. Can you show the summary of costs in your OpenAI account?

    • @EradicateLoL
      @EradicateLoL Год назад

      You don't, the embeddings handle getting that crucial context information and then you pipe just the relevant info to ChatGPT to help inform the answer. It's still not cheap by any means, but much cheaper than sending all of that data. We're doing something similar at work, and it's still super cheap.

    • @trendkillsp
      @trendkillsp Год назад

      The prompt sent is a normal one, the massive data is stored on the database

  • @rugerdie4054
    @rugerdie4054 Год назад

    Brother...! This is what I am talking about!
    My sellers are looking for this kind of capability for research as well as for searching our internal content management system.

  • @cigir2023
    @cigir2023 Год назад

    Thank you bro... Your are a F*** genius!!!.. This is one of those things that will radically chage the world in a very short time

  • @BlaziNTrades
    @BlaziNTrades Год назад +1

    This is amazing man. Thank you! It will take a while to fully understand how to implement this for myself but I appreciate the knowledge.

  • @georgesanchez8051
    @georgesanchez8051 Год назад +26

    Your Excali diagrams are invaluable too. I’ve been trying to get into the habit of creating similar diagrams. Do you usually come up with them before you code, make them manually as you go, or put it together once you’re done?

    • @vinosamari
      @vinosamari Год назад +1

      I picked up the habit of doing it before and it helped tremendously. You might get sucked in for a while but it’ll help streamline the process

    • @chatwithdata
      @chatwithdata  Год назад +10

      Yeah usually before because it helps me think through how to solve the problem

    • @picklenickil
      @picklenickil Год назад

      Anything fancier than party tricks requires System design and engineering. Logic is almost always beautiful on paper

    • @gambaweb
      @gambaweb Год назад

      This is like having the architecture of your system m. Imo it is a must to have. Good job meanwhile

    • @PizzaLord
      @PizzaLord Год назад +1

      @@chatwithdata what tool do you use for the diags?

  • @suniyamokipallo2571
    @suniyamokipallo2571 Год назад +4

    would be great if u could build this full tutorial in just python!

  • @calebperkins8776
    @calebperkins8776 Год назад +2

    Why not use Embeddings as the documents title so that instead using Gpt to do a search of the exact name space, it could do a semantic search of name spaces similar to the initial prompt, thus finding things relevant more effectively. Thoughts?

  • @MidoBroski
    @MidoBroski Год назад +3

    The GitHub repository in the description does not include this multi pdf bot? Could we get the source code?

  • @Lutherbaer
    @Lutherbaer Год назад +2

    Thanks for the detailed presentation and explanation of your concept! This is really exciting to learn - subscription is set, I'm looking forward to more videos. Thanks mate!

  • @DaveShap
    @DaveShap Год назад +2

    Great work! Now whenever people ask me how to do this I will just point at this video :D Cheers, thanks for shareing

  • @thedude9270
    @thedude9270 Год назад +1

    This tutorial is mindblowing. How much does it cost to feed chatgpt all of the tokens from the context document though?

  • @emmanuelkolawole6720
    @emmanuelkolawole6720 Год назад +1

    Please use chromadb not pinecone. So we can reduce cost significantly. Also can you use Vicuna AI model not GPT 4?

  • @Kryptikoo
    @Kryptikoo Год назад

    Thank you for sharing, please do more, your explanation is so easy to understand !

  • @allthenewsthatsfittomock6578
    @allthenewsthatsfittomock6578 Год назад +3

    First off, very well explained, great detail in the video - thank you for taking the time to put this togeather. I was wondering, for use-cases where the documents you are working with should remain private... is there any way to create the embeddings with some locally run model, rather than sending the raw document text in chunks to OpenAI?

  • @davidaktary
    @davidaktary Год назад +2

    When will you release the code from this walk through (not just the "base template", but the actual code here that uses namespaces)? Also, do you know of any way to prioritize what source it uses? For example, if we have 3 PDFs for 2021, 2022, and 2023, and each declares a new color of the year, and I simply ask "what is the color of the year?" how can I ensure it gives me the most recent answer?

  • @IStMl
    @IStMl Год назад +3

    Would building an app that allows doing all this through a simple UI (and with other features) infringe the ToS?

  • @neerajkulkarni6506
    @neerajkulkarni6506 Год назад

    This is an amazing tutorial! More videos like this please.

  • @ChrisAllenMusic
    @ChrisAllenMusic Год назад

    Thanks! Terrific walk-through, really appreciate this tutorial!

  • @marcofikkers
    @marcofikkers Год назад

    Am i the only one that thinks this needs an animated clippy from old word versions as an avatar? This is how we wanted it to work!

  • @Phrog64
    @Phrog64 Год назад +4

    I'm extremely interested in AI and Language Models, but I have a limited understanding of where to start. I was wondering if you'd be able to recommend any programs, courses, or tutorials for those in a similar position to me. Thank you, and what an interesting video this was!

    • @henry7434
      @henry7434 Год назад

      Maybe read the transformer, Bert and gpt papers first?

    • @1986xuan
      @1986xuan Год назад

      He has the course opened now. You will find the link the description below this video

    • @rajns8643
      @rajns8643 Год назад

      If you have a basic understanding of multivariable calculus, probstats and linear algebra; see few videos on ANN, then RNN, then transformers and BERT, and then finally LLMs.
      Its a long way but this way you would be able to understand it more deeply and you would be able to understand the recent developments in LLMs more clearly.

  • @sathithyayogi99
    @sathithyayogi99 7 месяцев назад +1

    you are Legend brother

  • @leonh-kd3cm
    @leonh-kd3cm Год назад +1

    So when you do Embeddings - maybe I missed it - I assume that you do not generate embedding of the entire PDF page (that you converted to text before that). So how do you chunk the text in the page? By paragraphs? By sentences? (The former would be a bit of a problem if you removed all "
    " while extracting PDF text.).

  • @picklenickil
    @picklenickil Год назад +2

    Just as a suggestion, Put a disclaimer. I tried running something similar (tensor flow Universal encoder decoder 4 running locally + some custom RNN + SinDy)with Kongsberg groups annual statement which consists many smaller companies such as Rolls Maritime and Security. Although publicly available. I swear to everything holy, All my files vanished. No updates or crash nothing. I went to prepare some food, came back and it was like I never started the project. Hell.. even my Bills for openAI were cleared. Like I never wrote a word. Im still doubting my sanity. But the paperwork is there so I couldn't have "tripped".

  • @BlackStudies
    @BlackStudies Год назад +1

    Sorry, I must have missed something. I've watched both videos, and I still don't know how to take a pdf and insert it into Pinecone. Do I upload it somewhere? Do I share a pdf link with Pinecone? How did you get Pinecone to interact with the Tesla pdf? How does Pinecone know that this pdf exists?

  • @resistance_tn
    @resistance_tn Год назад +1

    As always, you're delivering awesome content !!

  • @xoxtoree
    @xoxtoree Год назад +1

    how do you get the page numbers in the source?

  • @anujcb
    @anujcb Год назад +2

    @Chat with data , do you mind sharing the meta data version of the code? where you chunk and tag the documents by year and page number. I am really interested in what you showed in the demo. Impressive work!!!

  • @matte3333
    @matte3333 Год назад

    Thanks for this. It would be helpful to see your CustomQAChain Class. How do I tell Pinecone to only search through specific namespaces?

  • @shashankaadimulam4504
    @shashankaadimulam4504 Год назад +2

    Could you please let me know when the workshop is going to be?

    • @chatwithdata
      @chatwithdata  Год назад +1

      I'm currently putting together the outline, probably next week.

  • @user-wr4yl7tx3w
    @user-wr4yl7tx3w Год назад +2

    can you consider python videos as well.

  • @merozemory
    @merozemory Год назад +1

    Great video :) What do you think about using metadata filtering instead of separating data by namespace? By giving users the ability to select metadata filters, you can provide a personalized search experience with precise filters for two or more metadata types, without the need for feature extraction from text.
    Keep up the good work!

    • @jazzyj2899
      @jazzyj2899 Год назад

      have you tried this? i have tried it in python. at least I've tried metadata filtering with the new SelfQueryRetriever, and it did not work. There is a Structured Query component using comparators that does not allow it to work without some extension of the class and modifcation. Let me know if interested in hashing this out together.

  • @justinc2114
    @justinc2114 Год назад +2

    Can this be wrapped into a web app package for ease of use?

  • @hamzariazuddin424
    @hamzariazuddin424 Год назад

    Sorry im new to this channel looks amazing.
    How are you getting that chatbot? where can i find it?
    Amazing channel and information you are providing though

  • @togo7022
    @togo7022 Год назад +1

    bro just casually creating millions of dollars of value in a youtube video

  • @timokaya296
    @timokaya296 Год назад +1

    I want to use a tool where I can upload 2/3 pdf files so I can ask it questions during an exam for example so it can answer my questions according to the pdf documents I have uploaded. How can I arrange that ?

  • @prednosttrake
    @prednosttrake Год назад +1

    Question: instead of PDF, could you point it to the database to do the same? How would one accomplish that? Thinking of a BI tool or an ERP.
    I think digitizing PDFs will be great for organizations that have technical machinery - example field crew sales - if they need to answer the technical question, it can be input (via voice) into GPT and reference company information. One would need to make sure it is secure (ie ChatGPT cannot store what it reads).

  • @IvoSchmidt-w9z
    @IvoSchmidt-w9z Год назад +1

    Are there any ready-to-use (free or commercial) applications where I can drop a bunch of PDFs (e.g. 100) and chat about them?

  • @argniests5357
    @argniests5357 Год назад

    Hi there, what an awesome video. Will be trying this out this week, asap. I am wondering how long have you been coding, and more specifically how long have you been working with technology like in this video?

  • @nicolofranceschi
    @nicolofranceschi Год назад +1

    hi, let try adobe pdf api , you can improve a lot your embeddings with that , the api auto detaching paragraph , title , table and image , and you can superpower your stuff , also the chunk in not more no sens text but is p or H1 ecc that have sense and you can alt give weight to data , for example h1 is more important of a end page little paragraph

  • @klammer75
    @klammer75 Год назад +3

    Where do you get the reference/source in the code? I didn’t seem to see that part?

    • @chatwithdata
      @chatwithdata  Год назад +2

      github.com/mayooear/gpt4-pdf-chatbot-langchain/blob/main/utils/customPDFLoader.ts

  • @ReflectionOcean
    @ReflectionOcean 9 месяцев назад

    - Convert large PDFs into text for processing (timestamp: 6:36).
    - Split text into chunks compatible with OpenAI's context window (timestamp: 6:59).
    - Create embeddings from chunks and store them in a vector database (timestamp: 7:14).
    - Query the database dynamically using namespaces to extract information from specific years (timestamp: 11:03).
    - Use GPT-4 to parse and understand complex financial questions across multiple documents (timestamp: 17:19).
    - Leverage Langchain and Pinecone services to structure and query data efficiently (timestamp: 18:24).

  • @LukeTownsend-s2t
    @LukeTownsend-s2t Год назад +1

    This video and channel are awesome! I'm working on a version of this demo in python, would you mind adding an Apache or MIT license to the gpt4-pdf-chatbot-langchain repo so I can share without worrying about any weird copyright issues with github? (Accidentally posted this on the wrong video earlier...)

  • @SwingingInTheHood
    @SwingingInTheHood Год назад +1

    Thank you for this excellent contribution! It has really simplified this process for me greatly. One question: The first step (well, second step) in Phase II of your flowchart is to combine the Chat History with New Question. The next step is to submit this to an LLM to get a new standalone question. Why do you do add this step instead of immediately vectorizing the Chat History with the New Question. What does this "new standalone question" look like? If this is explained in this video, or another, could you please include the url? I'm kind of new to this. Thanks!

  • @vetonrushiti19
    @vetonrushiti19 2 месяца назад

    Hi, thank you for this great architecture, but I wanted to ask if there is a tokenization phase that happens here?

  • @sf0101
    @sf0101 Год назад +1

    This is golden 💛

  • @RedCloudServices
    @RedCloudServices Год назад +1

    ugh TypeScript seems so verbose compared to Python. It’s hard to follow JS and TS setups. But…have you considered prompts with calculations derived from the corpus? “how much did Tesla spend on batteries in the last 3 years?

  • @markoguru
    @markoguru 9 месяцев назад

    Really nice work, thanks! How about loading files which are located remotely, like S3 bucket?

  • @tapos999
    @tapos999 Год назад +1

    Discord link invalid. Could you reshare? I am also looking for the latest diagram you were showing on the video. In git, I found one PDF scenario. Is it possible to re-upload that image? Thanks for the great content

  • @aryanphilip1527
    @aryanphilip1527 Год назад +1

    Will you be releasing the codes for this one? Loved the earlier videos.

  • @GeekFromPH
    @GeekFromPH Год назад

    Looking forward to LangChain Agents.

  • @abdullaansari7163
    @abdullaansari7163 6 месяцев назад

    Very nice information,
    can you please create a video for python based for the same?
    It will be helpful
    thanks in advance😍😍😍

  • @TheSshahrukh
    @TheSshahrukh Год назад

    @chatwithdata thanks for sharing this demo. In my line of work we sometimes deal with sensitive and non-public data. What risk factors are associated with analyzing sensitive data using GPT apis?

  • @jennyxu3747
    @jennyxu3747 Год назад +1

    great video! can we see the source code for this? would love to see how extractYearFromQuestion is implemented

  • @rathishmenon1612
    @rathishmenon1612 4 месяца назад

    Very nice video, can we use metadata filter instead of namespace , will it work? Also could you please share me your main.ts , fewshotprompt template..

  • @Steve-js7bp
    @Steve-js7bp Год назад +1

    Are you by any chance running a function that strips out the pdf: object [] from the metadata? For some reason when I try to upload my PDF, there is a pdf: object [] that is causing a Pinecone error. I noticed that you have pdf-parse as a dependency, but I can't quite see how this is working. Keep up the great work by the way!

  • @alexsov
    @alexsov Год назад +1

    thank you! not clear how you select namespaces (pdfs). just by year in request? code in video not same as in github?

  • @NishNannapaneni
    @NishNannapaneni Год назад

    Thank you for the walkthrough!

  • @Chuukwudi
    @Chuukwudi Год назад

    Python developer here. So I am going to have to learn typescript now?

  • @vladbph
    @vladbph Год назад +1

    what if the query is not related to a year - but anything else, how are you going to approach this? Your namespaces are hardcoded essentially... In general case you want them to be dynamic.

    • @chatwithdata
      @chatwithdata  Год назад

      In this demo, the namespace is hardcoded for ingestion purposes, but dynamic afterwards at the point of query

  • @angelocortez345
    @angelocortez345 9 месяцев назад

    So my understanding is that it's 1 namedspace per pdf? If I had 10 real estate books, should I create 1 namedspace or 1 per book?

  • @이주석-g7g
    @이주석-g7g Год назад +1

    Thanks for the great video, I'm using it very well. When I use the code you posted on GitHub based on pinecone, is there any limit to the number of PDF files/total capacity/total number of sides, etc.? It worked fine when the total size of multiple PDF files was around 84MB, but when it went up to 128MB, I got a SocketError:other side closed error in npm run ingest, so adding a PDF file to an already existing vector store doesn't seem to work well. Do you think I need to adjust the timeout for undici to be a few times longer?

  • @zain1045
    @zain1045 Год назад +1

    quite often a year i stumble into these parts of the net where I dont understand a single thing , as such

  • @salman0ansari
    @salman0ansari Год назад +1

    can you create a tutorial for converting langchain code to API and maybe use that API to create a discord bot?

  • @raybenchen
    @raybenchen Год назад +2

    Just wondering if the same results would be obtained for the same prompts if all three docs are put into one namespace in VDB and then let GPT figure out the across-the-year answers?

    • @jazzyj2899
      @jazzyj2899 Год назад

      No, it wouldnt work out. if youd like to get together and try to solve that portion, comment back!

  • @arsalanriaz3382
    @arsalanriaz3382 Год назад +1

    Is there a python implementation ?

  • @damnojs
    @damnojs Год назад

    How many tokens does it use when you ask a question? Can you give a general calculation?

  • @nicholasbishop2695
    @nicholasbishop2695 Год назад

    Interesting that there is a grammatical error in both the question and resulting answers. Does this mean GPT-4 is simply parroting back to the questioner OR is there an inherent grammatical error bias in the model? I am referring to the use of the plural for the management of a company and the company itself. Whereas in fact, a company (Tesla) is a single legal personality and should be referred to in the singular. 'Its' not 'their', 'it' not 'they', 'has' not 'have' etc...

  • @despo13
    @despo13 Год назад

    bro which keyboard do you use, and thanks for this detailed video

  • @KleiAliaj
    @KleiAliaj Год назад

    Hi,
    great video and app.
    I am trying to deploy it just for testing live, but it doesnt work An error occurred while fetching the data. Please try again.
    But in local development it is working. how can we fix this ?
    do you have any version deployed online ?
    Thanks

  • @senju2024
    @senju2024 Год назад +1

    About privacy concerns, Is there a way for GTP-4 to NOT look a certain data? How about if a PDF has sensitive data that should not scanned by GPT? or maybe it can scan it but due to a "INSTRUCT-POLICY" it is NOT allows to output that related info? I feel security policies need to added to the architecture. My 2 cents.

    • @Jingizz
      @Jingizz Год назад

      Openai themselves should add some opt-out so the data doesn't get saved by them. EU might force them to add something.

  • @moz658
    @moz658 Год назад +2

    Awesome. Question: While it is searching the answer through the document, is it possible to add an extra condition on the base code that it will be also using its own reasoning and to create some ideas based on both the pdf and its default model upon the question?

    • @hayx210
      @hayx210 Год назад

      did u find an answer?

    • @moz658
      @moz658 Год назад

      @@hayx210 Nope. So for now I decided to use FAISS as local vector db. If Chroma repo owner fixes the problem, I will check it again.

  • @todaydailyeveryday
    @todaydailyeveryday Год назад

    Hi, great tutorial! I was wondering if it were able to read the financial reports such as the net income, debt etc.. within the balance sheet or income statement?

  • @svgtdnn6149
    @svgtdnn6149 Год назад

    Hi, this is amazing! thanks for the contents
    Wondering if you have any idea on the scalability and and cost of having such a system?

  • @rajpdus
    @rajpdus Год назад

    I like how you diagram and explain things. Keep going. Any chance of having your template for diagrams for our personal use :)

  • @JesusSalazar-kv7mn
    @JesusSalazar-kv7mn 10 месяцев назад

    🎯 Key Takeaways for quick navigation:
    00:00 📑 *This video discusses how to analyze multiple massive PDF documents, like Tesla's annual reports for different years, using AI.*
    02:36 🧠 *The video demonstrates asking analytical questions about multiple PDFs simultaneously and extracting insights from them.*
    03:21 💼 *It showcases asking technical questions and comparing data from different years in PDF reports.*
    06:36 📊 *The video explains the architecture, including converting PDFs to text, splitting into chunks, and using PineCon for storage.*
    11:03 📈 *It discusses how to dynamically extract namespaces from questions to query specific PDFs and retrieve relevant information across multiple years.*
    35:52 🤖 *The video demonstrates the process of extracting namespaces and mapping them to specific years in PDF documents, enabling targeted queries across multiple files.*
    37:58 📊 *The presenter showcases the ability to analyze compound annual growth rates across three years of financial reports using GPT-4.*
    39:49 💰 *GPT-4 accurately responds to questions about growth potential and profitability based on the past three years' annual reports.*
    41:03 📈 *The video outlines the architecture for using GPT-4 in applications involving multiple PDF files and suggests adjusting parameters for improved accuracy.*
    42:36 🧰 *The presenter encourages viewers to re-watch the video for a better understanding and mentions upcoming workshops for more detailed guidance.*
    Made with HARPA AI

  • @RyanDewar-p8i
    @RyanDewar-p8i Год назад

    What is unclear to me is how does the ChatGPT get interacted with in the backend when you say "Tell me the risk factors for 2022" how does your code know to fetch each chunk and then search chat GPT?

  • @infocentrousmajac
    @infocentrousmajac Год назад

    Awesome stuff... Excellent video!

  • @ColinTimmins
    @ColinTimmins Год назад

    Fantastic stuff my friend.

  • @Antonio-cn3ji
    @Antonio-cn3ji Год назад

    Amazing job! When database as source? Thanks

  • @yazirturku
    @yazirturku Год назад

    excuse me, I didn't fully understand. I have a PDF and I want to integrate it with GPT-4 and chat over the PDF. Is there a ready-made plugin for this, or is it related to my video question?"