Extracting Structured Data From PDFs | Full Python AI project for beginners (ft Docker)

Поделиться
HTML-код
  • Опубликовано: 24 ноя 2024

Комментарии • 86

  • @MS-vv2sd
    @MS-vv2sd День назад

    This tutorial has been the foundation of launching my education on python, RAG, ai development. I essentially started with almost no knowledge of any coming from more of a business background. I essentially replicated every single step in here and took time to learn about each aspect of it. It's taken several months, but I was able to re-create a basic version of this app and I'm using it as a work 'agent' to help with my job. Half-way through this journey, I signed up with Thu's python labs class, which I wish I had taken at the start. It gave me a structured approach of learning the basics that I had been stumbling through initially. Thu's a great teacher and a motivator.

  • @awesomeowwww
    @awesomeowwww 2 месяца назад +39

    I started my Data Science journey two years ago and now I'm building projects like AI Assistants or tkinter Desktop apps with ollama integration which is able to summarize the content of different files (pdf, docx, images) and you are a big part of my development since your passion and love for this swapped over to me :))

    • @Thuvu5
      @Thuvu5  2 месяца назад +6

      Oh congratulations on your projects! So glad you found inspiration from my vids 🤗☺️

  • @rayzorr
    @rayzorr 2 месяца назад +13

    Wow, that was probably the best tutorial I have watched ... and I have watched a lot! Perfectly pitched and well thought out and delivered. Congrats on a great job!

    • @Thuvu5
      @Thuvu5  Месяц назад

      Aw you’re so kind! I’m so glad to hear that 🙌

  • @aireescreates
    @aireescreates 2 месяца назад +5

    Thanks for this Thu Vu. I have followed you from your first video. I was just starting in my DS journey. Your videos helped me a lot in my journey. I kinda missed you and I'm just glad that you posted again. This is super helpful and you explained all the concepts very clearly. I am currently building a web app extracting sales data from PDF files and using LLM to generate insights, analysis and recommendations and data viz. You explanation on Docker is a treasure as I'm building my app! Thank you so much!

    • @Thuvu5
      @Thuvu5  2 месяца назад +1

      I'm so glad to hear! 🙌

  • @thirumalaraojuvvisetti
    @thirumalaraojuvvisetti 3 дня назад +1

    You are speaking with clarity and confidence. Thank you

  • @luisalbertocodes
    @luisalbertocodes 16 дней назад

    Just started my data science studies at university and this is awesome, I see my initial linear algebra classes paying off

  • @oksanastrelnikova6970
    @oksanastrelnikova6970 Месяц назад +2

    Absolutely amazing content. I an only a beginner, I do not think I will be able to do it by myself (too frighten) but I could understand every single step you were doing!!! (also considering that English is not my first language). Thank you a lot!!! For all you work. Your tutorials are super professional and extremely useful!!!

    • @Thuvu5
      @Thuvu5  5 дней назад

      Aw you're too kind! I'm really happy to hear it was helpful!

  • @marktahu2932
    @marktahu2932 2 месяца назад +1

    Thank you Thu Vu, for a very straight forward step by step guide to creating a RAG project, I have needed something like this for a while to understand how to implement this. Many thanks!!

  • @cerealport2726
    @cerealport2726 2 месяца назад +1

    This is super interesting. Just like your other projects, you make it easy to see how the general process could be adapted for other purposes. Thanks very much!

  • @sk3ffingtonai
    @sk3ffingtonai Месяц назад +1

    Thank you so much for creating this comprehensive tutorial. I have been and am working hard on my AI Certification and this content is gold.

  • @ZakinAbdul
    @ZakinAbdul Месяц назад

    Thank you for the video, Thu vu. I recently completed a project using LLMs to interact with PDF data as a chatbot. Your code has been invaluable in helping me handle errors with ChromaDB and create a well-structured project directory. I was curious about potential improvements or alternative approaches that could enhance my project. Convert unstructured PDF data into a structured format with the use of LLMs. This was a new concept for me, as my project focused solely on chatbot interactions with the data. And your approach has opened my eyes to new possibilities and I'm eager to explore similar techniques in my future work.

  • @jemiranhunter
    @jemiranhunter 2 месяца назад +4

    Great content. Very informative. Thanks for sharing.

  • @georgejetson9801
    @georgejetson9801 2 месяца назад +5

    this would have been amazing for my phd studies

    • @Thuvu5
      @Thuvu5  2 месяца назад +5

      Maybe for a second PhD? 🤣

  • @cybetica
    @cybetica 2 месяца назад +4

    You might want to renew your API key, as you showed it in plain text in 10:09 secs and scrolled. Nice vid!

    • @Thuvu5
      @Thuvu5  2 месяца назад +5

      Oh thanks, good eyes! 😄 Yep I've revoked the key :)

  • @kenchang3456
    @kenchang3456 2 месяца назад +2

    Excellent tutorial. Thank you very much.

  • @whatsbetter8457
    @whatsbetter8457 2 месяца назад +5

    Instead of only be able to use OpenAI you could use the “instructor” or “ollama-instructor” library in Python to get structured and validated outputs from a LLM (Ollama, OpenAI, Gemini, Groq, etc.). Was already there before OpenAI came up with its feature :-)

    • @Thuvu5
      @Thuvu5  2 месяца назад

      Thanks for sharing this! Yeah indeed, instructor seems to be more flexible if we want to try different LLMs in the same project

  • @rickrandall3174
    @rickrandall3174 2 месяца назад +1

    Thu Vu, you are wonderful. 🙂

  • @SumithRajagopalan
    @SumithRajagopalan Месяц назад +1

    Amazing explanation and video 👍

  • @RohithS-ig4hl
    @RohithS-ig4hl 12 дней назад

    Thank you so much for this! You explained it really really well. Kindly Kindly post many videos such like this one/other topics.

  • @nguyenhai.truongan
    @nguyenhai.truongan Месяц назад +1

    Hi Thu. Tôi đã theo dõi bạn cách đây vài năm trước, video của bạn làm rất hay. Thời gian gần đây tôi thấy bạn có đăng những video phân tích dữ liệu sử dụng AI. Là một nhà phát triển ứng dụng AI, tôi muốn tìm hiểu các quy trình, nhiệm vụ và nhu cầu của một nhà phân tích dữ liệu là như thế nào để có thể tạo ra một ứng dụng hoàn chỉnh cho ngành phân tích dữ liệu này. Hy vọng bạn sẽ có vài gợi ý cho tôi. Cám ơn Thu.

  • @CapybaraLifeStyle
    @CapybaraLifeStyle Месяц назад

    Absolutely fantastic! ❤

  • @perpl1618
    @perpl1618 Месяц назад

    This was an amazing video , Thank you Thu San , Would you consider making an advanced users video , with all of the small details and edge options ?

  • @ahmadzaimhilmi
    @ahmadzaimhilmi 2 месяца назад

    I prefer to use Cohere's command-r instead of OpenAI for RAG tasks. The api response can pinpoint the exact sentences from where the information is retrieved given the chunks that we feed in. Good for retrieving answers with citations.

  • @MrGbruges
    @MrGbruges 2 месяца назад +1

    THANX THU VU, VERY INTERESTING!!!!

  • @gviacava
    @gviacava 2 месяца назад

    What a great tutorial!!! Thank you!

  • @istifanusbulus1214
    @istifanusbulus1214 23 дня назад

    Wow, one of the best tutorials, I want learn how to extract info on sales invoices and vendor invoices and convert them in datagram to match it the general ledger. Please can do a video about it. Thank in advance.

  • @Aaron-it5il
    @Aaron-it5il 2 месяца назад +1

    Thanks for sharing!

  • @robertbutscher6824
    @robertbutscher6824 2 месяца назад +1

    great video, thank you so much for that valuable inspirations

  • @VinhNguyen-zg7lu
    @VinhNguyen-zg7lu Месяц назад

    Hay quá chị ơi ❤❤❤

  • @ravikumarsingh9766
    @ravikumarsingh9766 Месяц назад

    Very nicely explained ... Really love the content . Way to go !!!. I wanted to ask if I have multiple PDF files , How can create Embedding for all the PDF files, like 10 PDF files . And then want to run rest of the query ? Whenever you have time , please do suggest . would wait for your reply !!!

  • @quangvu20780
    @quangvu20780 2 месяца назад

    Tuyệt vời, video hay đấy em..

  • @iantotan4229
    @iantotan4229 2 месяца назад +2

    New video!Finally!

  • @ruanvieira9082
    @ruanvieira9082 4 дня назад +1

    thank you friend

  • @agape13
    @agape13 Месяц назад

    With that said, there are going to be a big layoffs waves.
    One can already experience translators positions being significantly reduced.
    The need for analysts will change in the future as well.

  • @dannyrene
    @dannyrene 26 дней назад

    Ngl you’re one smart cookie

    • @dannyrene
      @dannyrene 26 дней назад

      I’m not finished watching but doesn’t each embedding vector need to have the same number of dimensions to perform a calculation of their Euclidean distance? Which would imply that all vectors have the same number of dimensions, right? If that’s the case, what is the limiting variable on the number of dimensions? Processing power? Wouldn’t more dimensions as give you a smarter model?

  • @coldbelowfroze
    @coldbelowfroze 2 месяца назад +2

    I missed you so much!

    • @Thuvu5
      @Thuvu5  2 месяца назад +2

      Aww, thank you 🥹

  • @petersheldrick1851
    @petersheldrick1851 2 месяца назад

    great content, so well explained. I am doing an AI course at the moment, I am stuck on solving my project task,see if you can guide me! The requirement is to use AI or even deep learning to predict a person's shoe size based on a photo of the sole of their foot, without shoes and socks. Not allowed to use other items in the photo as a reference point, for example a centimetre ruler or something of a known size. Have to use learning from known images and their respective shoe size. I am struggling where to start!

  • @eulerthegreatestofall147
    @eulerthegreatestofall147 Месяц назад

    Great Video as always!!!, quick question, how did you create the requirements.txt file??

  • @SkySesshomaru
    @SkySesshomaru 2 месяца назад +1

    incredible

  • @freedman1405
    @freedman1405 2 месяца назад +1

    Hi Thu Vu, what's your take on privacy issues with ChatGPT? Wouldn't companies risk their confidential data if they implement this system and use their APIs?

    • @Thuvu5
      @Thuvu5  2 месяца назад

      Good question! In my experience companies typically use an enterprise subscription to a cloud service like Microsoft Azure that integrates access to these LLMs. Here’s an example learn.microsoft.com/en-us/azure/ai-services/openai/

  • @readas1
    @readas1 Месяц назад

    Hello, I found your video very informative since I have a similar project I am working on. Question for you: What would you do if the program was not returning good chunks? By that I mean, I uploaded a 90 page pricing document, and asked for the title of the document, and none of the chunks included the first page of the document, so the LLM could not correctly answer the question.

  • @GoogleUser-tk3mb
    @GoogleUser-tk3mb Месяц назад

    You're really taking my interest in data to the next level! It popped up in my RUclips recommendations, and this is truly a hidden gem. Keep it up, sis.
    +1 Subscribe! I'm sure this channel will blow up soon 🎉
    Anyway, I was wondering how you do that code thing in VSCode without having to type everything? It's amazing!.
    And now I'm totally lost!
    FullStack? FrontEnd? BackEnd? Data Analytics?...
    FOMO is killing me! 🔥😭
    But, the worldwide jobs market is stable for data roles, right? 🤔

  • @MichealAngeloArts
    @MichealAngeloArts 2 месяца назад

    Thanks for the awesome project. What is the amount of code change required if I'll be using a Gemini LLM via Vertex AI on GCP instead of GPT4 / OpenAI (in particular, the LangChain-related code) to replicate this project?

  • @dushimiyimanathaulin7930
    @dushimiyimanathaulin7930 2 месяца назад +2

    Very informative

  • @Rationalview4915
    @Rationalview4915 2 месяца назад +1

    It was really helpful
    Thank you for this video❤

  • @kamilherbik
    @kamilherbik Месяц назад

    Thanks

  • @nyanlynn-450
    @nyanlynn-450 2 месяца назад

    Cool👍💯

  • @s3m3sta
    @s3m3sta 2 месяца назад +1

    thanks a bunch Thu Vu

  • @rodeondurotan6142
    @rodeondurotan6142 2 месяца назад

    I hope you can make a video on unstructured pdf data.

  • @nnamdiodozi7713
    @nnamdiodozi7713 2 месяца назад

    Did you use a Linux environment for this video? I’m asking cos I keep seeing bin in the file paths.

  • @sayfasayfa3500
    @sayfasayfa3500 Месяц назад

    Pls can u tell which ide u use iam complete biginner and i wanna do this for main project

  • @junaidamin
    @junaidamin Месяц назад

    For getting structured data in our answer, we can also use metadata ?

  • @ahmadzaimhilmi
    @ahmadzaimhilmi 2 месяца назад

    I have a question about the structured output. I've been trying to find a workaround for dynamic attributes. The ones that you showed as example are hardcoded. I want to pass in a dictionary of field name and its explanations and get a resulting dictionary back in return. So far I couldn't think of a way.

  • @_Around_The_Globe_
    @_Around_The_Globe_ Месяц назад

    i get a NotImplementedError when using the with_structured_output, using gpt-4o-mini, can someone help plz?

  • @CyberHorrorHunter
    @CyberHorrorHunter 2 месяца назад +1

    I am new to this journey but how did you get your VS code to output so many lines, I have been tinkering with notebook settings and cant seem to get it to output the larger amount of data without go far out the right of the screen.

    • @Thuvu5
      @Thuvu5  2 месяца назад

      Good question, I believe it’s a setting for notebook. Check it stackoverflow.com/questions/67855498/how-to-display-all-output-in-jupyter-notebook-within-visual-studio-code

  • @hongmeixie409
    @hongmeixie409 Месяц назад

    can you show what it looks like in the docker?

  • @trungvan2154
    @trungvan2154 Месяц назад

    Does this code scenario work well for the other language such as Vietnamese , with a lang parameter vn for example? Thanks

  • @joseduarte1240
    @joseduarte1240 Месяц назад

    we can create an local envirement that can read all the files in one folder, even if its excel,pdfs everything?

  • @datagus
    @datagus 2 месяца назад

    Is the extracting outcome from the PDF good? Often the extraction process produces text that is all messed up, which can have negative consequences in the chunking process.

    • @heritage1834
      @heritage1834 2 месяца назад +1

      I believe it depends on the formatting of the pdf files and also method the extraction is carried. A project article I read suggested that using image to text (OCR) usually produces better results than parsing pdf documents, especially when the pdf is badly formatted

  • @FauziFayyad
    @FauziFayyad 2 месяца назад +1

    Yay Thu vuu !

  • @CyberHorrorHunter
    @CyberHorrorHunter Месяц назад

    Additionally, I have found this does not output tables correctly (any idea how to remedy that?). Also, this seems to be affected by real text vs PNG, jpg images of the original pdf text that was then embedded in a pdf.

  • @jeffkidder5282
    @jeffkidder5282 15 дней назад

    Anything that even looks/feels too good to be true usually is. All this wonderful advancement screams of disaster just waiting to happen.

  • @d.d.z.
    @d.d.z. 2 месяца назад

    Very complete

  • @RipulKumar-g2d
    @RipulKumar-g2d Месяц назад

    Hi not sure of you revert or not , i tried to follow your video but i am stuck at 22:22 sec and not able to move further. getting error when i execute the same code

  • @supertab365
    @supertab365 Месяц назад

    Damn that's beginner level? I am f--d

  • @readas1
    @readas1 Месяц назад

    Have you refined this project at all? I built your version with 0 edits, and it gets everything wrong every time I test a paper of anything length. The program works, but it does not actually interpret the documents well at all.. Of the 10 or so I have tested I dont think it has gotten a title correct once, and it usually gets 0/4 correct.

  • @DrB934
    @DrB934 2 месяца назад

    You may have just killed QSR NVivo...

  • @ANAND02120
    @ANAND02120 10 дней назад

    Hey I tried to play with your code. But I can see the model is Hallucinating. The answer it's giving is not correct. How can we fix it?
    `paper_title paper_summary publication_year paper_authors
    answer Title of the Research Paper This research paper discusses the impact of cl... 2023 John Doe, Jane Smith
    source The title is clearly stated at the beginning o... The summary of the paper outlines the key find... The publication date is mentioned in the heade... The authors' names are listed right below the ...
    reasoning The title of a research paper is typically fou... A summary usually encapsulates the main points... Publication years are typically indicated in t... Authors are usually prominently displayed alon...`

  • @hoangsang2471
    @hoangsang2471 2 месяца назад

    Are you vietnamese, your name seem like vietnamese nam

  • @sifhatshams-s1j
    @sifhatshams-s1j 2 месяца назад

    If you ware my sister i dont have to warry about any problem :))
    Why you did not born as my sister :((