How to convert PDF DOCX to Structured TXT Formats for RAG! (UNSTRUCTURED Tutorial)

Поделиться
HTML-код
  • Опубликовано: 30 сен 2024

Комментарии • 54

  • @IdPreferNot1
    @IdPreferNot1 5 месяцев назад +9

    Going over libraries useful for AI dev is a great video series idea!

    • @1littlecoder
      @1littlecoder  5 месяцев назад

      Thank you. If you have any interesting choices in mind feel free to let me know :)

    • @eugenmalatov5470
      @eugenmalatov5470 5 месяцев назад +1

      100%

  • @MrLyonliang
    @MrLyonliang Месяц назад

    thanks. looking forward to advanced tutorial covering using unstructured to do chunking, rag....

  • @OP-yr6jb
    @OP-yr6jb 4 месяца назад

    Yes I am looking at unstructured - have you used it? How good is it for tables?

  • @mandalorian1992
    @mandalorian1992 Месяц назад

    The problem with this library is that this is very generic... the main problem here is that there are much better libraries that are currently available for each of the formats which do a better job than the underlying libraries used here... Tesseract is easily beaten by easyocr, pdfminer is beaten by pymupdf and so on. It's very good with that. Also, what if docs or pdfs have images in some of the pages.

  • @nithishkrish3442
    @nithishkrish3442 4 месяца назад

    After extraction the text how to extract some information and write to a excel

  • @ilianos
    @ilianos 5 месяцев назад

    Oh, you briefly mentioned it uses pdf.miner under the hood? I hope not! From personal experience with testing different Python libs, I found the results of pyPDF and PyMuPDF much better.

  • @BiMoba
    @BiMoba 5 месяцев назад +4

    An idea for better video structure would be to have a demo at the beginning, while I have some idea but had to watch until the end to understand what the library can do.

    • @1littlecoder
      @1littlecoder  5 месяцев назад +2

      Thanks for the tip. Do you mean like showing the final output?

    • @BiMoba
      @BiMoba 5 месяцев назад +1

      @@1littlecoder yes, something like input and output. It acts as a hook.

    • @1littlecoder
      @1littlecoder  5 месяцев назад +1

      @@BiMoba Thank you. I'll try to make sure!

    • @MrKellvalami
      @MrKellvalami 5 месяцев назад +1

      I always find out if I'm interested in a particular video by reading the transcript summary.

    • @1littlecoder
      @1littlecoder  5 месяцев назад

      That's a clever way!

  • @MrPierreSab
    @MrPierreSab 5 месяцев назад +1

    Do you know what is the difference with pandoc?

    • @1littlecoder
      @1littlecoder  5 месяцев назад

      Afaik pandoc helps you generate PDFs.

    • @eugenmalatov5470
      @eugenmalatov5470 5 месяцев назад

      @@1littlecoder and the difference between unstructured html parser and the library html2text? And why are there pages in HTML documents in the first place?

    • @MrPierreSab
      @MrPierreSab 5 месяцев назад

      @@1littlecoder I see, thanks. pdfminer is an alternative as you mentionned.

  • @nirmaldesai4504
    @nirmaldesai4504 5 месяцев назад

    If it is implemented, it is on-premise or calling Unstructured API which is using our ingestion data

    • @1littlecoder
      @1littlecoder  5 месяцев назад

      Whatever we did on this video is on-prem because we aren't calling any api

  • @captainoddessy
    @captainoddessy 5 месяцев назад +1

    wow you are back after a week. You should take some breaks like this. AI is going crazy. You won't miss anything

    • @1littlecoder
      @1littlecoder  5 месяцев назад +3

      I saw a lot of models being launched. In fact been thinking to do a weekly summary line Ai news this time.

    • @captainoddessy
      @captainoddessy 5 месяцев назад

      @@1littlecoder yea I miss you weekly AI news. You should start it again. Not the all AI stuff happened that week but like crazy ground braking invention or paper. Or whatever impresses you. In this way it won't be 20-30 min long. you can make it 10-12 min. There's a youtube channel "the friday checkout" you can follow his format.

  • @DeepakRavi93
    @DeepakRavi93 5 месяцев назад

    PDFs will take longer to process than a text file. This creates a need to use Unstructured Commercial SaaS API. For other formats, it is okay to use.

  • @drramasubramaniam6724
    @drramasubramaniam6724 5 месяцев назад

    Wow thanks Majeed that’s something which I desperately need. Was facing lot of issues for text conversion in my Rag system. Will also be helpful if you can run a tutorial on sentence window retrieval + rerank for RAG.

  • @faisalIqbal_AI
    @faisalIqbal_AI 5 месяцев назад +1

    Informative Thanks

  • @Raphy_Afk
    @Raphy_Afk 5 месяцев назад +1

    This channel is completely underrated! Thanks for this video

    • @1littlecoder
      @1littlecoder  5 месяцев назад +1

      Glad you think so! Thank you :)

  • @yusufersayyem7242
    @yusufersayyem7242 5 месяцев назад +2

    Honestly, we are lucky to know you..... Many thanks and appreciation to you, Mr. Abdul ❤

    • @1littlecoder
      @1littlecoder  5 месяцев назад

      I'm glad you found it useful :)

  • @Nick_With_A_Stick
    @Nick_With_A_Stick 5 месяцев назад

    I was looking for something like this to make a raw text of the hugging face documentation, since no LLM’s are trained in it since it’s available in a very weird website format. This is awesome :)

  • @sharanbabu2001
    @sharanbabu2001 5 месяцев назад +1

    Thanks for sharing!!

  • @maizizhamdo
    @maizizhamdo 5 месяцев назад

    great video boss, it support multilangues

  • @rounaksen1683
    @rounaksen1683 5 месяцев назад

    are you also doing vectara advanced rag hackathon ?

  • @drmetroyt
    @drmetroyt 3 месяца назад

    Sir , how to install and use this on docker , no video on internet

    • @1littlecoder
      @1littlecoder  3 месяца назад

      I think llama index as its own docker version

    • @drmetroyt
      @drmetroyt 3 месяца назад

      @@1littlecoder there is a docker image of unstructured io and they also give option to install as docker container but there are no instructions as how to proceed , a video on it would be very helpful

  • @jmirodg7094
    @jmirodg7094 5 месяцев назад

    Great tool Thanks!🤩

  • @Saranlisto
    @Saranlisto 5 месяцев назад +1

    👏👏👏👏👏

  • @eugenmalatov5470
    @eugenmalatov5470 5 месяцев назад +1

    Great video!

  • @adarmawan1977
    @adarmawan1977 5 месяцев назад

    I like this !

  • @Macorelppa
    @Macorelppa 5 месяцев назад +1

    Stop shtposting please 🙏

    • @1littlecoder
      @1littlecoder  5 месяцев назад

      Means

    • @kalilinux8682
      @kalilinux8682 5 месяцев назад +1

      ​​@@1littlecoder he is implying this video is shit. Which I disagree with. Although the video could have been shorter.

    • @1littlecoder
      @1littlecoder  5 месяцев назад +1

      @@kalilinux8682 i actually asked the question to make sure it's not a bot

    • @Macorelppa
      @Macorelppa 5 месяцев назад +1

      ​@@1littlecoder I am not a bot. LMAO.

    • @1littlecoder
      @1littlecoder  5 месяцев назад +2

      @@Macorelppa Glad to know. Dealing with a lot of bots, I'm happy to see humans

  • @etahydri
    @etahydri 5 месяцев назад

    Exactly what I needed. 🥌