The problem with this library is that this is very generic... the main problem here is that there are much better libraries that are currently available for each of the formats which do a better job than the underlying libraries used here... Tesseract is easily beaten by easyocr, pdfminer is beaten by pymupdf and so on. It's very good with that. Also, what if docs or pdfs have images in some of the pages.
Oh, you briefly mentioned it uses pdf.miner under the hood? I hope not! From personal experience with testing different Python libs, I found the results of pyPDF and PyMuPDF much better.
An idea for better video structure would be to have a demo at the beginning, while I have some idea but had to watch until the end to understand what the library can do.
@@1littlecoder and the difference between unstructured html parser and the library html2text? And why are there pages in HTML documents in the first place?
@@1littlecoder yea I miss you weekly AI news. You should start it again. Not the all AI stuff happened that week but like crazy ground braking invention or paper. Or whatever impresses you. In this way it won't be 20-30 min long. you can make it 10-12 min. There's a youtube channel "the friday checkout" you can follow his format.
Wow thanks Majeed that’s something which I desperately need. Was facing lot of issues for text conversion in my Rag system. Will also be helpful if you can run a tutorial on sentence window retrieval + rerank for RAG.
I was looking for something like this to make a raw text of the hugging face documentation, since no LLM’s are trained in it since it’s available in a very weird website format. This is awesome :)
@@1littlecoder there is a docker image of unstructured io and they also give option to install as docker container but there are no instructions as how to proceed , a video on it would be very helpful
Going over libraries useful for AI dev is a great video series idea!
Thank you. If you have any interesting choices in mind feel free to let me know :)
100%
thanks. looking forward to advanced tutorial covering using unstructured to do chunking, rag....
Yes I am looking at unstructured - have you used it? How good is it for tables?
The problem with this library is that this is very generic... the main problem here is that there are much better libraries that are currently available for each of the formats which do a better job than the underlying libraries used here... Tesseract is easily beaten by easyocr, pdfminer is beaten by pymupdf and so on. It's very good with that. Also, what if docs or pdfs have images in some of the pages.
After extraction the text how to extract some information and write to a excel
Oh, you briefly mentioned it uses pdf.miner under the hood? I hope not! From personal experience with testing different Python libs, I found the results of pyPDF and PyMuPDF much better.
An idea for better video structure would be to have a demo at the beginning, while I have some idea but had to watch until the end to understand what the library can do.
Thanks for the tip. Do you mean like showing the final output?
@@1littlecoder yes, something like input and output. It acts as a hook.
@@BiMoba Thank you. I'll try to make sure!
I always find out if I'm interested in a particular video by reading the transcript summary.
That's a clever way!
Do you know what is the difference with pandoc?
Afaik pandoc helps you generate PDFs.
@@1littlecoder and the difference between unstructured html parser and the library html2text? And why are there pages in HTML documents in the first place?
@@1littlecoder I see, thanks. pdfminer is an alternative as you mentionned.
If it is implemented, it is on-premise or calling Unstructured API which is using our ingestion data
Whatever we did on this video is on-prem because we aren't calling any api
wow you are back after a week. You should take some breaks like this. AI is going crazy. You won't miss anything
I saw a lot of models being launched. In fact been thinking to do a weekly summary line Ai news this time.
@@1littlecoder yea I miss you weekly AI news. You should start it again. Not the all AI stuff happened that week but like crazy ground braking invention or paper. Or whatever impresses you. In this way it won't be 20-30 min long. you can make it 10-12 min. There's a youtube channel "the friday checkout" you can follow his format.
PDFs will take longer to process than a text file. This creates a need to use Unstructured Commercial SaaS API. For other formats, it is okay to use.
Wow thanks Majeed that’s something which I desperately need. Was facing lot of issues for text conversion in my Rag system. Will also be helpful if you can run a tutorial on sentence window retrieval + rerank for RAG.
Informative Thanks
This channel is completely underrated! Thanks for this video
Glad you think so! Thank you :)
Honestly, we are lucky to know you..... Many thanks and appreciation to you, Mr. Abdul ❤
I'm glad you found it useful :)
I was looking for something like this to make a raw text of the hugging face documentation, since no LLM’s are trained in it since it’s available in a very weird website format. This is awesome :)
Thanks for sharing!!
great video boss, it support multilangues
are you also doing vectara advanced rag hackathon ?
Sir , how to install and use this on docker , no video on internet
I think llama index as its own docker version
@@1littlecoder there is a docker image of unstructured io and they also give option to install as docker container but there are no instructions as how to proceed , a video on it would be very helpful
Great tool Thanks!🤩
👏👏👏👏👏
Look who's here 😁
Great video!
Glad you enjoyed it
I like this !
Stop shtposting please 🙏
Means
@@1littlecoder he is implying this video is shit. Which I disagree with. Although the video could have been shorter.
@@kalilinux8682 i actually asked the question to make sure it's not a bot
@@1littlecoder I am not a bot. LMAO.
@@Macorelppa Glad to know. Dealing with a lot of bots, I'm happy to see humans
Exactly what I needed. 🥌