I was looking for something like this to make a raw text of the hugging face documentation, since no LLM’s are trained in it since it’s available in a very weird website format. This is awesome :)
An idea for better video structure would be to have a demo at the beginning, while I have some idea but had to watch until the end to understand what the library can do.
Wow thanks Majeed that’s something which I desperately need. Was facing lot of issues for text conversion in my Rag system. Will also be helpful if you can run a tutorial on sentence window retrieval + rerank for RAG.
@@1littlecoder yea I miss you weekly AI news. You should start it again. Not the all AI stuff happened that week but like crazy ground braking invention or paper. Or whatever impresses you. In this way it won't be 20-30 min long. you can make it 10-12 min. There's a youtube channel "the friday checkout" you can follow his format.
@@1littlecoder and the difference between unstructured html parser and the library html2text? And why are there pages in HTML documents in the first place?
@@1littlecoder there is a docker image of unstructured io and they also give option to install as docker container but there are no instructions as how to proceed , a video on it would be very helpful
The problem with this library is that this is very generic... the main problem here is that there are much better libraries that are currently available for each of the formats which do a better job than the underlying libraries used here... Tesseract is easily beaten by easyocr, pdfminer is beaten by pymupdf and so on. It's very good with that. Also, what if docs or pdfs have images in some of the pages.
Oh, you briefly mentioned it uses pdf.miner under the hood? I hope not! From personal experience with testing different Python libs, I found the results of pyPDF and PyMuPDF much better.
Going over libraries useful for AI dev is a great video series idea!
Thank you. If you have any interesting choices in mind feel free to let me know :)
100%
Honestly, we are lucky to know you..... Many thanks and appreciation to you, Mr. Abdul ❤
I'm glad you found it useful :)
This channel is completely underrated! Thanks for this video
Glad you think so! Thank you :)
I was looking for something like this to make a raw text of the hugging face documentation, since no LLM’s are trained in it since it’s available in a very weird website format. This is awesome :)
Thanks for sharing!!
An idea for better video structure would be to have a demo at the beginning, while I have some idea but had to watch until the end to understand what the library can do.
Thanks for the tip. Do you mean like showing the final output?
@@1littlecoder yes, something like input and output. It acts as a hook.
@@BiMoba Thank you. I'll try to make sure!
I always find out if I'm interested in a particular video by reading the transcript summary.
That's a clever way!
Wow thanks Majeed that’s something which I desperately need. Was facing lot of issues for text conversion in my Rag system. Will also be helpful if you can run a tutorial on sentence window retrieval + rerank for RAG.
Informative Thanks
wow you are back after a week. You should take some breaks like this. AI is going crazy. You won't miss anything
I saw a lot of models being launched. In fact been thinking to do a weekly summary line Ai news this time.
@@1littlecoder yea I miss you weekly AI news. You should start it again. Not the all AI stuff happened that week but like crazy ground braking invention or paper. Or whatever impresses you. In this way it won't be 20-30 min long. you can make it 10-12 min. There's a youtube channel "the friday checkout" you can follow his format.
Great tool Thanks!🤩
thanks. looking forward to advanced tutorial covering using unstructured to do chunking, rag....
Great video!
Glad you enjoyed it
👏👏👏👏👏
Look who's here 😁
Yes I am looking at unstructured - have you used it? How good is it for tables?
Do you know what is the difference with pandoc?
Afaik pandoc helps you generate PDFs.
@@1littlecoder and the difference between unstructured html parser and the library html2text? And why are there pages in HTML documents in the first place?
@@1littlecoder I see, thanks. pdfminer is an alternative as you mentionned.
are you also doing vectara advanced rag hackathon ?
I like this !
After extraction the text how to extract some information and write to a excel
great video boss, it support multilangues
Sir , how to install and use this on docker , no video on internet
I think llama index as its own docker version
@@1littlecoder there is a docker image of unstructured io and they also give option to install as docker container but there are no instructions as how to proceed , a video on it would be very helpful
PDFs will take longer to process than a text file. This creates a need to use Unstructured Commercial SaaS API. For other formats, it is okay to use.
If it is implemented, it is on-premise or calling Unstructured API which is using our ingestion data
Whatever we did on this video is on-prem because we aren't calling any api
The problem with this library is that this is very generic... the main problem here is that there are much better libraries that are currently available for each of the formats which do a better job than the underlying libraries used here... Tesseract is easily beaten by easyocr, pdfminer is beaten by pymupdf and so on. It's very good with that. Also, what if docs or pdfs have images in some of the pages.
Stop shtposting please 🙏
Means
@@1littlecoder he is implying this video is shit. Which I disagree with. Although the video could have been shorter.
@@kalilinux8682 i actually asked the question to make sure it's not a bot
@@1littlecoder I am not a bot. LMAO.
@@Macorelppa Glad to know. Dealing with a lot of bots, I'm happy to see humans
Oh, you briefly mentioned it uses pdf.miner under the hood? I hope not! From personal experience with testing different Python libs, I found the results of pyPDF and PyMuPDF much better.
Exactly what I needed. 🥌