Sir, Can you please make a further video on complete flow of data ingestion to Qdrant vectorDB without using ipynb notebook. I have tried many times without success due to issues like SSL certificate & unable to download nltk issues.
Great video! just one thing - if there are any columns in the pdf which have only URLs, then the urls are just shown as NaN,. and the urls are not read during inferencing from the pdf..(after the data structuring), have you also encountered or tried this? Can you try this out in one of the upcoming videos?
It was fruitful video, and wonder if the pdf has complex layout like made by different dimensions rectangles and rectangles have information in it. For that case, yolo or cv2 is used to detect edges and then implement OCR to extract table and information in the tables. My question is the way possible to extract layouts and information and then visualize on jupyter ?
it would be really interesting if you make a video on a multimodal RAG using unstructured, groq, quadrant, langchain and chainlit. (even better to make a streamlit app out of it)
yes, it is as you used the llm via Ollama which is downloaded in your machine.And with Unstructured, if you used the pip install, thats private but with API make sure to check how they process the data.
I have implemented the code in Colab on own custom data.I am facing the issue as it omit the zero's for ex Amount value is 43220.00, but show only 4322. suggest some way so it fix this issue
Sir, Can you please make a further video on complete flow of data ingestion to Qdrant vectorDB without using ipynb notebook. I have tried many times without success due to issues like SSL certificate & unable to download nltk issues.
Thanks, all i ask is perfect table extraction with all the formatting and accuracy. what s my best bet?
You are welcome, you can give LlamaParse a try
ruclips.net/video/S_F4RUhKaV4/видео.htmlsi=XHE98g6xAuh0u8jb
Great video! just one thing - if there are any columns in the pdf which have only URLs, then the urls are just shown as NaN,. and the urls are not read during inferencing from the pdf..(after the data structuring), have you also encountered or tried this? Can you try this out in one of the upcoming videos?
It was fruitful video, and wonder if the pdf has complex layout like made by different dimensions rectangles and rectangles have information in it. For that case, yolo or cv2 is used to detect edges and then implement OCR to extract table and information in the tables.
My question is the way possible to extract layouts and information and then visualize on jupyter ?
i have a question if i have a pdf file in other language it will work?
it would be really interesting if you make a video on a multimodal RAG using unstructured, groq, quadrant, langchain and chainlit. (even better to make a streamlit app out of it)
will take it in my to do list ✅
here is one example you can try,
RAG With LlamaParse from LlamaIndex & LangChain 🚀
ruclips.net/video/f9hvrqVvZl0/видео.html
Very nice video. I have a question. Is it private? If i have sensitive documents. Does it stays private?
yes, it is as you used the llm via Ollama which is downloaded in your machine.And with Unstructured, if you used the pip install, thats private but with API make sure to check how they process the data.
which app you use for python coding?
hi, its VSCode
How difficult is it to bypass the paywall to build your own instance that serves the pdf extraction instead of using their api?
its not that difficult but installing packages might be challenging as it needs different packages to do the task.
How much accuracy is it provides when we are extracting tables and text from scanned and handwritten PDFs ??
Haven’t tried that, you can give a try !!
Sir, could you please make a video on extract images from PDFs using open source models.
will take it in my to-do list !!
@@datasciencebasics😂
Have you tried llamaparser?
here it is :)
Super Easy Way To Parse PDF | LlamaParse From LlamaIndex | LlamaCloud
ruclips.net/video/wRMnHbiz5ck/видео.html
Extracting Japanese tables has problems with garbled characters. unstructured can get characters, why OCR has to re-read them wrong?
which browser do you use?
Arc browser!!
Does it cover scanned pdf ?
haven’t tried that yet. You can give a try !!
I have implemented the code in Colab on own custom data.I am facing the issue as it omit the zero's for ex Amount value is 43220.00, but show only 4322. suggest some way so it fix this issue
Thank you,,,
You are welcome !!
anyone getting error while importing unstructured?
yes