great _ I process a lot medical publications in the form of pdf ( i created my own pdf extractor in Python) I will test this new tool - thanks for sharing:)
*updated insight after testing* : Upstage can parse vector-based graphics because they're essentially mathematical descriptions (paths, coordinates, shapes) in the PDF, not pixel-based images). However it is NOT able to parse/describe/extract "raster images" : pixel-based data, embedded binary data, fixed resolution photos, scanned documents. This is a major downside as LlamaParse, Azure Document Intelligence & Unstructured both provide rasterized & vector graphic services. Without it developers would need to use another service or host it themselves. So while we are rooting for Upstage it seems to be an incomplete solution. I do love the speed ! thoughts appreciated.
However, Upstage is 5-10 faster for text & table extraction so it might be worth it to combine with a different service (image extraction/description for multi-modal RAG).
@@awakenwithoutcoffee Hey, you seem to know a lot about this stuff. So, I use python code with regex to parse specific Info I need for extraction from PDF files. I face mutiple PDF layouts and designs (some of them are scanned which is painful to deal with). I have a collection of python codes for each layout I faced so far, but I need something more efficient using LLMs, to make my work faster. The extracted PDF data is generally compared along with data extracted from excel files. Then both data from the PDF and Excel file will populate in excel sheet for comaprison purposes (Audit). Each time I have to insert the PDF and Excel files paths, valuation date, and project number. The PDFs somtimes have to be merged together and the excel files have to be filtered according to date, project number etc. I am total noob in programming, but I can get by. I want to create some sort of tool that any colleague in my team who does not know programming could use, for example by only inserting the files paths, client name and date, and then the code inside the tool that uses machine learning could create and automate the excel sheets for comparsion purposes in the audit process. Can you help me and tell me what is the best thing to do in my case? This needs to be done locally due to private data and work secrets.
How could I apply this parsing mechanism to a rag system? Is there any video about this. Would like to apply it to one of the most recent rag systems to extract specific fields from documents
OK, my question is can it handle word documents? If not, I can convert them over to PDF if I have to but it’s a lot of PDFs. I’m in the process of creating a tabletop RPG adventure novel. I know I know it’s nothing special but I have so much data and so much information created from custom characters, storylines game mechanics and everything and it takes up way too much time to be able to compile everything into a single file of everything broken down and there’s so many different files. I need to separate, but I still have to consolidate, would this help me be able to do that only just now starting to watch the video but I have to make the comment I’m trying to find something that can take all of my PDFs were documents and anything like just compiling all the data into one place
great _ I process a lot medical publications in the form of pdf ( i created my own pdf extractor in Python) I will test this new tool - thanks for sharing:)
Interesting to know! I am using Gemini 2b parameter for converting pdf to structure format without any issue. Will try this tool
*updated insight after testing* : Upstage can parse vector-based graphics because they're essentially mathematical descriptions (paths, coordinates, shapes) in the PDF, not pixel-based images).
However it is NOT able to parse/describe/extract "raster images" : pixel-based data, embedded binary data, fixed resolution photos, scanned documents.
This is a major downside as LlamaParse, Azure Document Intelligence & Unstructured both provide rasterized & vector graphic services. Without it developers would need to use another service or host it themselves. So while we are rooting for Upstage it seems to be an incomplete solution. I do love the speed ! thoughts appreciated.
However, Upstage is 5-10 faster for text & table extraction so it might be worth it to combine with a different service (image extraction/description for multi-modal RAG).
@@awakenwithoutcoffee Hey, you seem to know a lot about this stuff. So, I use python code with regex to parse specific Info I need for extraction from PDF files. I face mutiple PDF layouts and designs (some of them are scanned which is painful to deal with). I have a collection of python codes for each layout I faced so far, but I need something more efficient using LLMs, to make my work faster. The extracted PDF data is generally compared along with data extracted from excel files. Then both data from the PDF and Excel file will populate in excel sheet for comaprison purposes (Audit). Each time I have to insert the PDF and Excel files paths, valuation date, and project number. The PDFs somtimes have to be merged together and the excel files have to be filtered according to date, project number etc. I am total noob in programming, but I can get by. I want to create some sort of tool that any colleague in my team who does not know programming could use, for example by only inserting the files paths, client name and date, and then the code inside the tool that uses machine learning could create and automate the excel sheets for comparsion purposes in the audit process. Can you help me and tell me what is the best thing to do in my case? This needs to be done locally due to private data and work secrets.
Mervin, can you share a video on how you set up your environment for projects?
get any recommendations on how to do this but locally? without use of cloud servcies?
How could I apply this parsing mechanism to a rag system? Is there any video about this. Would like to apply it to one of the most recent rag systems to extract specific fields from documents
Is it multi lingual?what languages it can pars?
Does it support utf8 and right-to-left?
Does it work with pdf from scanned documents?
This is amazing
This tool looks excellent
Hope this gets opensourced at one point. When is their Solar model releasing?
are there any fees using upstage if I intergrate this parser in my app? are there limitations with the request?
Hello, Tell me if I'm wrong but it seems to me that you are able to do the same thing if not better in pure Python language, right?
Pure Python is limited in extracting complex information.
Just a question about sensitive private documents, would this method be serious compromise of privacy? Is the data stored in any way in the LLM?
nope it rarely ever is although most services keep data cached for some hours for practical (read: uploading cache) services.
OK, my question is can it handle word documents? If not, I can convert them over to PDF if I have to but it’s a lot of PDFs. I’m in the process of creating a tabletop RPG adventure novel. I know I know it’s nothing special but I have so much data and so much information created from custom characters, storylines game mechanics and everything and it takes up way too much time to be able to compile everything into a single file of everything broken down and there’s so many different files. I need to separate, but I still have to consolidate, would this help me be able to do that only just now starting to watch the video but I have to make the comment I’m trying to find something that can take all of my PDFs were documents and anything like just compiling all the data into one place
Just needed this 😂
Is there a local version of this technology? For privacy. Without external calling. Or only with Upstage calling?
yeah, I'm looking for a local version too
is it open source?
Not open source
Unexpected to see Unstructured being such a poor performer in all aspects... 😮
agree, this goes for most major players on the list.