Hi, We can use the TessDataPath property to specify the path to the folder containing trained data for other languages in the OCR Processor class. You can find the trained tessdata files for additional languages in the GitHub repository linked below. github.com/tesseract-ocr/tessdata We have attached our UG documentation below for your reference help.syncfusion.com/document-processing/pdf/pdf-library/net/working-with-ocr/features#tesseractbinaries-paths-and-tesseract-language-data support.syncfusion.com/kb/article/4219/how-to-support-german-and-other-languages-in-the-ocr-processor help.syncfusion.com/document-processing/pdf/pdf-library/net/working-with-ocr/features#ocr-with-multiple-languages However, we have attached the sample and output document below for your reference. Sample: www.syncfusion.com/downloads/support/directtrac/general/ze/OCR-with-multiple-langauages357136719 Output: www.syncfusion.com/downloads/support/directtrac/general/pd/Output221679158
Hi, Optical character recognition (OCR) is a technology used to convert scanned paper documents in the form of PDF files or images into searchable and editable data. The Syncfusion OCR processor library has extended support to process OCR on scanned PDF documents and images with the help of Google’s Tesseract Optical Character Recognition engine. We are internally extracting the images from the PDF document page by page, and then sending the images to the OCR processor to recognize the text from the images. So, it doesn't recognize the already searchable and editable data. However, we have attached our documentation, the sample and output document below for your reference Documentation: help.syncfusion.com/document-processing/pdf/pdf-library/net/working-with-ocr/working-with-ocr Sample: www.syncfusion.com/downloads/support/directtrac/general/ze/.NET-114870683 Output: www.syncfusion.com/downloads/support/directtrac/general/pd/Output-1171496377
Hi, Yes, we have support to overlay the text on the image containing PDF documents. Please refer the below code snippet for more information. // Create a PDF Document. PdfDocument doc = new PdfDocument(); //Add pages to the document PdfPage page = doc.Pages.Add(); //Create PDF graphics for the page PdfGraphics graphics = page.Graphics; //Create PDF font. PdfFont font = new PdfStandardFont(PdfFontFamily.Helvetica, 12, PdfFontStyle.Regular); //Set transparancy. graphics.SetTransparency(0.5f, 0.5f, PdfBlendMode.Overlay); //Draws the String. graphics.DrawString("Hello world!", font, PdfPens.Black, PdfBrushes.Red, 0, 0); //Save the document doc.Save("Output.pdf"); //Close the document doc.Close(true); Please find the documentation. help.syncfusion.com/cr/file-formats/Syncfusion.Pdf.Graphics.PdfBlendMode.html
Hi, Yes, we can specify the region to perform the OCR process. Please find the documentation and GitHub sample. Documentation: help.syncfusion.com/file-formats/pdf/working-with-ocr/features?#performing-ocr-for-a-region-of-the-document GitHub Sample: github.com/SyncfusionExamples/PDF-Examples/tree/master/OCR/.NET/Perform-OCR-on-particular-region-of-PDF-document?
Hi, We have support to extract and modify the form fields from pdf document Please find the below UG link pulse.ly/bsqgk1yrj6 If you have forms in image format, We can use Azure Form Recognizer to extract the information from those forms. So, we request that you elaborate on your actual requirements in detail so that it will be helpful for us to analyze and assist you further on this.
Hi, In .NET Core, the platform design favors portability and modularity, which means that certain file handling behaviors differ from those in the Windows-only .NET Framework. Specifically, .NET Core requires using a `FileStream` to load files with `PdfLoadedDocument` for the following reasons: 1. Cross-Platform Compatibility: Unlike the Windows-based .NET Framework, .NET Core is designed to be cross-platform, running on Windows, macOS, and Linux. Since not all platforms handle file paths and file access in the same way, `FileStream` is used to create a more universal approach. This makes your code work consistently across different environments by providing an explicit way to handle file access. 2. Direct Path Access Restrictions: In .NET Framework on Windows, you can pass the file path directly because the underlying libraries support it on this specific OS. However, .NET Core enforces stricter access patterns and doesn’t allow the direct passing of a file path to the `PdfLoadedDocument` constructor. Instead, it requires a `FileStream` to ensure more reliable and controlled file handling, especially given the variety of file systems across platforms. Using `FileStream` is the .NET Core-compliant way to ensure that your file handling is efficient, reliable, and compatible across different operating systems. This approach also provides additional control over file permissions and cleanup, which can help improve application stability. Please follow the below links for more information: help.syncfusion.com/document-processing/pdf/pdf-library/net/open-and-save-pdf-file-in-c-sharp-vb-net?cs-save-lang=1&cs-lang=csharp
@@JustinEmlay And console apps don't only use core. I thought this was odd too. I don't see myself purposely making my app underperform for the sake of maybe being cross platorm which I know for a fact will never happen with a small OCR tool I'm whipping up real quick.
Hi, Thanks for pointing that out! We appreciate your feedback, and we’ll update the title accordingly to ensure it’s more accurate and aligned with the content of the video.
@@isohaven758 Upon further analysis, We can have support to use both .NET Core and .NET Framework OCR packages in console applications to perform OCR. To perform OCR Processor in cross-platforms like Windows, Linux, and Mac, you can use the below-mentioned package
However, to perform the OCR Process in the Windows platform, you can use the following package. However, we have attached the sample for both .NET Core and .NET Framework below for your reference: .NET Core: www.syncfusion.com/downloads/support/directtrac/general/ze/Perform-OCR-.NETCore8.0-683205044 .NET Framework: www.syncfusion.com/downloads/support/directtrac/general/ze/Perform-OCR-.NETFramework4.81666649514
Is it possible to do OCR with different languages by injecting traineddata files?
Hi,
We can use the TessDataPath property to specify the path to the folder containing trained data for other languages in the OCR Processor class. You can find the trained tessdata files for additional languages in the GitHub repository linked below.
github.com/tesseract-ocr/tessdata
We have attached our UG documentation below for your reference
help.syncfusion.com/document-processing/pdf/pdf-library/net/working-with-ocr/features#tesseractbinaries-paths-and-tesseract-language-data
support.syncfusion.com/kb/article/4219/how-to-support-german-and-other-languages-in-the-ocr-processor
help.syncfusion.com/document-processing/pdf/pdf-library/net/working-with-ocr/features#ocr-with-multiple-languages
However, we have attached the sample and output document below for your reference.
Sample: www.syncfusion.com/downloads/support/directtrac/general/ze/OCR-with-multiple-langauages357136719
Output: www.syncfusion.com/downloads/support/directtrac/general/pd/Output221679158
Is it also possible to extract text for a PDF with mixed content? Simple Text AND scanned images?
Hi,
Optical character recognition (OCR) is a technology used to convert scanned paper documents in the form of PDF files or images into searchable and editable data. The Syncfusion OCR processor library has extended support to process OCR on scanned PDF documents and images with the help of Google’s Tesseract Optical Character Recognition engine. We are internally extracting the images from the PDF document page by page, and then sending the images to the OCR processor to recognize the text from the images. So, it doesn't recognize the already searchable and editable data.
However, we have attached our documentation, the sample and output document below for your reference
Documentation: help.syncfusion.com/document-processing/pdf/pdf-library/net/working-with-ocr/working-with-ocr
Sample: www.syncfusion.com/downloads/support/directtrac/general/ze/.NET-114870683
Output: www.syncfusion.com/downloads/support/directtrac/general/pd/Output-1171496377
Hi thanks, is it possible to do overlay on top of the chanracter into pdf that contains images as well?
Hi,
Yes, we have support to overlay the text on the image containing PDF documents. Please refer the below code snippet for more information.
// Create a PDF Document.
PdfDocument doc = new PdfDocument();
//Add pages to the document
PdfPage page = doc.Pages.Add();
//Create PDF graphics for the page
PdfGraphics graphics = page.Graphics;
//Create PDF font.
PdfFont font = new PdfStandardFont(PdfFontFamily.Helvetica, 12, PdfFontStyle.Regular);
//Set transparancy.
graphics.SetTransparency(0.5f, 0.5f, PdfBlendMode.Overlay);
//Draws the String.
graphics.DrawString("Hello world!", font, PdfPens.Black, PdfBrushes.Red, 0, 0);
//Save the document
doc.Save("Output.pdf");
//Close the document
doc.Close(true);
Please find the documentation.
help.syncfusion.com/cr/file-formats/Syncfusion.Pdf.Graphics.PdfBlendMode.html
thanks, how to specify the region that to be converted?
Hi,
Yes, we can specify the region to perform the OCR process. Please find the documentation and GitHub sample.
Documentation: help.syncfusion.com/file-formats/pdf/working-with-ocr/features?#performing-ocr-for-a-region-of-the-document
GitHub Sample: github.com/SyncfusionExamples/PDF-Examples/tree/master/OCR/.NET/Perform-OCR-on-particular-region-of-PDF-document?
Thanks
is it possible to extract the text from pdf form with different field values ?
Hi,
We have support to extract and modify the form fields from pdf document
Please find the below UG link
pulse.ly/bsqgk1yrj6
If you have forms in image format, We can use Azure Form Recognizer to extract the information from those forms. So, we request that you elaborate on your actual requirements in detail so that it will be helpful for us to analyze and assist you further on this.
Why do you use FileStream when all you need is to pass the full file path to PdfLoadedDocument?
Hi,
In .NET Core, the platform design favors portability and modularity, which means that certain file handling behaviors differ from those in the Windows-only .NET Framework. Specifically, .NET Core requires using a `FileStream` to load files with `PdfLoadedDocument` for the following reasons:
1. Cross-Platform Compatibility: Unlike the Windows-based .NET Framework, .NET Core is designed to be cross-platform, running on Windows, macOS, and Linux. Since not all platforms handle file paths and file access in the same way, `FileStream` is used to create a more universal approach. This makes your code work consistently across different environments by providing an explicit way to handle file access.
2. Direct Path Access Restrictions: In .NET Framework on Windows, you can pass the file path directly because the underlying libraries support it on this specific OS. However, .NET Core enforces stricter access patterns and doesn’t allow the direct passing of a file path to the `PdfLoadedDocument` constructor. Instead, it requires a `FileStream` to ensure more reliable and controlled file handling, especially given the variety of file systems across platforms.
Using `FileStream` is the .NET Core-compliant way to ensure that your file handling is efficient, reliable, and compatible across different operating systems. This approach also provides additional control over file permissions and cleanup, which can help improve application stability.
Please follow the below links for more information:
help.syncfusion.com/document-processing/pdf/pdf-library/net/open-and-save-pdf-file-in-c-sharp-vb-net?cs-save-lang=1&cs-lang=csharp
@@SyncfusionInc Makes sense and that's fair. Just note this video is titled .NET and not .NET CORE or cross-compatible. Thanks for the explanation!
@@JustinEmlay And console apps don't only use core. I thought this was odd too. I don't see myself purposely making my app underperform for the sake of maybe being cross platorm which I know for a fact will never happen with a small OCR tool I'm whipping up real quick.
Hi,
Thanks for pointing that out! We appreciate your feedback, and we’ll update the title accordingly to ensure it’s more accurate and aligned with the content of the video.
@@isohaven758 Upon further analysis, We can have support to use both .NET Core and .NET Framework OCR packages in console applications to perform OCR. To perform OCR Processor in cross-platforms like Windows, Linux, and Mac, you can use the below-mentioned package
However, to perform the OCR Process in the Windows platform, you can use the following package.
However, we have attached the sample for both .NET Core and .NET Framework below for your reference:
.NET Core: www.syncfusion.com/downloads/support/directtrac/general/ze/Perform-OCR-.NETCore8.0-683205044
.NET Framework: www.syncfusion.com/downloads/support/directtrac/general/ze/Perform-OCR-.NETFramework4.81666649514