Effortlessly Extract Text from Scanned PDFs Using .NET Core OCR Library

Syncfusion, Inc

Просмотров 3,6 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 2 дек 2024

Комментарии • 16

@zafioondemand3839 Месяц назад
Is it possible to do OCR with different languages by injecting traineddata files?
@SyncfusionInc Месяц назад ⁺¹
Hi,
We can use the TessDataPath property to specify the path to the folder containing trained data for other languages in the OCR Processor class. You can find the trained tessdata files for additional languages in the GitHub repository linked below.
github.com/tesseract-ocr/tessdata
We have attached our UG documentation below for your reference
help.syncfusion.com/document-processing/pdf/pdf-library/net/working-with-ocr/features#tesseractbinaries-paths-and-tesseract-language-data
support.syncfusion.com/kb/article/4219/how-to-support-german-and-other-languages-in-the-ocr-processor
help.syncfusion.com/document-processing/pdf/pdf-library/net/working-with-ocr/features#ocr-with-multiple-languages
However, we have attached the sample and output document below for your reference.
Sample: www.syncfusion.com/downloads/support/directtrac/general/ze/OCR-with-multiple-langauages357136719
Output: www.syncfusion.com/downloads/support/directtrac/general/pd/Output221679158
@zafioondemand3839 Месяц назад
Is it also possible to extract text for a PDF with mixed content? Simple Text AND scanned images?
@SyncfusionInc Месяц назад ⁺¹
Hi,
Optical character recognition (OCR) is a technology used to convert scanned paper documents in the form of PDF files or images into searchable and editable data. The Syncfusion OCR processor library has extended support to process OCR on scanned PDF documents and images with the help of Google’s Tesseract Optical Character Recognition engine. We are internally extracting the images from the PDF document page by page, and then sending the images to the OCR processor to recognize the text from the images. So, it doesn't recognize the already searchable and editable data.
However, we have attached our documentation, the sample and output document below for your reference
Documentation: help.syncfusion.com/document-processing/pdf/pdf-library/net/working-with-ocr/working-with-ocr
Sample: www.syncfusion.com/downloads/support/directtrac/general/ze/.NET-114870683
Output: www.syncfusion.com/downloads/support/directtrac/general/pd/Output-1171496377
@mohdnasir7023 6 месяцев назад
Hi thanks, is it possible to do overlay on top of the chanracter into pdf that contains images as well?
@SyncfusionInc 6 месяцев назад
Hi,
Yes, we have support to overlay the text on the image containing PDF documents. Please refer the below code snippet for more information.
// Create a PDF Document.
PdfDocument doc = new PdfDocument();
//Add pages to the document
PdfPage page = doc.Pages.Add();
//Create PDF graphics for the page
PdfGraphics graphics = page.Graphics;
//Create PDF font.
PdfFont font = new PdfStandardFont(PdfFontFamily.Helvetica, 12, PdfFontStyle.Regular);
//Set transparancy.
graphics.SetTransparency(0.5f, 0.5f, PdfBlendMode.Overlay);
//Draws the String.
graphics.DrawString("Hello world!", font, PdfPens.Black, PdfBrushes.Red, 0, 0);
//Save the document
doc.Save("Output.pdf");
//Close the document
doc.Close(true);
Please find the documentation.
help.syncfusion.com/cr/file-formats/Syncfusion.Pdf.Graphics.PdfBlendMode.html
@hoho-san1629 6 месяцев назад
thanks, how to specify the region that to be converted?
@SyncfusionInc 6 месяцев назад ⁺¹
Hi,
Yes, we can specify the region to perform the OCR process. Please find the documentation and GitHub sample.
Documentation: help.syncfusion.com/file-formats/pdf/working-with-ocr/features?#performing-ocr-for-a-region-of-the-document
GitHub Sample: github.com/SyncfusionExamples/PDF-Examples/tree/master/OCR/.NET/Perform-OCR-on-particular-region-of-PDF-document?
@cissemy Год назад
Thanks
is it possible to extract the text from pdf form with different field values ?
@SyncfusionInc Год назад
Hi,
We have support to extract and modify the form fields from pdf document
Please find the below UG link
pulse.ly/bsqgk1yrj6
If you have forms in image format, We can use Azure Form Recognizer to extract the information from those forms. So, we request that you elaborate on your actual requirements in detail so that it will be helpful for us to analyze and assist you further on this.
@JustinEmlay Месяц назад
Why do you use FileStream when all you need is to pass the full file path to PdfLoadedDocument?
@SyncfusionInc Месяц назад
Hi,
In .NET Core, the platform design favors portability and modularity, which means that certain file handling behaviors differ from those in the Windows-only .NET Framework. Specifically, .NET Core requires using a `FileStream` to load files with `PdfLoadedDocument` for the following reasons:
1. Cross-Platform Compatibility: Unlike the Windows-based .NET Framework, .NET Core is designed to be cross-platform, running on Windows, macOS, and Linux. Since not all platforms handle file paths and file access in the same way, `FileStream` is used to create a more universal approach. This makes your code work consistently across different environments by providing an explicit way to handle file access.
2. Direct Path Access Restrictions: In .NET Framework on Windows, you can pass the file path directly because the underlying libraries support it on this specific OS. However, .NET Core enforces stricter access patterns and doesn’t allow the direct passing of a file path to the `PdfLoadedDocument` constructor. Instead, it requires a `FileStream` to ensure more reliable and controlled file handling, especially given the variety of file systems across platforms.
Using `FileStream` is the .NET Core-compliant way to ensure that your file handling is efficient, reliable, and compatible across different operating systems. This approach also provides additional control over file permissions and cleanup, which can help improve application stability.
Please follow the below links for more information:
help.syncfusion.com/document-processing/pdf/pdf-library/net/open-and-save-pdf-file-in-c-sharp-vb-net?cs-save-lang=1&cs-lang=csharp
@JustinEmlay Месяц назад
@@SyncfusionInc Makes sense and that's fair. Just note this video is titled .NET and not .NET CORE or cross-compatible. Thanks for the explanation!
@isohaven758 Месяц назад
@@JustinEmlay And console apps don't only use core. I thought this was odd too. I don't see myself purposely making my app underperform for the sake of maybe being cross platorm which I know for a fact will never happen with a small OCR tool I'm whipping up real quick.
@SyncfusionInc 28 дней назад
Hi,
Thanks for pointing that out! We appreciate your feedback, and we’ll update the title accordingly to ensure it’s more accurate and aligned with the content of the video.
@SyncfusionInc 28 дней назад
@@isohaven758 Upon further analysis, We can have support to use both .NET Core and .NET Framework OCR packages in console applications to perform OCR. To perform OCR Processor in cross-platforms like Windows, Linux, and Mac, you can use the below-mentioned package

However, to perform the OCR Process in the Windows platform, you can use the following package.
However, we have attached the sample for both .NET Core and .NET Framework below for your reference:
.NET Core: www.syncfusion.com/downloads/support/directtrac/general/ze/Perform-OCR-.NETCore8.0-683205044
.NET Framework: www.syncfusion.com/downloads/support/directtrac/general/ze/Perform-OCR-.NETFramework4.81666649514

Следующие

Автовоспроизведение

Create, Fill, and Flatten PDF Forms with .NET PDF Library