Effortlessly Extract Text from Scanned PDFs Using .NET Core OCR Library

Поделиться
HTML-код
  • Опубликовано: 2 дек 2024

Комментарии • 16

  • @zafioondemand3839
    @zafioondemand3839 Месяц назад

    Is it possible to do OCR with different languages by injecting traineddata files?

    • @SyncfusionInc
      @SyncfusionInc  Месяц назад +1

      Hi,
      We can use the TessDataPath property to specify the path to the folder containing trained data for other languages in the OCR Processor class. You can find the trained tessdata files for additional languages in the GitHub repository linked below.
      github.com/tesseract-ocr/tessdata
      We have attached our UG documentation below for your reference
      help.syncfusion.com/document-processing/pdf/pdf-library/net/working-with-ocr/features#tesseractbinaries-paths-and-tesseract-language-data
      support.syncfusion.com/kb/article/4219/how-to-support-german-and-other-languages-in-the-ocr-processor
      help.syncfusion.com/document-processing/pdf/pdf-library/net/working-with-ocr/features#ocr-with-multiple-languages
      However, we have attached the sample and output document below for your reference.
      Sample: www.syncfusion.com/downloads/support/directtrac/general/ze/OCR-with-multiple-langauages357136719
      Output: www.syncfusion.com/downloads/support/directtrac/general/pd/Output221679158

  • @zafioondemand3839
    @zafioondemand3839 Месяц назад

    Is it also possible to extract text for a PDF with mixed content? Simple Text AND scanned images?

    • @SyncfusionInc
      @SyncfusionInc  Месяц назад +1

      Hi,
      Optical character recognition (OCR) is a technology used to convert scanned paper documents in the form of PDF files or images into searchable and editable data. The Syncfusion OCR processor library has extended support to process OCR on scanned PDF documents and images with the help of Google’s Tesseract Optical Character Recognition engine. We are internally extracting the images from the PDF document page by page, and then sending the images to the OCR processor to recognize the text from the images. So, it doesn't recognize the already searchable and editable data.
      However, we have attached our documentation, the sample and output document below for your reference
      Documentation: help.syncfusion.com/document-processing/pdf/pdf-library/net/working-with-ocr/working-with-ocr
      Sample: www.syncfusion.com/downloads/support/directtrac/general/ze/.NET-114870683
      Output: www.syncfusion.com/downloads/support/directtrac/general/pd/Output-1171496377

  • @mohdnasir7023
    @mohdnasir7023 6 месяцев назад

    Hi thanks, is it possible to do overlay on top of the chanracter into pdf that contains images as well?

    • @SyncfusionInc
      @SyncfusionInc  6 месяцев назад

      Hi,
      Yes, we have support to overlay the text on the image containing PDF documents. Please refer the below code snippet for more information.
      // Create a PDF Document.
      PdfDocument doc = new PdfDocument();
      //Add pages to the document
      PdfPage page = doc.Pages.Add();
      //Create PDF graphics for the page
      PdfGraphics graphics = page.Graphics;
      //Create PDF font.
      PdfFont font = new PdfStandardFont(PdfFontFamily.Helvetica, 12, PdfFontStyle.Regular);
      //Set transparancy.
      graphics.SetTransparency(0.5f, 0.5f, PdfBlendMode.Overlay);
      //Draws the String.
      graphics.DrawString("Hello world!", font, PdfPens.Black, PdfBrushes.Red, 0, 0);
      //Save the document
      doc.Save("Output.pdf");
      //Close the document
      doc.Close(true);
      Please find the documentation.
      help.syncfusion.com/cr/file-formats/Syncfusion.Pdf.Graphics.PdfBlendMode.html

  • @hoho-san1629
    @hoho-san1629 6 месяцев назад

    thanks, how to specify the region that to be converted?

    • @SyncfusionInc
      @SyncfusionInc  6 месяцев назад +1

      Hi,
      Yes, we can specify the region to perform the OCR process. Please find the documentation and GitHub sample.
      Documentation: help.syncfusion.com/file-formats/pdf/working-with-ocr/features?#performing-ocr-for-a-region-of-the-document
      GitHub Sample: github.com/SyncfusionExamples/PDF-Examples/tree/master/OCR/.NET/Perform-OCR-on-particular-region-of-PDF-document?

  • @cissemy
    @cissemy Год назад

    Thanks
    is it possible to extract the text from pdf form with different field values ?

    • @SyncfusionInc
      @SyncfusionInc  Год назад

      Hi,
      We have support to extract and modify the form fields from pdf document
      Please find the below UG link
      pulse.ly/bsqgk1yrj6
      If you have forms in image format, We can use Azure Form Recognizer to extract the information from those forms. So, we request that you elaborate on your actual requirements in detail so that it will be helpful for us to analyze and assist you further on this.

  • @JustinEmlay
    @JustinEmlay Месяц назад

    Why do you use FileStream when all you need is to pass the full file path to PdfLoadedDocument?

    • @SyncfusionInc
      @SyncfusionInc  Месяц назад

      Hi,
      In .NET Core, the platform design favors portability and modularity, which means that certain file handling behaviors differ from those in the Windows-only .NET Framework. Specifically, .NET Core requires using a `FileStream` to load files with `PdfLoadedDocument` for the following reasons:
      1. Cross-Platform Compatibility: Unlike the Windows-based .NET Framework, .NET Core is designed to be cross-platform, running on Windows, macOS, and Linux. Since not all platforms handle file paths and file access in the same way, `FileStream` is used to create a more universal approach. This makes your code work consistently across different environments by providing an explicit way to handle file access.
      2. Direct Path Access Restrictions: In .NET Framework on Windows, you can pass the file path directly because the underlying libraries support it on this specific OS. However, .NET Core enforces stricter access patterns and doesn’t allow the direct passing of a file path to the `PdfLoadedDocument` constructor. Instead, it requires a `FileStream` to ensure more reliable and controlled file handling, especially given the variety of file systems across platforms.
      Using `FileStream` is the .NET Core-compliant way to ensure that your file handling is efficient, reliable, and compatible across different operating systems. This approach also provides additional control over file permissions and cleanup, which can help improve application stability.
      Please follow the below links for more information:
      help.syncfusion.com/document-processing/pdf/pdf-library/net/open-and-save-pdf-file-in-c-sharp-vb-net?cs-save-lang=1&cs-lang=csharp

    • @JustinEmlay
      @JustinEmlay Месяц назад

      @@SyncfusionInc Makes sense and that's fair. Just note this video is titled .NET and not .NET CORE or cross-compatible. Thanks for the explanation!

    • @isohaven758
      @isohaven758 Месяц назад

      @@JustinEmlay And console apps don't only use core. I thought this was odd too. I don't see myself purposely making my app underperform for the sake of maybe being cross platorm which I know for a fact will never happen with a small OCR tool I'm whipping up real quick.

    • @SyncfusionInc
      @SyncfusionInc  28 дней назад

      Hi,
      Thanks for pointing that out! We appreciate your feedback, and we’ll update the title accordingly to ensure it’s more accurate and aligned with the content of the video.

    • @SyncfusionInc
      @SyncfusionInc  28 дней назад

      @@isohaven758 Upon further analysis, We can have support to use both .NET Core and .NET Framework OCR packages in console applications to perform OCR. To perform OCR Processor in cross-platforms like Windows, Linux, and Mac, you can use the below-mentioned package

      However, to perform the OCR Process in the Windows platform, you can use the following package.
      However, we have attached the sample for both .NET Core and .NET Framework below for your reference:
      .NET Core: www.syncfusion.com/downloads/support/directtrac/general/ze/Perform-OCR-.NETCore8.0-683205044
      .NET Framework: www.syncfusion.com/downloads/support/directtrac/general/ze/Perform-OCR-.NETFramework4.81666649514