Extract Text from any PDF File in Python 3.10 Tutorial

Поделиться
HTML-код
  • Опубликовано: 17 дек 2024

Комментарии • 40

  • @tobiwie
    @tobiwie Год назад +18

    In some of the latest updates to PyPDF2 the class "PdfFileReader" got replaced with "PdfReader". Code still works fine with "PdfReader". :)

  • @frapsg2
    @frapsg2 9 месяцев назад +1

    Awesome, so helpful! That's much simpler and ready-to-use compared to all others approaches found online. Is there a way to export the extracted text to a csv or xlsx file?

  • @vitaliibaglaiev4147
    @vitaliibaglaiev4147 8 месяцев назад

    Just amazing explanation, short and sweet!

  • @akashnath7999
    @akashnath7999 2 года назад +3

    It's so helpful...loved it ❤

    • @Indently
      @Indently  2 года назад +2

      Glad it helped! :)

  • @davet4335
    @davet4335 Год назад +9

    The code did not work for me on a Windows 11 PC. I kept having ChatGPT analyze the code and error messages and after many tires it fixed it:
    import os
    import PyPDF2
    import re
    import math
    def extract_text_from_pdf(pdf_file: str) -> [str]:
    # Open the PDF file of your choice
    with open(pdf_file, 'rb') as pdf:
    reader = PyPDF2.PdfReader(pdf)
    pdf_text = []
    for page in reader.pages:
    content = page.extract_text()
    pdf_text.append(content)
    return pdf_text
    def main():
    extracted_text = extract_text_from_pdf('sample.pdf')
    for text in extracted_text:
    print(text)
    if __name__ == '__main__':
    main()

  • @Mike_elGreco
    @Mike_elGreco 9 месяцев назад

    It worked! Thank you !!

  • @Miyazaki97
    @Miyazaki97 2 года назад

    Thank you for the awesome tutorial. I have a some question about extracting articles. I hope you can help me. While extracting articles and reports there are many references and table legends, titles which is not required. Would it be possible to remove all those references and table contents including legends and titles when extracting the pdf file?

  • @vishnumuralidhar5659
    @vishnumuralidhar5659 Год назад

    Thanks for the awesome tutorial. Please do the video for two sided pdfs. Which wasnt there on youtube🙃

  • @MedoHamdani
    @MedoHamdani 7 месяцев назад

    Will it work on Arabic language and will it be able to extract hand written manuscript?

  • @albeeshi
    @albeeshi Год назад +3

    How to extract data from more than one PDF file and put it in a table

  • @rishikeshchava6895
    @rishikeshchava6895 8 месяцев назад

    Hey , I have some 600 files which have large volume of data, text extraction using pypdf2 is taking a lot of time , is there any other way to do this ?

  • @kevinmakumbe
    @kevinmakumbe 10 месяцев назад

    Nice tutorial, how can i get the cordinates of the text in my pdf file?

  • @mehdismaeili3743
    @mehdismaeili3743 2 года назад

    great as always.

  • @zainsaqib3702
    @zainsaqib3702 Год назад

    I keep on getting Syntax Error: unmatched ')' on line 4 I'm running python 3.9 could that be the case?

  • @boukefmohamed3191
    @boukefmohamed3191 8 месяцев назад

    Excellent

  • @オタヴィオルイス
    @オタヴィオルイス Год назад

    helped me a lot. Thanks

  • @atharkhalid3275
    @atharkhalid3275 Год назад

    what if we want to extract text for any particular page

  • @jvwee
    @jvwee Год назад

    I am pretty sure there are over a thousand isntances of the word "coffee" in the pdf. However, this seems to have only counted the number of pages that the word appeared.

  • @gulfamhussain9674
    @gulfamhussain9674 4 месяца назад

    Do you have any solution for pdfs with characters because when I try to apply this solution on those pdfs it prints gibberish characters.

  • @rs-nm7hp
    @rs-nm7hp 2 года назад +1

    U r awesome 👏

  • @Sathishedutech
    @Sathishedutech Год назад

    Hi sir..is it Work on Local Language Like Telugu

  • @mohammedasimsameer1220
    @mohammedasimsameer1220 11 месяцев назад

    Thank you bro

  • @valmirrastelyjunior9400
    @valmirrastelyjunior9400 11 месяцев назад

    Great

  • @gvenagas
    @gvenagas 6 месяцев назад

    I found that by opening a pdf file with Mozilla Firefox and inspecting it with the developer tools you can collect its text (with the help of JavaScript) after the web browser has converted it to HTML and maybe save it for further processing with someone programming language.

  • @gianlucagiannetto5146
    @gianlucagiannetto5146 5 месяцев назад

    I wrote the code line per line, word for word but it continue to give me File not found, how it's possible?
    p.s. I managed to extrat text, the only problem is the layout of the answer, i have a string long miles

    • @enkvadrat_
      @enkvadrat_ 5 месяцев назад +1

      def convert_pdf_to_text(pdf_path):
      with pdfplumber.open(pdf_path) as pdf:
      for page in pdf.pages:
      text = page.extract_text(layout=True)
      print(text)
      return text

  • @louis19449
    @louis19449 11 месяцев назад

    how do you add the pdf file to the project?

  • @MedoHamdani
    @MedoHamdani 7 месяцев назад

    So this is not an OCR

  • @as8243
    @as8243 2 месяца назад

    this only extracted text from the first page of my PDF. anyone else have this issue?
    Thanks for the video!

  • @raniarasmy6489
    @raniarasmy6489 2 года назад

    please the resolution of your screen is not clear

    • @Indently
      @Indently  2 года назад

      Just change the resolution on RUclips from 144p to 720p

  • @Baka_Oppai
    @Baka_Oppai Год назад

    no idea how this is setup kina pointless where is pypdf do i get it from inside my bum bum? and what is this program?

    • @enkvadrat_
      @enkvadrat_ 5 месяцев назад

      pip install pypdf