Extracting data from PDF files using Python

แชร์
ฝัง
  • เผยแพร่เมื่อ 5 ก.ค. 2024
  • 【Online Courses】
    ⚡Getting Started with Stata: (24 lectures + 4 assignments = 5.5 hours content): available on Udemy: www.udemy.com/course/getting-...
    ⚡Applied Time Series using Stata (29 lectures + 4 assignments = 6.5 hours content): available on Udemy: www.udemy.com/course/applied-...
    This is a detailed step-by-step guide that develops a Python code to extract information from PDF files. This is very useful if you have to handle a large number of files. The Python code returns the number of all search term occurrences in the document and identifies the page numbers. All material including the code is on GitHub github.com/GerhardKling/DataW...
    I introduce the PyPDF2 package, which we need to install.
    Installation on Anaconda:
    conda install -c conda-forge pypdf2
    Installation using the pip installer:
    pip install PyPDF2
    I show you how to create and activate a virtual environment (which is optional - but useful to do). Then we develop the code step-by-step. This will enable you to learn how to modify the code to suit your specific requirements. Please leave a comment if you have any questions.
    Finally, we will refactor the code. We define a function that takes a search term and filename and returns a tuple containing the total number of occurrences and the number of pages that contain the search term at least once.
    Chapters
    0:00 Welcome
    0:15 Return all occurrences & page numbers
    0:44 Example PDF
    2:23 Python setup
    3:55 Virtual environment
    6:16 Coding fun
    28:05 Refactoring
    The channel
    YUNIKARN focuses on publishing educational content in applied statistics, mathematics, and data science. In these fields, programming skills have become essential. Hence, we cover various programming languages including Python, Stata, and C++ to tackle problems and for fun.
    Stay in touch
    Please leave comments or follow us on Twitter ( / gerhardklings . DMs are open.
    Hashtags
    #datascience #python #PDF

ความคิดเห็น • 72

  • @agustincsn
    @agustincsn ปีที่แล้ว

    Fantastic tutorial, thanks. I wonder how if we want to search multiple search terms and by the end make a table (csv) out of it? thanks

    • @YUNIKARN
      @YUNIKARN  ปีที่แล้ว

      I am glad that you enjoyed the tutorial. I have another video, which shows how to find the most common words in PDF files (see link). Yes, you can modify my code to look for several words and store the results in lists or other data structures. These lists can be exported into csv or Excel files (or many other formats). We can guide you if you need support. You find our email address on the Channel pages. May the Power be with you! th-cam.com/video/3s0-TGLbB4M/w-d-xo.html

  • @seungholee8552
    @seungholee8552 2 ปีที่แล้ว +1

    Very useful video, thank you!

    • @YUNIKARN
      @YUNIKARN  2 ปีที่แล้ว

      Thanks, Seungho. Python is the way!

  • @seanredmond9212
    @seanredmond9212 ปีที่แล้ว

    this is a helpful video. thank you :)

    • @YUNIKARN
      @YUNIKARN  ปีที่แล้ว

      Glad it was helpful!

  • @mirof1169
    @mirof1169 2 ปีที่แล้ว +1

    Hi there, thanks for the great video. Is there any way we can pick up the words/terms that occur the most? instead of searching for the word, ask python to show us like the top 10 or 20 words that repeat the most

    • @YUNIKARN
      @YUNIKARN  2 ปีที่แล้ว +1

      I am glad that you find this video useful. Finding the most common words is a nice problem. One has to remove stopwords (e.g., and, the, in) to get meaningful results. I am working on a video to address your question properly. I plan to upload this video on 4th July 2022 at 10am GMT. You can get updates on my channel and my Facebook page (link on the channel). Python is the way!

    • @mirof1169
      @mirof1169 2 ปีที่แล้ว

      @@YUNIKARN Thank you, sir. I appreciate your work.

  • @alvin3428
    @alvin3428 2 ปีที่แล้ว

    Hey! Thank you so much for such a wonderful video. I have a question, what if we have different purchase orders in different formats? How can we get the specific information out of them using python. I am doing a college year project and unable to proceed.

    • @YUNIKARN
      @YUNIKARN  2 ปีที่แล้ว

      Hi Alvin, thanks for your comment! I need more details to answer your question: (1) what do you mean by different formats (file type or text/tables)? (2) I need a minimum working example to understand the structure of the files. Email or DM on Twitter/Facebook might be easier. Python is the Way!

    • @alvin3428
      @alvin3428 2 ปีที่แล้ว

      @@YUNIKARN Hey! Thank you for responding. So, the purchase orders are of type : PDF.
      Different formats : the purchase orders incoming are of different templates which results in making it difficult to extract certain data each time and load it to excel. I am looking for something which could extract Po no, Quantity, Price etc from these pdf files (it could be located anywhere considering the fact that we have varying templates and not a standard one).
      Please help, I really want to pull off this project and make something useful.

    • @YUNIKARN
      @YUNIKARN  2 ปีที่แล้ว

      @@alvin3428 Hi Alvin, can you email a sample pdf file? Details are on the channel page. If the data is unstructured (e.g., not in a table), it might be hard to do. Best wishes, Gerhard

  • @ktmt100
    @ktmt100 2 ปีที่แล้ว +1

    Fantastic! My boss is a youtuber.

    • @YUNIKARN
      @YUNIKARN  2 ปีที่แล้ว

      TH-cam is very hard work! I am trying to get better :-). We are always looking for presenters on the channel :-)

  • @michaelobrist4716
    @michaelobrist4716 2 ปีที่แล้ว +1

    Hi y'all! Thank you very much for this video. I've tried for hours to write a script that's doing exactly what you explain here. I've had almost given up but then my TH-cam algorithm brought me here to the most comprehensive pypdf search string tutorial I've seen so far. However, I keep running into this freaking "TypeError: a bytes-like object is required, not 'dict'" which seems to be a thing with pypdf2 and python3. I've already researched for quite a while on this topic and just couldn't solve it. Since this video is relatively new, maybe there's hope that you or somebody else in here knows what to do? Thanks anyway, great tutorials!

    • @YUNIKARN
      @YUNIKARN  2 ปีที่แล้ว

      Hi Michael, thanks for your comment. This is much appreciated! The issue you encounter can occur due to many reasons. (1) I suggest to work with virtual environments to ensure version control. (2) Different characters (e.g., Chinese) need to be replaced with their HTML counterparts. I had such issues as I tend to work on China. (3) If nothing works, you might need to move to pdfminer. I hope that helps? I am working on another video focused on PDF files. Best wishes, Gerhard

    • @YUNIKARN
      @YUNIKARN  ปีที่แล้ว +1

      I did a new video on PyPDF2 changes and how to address them using virtual environments. Thanks for your comment! th-cam.com/video/35p2-74bXNQ/w-d-xo.html

    • @michaelobrist4716
      @michaelobrist4716 ปีที่แล้ว

      @@YUNIKARN Thanks for the video! Highly appreciated!

  • @yck3810
    @yck3810 2 ปีที่แล้ว

    Hi, may I know what python version you are currently using in this video? I am using 3.8 version, however I am not sure why, I think the extractText() functions seems to be obsolete.

    • @YUNIKARN
      @YUNIKARN  2 ปีที่แล้ว +1

      Thanks for your comment! Based on the documentation (pypi.org/project/PyPDF2/) the latest version of PyPDF2 should work fine with Python 3.8 and higher. For this video, I used Python 3.7.9 in my virtual environment and PyPDF2 version 1.27.1. One has to note that the extractText method has its limitations depending on the type of PDF file. I should do another video on it. Best wishes, Gerhard

    • @yck3810
      @yck3810 2 ปีที่แล้ว

      @@YUNIKARN Hi Gerhard, first of all, thank you for your prompt response. Yes. I should have corrected my statement. The extractText() function is not obsolete. However, it doesn't work well with all types of pdf. Because apparently in my case, some of the pdf files work well, but some don't (I still have no idea how to differentiate what type of pdf is applicable and what is not). Anyway, thanks again for the documentation link provided. Keep up the good work. 👍

    • @YUNIKARN
      @YUNIKARN  2 ปีที่แล้ว +1

      @@yck3810 Yes, sadly the extractText() method has limitations. I will do a few more videos on fun with PDFs using Python. Best wishes, Gerhard

  • @michaelmraz2707
    @michaelmraz2707 ปีที่แล้ว

    Then how do you put that Director 31 times into an output table? I am trying to extract specific data from PDFs, for example, it would extract all rent expenses from a Financial Statement and tabulate the numbers into an output table. Any ideas?

    • @YUNIKARN
      @YUNIKARN  ปีที่แล้ว

      I have done another video th-cam.com/video/3s0-TGLbB4M/w-d-xo.html on PDF files, where the search result is organised in a list. From Python lists (or other types), it is easy to construct tables (e.g., convert to Pandas DataFrame and export as csv or Excel file/table). However, if your PDF input refers to tables, you will need to modify your approach. The camelot library might be a useful starting point. Please get in touch if you want to discuss this problem in more detail. You can book consultations online www.yunikarn.com or drop us an email (see channel pages). May the Power be with you!

  • @feliciak3483
    @feliciak3483 ปีที่แล้ว

    Hi, this video is super helpful for understanding the process, thank you! However, when I run the code, I keep getting this exception: "PyPDF2.errors.DeprecationError: PdfFileReader is deprecated and was removed in PyPDF2 3.0.0. Use PdfReader instead." So I changed PdfFileReader to PdfReader in the code and then it said: "PyPDF2.errors.DeprecationError: reader.getNumPages is deprecated and was removed in PyPDF2 3.0.0. Use len(reader.pages) instead." I'm a little confused on how to change the code from here or what exactly to change to len(reader.pages) because substituting it into the existing code didn't work. Do you have any suggestions? Did PyPDF2 change?

    • @YUNIKARN
      @YUNIKARN  ปีที่แล้ว

      PyPDF2 and Python packages in general keep changing. In addition, their dependencies might change. This is the main reason why I use virtual environments (version control). There are two approaches: 1. Install an older version of PyPDF2 using pip (ideally use a virtual environment). 2. Read the documentation and update your code. Visit us on www.yunikarn.com or drop us an email if you need help. May the Power be with you!

  • @gard8995
    @gard8995 ปีที่แล้ว

    Hi. Thanks for a very helpful tutorial.
    Would it be possible to search for several strings at the same time and get an output something along these lines:
    Word A was found X times on pages x, y, z
    Word B was found X times on pages x, y, z
    And so on?
    Also, on top of that, could one run this script on several PDF files at the same time to get an output along these lines:
    Word A was found X times on pages x, y, z in document1
    Word A was found X times on pages x, y,z in document2
    Word B was found X times on pages x, y, z in document10
    I'm a Python newbie so apologies in advance if my quesitons are stupid.

    • @YUNIKARN
      @YUNIKARN  ปีที่แล้ว

      Thanks! Yes, this can be achieved using loops (list of strings) and you can also loop through all pdf files in a folder. This would be a nice exercise

  • @catesconsultinggroupllc937
    @catesconsultinggroupllc937 9 หลายเดือนก่อน

    Greetings, Great video tutorial. I have a question: I was able to search for a string of words using this code without any modifications. What I would like to do is return something based on the search words. For example: If I'm searching for the date something occurred, there is typically a preceding string. "Date of Service" should have a date following that string. How do I return the date just following that string? "Date of Service" 01/05/2019 for example. I want to return the date: 01/05/2019. There are 2 changes that would need to occur. How to return the date given it's not the search being made and since it is not a string. would we need to change the str anywhere in the code?

    • @YUNIKARN
      @YUNIKARN  9 หลายเดือนก่อน

      I worked on a somewhat related problem. The task was to explore words in their context. The challenge is to ensure that all dates are captured even if the date formats change. A two step approach is usually best. 1. Get the whole sentence that contains your search term (careful with page breaks). 2. Use an algorithm to filter dates. Drop us a line (see channel pages) if you want a chat. May the Force be with you!

  • @SuperPaulofeitosa
    @SuperPaulofeitosa ปีที่แล้ว

    Excellent video, congratulation.
    Is possible make a search many words in same line?
    Example:
    From: Paulo Feitosa
    Sent: quinta-feira, 1 de dezembro de 2022 17:48
    I have a PDF with may words From and Sent, i want search it and also a line PDF doc.

    • @YUNIKARN
      @YUNIKARN  ปีที่แล้ว

      That is possible. I have done another video on PDF files, which looks at related problems: th-cam.com/video/3s0-TGLbB4M/w-d-xo.html - just get in touch if you need help (email on channel pages or www.yunikarn.com). May the Power be with you!

  • @rivaltersilva9216
    @rivaltersilva9216 2 ปีที่แล้ว

    Excellent class. but how could I find words and select an entire sentence containing the same. Walter from Brazil

    • @YUNIKARN
      @YUNIKARN  2 ปีที่แล้ว

      Thanks for your comment! In principle, one could use the split method in Python and use list comprehension. For instance: page = "Hallo World yet again. I can see you. Find the word." then use: result = [sentence + '.' for sentence in page.split('.') if 'word' in sentence]. This might be a nice problem for another video. Best wishes, Gerhard

  • @kibtiachowdhury6011
    @kibtiachowdhury6011 ปีที่แล้ว +1

    Hi. I want to extract only paragraph and title without any table and figure from multiple pdf file. How can I solve this?

    • @YUNIKARN
      @YUNIKARN  ปีที่แล้ว

      If your PDFs refer to academic papers, the easiest approach is to use Google Scholar (API) and obtain titles and abstracts. Then you don't need to handle PDFs, which is faster. Otherwise you have to think about how to identify titles from PDF files, which is harder. You can get in touch (see channel pages) if you need help. We do projects and bespoke training

  • @saeedewu129
    @saeedewu129 ปีที่แล้ว

    Hi. Thnx for your video. Is it possible to extract multiple search terms from multiple pdf files at a time?

    • @YUNIKARN
      @YUNIKARN  ปีที่แล้ว +1

      Multiple search terms could be arranged in a list and you can loop trough it. You might prefer your output arranged differently (e.g., dictionary, Excel file etc.). I have done another video that reads PDFs and outputs the most common words. You might find that helpful. Finally, Python can go through several PDF files. There are many ways to do it. An easy option is to store all files in the same folder and then go through the folder in a loop. May the Power be with you!

    • @saeedewu129
      @saeedewu129 ปีที่แล้ว

      @@YUNIKARN Many thnx for ur reply. Will work on that. Is there any way to communicate with you to get any tips or advices when I try to do it by myself and face any problem?

    • @YUNIKARN
      @YUNIKARN  ปีที่แล้ว +1

      @@saeedewu129 You find our email on the channel page or visit www.yunikarn.com

    • @saeedewu129
      @saeedewu129 ปีที่แล้ว

      @@YUNIKARN okay. many thnx

  • @YUNIKARN
    @YUNIKARN  ปีที่แล้ว

    My new Company Valuation course is out! Limited offer for USD 9.99 (expires in four days): www.udemy.com/course/company-valuation-a-guide-for-analysts-investors-and-ceos/?couponCode=FEA4E8F50C8E011B61F2

  • @tedmac8984
    @tedmac8984 ปีที่แล้ว

    Sir, thanks for the great service, can you help me, if I want to extract data of each word into excel from pdf.

    • @YUNIKARN
      @YUNIKARN  ปีที่แล้ว

      I am glad you find it useful. I have done a related video, which identifies the most common words in a PDF document and returns a list. Lists can be exported as Excel files. (Link: th-cam.com/video/3s0-TGLbB4M/w-d-xo.html). In short, what you are looking for can be done. If you need further help, we do consulting projects and develop bespoke training. Our contact details are on the channel page

  • @juhaszat
    @juhaszat ปีที่แล้ว

    Superb content Michael! Could you please remove the ")" from github-repo link?

    • @YUNIKARN
      @YUNIKARN  ปีที่แล้ว

      Updated - hope it works

  • @academysolution8074
    @academysolution8074 ปีที่แล้ว

    Is it possible to extract only text that is in red color font from pdf by using font ???

    • @YUNIKARN
      @YUNIKARN  ปีที่แล้ว

      That is a problem that is not implemented in PyPDF2 as far as I know. PDFPLUMBER is able to extract font colour

  • @umamaheswararaom7909
    @umamaheswararaom7909 2 ปีที่แล้ว +1

    How to convert different tables data in scanned image pdf into excel csv file

    • @YUNIKARN
      @YUNIKARN  2 ปีที่แล้ว

      Thanks your your question. Converting tables in PDF files into Excel files can be tricky. This requires another video. You can get updates on my channel and my Facebook page (link on the channel). Python is the way!

    • @walkwithus6536
      @walkwithus6536 ปีที่แล้ว

      @@YUNIKARN yeah, please make vidio as soon as possible

    • @YUNIKARN
      @YUNIKARN  ปีที่แล้ว

      @@walkwithus6536 you can always drop us a line (email see Channel pages) if you need a tailor-made solution

  • @hariprasad-ch2qc
    @hariprasad-ch2qc 7 หลายเดือนก่อน

    Can we identify a table in the PDF and represent the same in a tabular format?

    • @YUNIKARN
      @YUNIKARN  7 หลายเดือนก่อน

      If your PDF input refers to tables, you will need to modify your approach. The camelot library might be a useful starting point.

  • @harishbollineni2588
    @harishbollineni2588 2 ปีที่แล้ว

    how to install pip for virtual environment

    • @YUNIKARN
      @YUNIKARN  2 ปีที่แล้ว

      Hi Harish, If you use Anaconda, I need to work with the conda environment for updates. If you run Python directly, you can install the pip installer as follows: On Windows download get-pip.py (do a Google search). This needs to be on the same path as your Python installation. Then change the directory into the folder. Use cmd (command prompt) and type python get-pip.py. Finally check the installation using pip -V - Python is the Way!

  • @umamaheswararaom7909
    @umamaheswararaom7909 2 ปีที่แล้ว

    How to convert tables in scanned image pdf into Excel csv file...

    • @YUNIKARN
      @YUNIKARN  2 ปีที่แล้ว

      Thanks your your question. Converting tables in PDF files into Excel files can be tricky. This requires another video. You can get updates on my channel and my Facebook page (link on the channel). Python is the way!

    • @umamaheswararaom7909
      @umamaheswararaom7909 2 ปีที่แล้ว

      @@YUNIKARN scanned image pdf needs OCR extraction which doesn't require for normal pdf ..
      Or is it the same way for both?

    • @YUNIKARN
      @YUNIKARN  2 ปีที่แล้ว +1

      @@umamaheswararaom7909 for scanned images OCR is the way to go. If the table is part of a pdf file, other methods might work as well. I will cover these aspects in future videos

  • @walkwithus6536
    @walkwithus6536 ปีที่แล้ว

    How to extract pdf tables files into excell?

    • @YUNIKARN
      @YUNIKARN  ปีที่แล้ว +1

      I have found a few videos by other creators that cover this topic (mostly for financial accounting). I might do a video on it in future - but my production pipeline is full for the next 4-5 weeks

  • @walkwithus6536
    @walkwithus6536 ปีที่แล้ว

    the git hub link is not working

    • @YUNIKARN
      @YUNIKARN  ปีที่แล้ว +1

      I tested the link github.com/GerhardKling/DataWrangling/tree/main/DataExtractionPDF in the description. It seems to work fine for me. Drop me a line (see Channel page for email) if you are having trouble, and I can send you the files by email. May the Power be with you!

  • @picklenickil
    @picklenickil ปีที่แล้ว

    TLDR : langchain

    • @YUNIKARN
      @YUNIKARN  ปีที่แล้ว

      Langchain rules

  • @valmirrastelyjunior9400
    @valmirrastelyjunior9400 6 หลายเดือนก่อน

    OK

    • @YUNIKARN
      @YUNIKARN  6 หลายเดือนก่อน

      Thik hai - have a great 2024!

  • @Baka_Oppai
    @Baka_Oppai ปีที่แล้ว

    pypdf2 is just a mess of errors

    • @YUNIKARN
      @YUNIKARN  ปีที่แล้ว

      Yes, it is messy ... 🫠