Extract and Visualize Data from PDF Tables with PDFplumber in Python

แชร์
ฝัง
  • เผยแพร่เมื่อ 8 ก.พ. 2025
  • Howdy all! I recently published a story that was based on some data analysis I did of a report I obtained from the Department of Behavioral Health and Developmental Services in VA. I wanted to share a quick walkthrough of how I extracted the data from tables in a PDF using a Python module called PDFplumber. Here's a link to the text version with the code - github.com/gam...
    By using PDFplumber, I was able to create a graph which shows the trend at the center of my article. I hope some of you can take something away from this walkthrough that will help you supplement your own reporting, especially if you're interested in data journalism.
    I'm by no means an expert coder, very much a beginner, so if there are things I could have done better let me know. That being said, I hope this walkthrough proves that any journalist can use programming to enhance their work, so you should try it if you haven't already!
    PDFplumber docs - github.com/jsv...
    Python tutorials - / @socratica
    jwcaterine.com
    #python #walkthrough #journalism
  • วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 21

  • @CoachMagic1
    @CoachMagic1 3 หลายเดือนก่อน +1

    Mr Caterine, you are gifted at sharing technical knowledge. I'm watching this video from a southwestern city not far from Shangri-la. Thank your for your step by step articulation demonstration. Learn and practice a lot.😄

    • @JWCat757
      @JWCat757  2 หลายเดือนก่อน

      Thanks a lot! Glad it was helpful.

  • @ramarisonandry8571
    @ramarisonandry8571 ปีที่แล้ว +4

    I'm watching your video from Madagascar. Great job, thank you!

    • @JWCat757
      @JWCat757  ปีที่แล้ว

      Wow! Very cool. Thanks for watching!

  • @virajmoghe2012
    @virajmoghe2012 ปีที่แล้ว +1

    This is amazing stuff. God bless you. Keep up the good work

    • @JWCat757
      @JWCat757  ปีที่แล้ว

      Thank you!

  • @maheshvdy
    @maheshvdy 3 หลายเดือนก่อน +1

    Simple and super, easy to understand

    • @JWCat757
      @JWCat757  2 หลายเดือนก่อน

      Appreciate it! Hope it helped.

  • @cken27
    @cken27 ปีที่แล้ว +3

    If you are interested in pdf table extraction, give "camelot" library a try. I found it superior than PDFplumber in terms of automatic table identification. It could detect bank statement tables without explicit lines and empty cells. Also, the resulting object is already a pandas Dataframe, so you can select and clean the data in the usual pandas way.

    • @JWCat757
      @JWCat757  ปีที่แล้ว +3

      Thank you for sharing, I will definitely give it a try!

    • @ajarivas72
      @ajarivas72 ปีที่แล้ว +1

      @@JWCat757
      Do both libraries work on tables built as images or vectorized images (selectable) ?

    • @JWCat757
      @JWCat757  ปีที่แล้ว

      PDFplumber works with images, but it takes work to get it to read the table. See the "Visual Debugging" section of the ReadMe for more info - github.com/jsvine/pdfplumber#visual-debugging
      As for camelot, I'm not as familiar with it, but from what I can tell it doesn't seem to work with images. @@ajarivas72

  • @YashsCodeCamp
    @YashsCodeCamp 10 หลายเดือนก่อน +2

    Thanks!

  • @YalavarthiRahul
    @YalavarthiRahul 11 หลายเดือนก่อน +1

    Thanks a lottttt !!!!!!!!!!!!!!!!!

    • @JWCat757
      @JWCat757  11 หลายเดือนก่อน

      You’re welllllcommeeeeee!!!!!!!!!!

  • @gvenagas
    @gvenagas 8 หลายเดือนก่อน +1

    I found that by opening a pdf file with Mozilla Firefox and inspecting it with the developer tools you can collect its text (with the help of JavaScript) after the web browser has converted it to HTML and maybe save it for further processing with someone programming language.

    • @JWCat757
      @JWCat757  6 หลายเดือนก่อน +1

      Interesting!

  • @bxroberts
    @bxroberts ปีที่แล้ว +1

    Great video! Do you know if the extract tables functionality needs the tables to be ruled?

    • @JWCat757
      @JWCat757  ปีที่แล้ว +1

      Thank you! According to the PDFplumber docs, it will find both lines that are explicitly defined and/or implied by the alignment of words on the page, so my guess is that tables don't need to be ruled.

  • @bennguyen1313
    @bennguyen1313 11 หลายเดือนก่อน

    Not sure how to choose from the many python packages to extract data from a PDF.. PyMuPDF,PyPDF2 , PDFplumber, tabula-py, etc..
    For example, what if the PDF is a scan of a paper document.. i.e. it's crooked, and quality is bad. Is there one that does it best? Or maybe I should use AI (ChatGPT + GPT4Vision/Ai PDF) to do an OCR, then have it extract the data?
    Also any suggestions how to get the values from specific columns in a text file. For example, I have a text file with data like this:
    #Time (HHH:MM:SS): 002:34:02
    # T(ms) BUS CMD1 CMD2 FROM SA TO SA WC TXST RXST ERROR DT00 DT01 DT02 DT03 DT04 DT05 DT06 DT07
    # ===== === ==== ==== ==== == ==== == == ==== ==== ====== ==== ==== ==== ==== ==== ==== ==== ====
    816 B0 D84E BC RT27 2 14 D800 2100 0316 0000 0000 0000 0000 CCCD 0000
    817 A0 DC50 RT27 2 BC 16 D800 2120 0000 4080 3000 0000 3000 0000 0000
    #Time (HHH:MM:SS): 002:34:03
    # T(ms) BUS CMD1 CMD2 FROM SA TO SA WC TXST RXST ERROR DT00 DT01 DT02 DT03 DT04 DT05 DT06 DT07
    # ===== === ==== ==== ==== == ==== == == ==== ==== ====== ==== ==== ==== ==== ==== ==== ==== ====
    056 B0 D84E BC RT27 2 14 D800 2100 0316 0000 0000 0000 0000 CCCD 0000
    057 A0 DC50 RT27 2 BC 16 D800 2120 0000 4080 3000 0000 3000 0000 0000
    How can get just the data from DT00 thru DT07 into an array, without doing lots of preprocessing to scrub out the repeating #Time headers that appear throughout the file?

    • @JWCat757
      @JWCat757  10 หลายเดือนก่อน

      I don't have an exact answer to your question, but I will say that when I posted a specific problem like this to the discussions section of the PDFPlumber github I got a pretty quick and thorough response - github.com/jsvine/pdfplumber/discussions