Combine and Extract multiple PDF tables to clean Excel Data using Tabula library of python

The Data Corner

มุมมอง 7 279

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 28 ธ.ค. 2024

ความคิดเห็น • 26

@TheCopperMystic 3 หลายเดือนก่อน ⁺¹
When I typed the pdf_fils or pdf_files[1] in the editor I didnt get any results. When I typed the pdf_file[0} in the terminal I got an error as the term is not recognized as the name of the cmdlet,
@TheCopperMystic 3 หลายเดือนก่อน
01:40 how did you edit this to make the vs editor having each seperate cells. Please someone let m ekno w
@theDataCorner 3 หลายเดือนก่อน
Hello.
Within VS Code, make sure you have extensions by name of *Jupyter* and *Name: Jupyter Cell Tags* installed, read more on link below, they are basically jupyter code cells.
code.visualstudio.com/docs/python/jupyter-support-py
@AEARArg 11 หลายเดือนก่อน
Great walkthrough
@theDataCorner 11 หลายเดือนก่อน
Thank you, I appreciate it.
If you find extracting data from PDF to excel interesting, do check out my latest video where I extract PDF data using R script, Python libraries and Microsoft Power Query.
@prakharjain8716 9 หลายเดือนก่อน
What was the formatting you did at 1:44 ?
@theDataCorner 9 หลายเดือนก่อน
Hello.
these are Jupyter code cells inside of VS Code using interactive window. These are really helpful when I need to run a code block one by one, instead of running everything altogether.
You can read more on it on below link:
code.visualstudio.com/docs/python/jupyter-support-py
@laalbujhakkar 4 หลายเดือนก่อน
@@theDataCorner would be nice to know what extension you are using. any pointers? thanks!
@theDataCorner 4 หลายเดือนก่อน
@@laalbujhakkar Hello, other than bunch of themes, I have
DataWrangler - very helpful for data analysis folks.
IntelliCode - Dont use it much
Jupyter
Jupyter Cell tags
R - Dont use much of R Script
Codiumate - Have use a bit but it was causing a lot of memory usage.
I jumped from pycharm to vscode due to speed, so dont want to slow it down.
@mpfiesty ปีที่แล้ว
Thank you! Love this content! Only problem for me is, I have a monthly report with 61 different pdfs with three table types in each representing Deposits, Fees, and Discounts, and they vary from 2-11 pages and each table can be longer or shorter than another in each pdf so I can’t create those consistent rules like you did in this video.
Is there a way I could filter through the tables and make lists of the ones with the same heads and then append them and process them?
Thank you in advance! This video already helped me out a ton!
@theDataCorner ปีที่แล้ว
Thank you Matt, I am glad to hear the video helped you out.
I believe you can try is to check header columns using and if else and a for loop.
Another way is to check first row of any specific column to see if specific value matches and then go from there.
Below assumes df as dataframe, 'column_to_check' as column name.
df['column_to_check'].iloc[0]
Hope this helps.
@mustaqimjohari4276 4 หลายเดือนก่อน
hello, i can not can not get the pdf_files[0] there is error saying the term 'pdf_files[0]' its not reconized
@theDataCorner 4 หลายเดือนก่อน
Hello Mustaqim,
Thats strange, can you share the code you are working on? May be you skipped the line where *pdf_files* was defined?
Error code means variable named *pdf_files* doesnt exist which is why it cannot recognise it.
@AIWorld-1104 ปีที่แล้ว
Thank you this video is very helpful :) but in my case there is large pdf with more than 100 pages and columns are mentioned only on 1st page so this extracts data from first page only but i want to extract from all pages can you provide some guidance to solve this?? Thank you
@theDataCorner ปีที่แล้ว
Hello, thank you for watching the video.
Below code line should load all pages from your pdf.
tables = tabula.read_pdf(pdf_file, pages='all', multiple_tables = True)
have you checked what len(tables) returns? how many tables does it say your PDF have?
@AIWorld-1104 ปีที่แล้ว
Hello@@theDataCorner Thanks for your this suggestion :)
@theDataCorner ปีที่แล้ว
Happy to help :)
@sarayumallam9507 5 หลายเดือนก่อน
send source code and btw getting error like java not found , so help me resolve it , appreciate your work.
@theDataCorner 4 หลายเดือนก่อน
Hello,
if you have java installed already and still getting an error, then please try below steps, the java setup is bit tricky but hopefully a one time setup.
from windows start option, search for *Environment Variables* and search for *Edit environment variables*, then follow below steps:
****
Under the System Variables click Path and then press the Edit... instead of New. Then in the next screen (Edit environment variable for the Path variable) click New and add the address, e.g. C:\Program Files (x86)\Java\jre1.8.0_201\bin. Press OK and the Path variable will be appended/updated.****
Answer taken from below:
stackoverflow.com/questions/54817211/java-command-is-not-found-from-this-python-process-please-ensure-java-is-inst
Source code:
codepad.site/edit/q9aig7rj
@sarayumallam9507 4 หลายเดือนก่อน
@@theDataCorner thanks for your time bro , keep it up .
@theDataCorner 4 หลายเดือนก่อน
Happy to help!
@smithndongla5514 4 หลายเดือนก่อน
Hello, I have an " org.apache.fontbox.ttf.CmapSubtable processSubtype14
WARNING: Format 14 cmap table is not supported and will be ignored"
@theDataCorner 4 หลายเดือนก่อน
Hello Smith,
Is your PDF scanned or computer generated?
The error seems to be related to some font, I have not seen that error before.
You can try using camelot, which works in similar way to tabula library, if that doesnt work, you can try using microsoft power query to do same task.
th-cam.com/video/b8VTa3gYOBo/w-d-xo.html
Camelot Code:
import camelot
import os
import pandas as pd
# List all PDF files in the current directory
pdf_files = [x for x in os.listdir('.') if x.endswith('.pdf')]
print(pdf_files)
# Initialize an empty DataFrame to store the combined tables
combined_df = pd.DataFrame()
# Loop through each PDF file
for pdf_file in pdf_files:
# Extract tables from the PDF
tables = camelot.read_pdf(pdf_file, pages='all')
print(f"This pdf has {len(ttables)} pages")
# Check the number of tables and select the required ones
if len(tables) > 4:
required_table = pd.concat([tables[2].df, tables[6].df], ignore_index=True, sort=False)
else:
required_table = tables[2].df
# Add the PDF source column
required_table['pdf_source'] = pdf_file
# Append the required table to the combined DataFrame
combined_df = pd.concat([combined_df, required_table], ignore_index=True, sort=False)
# Create a copy of the combined DataFrame
df_new = combined_df.copy()

ต่อไป

เล่นอัตโนมัติ

[19] Convert a multi-page PDF file into csv / excel with Python