How to Extract Tables from PDF using Python

Misha Sv

มุมมอง 71 736

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 2 ก.พ. 2025

ความคิดเห็น • 88

@paulsmithson4941 3 ปีที่แล้ว ⁺³¹
Wow, fantastic tutorial! I work as an accountant, and Linda from HR, who, and this is between us, is thick as a brick, keeps sending us the payroll tables as PDFs. As an accountant, I need my tables in the Excel software so that I can generate the macros for the supervisors' meetings on every second Thursdays. Thanks to your brilliant, amazing tutorial, what used to take 4 hours (not counting lunch time) now takes 15 minutes tops! I have been able to use my remaining 3h45 minutes to clean-up my Desktop folders, entertain myself to some sudoku, and n0sc0pe h8ters on the LoL game. Thank you again Mr. Sv, very much appreciated!
@davidpalomeque4770 2 ปีที่แล้ว ⁺³
Super clever tutorial Misha, in 10 minutes you gave me what I was looking for. Keep up the good work!
@MishaSv 2 ปีที่แล้ว ⁺¹
Thank you!
@ajarivas72 ปีที่แล้ว ⁺¹
@@MishaSv
Does it work when the tables are pictures instead of vectorized data?
@MishaSv ปีที่แล้ว
@@ajarivas72 I am not sure about that, please feel free to explore using the code provided in the tutorial and update me in the comments section here. I'd be curious to know if it works for both vectorized data and images of tables!
@RC-ql5lp ปีที่แล้ว ⁺¹
Very concise but detailed explanation even for new Python user like me. Also the video is very easy to follow, and is organized logically. Very valuable 14 minutes I spent watching this. Thank You.
@MishaSv ปีที่แล้ว
Thank you!
@higiniofuentes2551 8 หลายเดือนก่อน
Thank you for this very useful video!
@higiniofuentes2551 8 หลายเดือนก่อน
Is going well too with tables without "lines"?
Thank you!
@nothing_to_love หลายเดือนก่อน ⁺¹
OMG, life saver.....
@MishaSv หลายเดือนก่อน ⁺¹
I'm glad this tutorial helped you!
@path2ds863 10 หลายเดือนก่อน ⁺²
you helped me a lot. Thx!
@MishaSv 5 หลายเดือนก่อน
Amazing!
@DwaraknathKeerthi 2 ปีที่แล้ว ⁺¹
Well explainted in the short time, thanks, Misha!
@MishaSv 2 ปีที่แล้ว
Thank you!
@chethanchintumj4162 ปีที่แล้ว ⁺¹
Thanks a lot for all your efforts to makes understand the pdf table extraction. 😇🥰 I'm now able to fetch tables from un structure format pdfs. Once again thanks a lot
@carloschire5777 2 ปีที่แล้ว ⁺¹
Thanks a lot, it helps so much, greetings from Peru
@MishaSv 2 ปีที่แล้ว
Thank you!
@marcobaquero6867 2 ปีที่แล้ว ⁺¹
thank you Misha...Very clear and useful your video!! TKS!!
@simplelearn25 2 ปีที่แล้ว ⁺¹
Thank you Mr.
@gregNFL 2 ปีที่แล้ว ⁺¹
I’m familiar with the Tabula Windows app (which works pretty well) but this is next level. Thank you so much!
@MishaSv 2 ปีที่แล้ว ⁺¹
Glad it was helpful!
@vladimirdiadichev6140 2 ปีที่แล้ว ⁺¹
Good tuturial, thanks.
@MishaSv 2 ปีที่แล้ว
Thank you!
@duaaanis ปีที่แล้ว ⁺²
grate👍
@MishaSv ปีที่แล้ว
Thank you!
@artemkovalenko7257 2 ปีที่แล้ว ⁺¹
Thanks, a great video!
@mariamalmutairi3044 ปีที่แล้ว ⁺¹
Thank you that's very helpful, i just have a question what if I have the same table repeated in multiple PDFs and I need to append them to one csv file
@MishaSv 11 หลายเดือนก่อน
If the PDF files are placed in the same folder, then you can iterate over multiple files, extract tables from each one, and then append them together.
@approvedtrash 3 ปีที่แล้ว ⁺¹
finally a tutorial where i can finally get a kitchen table out of my computer...
wait did i miss something...
@yo5175 ปีที่แล้ว ⁺¹
After- print(len(dfs)) I got "SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape"
could you tell me what's the problem?
Solved it 1: Just put `r` before your normal string. It converts a normal string to a raw string:
@meixinyap5560 2 ปีที่แล้ว ⁺²
Hi, your work is fantastic and I am amazed at that! But just wondering would Python can do if I need to extract specific tables that are located on different pages for different files?
I have more than 200 pdf files, each pdf has a different amount of pages, some have only 5 but some have 10. I need the table with the word “statement total” so that I can extract the data under “quantity” & “amount” in each of the tables.
Currently, my workflow is that (open pdf - scroll to the page that has statement total - search for a page with statement total - look for the amount under "quantity" & "Amount" - copy and paste into my excel - then close the pdf file.
Hope to seek some advice from you, thanks
@MishaSv 2 ปีที่แล้ว
This would probably require multiple steps. You will have to first find the relevant page with the text that says "statement total", then get the page number, and then extract the table from that page.
You will have to do it for every PDF file. It can be difficult since tables can take multiple pages as well. It's definitely doable but requires some amount of custom code for it including finding the correct text and then extracting the tables.
@kennethgomes4727 2 ปีที่แล้ว
@@MishaSv how to find specific text in the pdf and then take the table below it? can you let us know?
@BconeBot ปีที่แล้ว
@@kennethgomes4727 did you get answer to this if yes pleaselet me know facing same issue
@StefanoVerugi ปีที่แล้ว
@@kennethgomes4727 I would try the library fitz, it reads text in a pdf, you can store it in a dictionary using page number as key and text as value, from there you can run a search of your text and get the relevant page number where you can find your table
hope it helps
@andriuslopes6377 3 ปีที่แล้ว ⁺¹
Wonderfull. Thanks a lot !!
@pinkpython5548 2 ปีที่แล้ว ⁺³
nobody got error:
AttributeError: module 'tabula' has no attribute 'read_pdf'
?
@ousmantouray3315 2 ปีที่แล้ว ⁺¹
Install tabula-py in addition to tabula. Otherwise, it wont work
@saviodemirandapereira4924 10 หลายเดือนก่อน ⁺¹
Hey, how can i solve this?
No JVM shared library file (jvm.dll) found. Try setting up the JAVA_HOME environment variable properly.
@Vartos 5 หลายเดือนก่อน
I have the same problem, you managed to find any solutions..?
@jayzeen 2 ปีที่แล้ว
Helloo. Great tutorial. A quick question. If i wanted to use this on my application and host it, will it still work after hosting too
@italo.buitron 2 ปีที่แล้ว ⁺¹
I LOVE YOU!
@jonelatendido9836 ปีที่แล้ว ⁺¹
Do I need to install visual code, I already installed the python and java, ?? please answer immediately,, Thank you
@MishaSv ปีที่แล้ว
No you don't. You can run the code as .py file from any editor or from terminal.
@defypark4595 ปีที่แล้ว ⁺¹
JVMNotFoundException: No JVM shared library file (libjli.dylib) found. Try setting up the JAVA_HOME environment variable properly. It's my error. Any can help please? I've downloaded Java and installed tabula and tabula-py.
@MishaSv ปีที่แล้ว
You need to configure Java PATH in your environment variables.
Here are the instructions:
confluence.atlassian.com/doc/setting-the-java_home-variable-in-windows-8895.html
@gregorydunks 7 หลายเดือนก่อน
Hi I have one big table that carries on through each page but each page is technically it’s own table with new headers so is there anyway to append all of these tables in one file and remove the headers so that it becomes one long csv file with only one set of headers
@ousmantouray3315 2 ปีที่แล้ว ⁺³
What do you do if the table continues on multiple pages?
@StefanoVerugi ปีที่แล้ว ⁺¹
it creates a list of dataframes by setting 'all' as pages
@SeekingUltimateSynthesis 3 ปีที่แล้ว ⁺²
What if a table is split across multiple pages and the headers have multiple rows that are split into 'columns" differently?
@MishaSv 3 ปีที่แล้ว ⁺¹
That would be a more complex operation for the standard functionality to handle. I suggest looking at their full documentation here: tabula-py.readthedocs.io/en/latest/
@SeekingUltimateSynthesis 3 ปีที่แล้ว ⁺¹
@@MishaSv ah thanks!! I'll go see if I can figure that out.
@GururajSapkal 2 ปีที่แล้ว
In above video, the table data extracted from pdf as list, what to do in order to convert this list type data into Dataframe?
@Rocklee46v 2 ปีที่แล้ว ⁺¹
CalledProcessError: Command '['java', '-Dfile.encoding=UTF8', '-jar' returned non-zero exit status 1.
I'm getting the above error, even after installing latest version JVM, any help would be very much appreciated
@MishaSv 2 ปีที่แล้ว
stackoverflow.com/questions/53880574/calledprocesserror-when-i-am-trying-to-read-the-pdf-tables
@bushramodi671 2 ปีที่แล้ว
Code is running without any error but still not getting teh excel file. Can you help please?
@srinathk3254 2 ปีที่แล้ว ⁺¹
bro while trying to extract the whole pdf , its only giving me the last page excluding all the other pages ....can you help on this
@MishaSv 2 ปีที่แล้ว
It depends on how the original PDF was created. If it has images of tables inserted then the script might not get it from the PDF. It will only work if the tables were originally created as tables in the PDF.
@jonelatendido9836 ปีที่แล้ว
@@MishaSvIs this going to work if the pdf is scanned 1st using ocr, after that extract all the tables all at once?.
Really great tutorial love this❤❤
@tanmaychaturvedi8191 6 หลายเดือนก่อน
The thing is whether it is tabula Or camelot they don't read all the tables, I want to extract tables from research papers but my rag pipeline in which I have used tabula Or camelot for doing it fails in covering all the cases, so do we have any other solution.
@ammariskandar9939 2 หลายเดือนก่อน
have you found out the solution ?
@tanmaychaturvedi8191 2 หลายเดือนก่อน
@ammariskandar9939 yupp I got the solution. We needed it for complete table extraction for our Advance RAG Project as a part of our internship.
So we used Paddleocr and PPStructure for it. It completely extracts table along with maintaining its structure too.
@gvenagas 8 หลายเดือนก่อน
I found that by opening a pdf file with Mozilla Firefox and inspecting it with the developer tools you can collect its text (with the help of JavaScript) after the web browser has converted it to HTML and maybe save it for further processing with someone programming language.
@subbu2810 2 ปีที่แล้ว ⁺¹
Really good vedio...how can we data into single file with multiple tabs
@MishaSv 2 ปีที่แล้ว
You will have to write out each tables as a separate .csv file after it's extracted from the PDF.
@san2sreshta 10 หลายเดือนก่อน
how to handle if a single table is spanning over 2 pages?
@kaseox5436 2 ปีที่แล้ว ⁺¹
What if i want only first line of table?
@MishaSv 2 ปีที่แล้ว
You will have to extract the whole table, read it as a DataFrame, and then select the first from it using pandas.
@ramkumarkumar9305 3 ปีที่แล้ว ⁺¹
I need to learn coding print replication from pdf to html
@phild5339 ปีที่แล้ว ⁺¹
How would you change this code so that you only extract a specific column from a table
@MishaSv ปีที่แล้ว
You can extract the whole table and then just select the column you need using pandas.
@JM-fr9bc 3 ปีที่แล้ว
Hi, what if you have a table that spans multiple pages?
@MishaSv 3 ปีที่แล้ว
You should watch a section from 7:00 in the video. If you run tabula.convert_into() function as shown in the tutorial and setpages="all" or whatever the numbers of pages are, and it will write all tables into a single CSV file, and you would then need to separate the tables manually.
@Actanonverba01 2 ปีที่แล้ว ⁺¹
Good Videos, but the text is very small. You GOTSTA try zooming in. ;))
@taneryilmaz6171 3 ปีที่แล้ว ⁺¹
can ı use scanned pdf???
@MishaSv 3 ปีที่แล้ว
I haven't tried using it on scanned PDF files. Feel free to try the same code, and let me know in the comments section if it worked!
@glenn8781 11 หลายเดือนก่อน
Getting JavaNotFoundError :(
@parranoic 2 ปีที่แล้ว ⁺¹
It would be nice if my company didn't have 586 pages with 3 or 4 different tables on each page :))
@MishaSv 2 ปีที่แล้ว ⁺¹
Yes, this implementation is for some simple PDF files!
@parranoic 2 ปีที่แล้ว ⁺¹
@@MishaSv Good luck trying to explain that to them. They wanted to stop users from uploading confidential files to random conversion sites and I tried power automate, ai models and python
@parranoic 2 ปีที่แล้ว ⁺¹
@@MishaSv great tutorial btw, thanks alot :)
@MishaSv 2 ปีที่แล้ว
@@parranoic Thank you!
@NoToBusinessCasual 4 หลายเดือนก่อน ⁺¹
Good video. Nothing against the video but the library is not perfect. I was quite excited that I will get to extract data from my etrade statement files (pdf) but whether I run it for all pages or page by page, it skips certain tables or part of the tables. I have a suspicion when a table is continued from one page to the next, the logic hits a glitch and becomes very unpredictable for that page.
@MishaSv 3 หลายเดือนก่อน
Yes, the library has its limitations and this tutorial is just to showcase some of the functionality. It works best when a table is well defined and is on one page. The moment it becomes split across multiple pages, you need to customize the code to retrieve the table as a whole by concatenating several dataframes.
@abdulwajid6725 2 ปีที่แล้ว
Hoped it worked on scanned images to.🥲

ต่อไป

เล่นอัตโนมัติ

How to Extract Text from PDF using Python