Extract tabular data from PDF with Camelot Using Python

Frank Du

มุมมอง 49 415

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 1 ส.ค. 2024
Ever encountered the pain of extracting tabular data from PDF files?
Look no further!! Luckily, Python Module Camelot makes this easy.
Camelot documentation: camelot-py.readthedocs.io/en/...
The text-based version of this tutorial:www.frankdu.co/tutorial/extra...
Also, check out my whole channel here for some other interesting tutorials as well: frankdu.co/youtube

ความคิดเห็น • 56

@frankdu7364 5 ปีที่แล้ว ⁺⁶
Hi Guys,
Seems this video is gaining some traction and if you'd like to support this channel, please consider watching my other tutorials as well: frankdu.co/youtube. Thank you so much.
@artdoneus 4 ปีที่แล้ว ⁺²
By far the most useful and clear out video that i've seen on this topic thank you for your efforts!
@Airsoftcan737 5 ปีที่แล้ว ⁺⁴
Would it be possible to extract only specific tables, for example you have several PDFs and you want to extract one table that has the information you want?, thanks
@torrentinocom 4 ปีที่แล้ว
Hi! how can i also get a titles of tables, which actually lie outside a table (on top-left side from table)??
@AmitKumar-dt7sz 2 ปีที่แล้ว
Extremely helpful video. Thanks for sharing
@alaue 3 ปีที่แล้ว
Thank you, this video helped me a lot.
@satyamgupta1105 5 ปีที่แล้ว ⁺¹
it only parses the pdfs having a separtion line. Is there any other library which can parse the tables in pdfs having no separation lines?
@akshayakmahanand3632 4 ปีที่แล้ว
I have a PDF having multiple tables in it. I am using the for table in tables syntax but getting the IndexError: list index out of range erorr
@asishraz6173 4 ปีที่แล้ว ⁺²
Very helpful video, I must say. Thank you for sharing with us. But I just wanted to ask, this 'Camelot' package is not workable when it comes to 'scanned images or scanned pdfs'?
Please let me know if you know the solution for it. I have tried many approaches, but not able to extract the table data from the scanned image or pdf.
@mmgwengi 4 ปีที่แล้ว
Can you extract a specific table from a page that has multiple tables
@tlrlutz 4 ปีที่แล้ว
I am following the instructions provided by Camelot and when I check the version of Ghostscript (gswin64c.exe -version) on my command line my PC says "this app can't run on your PC. To find a version for your PC, check with the software publisher" then the command prompt says "access is denied" any solutions?
@jonathanfriz4410 3 ปีที่แล้ว
Hi, very good video. I don't remember if you mention this:
Camelot won't work with image-based pdf, only with text-based pdf (so if you have pdf that comes from a scanner paper won't work).
Only will take out the tables no the text.
In OSX a text-based pdf is very likely you can use quick look and just copy and paste. It will work in a bunch of cases.
For image base pdf I try with easyocr and pdf2image.
@khanabbas4608 3 ปีที่แล้ว
Sir, for ghostscript, do I need to download both GNU and Artifex, or just one? Many thanks!
@DRocksRecords 4 ปีที่แล้ว
Thank you very much
@lidory98 2 ปีที่แล้ว
how do I get rid of the first row of the indexes?
@hayathbasha4519 3 ปีที่แล้ว
Hi,
I am having large pdf where camelot takes lot of time to read
Is it possible to read one page at a time
@madhurisree1687 4 ปีที่แล้ว ⁺⁴
Hi, want to extract invoice pdf file to csv or excel. How can I do that ply reply. Thank u
@Htyagi1998 2 ปีที่แล้ว
You can use layout ml
@ashu60071 3 ปีที่แล้ว
i am trying to extract table from pdf as you shown but the contents are not coming. can't read the contents of the table only structure is coming.
@monkey4hire69 6 หลายเดือนก่อน
Frank! Excellent video. Quick question: If I have many many tables in many pdf's. How could I append to the same workbook but new sheets? Thanks in advance!
@hayathbasha4519 3 ปีที่แล้ว
Hi,
I am having table that starts in page 1 and ends at page 2
Page1 includes header and rows
Page2 contains only rows
In such case how to extract page2 data using Camelot
@sathwikameenabad9789 4 ปีที่แล้ว ⁺²
read_pdf() is not working for me.Can you please help me with that?
The error is:Please make sure that Ghostscript is installed
I installed ghostscript and also added path.
Help me with this,please
@jorgemayorga7600 3 ปีที่แล้ว
I'm having the exact same issue. Did you find a solution?
@sreigurushyam 5 ปีที่แล้ว
Hi, can i get the table title as well . If yes what should i do to get it
@frankdu7364 5 ปีที่แล้ว ⁺¹
Hi,
Thanks for your question! It seems Camelot won’t be very handy for such a job. Camelot is a master when extracting pure tabular data. It looked like you wanna extract text of the content. Maybe python module PyPDF2 is sth you’re looking for? Let me know.
Thanks.
Frank
@sadeksaci1247 ปีที่แล้ว
How to process a pdf file with multiple pages please
@MikeAkinyemi 5 ปีที่แล้ว ⁺¹
Hi, when I run the program, I get RuntimeError('Please make sure that Ghostscript is installed') error. I am sure Ghostscript is installed. I use windows 10
@mikequest4620 5 ปีที่แล้ว
Seth path of ghostscript
@mikequest4620 5 ปีที่แล้ว
Seth path of ghostscript
@artoke84 4 ปีที่แล้ว
hi, is it totally necessary to install Pandas library? or with Camelot is enough?
@frankdu7364 4 ปีที่แล้ว
Hi David, Pandas shall be installed as a dependency when installing Camelot.
@ayushi896 5 ปีที่แล้ว ⁺⁴
Hi, how can we read tables that has no borders or lines defined? Any idea????
@AltafKhan-pm3lk 4 ปีที่แล้ว
did you get any answers/solutions for this?
@ananthsireesh 4 ปีที่แล้ว ⁺¹
There are two flavours of the Camelot , it by default uses lattice which works for the tables seperated with lines, but you can also flavour of "stream" which has white spaces between cells, you can refer the documentation.
@AyushSharma-bn2js 5 ปีที่แล้ว
Its only reading the first page of the pdf ....... what should i do ????
@saurabhrawat5999 5 ปีที่แล้ว
yes, i am also facing the same problem. It's just reading the first page in the pdf. Any suggestion?
@saurabhrawat5999 5 ปีที่แล้ว
Try this pages='1,2' or pages='all'
worked for me
@HemantKumar-iy7dn 5 ปีที่แล้ว
when we export all tables it makes multiple csv i want one file with merged indexes any suggestions
@jessicalee5175 4 ปีที่แล้ว ⁺²
Hi, Would you have a recommendation if I'm trying to extract a PDF file like a bank statement to CSV or Excel?
@frankdu7364 4 ปีที่แล้ว ⁺¹
Hi Jessica,
Thanks for your comment. So Camelot didn't work out for you?
General approach could be: 1. Use other PDF files parsers like PyPDF2 to extract raw text info 2. If your text has certain pattern, you might be able to parse the raw text line by line(You can do some filtering as well of course). 2. Parsed text to excel or csv: there are plethora of tools you can use: Python module CSV, Pandas, Openpyxl etc. But the challenge here is the pdf file parsing part.
If you don't mind sharing the file, I can have a look and try to release a new tutorial based on your case.
Let me know.
Frank
@jessicalee5175 4 ปีที่แล้ว
@@frankdu7364 Hi Frank! Thanks so much for replying. The files are mostly clients files. I can try to create my own PDF that is similar. Would you have an email I can send it to?
@frankdu7364 4 ปีที่แล้ว ⁺²
@@jessicalee5175 Yes, Jessica. Just send to robot80053906@gmail.com. I will have a look and create a tutorial about it. Let me know here when you sent.
Best
@berlusconitripurba2475 4 ปีที่แล้ว ⁺¹
@@jessicalee5175 Halo Jes. Thank you for asking about this. I have similar case with you. Could you mind to branstorming about this case?. #BankStatement
@DRocksRecords 4 ปีที่แล้ว ⁺¹
@@frankdu7364 this is a hilarious email adress I love it
@luckysunda9623 4 ปีที่แล้ว
Hi, Thanks for the video. I am getting no tables for the pdfs I want :(
@billbarron8666 3 ปีที่แล้ว
Same here, have you been able to fix this?
@luckysunda9623 3 ปีที่แล้ว
@@billbarron8666 No. The tables were really complicated in my case actually, even ABBY is not able to do a good job there.
@billbarron8666 3 ปีที่แล้ว
@@luckysunda9623 you need camelotpro.
@genieur8188 ปีที่แล้ว
Generally ok. But if you would type a bit slower you wouldn't have to correct so much of your typing.
@engineerbaaniya4846 2 ปีที่แล้ว
disliked as it is saying Ghostscript not installed please provide complete information
@cientificodedatos3292 2 ปีที่แล้ว
buuuuuuuu

ต่อไป

เล่นอัตโนมัติ

"Extracting tabular data from PDFs with Camelot & Excalibur" - Vinayak Mehta (PyCon AU 2019)