[19] Convert a multi-page PDF file into csv / excel with Python

แชร์
ฝัง
  • เผยแพร่เมื่อ 27 ธ.ค. 2024

ความคิดเห็น • 141

  • @mampiisaotaku
    @mampiisaotaku 2 ปีที่แล้ว +1

    aahh! I am so happy to find a fellow accountant doing python!! Greeting mate!

  • @sebastianpadilla8109
    @sebastianpadilla8109 4 ปีที่แล้ว +11

    Wow great, I'm just getting started with Python and realizing things like that can be done, it's awesome, thanks for sharing!

    • @PythonicAccountant
      @PythonicAccountant  4 ปีที่แล้ว

      Thanks for the note! Glad you find these helpful!

  • @SUNILKUMAR-sj5dp
    @SUNILKUMAR-sj5dp ปีที่แล้ว +1

    Clear, Concise. Best Wishes and continued success!!

  • @travisyin884
    @travisyin884 ปีที่แล้ว +1

    Found this piece of gold today, thank you for share your skills, and clear explanation ~

  • @baratin91
    @baratin91 2 ปีที่แล้ว +1

    this is some serious stuff, man. Thanx a lot! i got a similar issue, some clients send helluva income statements and ledgers in pdf format which currently i transform in xls tables manualy which drives me mad, what to say, the client is always right. i dunno so far much of python but intend to eviscerate your brillant example to adapt to my needs...

  • @gusestrella
    @gusestrella 3 ปีที่แล้ว +4

    WOW - what a very useful and simple to follow example. If not there already, you have a great future as a teacher for sure :)

  • @ED85
    @ED85 3 ปีที่แล้ว +1

    i love that you sum check all of the data...you know what i mean...

  • @datalyticsbootcamp
    @datalyticsbootcamp 3 ปีที่แล้ว +2

    Great video! Clear, concise, and just what I was looking for.

  • @JuanPerez-iu9vk
    @JuanPerez-iu9vk 6 หลายเดือนก่อน +1

    Wonderfully explained, thank you so much.

  • @danbates2760
    @danbates2760 2 ปีที่แล้ว

    Thank you very much. I have a report from Hades that is not far off from what you so clearly laid out.

  • @sergeishakhov5193
    @sergeishakhov5193 8 หลายเดือนก่อน +1

    Respect! Great video, super explanation.

  • @Shivam_Manswalia
    @Shivam_Manswalia 4 ปีที่แล้ว +5

    that's what i was looking for.

  • @barath961
    @barath961 3 ปีที่แล้ว +2

    Bravo ! Bravo! Literally Bravo!!!

  • @clear_vision_
    @clear_vision_ 5 หลายเดือนก่อน +1

    Thank you for this video!

  • @ChallengeFishing
    @ChallengeFishing 4 ปีที่แล้ว +1

    Supper useful, needed this for reconciling investment statements.

  • @unknowntech7
    @unknowntech7 3 ปีที่แล้ว +1

    woah, great work here! trying to learn and accomplish something similar myself. thanks!

  • @SK-jv2ro
    @SK-jv2ro 3 ปีที่แล้ว +2

    Thank you . Can we have one standard program that can read receipt. Ex: whole foods , walmart and CVS etc.. For these receipts only certain information is different , but items and description(except description names) are same

  • @anjelninja8952
    @anjelninja8952 2 ปีที่แล้ว +1

    is there a method to do the same thing but instead of pdf can I use a jpg ?

  • @mariordz76
    @mariordz76 2 ปีที่แล้ว +1

    great video , thanks

  • @stephenpereira7306
    @stephenpereira7306 2 ปีที่แล้ว

    Great work mate

  • @israelgonzalez677
    @israelgonzalez677 3 ปีที่แล้ว +1

    Awesome video!

  • @SamEdwardes
    @SamEdwardes 4 ปีที่แล้ว +1

    Great tutorial! Thank you for creating.

  • @awesh1986
    @awesh1986 11 หลายเดือนก่อน +1

    Awesome stuff

  • @mellismellis-c5n
    @mellismellis-c5n 5 หลายเดือนก่อน +1

    Very good

  • @alvin3428
    @alvin3428 2 ปีที่แล้ว +3

    Hey can this work for Pdf having different formats? Not much difference but just a little. For example an invoice can have different formats. So can we use the same logic there as well? Please help, I am trying to do this for my final year project. Also, thank you for explaining it so well.

    • @mShaheerKhan-vf3ro
      @mShaheerKhan-vf3ro 2 วันที่ผ่านมา

      Did you finish your fyp? also did you get around on handling different semi strcutured pdf?

  • @rkeenan85
    @rkeenan85 4 ปีที่แล้ว +1

    This is fantastic. Exactly what I need.

  • @wirechair
    @wirechair 3 ปีที่แล้ว

    You are the coolest ever

  • @billlathrop3986
    @billlathrop3986 4 ปีที่แล้ว +1

    Hi - just discovered your videos and appreciate the introduction to reading PDFs with Python. I've been working with a larger PDF with a big section that is rotated horizontally. That is the section that I want to capture. I've been able to load the PDF and read it - but the orientation is messing with the interpreter. The lines and words are loaded as if it was reading down the columns, not across the page. I can see where there is an rotation feature - but when I modify the value the results do not change. Any advice? Thanks in advance - nice work on your side.

    • @billlathrop3986
      @billlathrop3986 4 ปีที่แล้ว +1

      So - if you have an answer - I would love to hear. But I did solve the problem by using PyPDF2 to extract and rotate the pages I needed to analyze and then ran them through PDFPlumber - and while i haven't had a chance to parse the text lines yet - I do have a series of lines that looks appropriate. Thanks Bill

    • @PythonicAccountant
      @PythonicAccountant  4 ปีที่แล้ว

      I’d try Bill’s suggestion, basically you want to try and rotate the page using a method that permanently rotates it to the correct position, rather than just rotating the view.

  • @MahaCollegesafar
    @MahaCollegesafar 2 ปีที่แล้ว

    Hey can we connect I need some help regarding extraction of data tables from pdf.

  • @amithshambu7181
    @amithshambu7181 3 ปีที่แล้ว +1

    this man is a god! thanks a ton brother!!!

  • @JonathanCrescini
    @JonathanCrescini 4 ปีที่แล้ว +1

    Exactly what I needed! Thanks for sharing!

  • @jgwang7968
    @jgwang7968 3 ปีที่แล้ว

    I am trying to extract specific data, e.g. only Date, Gross and VATs. I found another video where it uses ' re.compile; finditer' to locate the words, but when I tried them following by 'for line in text.split('
    '):' it wont return the short answers Im looking for, still all of the texts. Could you give me some advice?

  • @enzodaniellunacarabajal3196
    @enzodaniellunacarabajal3196 3 ปีที่แล้ว +1

    Thanks for share. excelent!

  • @missing1person
    @missing1person 3 ปีที่แล้ว

    My variables inside this lines.append(Line(vend_no, vend_name, doctype, *items)) are coming back as unidentified, what is the problem ? I'm doing a project very similar to this.

  • @sharadaprasad
    @sharadaprasad 3 ปีที่แล้ว

    Thank you so much for what you do!

  • @azharalam16
    @azharalam16 3 ปีที่แล้ว +1

    Amazing tutorial! Quick question - How would you tackle this problem if all your data didn't fall so nicely under the overarching column headings? I.e., what if there was an additional column for the country and the country name had two words e.g., 'United States', 'United Kingdom' etc.? Thanks again!

    • @PythonicAccountant
      @PythonicAccountant  3 ปีที่แล้ว +1

      Each document has to be taken case by case. In that scenario it would depend where that column fell. If there was a clear pattern before or after that column (e.g. a specific length of digits before and a $ after) I could use regex to identify what’s before and after, with everything in the middle belonging to that country column

  • @MilkmanBro
    @MilkmanBro 3 ปีที่แล้ว

    Hi, My re.compile function doesnt seem to light up like yours. Is this an issue?

  • @mowburnt
    @mowburnt 3 ปีที่แล้ว +1

    Awesome video. One question I had is rather than me then using the csv to create a pivot table etc could you automate a graphical plot of sales by company and/ or by part number over a giventime frame to help quickly spot trends? Could this be extended to plot sales of multiple customers in the same chart? Kind of new to all this. Can send some example data if it helps.

    • @PythonicAccountant
      @PythonicAccountant  3 ปีที่แล้ว +1

      Sure that would be easily doable if too have the data. Would just need to add a field for report date and use that form the x axis

  • @acmccutcheon
    @acmccutcheon 3 ปีที่แล้ว

    Amazing video - concise

  • @datalyticsbootcamp
    @datalyticsbootcamp 3 ปีที่แล้ว +2

    I learned so much and have automated a task thanks to this video - watched the video a good 30 times. Any recommendations on how to learn to loop to the next file? Preferably would like to automate the processing of multiple files at once.

    • @PythonicAccountant
      @PythonicAccountant  3 ปีที่แล้ว +2

      Sure that’s easy! If the files are the same format, you can create a function that takes a file name as input, and in the function run all the steps needed to read the file, parse, and output. Then you can create a list of filenames and iterate through them, calling the function on each one. You could either manually create the file name list or use pathlib or os.path

  • @mpk2583
    @mpk2583 3 ปีที่แล้ว

    I'm using pdfplumber, but with some invoices I'm reading, I get (cid: xx) instead of text (where xx is some number). Any idea on how to decrypt this cid? Ive had no luck searching for the solution myself.

  • @webdev723
    @webdev723 4 ปีที่แล้ว +1

    Great job.

  • @timkong5149
    @timkong5149 4 ปีที่แล้ว +1

    Hi, I have couple questions here. What does (.*) and (*items) mean /do?

    • @PythonicAccountant
      @PythonicAccountant  4 ปีที่แล้ว +1

      The first pattern of .* is used in the “re” or regular expression context, which is used to do pattern matching. The “.” means any single character, and the “*” means zero or more of the previous pattern. So “.*” literally means to match everything, and it’s usually used to catch everything between other patterns defined before and after. For more info on regular expressions I suggest checking out Al Sweigert’s fantastic content automatetheboringstuff.com/chapter7/
      For your second question about *items, in this context I am using a python 3 pattern (believe it started in 3.6) that allows you to unpack an iterable. If I didn’t use the “*”, then it would have added a list as one item rather than each item individually, which would have thrown an error because Line would not have had enough items input into it. Trey Hunner has an awesome article on the use of asterisks in python treyhunner.com/2018/10/asterisks-in-python-what-they-are-and-how-to-use-them/

    • @timkong5149
      @timkong5149 4 ปีที่แล้ว +1

      Thank you so much for your detailed reply!

  • @Ndofi
    @Ndofi 3 ปีที่แล้ว +1

    great one

  • @nanairo2672
    @nanairo2672 4 ปีที่แล้ว +4

    thanks dude, my boss will give me more task from now

    • @mowburnt
      @mowburnt 3 ปีที่แล้ว

      Not if you don't tell them ;-)

  • @marc10uae
    @marc10uae 4 ปีที่แล้ว

    Thanks for this - How come you chose pdfplumber opposed to pypdf2 or pypdf4?

    • @PythonicAccountant
      @PythonicAccountant  4 ปีที่แล้ว

      Don’t recall exactly but I think I found pdfplumber to be either more pythonic or have more functionality

  • @shawnlee8135
    @shawnlee8135 4 ปีที่แล้ว

    Hi, may I know what packages are required? I am using PyCharm with anaconda but it seems i am missing a few packages here.

    • @PythonicAccountant
      @PythonicAccountant  4 ปีที่แล้ว

      In general you can tell what packages are needed by looking at the import statements of code. You can also tell by the error message you get in the traceback. In this specific case you would need to install pdfplumber, and the rest should already be included in the anaconda distro.

  • @hari-codes
    @hari-codes 4 ปีที่แล้ว

    What to do if the one cell in the row is just 3 words in same horizontal line but the other cell in the row has multiple lines and distributed vertically? (when i tried the split by "
    " it is considering the lengthy cell as multiple individual lines.)

    • @PythonicAccountant
      @PythonicAccountant  4 ปีที่แล้ว +2

      Yeah that can cause some challenges. Basically if you don’t need the full text, you can just ignore those rows. But if you want the full text, you’ll need to use some way to tell if you have reached the next row or not, then append a string for that cell each row with the new row’s content, and finally add the full record to your list of records once you’ve reached the last row of additional cel text. I’ll usually use a Boolean flag for that, like new_row=True, then flip it to false when you reach the first row of a new row, and check to see if you are at a new row. If you are not, then keep appending, otherwise flip it to True and add to your list of records.

    • @walkwithus6536
      @walkwithus6536 2 ปีที่แล้ว

      @@PythonicAccountant Hi , if we have multi tables , how we can extract, supposed we have 3k tables in 20 pdf files.

  • @serigamel
    @serigamel 3 ปีที่แล้ว +1

    will this work for scanned documents in pdf?

    • @PythonicAccountant
      @PythonicAccountant  3 ปีที่แล้ว +1

      This method will not work for scanned PDFs as is, but there are a few other python options that can work decently well depending on the quality of the scan

  • @SergejShishkin
    @SergejShishkin 3 ปีที่แล้ว +1

    Terrific!

  • @bhaumiksoni2009
    @bhaumiksoni2009 3 ปีที่แล้ว

    can you help me on my project ??? i got a pdf but it is little bit different different pages but still can you help me?

  • @007vipere
    @007vipere 3 ปีที่แล้ว

    I am using jupyter notebook and I get this error: ImportError: cannot import name 'namedtuple' from 'collection'

  • @riti_chrea
    @riti_chrea 4 ปีที่แล้ว +1

    Do you do freelance work? I am are looking for someone to create a Phython script to parse PDF invoice data into csv or json.

    • @PythonicAccountant
      @PythonicAccountant  4 ปีที่แล้ว +1

      No, but I’m sure you can find lots of freelancers on fiver or other similar sites

    • @riti_chrea
      @riti_chrea 4 ปีที่แล้ว

      @@PythonicAccountant Thanks for responding and recommending Fiver.
      Keep up the good work.

  • @tinoengel363
    @tinoengel363 2 ปีที่แล้ว +1

    nice!

  • @adebolarahman9885
    @adebolarahman9885 4 ปีที่แล้ว

    Thank you very much for this video @Pythonic Accountanat. What about a table in txt format with no delimeter? Can I convert it to Excel or Pandas

    • @PythonicAccountant
      @PythonicAccountant  4 ปีที่แล้ว

      How is it formatted? By character location? If so you can just specify the start and end positions of each column in pandas I believe

  • @aramsalvanera3698
    @aramsalvanera3698 4 ปีที่แล้ว

    Do you have a tutorial of how to split a large pdf of invoices into small pdf for each invoice?

  • @georgealex162
    @georgealex162 4 ปีที่แล้ว +1

    Please teach us how to compare pdf with a excel file

    • @PythonicAccountant
      @PythonicAccountant  4 ปีที่แล้ว

      Any specific use cases or examples you’re looking at?

  • @breid98
    @breid98 4 ปีที่แล้ว

    does this work for use with multiple documents? like will it just keep adding to the same excel sheet?

    • @PythonicAccountant
      @PythonicAccountant  4 ปีที่แล้ว

      That’s easy to do but the code would be a little different. You’d want to create separate data frames for each file, then concat the data frames together once you standardize the columns if necessary

  • @10straws59
    @10straws59 4 ปีที่แล้ว

    Thank you for the tutorial! However, (probably because of the format of the pdf file I am working with), I always get rows of (cid:num)(cid:num) instead of the actual text. Do you know how I can fix this?

    • @PythonicAccountant
      @PythonicAccountant  4 ปีที่แล้ว

      Try with a completely different PDF file. Perhaps it’s an issue with the format of that PDF

    • @luizvaz
      @luizvaz 4 ปีที่แล้ว

      @@PythonicAccountant No, it's really a issue: github.com/euske/pdfminer/issues/122

  • @MuhammadUsman-ix6jo
    @MuhammadUsman-ix6jo ปีที่แล้ว

    Can we do something like this using openAI/chatgpt?

    • @PythonicAccountant
      @PythonicAccountant  ปีที่แล้ว

      I love it, think it can but would need to experiment with it!

  • @vivekkaranath7706
    @vivekkaranath7706 4 ปีที่แล้ว

    Dear Thanks ..i have done it ..but only issue is its reading the last page only

  • @denizalbayrak6357
    @denizalbayrak6357 3 ปีที่แล้ว

    Super great what you did! Thanks. I just get an error NameError: name 'pdfplumber' is not defined. Any idea?

    • @PythonicAccountant
      @PythonicAccountant  3 ปีที่แล้ว

      Probably need to import pdfplumber, and if it’s not installed then pip install it

    • @denizalbayrak6357
      @denizalbayrak6357 3 ปีที่แล้ว +1

      ​@@PythonicAccountant ok, got it, the file had been renamed with .pdf.pdf

  • @nomadicshaman
    @nomadicshaman ปีที่แล้ว

    This channel is like mine, when I'm digging more I get more skills. I appreciate your videos.
    I convert the multi-page(143) bank statement pdf file to CSV file as debits and credits.
    The data frame is 5(column)x26800(row) and the balance is not valid.
    My question is the maximum index for row is 26800? How can I storage more data in CSV?

  • @GuilhermeSantos-gu3ef
    @GuilhermeSantos-gu3ef 3 ปีที่แล้ว

    Great videos !! Thanks for sharing!
    I'm having trouble creating a function that finds and prints a page based on a typed name in pdfplumber. My intent is find a name in the page with pdfplumber and print it in pyPDF2, but the first part is not working. If you can help me, I would appreciate it very much!!

    • @PythonicAccountant
      @PythonicAccountant  3 ปีที่แล้ว +1

      you’ll want to make sure that the case matches. You could just make everything lowercase. Iterate through each page and look for the string in each page, and if it’s in the page, print the whole page

    • @GuilhermeSantos-gu3ef
      @GuilhermeSantos-gu3ef 3 ปีที่แล้ว

      @@PythonicAccountant Understood... good tip!! Thanks!!

  • @scanapproved562
    @scanapproved562 4 ปีที่แล้ว +1

    Hi. Can anyone help. it states fileNotFoundError. I've tried changing the file = 'Sample Report Pythonic.pdf' to the 'c:\test\Sample Report Pythonic.pdf' but wont work. Any help appreciated. PS. This is amazing, cant wait to play with it properly.

    • @barath961
      @barath961 3 ปีที่แล้ว

      Please check the directory that you are working now and the file saved

  • @vivekkaranath7706
    @vivekkaranath7706 4 ปีที่แล้ว +1

    yes its working i found out the mistakes ...anyways thanks :)

  • @vivekkaranath7706
    @vivekkaranath7706 4 ปีที่แล้ว

    No module named 'pdfplumber' i am getting this error when i tried to run the code .please advise

    • @PythonicAccountant
      @PythonicAccountant  4 ปีที่แล้ว

      That means that the pdfplumber module hasn’t been installed on the same environment you are running your code in. Make sure to pip install pdfplumber then try it again.

    • @vivekkaranath7706
      @vivekkaranath7706 4 ปีที่แล้ว

      @@PythonicAccountant thanks for your reply.. I have done pip install pdfplumber several times .. but again same error is coming . I'm using python 3.8. please advise .as this is an important program helpful for all accountants in analysis

    • @PythonicAccountant
      @PythonicAccountant  4 ปีที่แล้ว

      Vivek Karanath type pip freeze in the environment you are using, and see if pdfplumber is included in that list

    • @vivekkaranath7706
      @vivekkaranath7706 4 ปีที่แล้ว

      I typed pip freeze in command prompt it's not showing anything

    • @PythonicAccountant
      @PythonicAccountant  4 ปีที่แล้ว

      Vivek Karanath it sounds like you might not have pip installed. Are you using miniconda or anaconda?

  • @nilekarmayur
    @nilekarmayur 4 ปีที่แล้ว

    hi
    i have a pdf file it contains lot of Data ,
    i only want to extract table and its data from PDF & no other data
    Conditions:
    1)i want to write code where i will give any pdf and it should only give me table (so i dont know the page number )
    2)table can be spread across on multiple pages(for eg. it will start from page 370 & end @page 380)
    also i am using latest python 3.8.1 & Pycharm
    can you please help me?or can you give me an email id so i can give you all the data

    • @hari-codes
      @hari-codes 4 ปีที่แล้ว

      im looking for the same. please let me know if you got it

    • @nilekarmayur
      @nilekarmayur 4 ปีที่แล้ว

      @@hari-codes i got the answer bro , i used tabula to convert PDF to CSV and then read that CSV data ...data will come in for of 2D list like [['1.1',chapter1],['1.2',chapter1]] like this , now iterate to access data using for loop,

    • @srikantpadhy9476
      @srikantpadhy9476 4 ปีที่แล้ว

      @@nilekarmayur If that file is scanned pdf in that case what i can do?

    • @geoffreyschaeffer7694
      @geoffreyschaeffer7694 4 ปีที่แล้ว

      @@srikantpadhy9476 So you'd have to text recognize it. The text recognition in PDF isn't great on scanned PDFs. Just my experience though.

  • @jacekw80
    @jacekw80 2 ปีที่แล้ว +1

    Great video and all tutorial !! I have a lot of cases with multiline data. As in this case how to grab data between vendor name and Supplier total e.g. KITTLINGGAAAAAA BBOO.....TETERY PPONZEM. Thanks

  • @vissivarrel9721
    @vissivarrel9721 5 หลายเดือนก่อน +1

    i passed out while learning regex💀

  • @roberthuang3465
    @roberthuang3465 3 ปีที่แล้ว

    That's amazing! I have a similar pdf need to do the same thing, could you help me write in python? Absolutely I will pay for the work.