Vinayak Mehta - Extracting tabular data from PDFs with Camelot & Excalibur - PyCon 2019

แชร์
ฝัง
  • เผยแพร่เมื่อ 15 ม.ค. 2025

ความคิดเห็น • 18

  • @EricPalmer_DaddyOh
    @EricPalmer_DaddyOh 5 ปีที่แล้ว +6

    Awesome. I'm going to try this out soon on open data pdf files. Looks like just what I need?

  • @christianlira1259
    @christianlira1259 4 ปีที่แล้ว

    Thank you Vinayak Mehta for the great presentation and the tons of work you made. This is a great tool and I lok forward to read abut your OCR read capabilities.

  • @Yelonek1986
    @Yelonek1986 3 ปีที่แล้ว

    Awesome, thanks for this library! It works like a charm.

  • @venkateswaraotella6581
    @venkateswaraotella6581 ปีที่แล้ว

    I need to extract document as same where i need to change the code..?

  • @muhammadahsam1346
    @muhammadahsam1346 5 ปีที่แล้ว

    Awesome library, but what do we do for swapping the columns after converting it into excel or csv format ?

  • @csdevendrajain9114
    @csdevendrajain9114 4 ปีที่แล้ว

    Ghostscript is not work in my pc, I have done everything like adding path or environment variables every time error shows this app not work in your PC and access denied in Windows 8.1

  • @hayathbasha4519
    @hayathbasha4519 3 ปีที่แล้ว

    Hi,
    I am having table that starts in page 1 and ends at page 2
    Page1 includes header and rows
    Page2 contains only rows
    In such case how to extract page2 data using Camelot

  • @amitkumdixit
    @amitkumdixit 5 ปีที่แล้ว +3

    not working failed miserably. It only showed first row of the tables. Tabula gave me perfect result.I wanted to extract table from the bank account statement.

    • @quantumcd1045
      @quantumcd1045 5 ปีที่แล้ว

      Anywhere I can find more information on this? I'm trying to do the same thing.

    • @srichandana602
      @srichandana602 4 ปีที่แล้ว

      @@quantumcd1045 hi ,can Camelot work on non editable PDFs? I had tested but it doesn't give me the results

    • @amankr1993
      @amankr1993 4 ปีที่แล้ว

      sri chandana It doesn’t. It only works with editable and searchable pdf’s. However, tesseract has a functionality which can convert a pdf into an editable version. Try this and then pass it to Camelot. Should work fine. :)

    • @srichandana602
      @srichandana602 4 ปีที่แล้ว +1

      @@amankr1993 hi Thanks for your reply ,I had tried all these things again these all are dependent on image quality ,it doesn't give me good results finally I had built my own to extract the tabular data to excel :)

    • @amankr1993
      @amankr1993 4 ปีที่แล้ว

      sri chandana Yes, it does depend a lot on the quality of the Image.
      And it’s great that you built your own. Would you mind sharing it? Only if that’s okay with you.

  • @Mach7RadioIntercepts
    @Mach7RadioIntercepts 4 ปีที่แล้ว

    Nice talk! Monty Python LOL. Dude, I knew I was going to be a big MP fan when I was punished in grade school for acting out the stoning scene ib "The Life of Brian"
    Hehe, to grow up and write lots of code in Python.

  • @srikantpadhy9476
    @srikantpadhy9476 4 ปีที่แล้ว

    is camelot and Excalibur work for scanned pdf

    • @torrentinocom
      @torrentinocom 4 ปีที่แล้ว

      I can just suppose: camelot just recognise table's contours by converting page to image, after that camelot put all text widgets to closest cell in recognised table.
      Finding respective text widgets lies on pdfminer responsibility. So if pdf miner can't recognise text that lies i cell - camelot just will not have text to put in respective cell.
      But it's just my supposition

  • @ShiquanWang
    @ShiquanWang 5 ปีที่แล้ว

    For the first question saying no good tool to convert a PDF file to HTML with its original layout/look.
    Please check this project: github.com/coolwanglu/pdf2htmlEX
    It converts a PDF file to HTML while keeping exactly the same look.
    It's a pity this project is not maintained.