Automated research: Extracting companies from PDF and "google" them using PyPDF2 and Regex (Python)

แชร์
ฝัง
  • เผยแพร่เมื่อ 29 มี.ค. 2020
  • Python programming tutorial for beginners: Learn how to extract text from a given PDF file using regex matching, and automating your research task by using the webbrowser module in Python.

ความคิดเห็น • 29

  • @SamuelChan
    @SamuelChan  2 ปีที่แล้ว +7

    Hey if you find this content helpful! I’m starting to make full series of python / data science / automation / blockchain (smart contracts) that is sorted in episodes. All of them are around hour long, HD, and have no ads. Please feel free to check out the channel or consider subscribing ❤️ Let me know if you have topics you’d like to see me cover!

  • @RiecVC
    @RiecVC 3 ปีที่แล้ว

    Me voló la cabeza :o ! Muchas gracias por compartir !!

  • @gupta_udit
    @gupta_udit 2 ปีที่แล้ว

    Great learning 🙏

  • @dubjoyce
    @dubjoyce 2 ปีที่แล้ว

    I like it because it’s a real life example. So useful man. Would’ve been great to add the click on the “News” tab of Google search. Great anyway 👍

    • @SamuelChan
      @SamuelChan  2 ปีที่แล้ว

      That a great suggestion! I have lots more planned in the pipeline. Here’s another real life python app (it’s available on pypi): m.th-cam.com/play/PLXsFtK46HZxXIVE4tRjwMjwKFVaQSdT5W.html
      Thanks for checking out and hope to see you around!

  • @wirechair
    @wirechair 2 ปีที่แล้ว

    Hello, thanks for this video! Amazing stuff. Does the pdf need to be a pdf with "native" text in it? Or can it be something like an image of a page with text?

    • @SamuelChan
      @SamuelChan  2 ปีที่แล้ว

      Image of a page with text would not work I don't think. One workaround is to use Tesseract or easyocr, or other OCR package. I have short clips of that on my FB / instagram but I'm planning to release a video walkthrough in the future

  • @robotdream8355
    @robotdream8355 ปีที่แล้ว +1

    thanks for amazing stuff.if the multiline text will available in table then is it able to extract?

    • @SamuelChan
      @SamuelChan  ปีที่แล้ว

      Yes, since we’re matching string patterns with regex. A multi line sentence is just a line with \s for space and
      for newlines.
      There are other videos I make that shows the use of regex for larger chunk of text. If you can match it with regex you can extract it with pypdf2.
      Related: Writing Regex for Humans
      th-cam.com/video/-QuH0jiVddY/w-d-xo.html

  • @KinqNick
    @KinqNick ปีที่แล้ว

    Good video sir. I habe a question. Is it possible to loop through multiple pdfs and name them what they found in the text. For example i need to loop through multiple pdf and search for a spesific number after a certain name and need to rename the pdfs propaly is there a way to do it.
    Do i need os and glob ?

    • @SamuelChan
      @SamuelChan  ปีที่แล้ว

      You will need os.rename(current_path, new_path). This new_path is a variable constructed from what you found from the body of the pdf. That’s how I would go about it! :)

    • @KinqNick
      @KinqNick ปีที่แล้ว

      and how can i loop through multiple files?

    • @SamuelChan
      @SamuelChan  ปีที่แล้ว

      for filename in os.listdir(current_dir):
      with open(…) # same drill as what you see in the vid

  • @inesaoues5464
    @inesaoues5464 ปีที่แล้ว +1

    hi please can you help me how can i get TOC (table of contents ) of PDF into a list with python

    • @SamuelChan
      @SamuelChan  ปีที่แล้ว +1

      Hey Ines, I would try: (1) re.match since TOC usually follows a very defined, very easily identifiable structure. If for some reason the pdf isnt proper encoded (so text were scanned in as images for example) I would try (2) OCR like Tesseract. The first approach is what I show in this video. Hope it helps!

    • @inesaoues5464
      @inesaoues5464 ปีที่แล้ว

      But the table of contents start with the word content and with my file its starts from page 2 to page 5 how can i get only the titles without numero of page

    • @SamuelChan
      @SamuelChan  ปีที่แล้ว

      You could just do the extraction first and then clean up the numero and dropping the word “content” by doing pattern matching? :)

  • @aparnamukka1956
    @aparnamukka1956 2 ปีที่แล้ว +1

    How to find the cost of DBS Group Holdings pvtLTD

    • @SamuelChan
      @SamuelChan  2 ปีที่แล้ว

      It's basically just separating by comma and then taking the value corresponding to the cost. Pause at 5:38, that's your text. Combine that with `text.split(',')[0]` gives you the first value, which is cost of DBS Group holdings. Want to watch out for numbers like 24,500 though if you use this method. It's good mental practice to think about how one can come up with a more robust solution.

  • @afifnugr3399
    @afifnugr3399 4 ปีที่แล้ว

    Its a good tutorial dude..just need a script to ruins all repetitive task ,

  • @davidkaradine5093
    @davidkaradine5093 ปีที่แล้ว +1

    I would like to extract financial data from pdf financial reports of some companies listed in bursa malaysia into an excel file to conduct my PhD thesis. But I don't know how to do!
    Please if you can help me that means you'll save my life .

    • @SamuelChan
      @SamuelChan  ปีที่แล้ว +1

      Hey where do you need help? On the use of pypdf2? On regex? :) happy to hear more about where you’re struggling and provide some pointers

    • @davidkaradine5093
      @davidkaradine5093 ปีที่แล้ว

      @@SamuelChan actually am not familiar with Python. I just dropped on your interesting video. I will download the financial reports of 100 companies listed in bursa malaysia for a period of 6 years and I would classify the financial reports of each company in a separate folder. I will set an excel sheet which contain in the headers: the ticker, the sector, the years, total assets, total equity, revenue, receivable, profit before tax, net profit, tax expense, tax paid, net cash flow, deferred tax assets, deferred tax liabilities, outstanding number of shares, market closing price. Property plant and equipment. The years are: 2018, 2019, 2020, 2021, 2022.
      I want a code to extract these data and paste them in the excel sheet. Cause manually is time consuming. And unlimited thanks for your response.

    • @davidkaradine5093
      @davidkaradine5093 ปีที่แล้ว

      @@SamuelChan can you help me??

    • @SamuelChan
      @SamuelChan  ปีที่แล้ว

      @@davidkaradine5093 I think if you're not keen on learning Python (or can't, due to time constraints), then doing it manually will be less effort since it's only 6 years worth of data.
      Alternatively, consider just buying xlsx format of these companies either from the bursa or 3rd party providers, then you don't have to extract data into xlsx, and still have historical data of the companies you're researching.
      Last idea: perhaps you may also consider outsourcing it on Fiverr? What you require isn't complicated at all and most programmers on freelancer sites will be happy to build an automation script / pipeline for you. I run Algoritma (algorit.ma) and Supertype (supertype.ai) and we provide these kind of services for companies all the time, so depending on your budget that might work too.

    • @davidkaradine5093
      @davidkaradine5093 ปีที่แล้ว

      @@SamuelChan how much would this cost ??
      It's for 6 years but also for hundreds of companies.

  • @hoangnhatquang4743
    @hoangnhatquang4743 2 ปีที่แล้ว

    Can you give me the code file in the github

    • @SamuelChan
      @SamuelChan  2 ปีที่แล้ว

      Yes anh. It’s here on my github: github.com/onlyphantom/automatetheboringstuff/blob/master/pdf_00.py