[77] Use Selenium and Pandas on Google Colab to access rendered HTML tables!

แชร์
ฝัง
  • เผยแพร่เมื่อ 25 พ.ย. 2024

ความคิดเห็น •

  • @roons2424
    @roons2424 6 หลายเดือนก่อน +1

    Question from a aspiring data scientist:
    How did you go about finding and fixing the problem that occourd in the last video?
    When it didn't work for me I felt hopeless and like my project wont continiue for a long time like the last time I ran into a problem.
    How did you manage to find an alternative/working way?
    I am very impressed by your skills of not only finding and fixin the problem but also putting in effort of making it avalible for everyone and even going about making an instructional video explaining it. Today I'll celebrate you my friend.
    Sorry for my bad English,
    Greets from the Netherlands

    • @PythonicAccountant
      @PythonicAccountant  6 หลายเดือนก่อน +1

      Thank you for your message and watching my videos! So short answer is previously I would’ve just googled it, but now I typically will go to ChatGPT! The first thing I will do is copy the error message and paste it in to see if it has any suggestions. I realized quickly that it was likely Just the perimeter is getting passed in and tried removing the first perimeter as one of the suggestions from ChatGPT had indicated.
      As far as your comments about feeling hopeless when something goes wrong, that’s totally normal! For me I am pretty stubborn and also a bit of a hacker mindset, so I usually Assume there is another way to do something and will work very hard and try many different things before I would eventually give up. Having that level of commitment and trying over and over again usually end up working out well for me, but I also usually learn many things along the way. Try it out! Next time you run into a problem try solving it and a bunch of different ways before you give up :-)

  • @ryzvonusef
    @ryzvonusef 6 หลายเดือนก่อน +1

    thank you!

  • @joepropertykey3612
    @joepropertykey3612 5 หลายเดือนก่อน +1

    The reality is, most pages that have 'in demand' data in tables are using heavy javascripting on the page . 'Selenium-Wire' (different from just 'Selenium')will catch the important parts though.
    pandas will read the html tables too. Rips right through it.

    • @PythonicAccountant
      @PythonicAccountant  5 หลายเดือนก่อน

      Thanks I haven’t heard of selenium wire! I’ll check it out

    • @joepropertykey3612
      @joepropertykey3612 5 หลายเดือนก่อน

      @@PythonicAccountant When you use 'Selenium-Wire, it's catching all of the 'network responses' in the background
      . If you google 'selenium-wire network scrape data' you'll see how to find the data in a specific response url, (usually stored as neat an tidy as json) but other times it can be html in the response. But for those 'dynamic tables and pages? Yessir. Selenium-Wire.
      I used to follow you a lot 'before AI', with your pdfplumber videos.
      I've been noticing if you can get pymupdf to extract text, and also 'preserve the line spacing' from a pdf? It's pretty easy to use pandas to go through the text results, and use the line spacing to ''data map' , mapping rows, and line positions on those rows, to columns and rows on another temp df
      When you look at 'most' pdf's, it's almost as if there are 3 'columns' of data going down down the middle of the page, where there is not overflow from one column to the next..... this is where you can get it into a form to go to town with pandas and data mapping

    • @joepropertykey3612
      @joepropertykey3612 5 หลายเดือนก่อน

      @@PythonicAccountant Have you ah.... 'saved a pdf to an html file' and tried to parse that with pandas yet? It's kind of an interesting workaround, whe you have unstructured data, and possibly a pdf what was created with any indexing. You can drop the html file into chatgpt too, and tell it what selectors you want, and it will scrape tables out = give you the pandas code to parse things out.

    • @PythonicAccountant
      @PythonicAccountant  5 หลายเดือนก่อน

      @@joepropertykey3612 that’s interesting, does it actually work?

    • @joepropertykey3612
      @joepropertykey3612 5 หลายเดือนก่อน

      @@PythonicAccountant pandas reading html from a pdf? yessir. It's just another tool in the pdf arsenal...if I can't do it with pdfplumber simply, I look at pymupdf, and how I can convert the data to something more simple and structured to parse.