Web Scraping Wikipedia tables using Python

แชร์
ฝัง
  • เผยแพร่เมื่อ 19 ม.ค. 2025

ความคิดเห็น • 38

  • @jiejenn
    @jiejenn  4 ปีที่แล้ว +9

    Forgot to mentioned that the output from read_html method is a list. To convert the list object to a DataFrame object, simple extract the first element from the output. For example df = df[0].

    • @mumin9436
      @mumin9436 3 ปีที่แล้ว +1

      dude ur awsome 😀 . did my first wikipedia scrapping referring to this video.Thanks for that, now the only problem i have is that there are multiple tables in the page and the output im getting is of the table thats on the top of the original table that i intended to scrap. trying to figure it out

    • @mumin9436
      @mumin9436 3 ปีที่แล้ว

      i figured it out . when multiple tables have same attributes , we just need to find the corresponding index of the table and mention it .

  • @sammcintyre26
    @sammcintyre26 3 ปีที่แล้ว +26

    For those with trouble finding table_id:
    You can use table class name, instead of the table_id (i.e: )
    In that case, I made a change to these 2 lines of code:
    table_name = 'wikitable sortable'
    soup_table = soup.find('table', {'class':table_name})
    Hope this helps

    • @erixyz
      @erixyz 3 ปีที่แล้ว +1

      this helped out a lot. thanks for sharing

    • @miloyang5893
      @miloyang5893 3 ปีที่แล้ว +2

      I tried to do so but for wiki pages with several tables by the class_name = 'wikitable sortable' the program only sends back the 1st one it finds... How do i get the other ones ? Thanx

    • @mumin9436
      @mumin9436 3 ปีที่แล้ว

      thanks alot. this helped

    • @chrispapadakis3965
      @chrispapadakis3965 3 ปีที่แล้ว

      thanks man!

    • @ideastoelectrons156
      @ideastoelectrons156 3 ปีที่แล้ว +1

      @@miloyang5893 You can try the soup.find_all() method instead of soup.find(). It will return a list of all the concerned tables.

  • @suomynona7261
    @suomynona7261 2 ปีที่แล้ว +1

    Why would you want to scrape a table instead of text? What would a table be used for?

  • @princek4935
    @princek4935 4 ปีที่แล้ว +5

    I cant find a table ID on the wiki page

  • @chrispapadakis3965
    @chrispapadakis3965 3 ปีที่แล้ว +1

    Nice and simple, thanks man!

  • @michaeltillcock3864
    @michaeltillcock3864 2 ปีที่แล้ว +1

    Thanks I am so nearly there! One question. I get to 5 mins 48 secs with the same results as Jie. But when I try to print(df),the terminal says: "Traceback (most recent call last):
    ///File "", line 1, in ///NameError: name 'df' is not defined".
    From my understanding I have defined df in line 12 - so I can't work out why it's not working? I am a newbie so answers for dummies appreciated.

    • @michaeltillcock3864
      @michaeltillcock3864 2 ปีที่แล้ว

      Dumb mistake where I needed to write print(df) at the end of the programme and select all the line of code and run that - it looked like you wrote it into the terminal which didnt workfor me

    • @jiejenn
      @jiejenn  2 ปีที่แล้ว +1

      Glad you were able to solve your issue. Apology for the late reply, currently moving back to the U.S. from Asia, too much stuff going on.

  • @christopherwells7295
    @christopherwells7295 4 ปีที่แล้ว +1

    Thanks for the video, I see you also forgot to mention that df makes use of lxml, thankfully I can read the errors and so installed it.

  • @otaviodzb1
    @otaviodzb1 4 ปีที่แล้ว

    Very good! It worked perfectly! Thank you!

  • @callvengeance5486
    @callvengeance5486 4 ปีที่แล้ว +4

    Hello, I am using Chrome but I can't see the table ID, only the class. Do I need to do something else to get the table ID?

    • @jiejenn
      @jiejenn  4 ปีที่แล้ว

      You should be able to. What steps you took to attempt viewing the source code?

    • @blacklabelmansociety
      @blacklabelmansociety 4 ปีที่แล้ว

      Same problem over here. Were you able to find any solution?

    • @gGBb27
      @gGBb27 4 ปีที่แล้ว

      same thing

    • @jessemetzger6709
      @jessemetzger6709 4 ปีที่แล้ว

      I went to the 'debugger' part on Firefox and under debugger the class had a slightly different name. I used that class name and everything worked

    • @miloyang5893
      @miloyang5893 3 ปีที่แล้ว +1

      wikipedia tables don't always have table ID's just use the class_name

  • @farhangony952
    @farhangony952 4 ปีที่แล้ว +2

    I can not use pandas. why is it happening?

    • @jiejenn
      @jiejenn  4 ปีที่แล้ว +3

      Did you install Pandas library?

    • @farhangony952
      @farhangony952 4 ปีที่แล้ว +1

      @@jiejenn oh ! hank you so much . can you kindly tell me how to install that library. actually I am a new learner and don't know most of the things.

    • @blacklabelmansociety
      @blacklabelmansociety 4 ปีที่แล้ว

      @@farhangony952 Try tiping pip install pandas in conda prompt

  • @tonypendletoniii3209
    @tonypendletoniii3209 4 ปีที่แล้ว +1

    Thanks for the vid, man! Do you happen to live in Alabama btw?

  • @princek4935
    @princek4935 4 ปีที่แล้ว

    I can a table ID? on the wiki page

  • @akshatjain3938
    @akshatjain3938 4 ปีที่แล้ว

    Can you recommend any good extensions for python in VS Code

  • @thomascooney4078
    @thomascooney4078 2 ปีที่แล้ว

    What python Client are you using?
    looks alot more simplified than pycharm

    • @jiejenn
      @jiejenn  2 ปีที่แล้ว

      VS Code. The configuration takes a bit to setup, but i like the flexibility much better than PyCharm.

  • @mohamedhachaichi2680
    @mohamedhachaichi2680 4 ปีที่แล้ว

    How to turn the output of this into a DataFrame?

    • @jiejenn
      @jiejenn  4 ปีที่แล้ว +1

      This is something I failed to mentioned in the video. To convert the df (while still is a list) to a DataFrame object, extract the first element. For example df = df[0].