The Rvest & RSelenium Tutorial - Web Scrape Dynamic Tables in R

แชร์
ฝัง
  • เผยแพร่เมื่อ 15 ม.ค. 2025

ความคิดเห็น • 51

  • @imfm
    @imfm ปีที่แล้ว +2

    I need to automate pulling data from several websites with atrocious autogenerated spaghetti code. I was trying with Rvest alone and httr and other solutions. I was getting nowhere fast. Then I found this video and boom, I'm in. I can't thank you enough Samer.

  •  ปีที่แล้ว

    Very well explained. I didn't' know about {RSelenium}, looks really powerful. Thanks!

  • @RolandŐzse
    @RolandŐzse ปีที่แล้ว +1

    Hi,
    Thank you so much for this. I am not that big on coding and this solution is really easy to follow. Excuse me if I am being too dumb. I ran into a problem when you refer to the pagination command at 5:25 using the aria label. I am trying to scrape a transfermarkt table and that field is looking pretty different for me:
      
    As you can see, it's a href and not an aria label. There is a link to the next page on every page and I do not know how to iterate this. Works fine if I want to do the first two page but then It's obviously not working. Could you maybe help me out what I should copy paste to the findElement function? Or is this a whole different situation and I have to do something new? Thank you for your help in advance :)

  • @delabungsu6817
    @delabungsu6817 2 ปีที่แล้ว

    Thank you Samer.

  • @馮庭萱
    @馮庭萱 2 ปีที่แล้ว

    many thanks. great explaination, super clear !

  • @AngelFelizF
    @AngelFelizF 2 ปีที่แล้ว

    Great video, thanks for sharing

  • @arunrajesh5137
    @arunrajesh5137 2 ปีที่แล้ว +1

    Watching this tutorial immediately after your Introduction to RSelenium. Really enjoyed learning it from you Samer. How do we navigate to a webpage with username and password from RSelenium ?

    • @SamerHijjazi
      @SamerHijjazi  2 ปีที่แล้ว

      Thank you, Arun! You can do so by identifying the username and password input boxes and sending the username and password to those boxes using the sendKeysToElement function from RSelenium

    • @arunrajesh5137
      @arunrajesh5137 2 ปีที่แล้ว

      @@SamerHijjazi thank you so much...

  • @huongheidinguyen337
    @huongheidinguyen337 2 ปีที่แล้ว +1

    Thank you for the tutorial. I'm practicing scraping Sephora product reviews and ran into a problem. On my last page, there is still a Next page button (it is just disabled), so there was no error and my Next-page loop didn't end. Do you have any suggestions on how to end the loop in this case?

    • @SamerHijjazi
      @SamerHijjazi  ปีที่แล้ว

      if there is a way for you to determine how many pages there are, you can set that as your limit in the loop so that it does not go over that number.

  • @shoakromyusupov7297
    @shoakromyusupov7297 ปีที่แล้ว

    Really helpful video. Would like to ask if you can make similar video to scrape data from social media sites like Instagram, LinkedIn or from your own preference ?

    • @SamerHijjazi
      @SamerHijjazi  ปีที่แล้ว

      Thank you! I don't think I will. LinkedIn is very difficult to scrape (plus they can close your account for it), and Instagram has its own API.

  • @eleonoras.2878
    @eleonoras.2878 ปีที่แล้ว

    Thank you very much for providing such a great explanation! I've encountered an issue in that I'm only seeing a limited selection of chromedriver versions. Unfortunately, none of these versions seem to be compatible with my current Google Chrome version. Would you by any chance have any suggestions on how I might go about resolving this problem? Your insights would be greatly appreciated. :)

    • @SamerHijjazi
      @SamerHijjazi  ปีที่แล้ว +1

      Thank you for the great feedback! I would suggest running the wdman::selenium function, which will download the latest drivers. Then when you run rsDriver, refer to the chromedriver version that corresponds to yours.

    • @eleonoras.2878
      @eleonoras.2878 ปีที่แล้ว

      @@SamerHijjazi I appreciate your response and assistance. Thank you very much. :)

    • @SamerHijjazi
      @SamerHijjazi  ปีที่แล้ว

      @@eleonoras.2878 my latest Selenium video might actually be able to solve your issue. th-cam.com/video/BnY4PZyL9cg/w-d-xo.htmlsi=RP74unOe8SvxWvPV

  • @sarahsuzz
    @sarahsuzz 5 หลายเดือนก่อน

    I keep getting an error "element not found" when using xpath to locate my "nextpage" button - it is an aria-label and it's located in the div section of the DOM - not sure what I am doing wrong. I have checked my code for typos, very carefully. Can you help?

    • @sarahsuzz
      @sarahsuzz 5 หลายเดือนก่อน

      I found my issue - my aria-label was not an "a tag" it was a button

  • @MrNachtduiker
    @MrNachtduiker 2 ปีที่แล้ว

    awesome, thanks

  • @respanol1970
    @respanol1970 2 ปีที่แล้ว

    Amazing!!!

  • @НикитаБабарыкин-р1ь
    @НикитаБабарыкин-р1ь ปีที่แล้ว

    Can you help please? Error in checkError(res) :
    Undefined error in httr call. httr output: Failed to connect to localhost port 4567 after 2254 ms: Connection refused
    What can be a problem?

  • @KaraniKeith
    @KaraniKeith ปีที่แล้ว

    how do i setup the server in firefox browser ?

  • @jeysunez
    @jeysunez 2 ปีที่แล้ว

    Would it be possible to hop on a zoom for help with a scraping project? I would really appreciate it

    • @SamerHijjazi
      @SamerHijjazi  ปีที่แล้ว

      I'm currently not offering that. But I might be in the future :)

  • @yehitzmedapirc
    @yehitzmedapirc ปีที่แล้ว

    Hi! What can I do if I my "Next button" is different every time?
    I do not have a "next" button, I have ti click on the 1, then 2 etc on the page.
    Thanks!

    • @SamerHijjazi
      @SamerHijjazi  ปีที่แล้ว

      Try to see if the different next buttons have a similar attribute that you can use.

  • @haraldurkarlsson1147
    @haraldurkarlsson1147 2 ปีที่แล้ว

    Neat. I was struggling with some dataset (tiny one) that has commas.

  • @devypratiwi8103
    @devypratiwi8103 ปีที่แล้ว

    hello thanks for sharing the video!
    so i've already watched and followed all the steps but i got an error saying
    Error in java_check() :
    PATH to JAVA not found. Please check JAVA is installed.
    but something that makes confuses is i've also already installed my JAVA till it complete but the error keeps saying that JAVA is not found. Do you know how to solve this issue? thankyou

  • @cameronl1434
    @cameronl1434 2 ปีที่แล้ว

    Sorry I am very much a beginner with all this so sorry if this is a stupid question. I have a data table which I want to extract the information from but when I inspect the code it doesn't have an ID. How can I go about selecting the date table without an ID? Thank you in advance

    • @zahrarahmati8612
      @zahrarahmati8612 2 ปีที่แล้ว

      Hello Samer, I have exactly the same problem. Would you please help with this?

    • @SamerHijjazi
      @SamerHijjazi  ปีที่แล้ว

      Not a stupid question at all! Try using a different attribute to identify your table by.

  • @tarasst6887
    @tarasst6887 2 ปีที่แล้ว

    Great!!!!

  • @haraldurkarlsson1147
    @haraldurkarlsson1147 2 ปีที่แล้ว

    Don't we have to check whether the site allows scraping first?

    • @SamerHijjazi
      @SamerHijjazi  ปีที่แล้ว

      Sure! This is only for demonstration purposes. But it's good practice to check first.

  • @CreativeOutput_L_Rist
    @CreativeOutput_L_Rist 2 ปีที่แล้ว +1

    Hey Samer, love the tutorial but ran into an issue I couldn't resolve yet. I am using RSelenium to click on a tab that contains the data I want, which works fine if I run the lines of code one after the other, but not in a for loop. I have a list of links the loop should iterate through and some tries it didn't even click the tab for the first list item, other times it stopped after just a couple.. after just adding a bunch of clickElement() commands it worked for a bit longer (but not directly related to the number of commands added) and then stopped again. Any idea how to make it run more stable? My R memory usage is kinda high, could it be due to that? Am a total noob at R, but confusing that it works manually but not in the loop
    Edit: Also, the netstat free_port function always gives me an 'Error in strsplit(local, ":") : non-character argument'.. I wrote it exactly as you have, so no idea why it doesn't work.. if I define a port manually it (e.g. 14415 or '14415') it says 'Error: port should be an integer value'.. my knowledge of maths might be limited but last time I checked 14415 was an integer lol

    • @SamerHijjazi
      @SamerHijjazi  2 ปีที่แล้ว +1

      Thank you for the great feedback. I'd have to look at your code to be able to see what's going wrong

    • @CreativeOutput_L_Rist
      @CreativeOutput_L_Rist 2 ปีที่แล้ว

      ​@@SamerHijjazi ​ Thanks for the quick response. Thought it might be a common or known issue.. I have posted the code in a reddit thread titled "Impossible to run RSelenium's clickElement() in a loop??" 6 days ago
      Only if you have time and interest tho, don't wanna force you to look at my spaghetti code haha

    • @SamerHijjazi
      @SamerHijjazi  2 ปีที่แล้ว

      @@CreativeOutput_L_Rist can't find it. Looks like the post was removed. Can you reply to this comment with your loop?

    • @CreativeOutput_L_Rist
      @CreativeOutput_L_Rist 2 ปีที่แล้ว

      @@SamerHijjazi yeah, just saw it did get removed. The loop looks like this:
      for (link in links) {
      remDr$navigate(link)
      object = remDr$findElement(...)
      results_object$clickElement() issue here (?)
      table i need = remDr$findElement(...)
      same table html = (...)$getPageSource()
      and so on, exactly like you did in the video. It worked line by line, which means the css selectors should be fine, just that the click command doesn't reliably execute.. since the code above probably doesn't help much, the site is (google) 'iaaf 100m times men', then for every athlete i want to go to their profile, click the results tab (this is where it fails randomly) where all the 100m times from the current season are listed, and then extract these values via html table (or similar). The links seem to be correct too, just something about the dyanamic nature of the specific site confuses the clickElement()

    • @SamerHijjazi
      @SamerHijjazi  2 ปีที่แล้ว

      @@CreativeOutput_L_Rist My guess is your loop is running too quickly, hence when it gets to the clickElement part, it's not able to locate the element due to the web page loading. I would suggest you include a small break in your loop to create a pause long enough for the site to load. You can do so by using the Sys.sleep function

  • @ahmed007Jaber
    @ahmed007Jaber 2 ปีที่แล้ว

    thank you for this;
    getting the below error
    Error in java_check() :
    PATH to JAVA not found. Please check JAVA is installed.
    whenver running
    rs_driver_object

    • @SamerHijjazi
      @SamerHijjazi  2 ปีที่แล้ว

      You need to make sure the JDK is properly installed on your machine. If you're on a windows machine, this tutorial is useful: th-cam.com/video/IJ-PJbvJBGs/w-d-xo.html

  • @glanegons
    @glanegons 2 ปีที่แล้ว

    Too good mate, is it possible to share your code? Thanks

    • @SamerHijjazi
      @SamerHijjazi  2 ปีที่แล้ว

      Thank you for your feedback. I've added the link to the code in the description. :)

  • @MohammadMohammad-mj6pc
    @MohammadMohammad-mj6pc 2 ปีที่แล้ว

    👌👌👌. can you create a video tutorial for chromote package.

    • @SamerHijjazi
      @SamerHijjazi  2 ปีที่แล้ว +1

      This is a good idea! I'd like to explore the package

  • @celmywall
    @celmywall ปีที่แล้ว

    Thank you for your extraordinary tutorial. I'd like to have your opinion on this error: Error in rbindlist(list(all_data, df)) :
    Column 1 of item 2 is length 3 inconsistent with column 2 which is length 4. Only length-1 columns are recycled.
    > Thank you so much.
    Hey, I solved the error easily. Thanks anyways.