Automated Web Scraping in R Part 1| Writing your Script using rvest

แชร์
ฝัง
  • เผยแพร่เมื่อ 19 ม.ค. 2025

ความคิดเห็น • 42

  • @ukuk9162
    @ukuk9162 5 ปีที่แล้ว +24

    your voice makes me feel like I'm on board an airplane hostess of the area

    • @11hamma
      @11hamma 4 ปีที่แล้ว

      honestly man

  • @victorsingam3238
    @victorsingam3238 9 หลายเดือนก่อน +1

    Thank you this was a really good video, easy to follow and well paced.

    • @Datasciencedojo
      @Datasciencedojo  9 หลายเดือนก่อน

      Thank you for your feedback!

  • @neguinerezaii3221
    @neguinerezaii3221 2 ปีที่แล้ว +1

    This is a great video. I now know how to get data from one wikipedia page. Is there a way to extract all text from all wikipedia pages?

  • @ayaabdelghany4404
    @ayaabdelghany4404 2 ปีที่แล้ว +1

    You make it look very easy 😅

  • @agustinblacker1324
    @agustinblacker1324 5 ปีที่แล้ว +2

    Is there a video about automate scrapping in Python? The first one about scrapping was about python and was really useful and awesome. Thanks for being so clear and informative. Keep rocking!

    • @Datasciencedojo
      @Datasciencedojo  5 ปีที่แล้ว

      Thanks! That is something we might put together in future! Our free Python web scraping tutorial is here if you need: th-cam.com/video/XQgXKtPSzUI/w-d-xo.html
      Rebecca

  • @moeshyassin
    @moeshyassin 5 ปีที่แล้ว +1

    Thank you very much for the nice video. Is there a package that can beautify the email contents so that it looks in a formatted structure?

  • @kolawolekushimo
    @kolawolekushimo 3 ปีที่แล้ว

    If you are joining the datetime; say when not all are visible, what are you supposed to join on?

  • @AbhijeetSinghs
    @AbhijeetSinghs 3 ปีที่แล้ว

    Please make a video on clicking a button programmatically on a website using R for data extraction/scraping purposes.

  • @giuliko
    @giuliko 6 ปีที่แล้ว +2

    What an awesome video! Congrats and keep the hard work. Hope to see more web scraping videos from you. Great Great video. Thanks a lot.

    • @rebeccamerrett6536
      @rebeccamerrett6536 6 ปีที่แล้ว +2

      Thanks, Giuliko! Glad you found it useful. Part 2 is yet to come! Soon!

    • @giuliko
      @giuliko 6 ปีที่แล้ว +2

      @@rebeccamerrett6536 I'm looking forward to watch it. You are by far my favorite R channel on TH-cam. Thanks a lot once again.

    • @rebeccamerrett6536
      @rebeccamerrett6536 6 ปีที่แล้ว +1

      @@giuliko Thank you! It means a lot, and encourages to keep going :)

  • @shilpasuresh641
    @shilpasuresh641 4 ปีที่แล้ว

    Hi, I have 52,000 urls and I need to create a search engine so that when they search for their question they get it . How do I do that ? I even have a json file . This should be done using R . If yes I can be in touch with you based on this .

  • @vitordeholandajo156
    @vitordeholandajo156 5 ปีที่แล้ว +2

    Amazing job.

  • @svaughn8891
    @svaughn8891 4 ปีที่แล้ว

    Hi, like your video.
    Copied the code from your code repository, but I get this error:
    > # Create a dataframe containing the urls of the web
    > # pages and their converted datetimes
    > marketwatch_webpgs_datetimes

    • @svaughn8891
      @svaughn8891 4 ปีที่แล้ว

      I when back through your video and at 5:01 there are some lines that creates urls on the screen:
      urls %
      html_nodes("div.searchresult a") %>% #See HTML source code for data within this tag
      html_attr("href")
      however, these are not in the current version of r_web_scraping_coded_example_share.R on your code repository.

  • @Austin-wh4yi
    @Austin-wh4yi 5 ปีที่แล้ว +1

    Hi so when I run this marketwatch_webpgs_datetimes

    • @Datasciencedojo
      @Datasciencedojo  5 ปีที่แล้ว

      Hey there! This could likely be due to datetimes being tagged under "div.deemphasized span.invisible" during certain times of the day. I briefly went over this in the video, but to help simplify the this, it is in the full script link below the video (see code.datasciencedojo.com):
      # Grab all datetimes on the page
      datetime %
      html_nodes("div.deemphasized span") %>%
      html_text()
      datetime
      # Filter datetimes that do not follow a consistent format
      datetime2

    • @Austin-wh4yi
      @Austin-wh4yi 5 ปีที่แล้ว +1

      @@Datasciencedojo thanks for the prompt and detailed answer.

  • @alisaja11
    @alisaja11 4 ปีที่แล้ว

    Hi, thank you so much for the nice video, I am new in this field and this video absolutely helpful for a beginner like me. However, when I run your coding in the part of looping titles and bodies, I got an error message which mentioned that the article didn’t exist. Can you help me to figure out, what could be the cause?

    • @주해람-d1b
      @주해람-d1b 4 ปีที่แล้ว

      Im having the same problem:( have you solved this problem by any chance? Thanks in advance.

  • @winnie_the_poohh
    @winnie_the_poohh 5 ปีที่แล้ว

    When I run the code down below new columns named Title and Body are not added to marketwatch_latest_data. Even when I copy your code and run it, it still does not work. What could be the problem?
    marketwatch_latest_data$Title

    • @michellelai6529
      @michellelai6529 5 ปีที่แล้ว

      Thanks for such a clear step by step tutorial. I've gotten quite far in, but have faced the same issue as Mickey, where
      names(marketwatch_latest_data) results in [1] "webPg" "DateTime" "DiffHours" only.
      Would you be able to help? Thanking you in advance.

    • @Datasciencedojo
      @Datasciencedojo  5 ปีที่แล้ว

      Hey folks! Glad you are following along :)
      Here's what could be happening in regards to your problem.
      It could be that it is not able to collect data that was published within an hour of whatever timeframe you have specified here:
      # Filter rows of the dataframe that contain
      # DiffHours of less than an hour
      marketwatch_latest_data

    • @Datasciencedojo
      @Datasciencedojo  5 ปีที่แล้ว

      @@michellelai6529 Glad you are following along :)
      Here's what could be happening in regards to your problem.
      It could be that it is not able to collect data that was published within an hour of whatever timeframe you have specified here:
      # Filter rows of the dataframe that contain
      # DiffHours of less than an hour
      marketwatch_latest_data

  • @pratyushak4921
    @pratyushak4921 5 ปีที่แล้ว

    i have tried to send the mail but it is showing authentication error.. any help?

    • @rebeccamerrett6536
      @rebeccamerrett6536 5 ปีที่แล้ว

      Mind sharing the error message? Just checking, are you using gmail?
      Sometimes gmail blocks from less secure apps. Enable 'Allow less secure apps' in your gmail account. You might want to set up a separate email account for this so you don't compromise security on your personal gmail account.
      Or, you could try setting smtp in your gmail account settings.

  • @maktech3936
    @maktech3936 6 ปีที่แล้ว +3

    Her voice is soooooooooooooooooooooooooooooooooo pleasing..
    **cough cough
    I meant nice tutorial ❤️

  • @paulh1720
    @paulh1720 5 ปีที่แล้ว +1

    thanks !!!!!!!

  • @ssisteluguharish1305
    @ssisteluguharish1305 4 ปีที่แล้ว

    awesome

  • @samb.6425
    @samb.6425 4 ปีที่แล้ว

    your way of speaking is very stressful