Thanks for the intro to Rvest! The code as shown doesn't quite work correctly, though, since the get_text and get_link functions assign the same hardcoded link right at the beginning. I was able to get it to work just by deleting those lines - I got 6603 unique "staff members" this way compared to the 33 from this code. Thanks again for the video!
Hi Andrew, thank you so much for sharing the amazing content! I have a question with regard to identifying the total pages. In your tutorial, you went through a manual process, I wonder if there is any means to have R identify the total pages available? Because as the number of articles grows, you will have more pages than current available. Thanks again!
I think it depends on the webpage that you are scraping. For example, using page=all can sometimes retrieve all of the links into one url. Another way would be entering a large number and iterating through pages with the safely function. The pages that have no content would result in an error but the mapped function would still iterate through it. tibble(page_num = 1:100) %>% mutate(page = paste0("fivethirtyeight.com/tag/slack-chat/page/", page_num, "/")) %>% mutate(links = map(page, safely(get_links))) %>% mutate(links = pluck(links, 1)) If you are planning on scraping data that will be added to the website with another link, I recommend saving the links that have been scraped and using an anti-join on the entire links set when re-running the script. I know this isn't the most efficient way of web scraping but I hope this helps! -Andrew
Is anything in data_slack_pages nested? You may need to unnest a column. Example: data_slack_pages %>% unnest(nested_column) %>% write.csv("data_test.csv")
Yes in general you should look for a robots.txt file in the website or an API. I advocate for scraping what you need for personal projects but for professional/work projects I do not scrape and instead purchase data from vendors.
Thanks so much for this Andrew!!! Cheers!!
Damn. I love how you choose your tidy tuesday contents. Cant complain enough.
Thanks for the intro to Rvest! The code as shown doesn't quite work correctly, though, since the get_text and get_link functions assign the same hardcoded link right at the beginning. I was able to get it to work just by deleting those lines - I got 6603 unique "staff members" this way compared to the 33 from this code. Thanks again for the video!
Good catch I’ll make sure to change it!
-Andrew
Hi Andrew, thank you so much for sharing the amazing content! I have a question with regard to identifying the total pages. In your tutorial, you went through a manual process, I wonder if there is any means to have R identify the total pages available? Because as the number of articles grows, you will have more pages than current available. Thanks again!
I think it depends on the webpage that you are scraping. For example, using page=all can sometimes retrieve all of the links into one url. Another way would be entering a large number and iterating through pages with the safely function. The pages that have no content would result in an error but the mapped function would still iterate through it.
tibble(page_num = 1:100) %>%
mutate(page = paste0("fivethirtyeight.com/tag/slack-chat/page/", page_num, "/")) %>%
mutate(links = map(page, safely(get_links))) %>%
mutate(links = pluck(links, 1))
If you are planning on scraping data that will be added to the website with another link, I recommend saving the links that have been scraped and using an anti-join on the entire links set when re-running the script. I know this isn't the most efficient way of web scraping but I hope this helps!
-Andrew
@@AndrewCouch Thank you so much for your quick reply Andrew! I will check it out...
It is work with this example but with other examples output shows xmlnodset(0)
how do I export this to CSV?
write.csv(data_slack_pages, "data_test.csv")
doesn't work
Is anything in data_slack_pages nested? You may need to unnest a column.
Example:
data_slack_pages %>%
unnest(nested_column) %>%
write.csv("data_test.csv")
@@AndrewCouch sorry for the slow reply. Worked a treat thank you, great tutorial. Might help to slow down for newbies just a bit!
Don't you have to check whether they allow scraping first? There may be no need if there is an API.
Yes in general you should look for a robots.txt file in the website or an API. I advocate for scraping what you need for personal projects but for professional/work projects I do not scrape and instead purchase data from vendors.