Thanks a lot, always informative. How would you then run the two scrapers concurrently? and how would you pattern match when scraping a lot of products (i.e scrape all products on both sites, and then create a product_dataframe for example with price comparison)
Hi, I"ve asked this question in other video of yours as well, but asking here again, in-case you have missed the other one: @John I've been following you for a long time and watching all your scraping videos with Python. I have started to create scraper but the website is not allowing me to access as it is considering my script as a bot, though I have changed the user-agent to latest chrome but still, that website is recognizing me as a bot. My question is that which combo I should use for scraping little complex JS/AJAX/bot-aware websites? People say that selenium is good for that purpose, but you say that selenium is not a good option now a days as it is slow, then what do you suggest, which combo should I use, that can fit in many scenarios, if not all. Looking forward! Thanks.
Hi - it depends on the site but generally i suggest trying; a) adding more headers as well as the useragent b) trying playwright/selenium with the undetectable driver c) using proxies d) combination of all three. Beating some anti bot protection can be tricky it takes time to figure out what it is you need to do to comply
@@JohnWatsonRooney Normally its cloudflare the only hinderence. Where can I find detailed documentation for selectolax, I'm write now writing a scraper using cloudscraper (found it a comment, answered by you) and it has bypassed cloudflare. But I'm having trouble with selectolax right now, unable to find proper documentation. Is there any other fast alternative to selectolax? That has bigger community?
Hello, I tried this but keep getting the Attribute error 'NoneType' object has no attribute 'text'. I outputted the text this resp receives and it doesnt have the tag which shows up while inspecting the page
Thank you for this video! Works wonderful with a particular item. But what if I want to get multiple items. Say, news stories from a website. html.css_first(selector).text().strip() - css_first gets only latest one. css_all - doesn't work, and just html.css(selector) won't work either. Please help.
Thanks. html.css(selector) will return a list of all matching elements for the given selector so we can loop through this and call .text() on each iteration to get the data
Great video as always! But how do you happen not to be banned by Amazon? I tried scraping a couple of years ago - it always detected my script as robot and didn't give data.
vids they are great When getting info from a site using python is the ip same or when using python? or do they have their own different ip address? and also same with scrapy; if i use scrapy does that ip address is same as this computers? because some sites have blocks set up to prevent types of things like this and i dont want to get banned forever by my ip any way to bypass this so you dont get banned?
Hey man, 10x for your tuts. I'm doing a lot of scrapping. Lately I need to get logos of 20k e-commerce stores. Imho, it was an interesting task. Unfortunately only about 1/3 could be automated - I went with finding divs, classnames, and image sources having a 'logo' in it. May be you did something like that before and have interesting strategy ?
hey, thanks. interesting task as you say. I would probably save the html for each into a document database like mongo, and then test different patterns against each - save having to make loads of requests over and over. this way you could try different ways and see which works, updating the database with the logo as you go. Theoretical approach it would probably need revising as you go though
@@JohnWatsonRooney yep, i skipped db part, used just saved pages (played with filenames to get a correlation to store identifier). picking a strategy is the tricky part every site chooses it's own way to keep the logo, even on platforms like shopify or wp :)
I’m trying to scrape addresses: zip code, city, state, etc. from thousands of websites. How would you recommend I do this. I’m trying regular expression stuff, but even then it pulls in other info.
Thank you John! 🙏
Informative and it got me a couple of new ideas I want to try now! 💡😀
thanks, that's great!
Could you also make one tutorial for your code editor setup and the terminal? It looks really cool.
Yes, working on a setup video and neovim video!
@@JohnWatsonRooney thanks so much.
@@JohnWatsonRooney how's that video coming along?? 😊
@@bakerssebandeke6764 haha yeah... soon :)
Learned a lot in your video, hope to come out with a neovim editor tutorial, thank you sir!
I am working on a neovim video and thanks for watching
@@JohnWatsonRooney Thanks, have a nice life!
Great video John, thank you. Very informative.
Thanks a lot, always informative. How would you then run the two scrapers concurrently? and how would you pattern match when scraping a lot of products (i.e scrape all products on both sites, and then create a product_dataframe for example with price comparison)
Hi,
I"ve asked this question in other video of yours as well, but asking here again, in-case you have missed the other one:
@John I've been following you for a long time and watching all your scraping videos with Python. I have started to create scraper but the website is not allowing me to access as it is considering my script as a bot, though I have changed the user-agent to latest chrome but still, that website is recognizing me as a bot. My question is that which combo I should use for scraping little complex JS/AJAX/bot-aware websites? People say that selenium is good for that purpose, but you say that selenium is not a good option now a days as it is slow, then what do you suggest, which combo should I use, that can fit in many scenarios, if not all.
Looking forward!
Thanks.
Hi - it depends on the site but generally i suggest trying; a) adding more headers as well as the useragent b) trying playwright/selenium with the undetectable driver c) using proxies d) combination of all three. Beating some anti bot protection can be tricky it takes time to figure out what it is you need to do to comply
@@JohnWatsonRooney Normally its cloudflare the only hinderence. Where can I find detailed documentation for selectolax, I'm write now writing a scraper using cloudscraper (found it a comment, answered by you) and it has bypassed cloudflare. But I'm having trouble with selectolax right now, unable to find proper documentation. Is there any other fast alternative to selectolax? That has bigger community?
@@yawarvoice selectolax is just an HTML parser - the main on in the python community is Beautifulsoup you could give that go
@@JohnWatsonRooney Got it. One last thing: Which one you'll prefer: 1) SE+BS or 2) Playwright + BS or 3) Cloudscraper + BS?
Hello, I tried this but keep getting the Attribute error 'NoneType' object has no attribute 'text'. I outputted the text this resp receives and it doesnt have the tag which shows up while inspecting the page
Thank you for this video! Works wonderful with a particular item. But what if I want to get multiple items. Say, news stories from a website. html.css_first(selector).text().strip() - css_first gets only latest one. css_all - doesn't work, and just html.css(selector) won't work either. Please help.
Thanks. html.css(selector) will return a list of all matching elements for the given selector so we can loop through this and call .text() on each iteration to get the data
@@JohnWatsonRooney Thank you! Waiting for more videos! Take care!
Oops. Misspelled Thomann; better remake the video! 😉
Haha yeah- I have actually done that before!
Very good video as usual! Thank you! When is chatgpt video coming 🤔?
thanks! hmm not a fan of chatgpt, not sure i'll cover it
nice work as always , can you please make a video about how to scrape email addresses from a domain ?
Great video as always! But how do you happen not to be banned by Amazon? I tried scraping a couple of years ago - it always detected my script as robot and didn't give data.
Thanks! I’ve never had an issue with Amazon - I found that I usually just need a user agent and occasionally the language header and I’m good
@@JohnWatsonRooney thank you! I gotta give it a try!
Hi John I have a question, can you guide me for how to scroll down a scrollable ul list in a section of the html with playwright
vids they are great
When getting info from a site using python is the ip same or when using python? or do they have their own different ip address? and also same with scrapy; if i use scrapy does that ip address is same as this computers?
because some sites have blocks set up to prevent types of things like this and i dont want to get banned forever by my ip
any way to bypass this so you dont get banned?
i have written the code but it will not print any results
One question, what is your ide?
Neovim - it’s a slightly modified version of chrisatmachine’s basic ide if you google it
Thanks for the zz shortcut
It’s a good one I didn’t even know about until recently
how to automate the captcha in python
Can you please post your codes in your videos to a link below or in github or etc. it would be so helpful
github.com/jhnwr/youtube - I am reorganizing my github but here it is
7:05 Yeah but, show me the ugly as sin CSS selectors/HTML. Those are the ones that give me the hardest time. Great vids! Thanks!
haha, yeah i understand. I'll include some more wonky stuff going forward
Hey man, 10x for your tuts.
I'm doing a lot of scrapping. Lately I need to get logos of 20k e-commerce stores.
Imho, it was an interesting task. Unfortunately only about 1/3 could be automated - I went with finding divs, classnames, and image sources having a 'logo' in it.
May be you did something like that before and have interesting strategy ?
hey, thanks. interesting task as you say. I would probably save the html for each into a document database like mongo, and then test different patterns against each - save having to make loads of requests over and over. this way you could try different ways and see which works, updating the database with the logo as you go. Theoretical approach it would probably need revising as you go though
@@JohnWatsonRooney yep, i skipped db part, used just saved pages (played with filenames to get a correlation to store identifier). picking a strategy is the tricky part every site chooses it's own way to keep the logo, even on platforms like shopify or wp :)
I’m trying to scrape addresses: zip code, city, state, etc. from thousands of websites. How would you recommend I do this. I’m trying regular expression stuff, but even then it pulls in other info.
Nice! Is this neovim? Can you write to me how to get this editor with syntax highlighting tabs etc? Thank you!
Yes it is! I am going to do a video on it but if you google "chrisatmachine basic IDE neovim" its basically that
@@JohnWatsonRooney thx!
so coooool
Thanks !
The comment section be like
Video : "How I survived from dying"
Comments: the shirt looks good.
What I mean is everyone is asking for ide 😂
Haha yeah, I didn’t think people would that interested in it