Excellent as always! I believe my web scraping performance has got much better after learning Javascript. Understanding async/await concept by learning JS promises was crucial for me.
John, I am using scrapingbee synchronously to scrape 1000 URLs and growing and it takes forever. Scrapingbee and other proxies allow for concurrent requests, while I also know you can do things Async. A video would be great on the difference and why you would do one or the other or how you would do both. Here are some questions: 1. Is concurrent procesess just for requests or the parsing as well? Does this impact writing to a csv if you have multiple processes running at once? Appreciate your content. I feel like my scraper is almost there in terms of scalability and efficiency and I'm really excited. (Although I probably need to implement a dataclass at some point)
ปีที่แล้ว +3
What do you think of scraping google cache? Might speed it up too when you dont have the JS stuff to download
async code makes things messy. i love to keep class base code and hard to handle that way. for speedy things, i use threading which works fine. if you have any video with async in class structure . would love to check that.
I ran into an issue with using aiohttp while requesting a bunch of urls at the same time, i don't know if its a problem on my behalf or the server is not happy with me. I've put a limit of how much tcp connections are made seem to solve the issue, anyways I'm beginning to consider httpx as an alternative.
I have a video coming soon that will help. I like aiohttp- i tihnk its unlikely thats the issue, HTTPX is good because you have requests like API for easy use as well as the async capabilities when you want them
Excellent as always! I believe my web scraping performance has got much better after learning Javascript. Understanding async/await concept by learning JS promises was crucial for me.
John, I am using scrapingbee synchronously to scrape 1000 URLs and growing and it takes forever.
Scrapingbee and other proxies allow for concurrent requests, while I also know you can do things Async. A video would be great on the difference and why you would do one or the other or how you would do both. Here are some questions:
1. Is concurrent procesess just for requests or the parsing as well? Does this impact writing to a csv if you have multiple processes running at once?
Appreciate your content. I feel like my scraper is almost there in terms of scalability and efficiency and I'm really excited.
(Although I probably need to implement a dataclass at some point)
What do you think of scraping google cache? Might speed it up too when you dont have the JS stuff to download
that's not something I've tried actually interesting idea though
Hey John, thanks for this video. I see you recommend httpx over requests for async: what about the AsyncHTMLSession from requests-html?
It went unmaintained for a while so I moved away from it. It’s got new maintainers now so hopeful it gets a few issues fixed and comes back
I would link a video showing async and threading when scraping using playwright!
can i use async too if the website has a limit rate? for example : 429 too much request
great video
Awesome!
async code makes things messy. i love to keep class base code and hard to handle that way. for speedy things, i use threading which works fine. if you have any video with async in class structure . would love to check that.
I ran into an issue with using aiohttp while requesting a bunch of urls at the same time, i don't know if its a problem on my behalf or the server is not happy with me. I've put a limit of how much tcp connections are made seem to solve the issue, anyways I'm beginning to consider httpx as an alternative.
I have a video coming soon that will help. I like aiohttp- i tihnk its unlikely thats the issue, HTTPX is good because you have requests like API for easy use as well as the async capabilities when you want them
Well lucky me, excited about the video. Aiohttp is working fine after the fix, maybe server limit.
Is it legal to scrape data from foreign countries like making thousands of requests might crash their website 😅
Hhhhhhh
If it's a problem, they'll block it. If they don't block, then do as you want, there is no law about not collecting data massively.