Missed the intro !! :) , You again did it mate...solved a problem in such an easy way! I have had that issue [one of the script i send in email](code review)...After your video my script just got more efficient!! hehe
Man, I watched a few videos and read the section of a scraping book to figure this out, with this short video, it finally clicked, thank you! 1000 URLs gone from 5-6 mins job to about a min tops. (My internet is not great plus I am grabbing quite a bit of data)
Tested it on my selenium scripts as well. Works like a charm!!! Kudos! John and Thank You :) ... Just came across this content and subscribed asap. FYI ;)
I already used this before but it doesn't work well with async programming and also there is ProcessPoolExecutor class that i didn't understood what it does
Hi John, Thanks for the tutorial. Can concurrent futures used to optimize "while True loop" with if then break at the end ? I saw your tutorial and also did some googling and can't found any example. Most of the example are 'for loop' or 'while loop' with predefined range.
Hi John,. another excellent video man!. I have a use for this, but one huge doubt. I made a list links and I make a request in a loop with the complete list of (8000-12000 links of the same server). It works one by one, but I need to keep the computer on for like 12 hours since I make a random time to sleep in order not to overload the server while I getting the data. With this it is possible to make the complete requests en minutes? how is that possible, the server doesn't block you or anything like that? I only get 4 lines of text from those links. --- Edited sorry didn't see your answer below. For different sites works. Thank you!
wouldn't this get you rate limited? My scraper for a site i like colleting data from is slow because I deliberately put a time buffer between requests to keep from getting rate limited
It won't work with selenium now, as you need to load up the browser instance. Scrapy is inherently async anyway (twisted reactor), and although I've never tested them side by side I'd expect it to be equally as fast
This is great!! Thanks:) I have one question. For my scraping job, I need to scrape a days worth of data from an API. The API accepts start_time_epoch and end_time_epoch as input request. I have the script ready, but is taking ~2 hours to complete the job. Now I’m thinking of parallelizing this job. Please note that the API rate limits requests coming from single IP. So I’m planning to distribute over a cluster of 24 nodes each scraping data for an hour. So basically I’ll change the input requests 1 -> start_time_hour1_epoch, end_time_hour1_epoch 2 -> start_time_hour2_epoch, end_time_hour2_epoch . . . 24 -> start_time_hour24_epoch, end_time_hour24_epoch What would be the most cost effective way to accomplish this using any AWS service. These jobs are not super critical, so in case of failure, I can just rerun it. Any help appreciated.
Yes unfortunately it does - but for some cases like the one I showed if we are scraping multiple urls, but from different servers we can scrape much quicker. It would also work well with rotating proxies
hey hi sir , just for making sure , it won't work if we are using pagination to get the urls , it tried with pagination ,which will also scrap the url and then scrap it. its not working in that scenario.i just want to make sure that Am I doing something wrong or it is not supposed to be working . thanks
Hi, I have used playwright and created a scraper. It is taking too much time. I have used Chromium for light weight use. But still my code is taking too much time.
I created a script to track GPUs on Best buy (personal use) and your video sped up my process 20x times! Used for loop and holy it was slow. How do you exit the script with concurrent.futures? Tried ctrl-c, it stops but doesn't quiet stop.
Hey John! Thanks for your content. Really helpful stuff! I'm trying to download about 4 million of images using threading and requests sessions. From my estimates, it should take about 30 hours. I was wondering if there is a way to create checkpoints for moments of bad connections or if the server temporarily blocks my session, and stuff like that. Just so I don't have to start all over from scratch. Is there any recommendations in this sense? I appreciate any opinions on that.
Hello John I hope your're very well. John I need orientation. I have a service with FastApi. I'm using Selenium with python too. My code visit one page at a time, for each request. How to Can I integrate concurrent.futures in my service. We remenber, this execute for each request. Do you have any examples? I hope your answer. Regards, Nelson
Hi john and everyone.. I have around 27k urls to visit on 3 websites (around 9k each) so while using requests libraty would i need a dynamic ip? Will the website block soo many requests at once? Please comment your opinions/experiences.
you forget that using Concurrent Futures do not free memory... and with larg data or long running task you we'll run out of ram😮 and the process crash.... as_compleet or shutdown... has no effect 😢 you can test this simply with on task vs for loop result c.f wille consume more ram and that is the problem of c.f
Missed the intro !! :) , You again did it mate...solved a problem in such an easy way!
I have had that issue [one of the script i send in email](code review)...After your video my script just got more efficient!! hehe
Glad it helped!
@@JohnWatsonRooney Very helpful indeed !!
your two lines of code saved my 20 minutes of time :)
Man, I watched a few videos and read the section of a scraping book to figure this out, with this short video, it finally clicked, thank you! 1000 URLs gone from 5-6 mins job to about a min tops. (My internet is not great plus I am grabbing quite a bit of data)
This is what exactly I was looking for. I am python newbie. You are awesome!!! thank you so much!
Thank you!
Just completed some unit tests, you have increased my speeds by 2/3!!!!! Thanks for that!
Another great video and tips shared. Many thanks John !
Thanks Amine 👍
youtube recommended me this video just when i needed it, thanks its really good video
That’s great I’m glad you found it useful!
i am amazed on how fast it is, I crawled 60000 links in just 30 mins before it showed me 10 days OMG Thanks
Tested it on my selenium scripts as well. Works like a charm!!! Kudos! John and Thank You :) ... Just came across this content and subscribed asap. FYI ;)
That’s great glad it helped!
Today I used it giving awesome results
John, thanks a lot. It's a really really excellent module for processing big records. Keep making videos of this kind. when can I see you live ?
Thanks Sayyad - soon I hope! I will schedule it on YT as early as I can to give as many people the chance to watch
Hi John thank you for the video. One question: do you know how to output the titles in the original order of the urls list? Thank you
Hey thanks, you help me a lot. From 21 seconds went to 1.5 seconds.
I already used this before but it doesn't work well with async programming and also there is ProcessPoolExecutor class that i didn't understood what it does
Hi John,
Thanks for the tutorial.
Can concurrent futures used to optimize "while True loop" with if then break at the end ?
I saw your tutorial and also did some googling and can't found any example.
Most of the example are 'for loop' or 'while loop' with predefined range.
I really really love your content John..
It has helped me a lot
Thank you!
IIRC the system measurement uses 100% to mean 1 cpu core, so that means it used a little bit more than one core on average during the run.
John thanks a lot. this looks really cool
Thanks this helped me understand 😌. This is exactly what i needed
Hi John,. another excellent video man!. I have a use for this, but one huge doubt. I made a list links and I make a request in a loop with the complete list of (8000-12000 links of the same server). It works one by one, but I need to keep the computer on for like 12 hours since I make a random time to sleep in order not to overload the server while I getting the data. With this it is possible to make the complete requests en minutes? how is that possible, the server doesn't block you or anything like that? I only get 4 lines of text from those links. --- Edited sorry didn't see your answer below. For different sites works. Thank you!
Very helpful content & into the point .. great It speeds up my code.. thanks❤️
thanks. understood the concept
Wow, that is an amazing tip. Thank you very much
You're very welcome!
Thank a lot for sharing.
Great video, what's happening under the hood? is the speed dependent on number of processor cores the machine has?
It’s async, so still single thread. It uses the time whilst waiting for a server response to create more requesst
BROOOOOOOOOOOOOOOOOOOOOO!!!!!! YOOOOOOUUUUUUU DAAAAAAA MANNNNNNNNN!
Thanks John!
Thanx for the great content, however, I've tried this with function scraping many data and it doesn't work, any explication
that was really helpful, thanks a lot.
wouldn't this get you rate limited? My scraper for a site i like colleting data from is slow because I deliberately put a time buffer between requests to keep from getting rate limited
Yes it absolutely can. You would want to combine this with some good proxies to avoid being blocked
Great Great and Awesome
Is concurrent futures compatible with scrapy or Selenium? if not, would beautifulsoup be faster with this module?
It won't work with selenium now, as you need to load up the browser instance. Scrapy is inherently async anyway (twisted reactor), and although I've never tested them side by side I'd expect it to be equally as fast
@@JohnWatsonRooney Alright John, thank you so much for all the help you give us!
Hi, how would you store the prints jn a list when the executor is ran?
This is great!! Thanks:)
I have one question.
For my scraping job, I need to scrape a days worth of data from an API. The API accepts start_time_epoch and end_time_epoch as input request.
I have the script ready, but is taking ~2 hours to complete the job. Now I’m thinking of parallelizing this job. Please note that the API rate limits requests coming from single IP.
So I’m planning to distribute over a cluster of 24 nodes each scraping data for an hour. So basically I’ll change the input requests
1 -> start_time_hour1_epoch, end_time_hour1_epoch
2 -> start_time_hour2_epoch, end_time_hour2_epoch
.
.
.
24 -> start_time_hour24_epoch, end_time_hour24_epoch
What would be the most cost effective way to accomplish this using any AWS service. These jobs are not super critical, so in case of failure, I can just rerun it.
Any help appreciated.
Hello John thats alot for your content and i want to ask about how can we add tqdm to the progress of concurrent can you help me with that
Possible to do quite the same wih selenium with find_element(ByX_PATH, "").get_attribute("href) ?
but little question John
isn't this face us to server refuses or too many requests or something like that problem ?
Yes unfortunately it does - but for some cases like the one I showed if we are scraping multiple urls, but from different servers we can scrape much quicker. It would also work well with rotating proxies
@@JohnWatsonRooney can you make a video regarding rotating proxies?
@@irfankalam509 I can (wokring on it now)
Nice this is very useful 👌🏻👌🏻👌🏻🙏🏻🙏🏻
Would be nice if you could make a video on requests-html with the async feature and compare the speed.
that's in the works!
hey hi sir , just for making sure , it won't work if we are using pagination to get the urls , it tried with pagination ,which will also scrap the url and then scrap it. its not working in that scenario.i just want to make sure that Am I doing something wrong or it is not supposed to be working . thanks
Great video! I am having a doubt. You have not passed url arguement while calling the transform function. Still code is working, how?
Hi. I have a question.
In windows, I can't append list...
Then... How can i use appendlist in windows?
Hiii sir, in my code 1 result is printing multiple times . How to stop them from reprinting. Plz reply
can you also explain the use of append(row[0]) instead of just append(row) ?
So I get a list, not a list of lists
Hi, I have used playwright and created a scraper. It is taking too much time. I have used Chromium for light weight use. But still my code is taking too much time.
I created a script to track GPUs on Best buy (personal use) and your video sped up my process 20x times! Used for loop and holy it was slow.
How do you exit the script with concurrent.futures? Tried ctrl-c, it stops but doesn't quiet stop.
That’s great. Not sure what you mean it doesn’t quite stop? That shortcut It should terminal the program there and then
Also check out my async videos too I think you’ll find them useful!
Is there a similar in Windows?
thanks man is it similar to multithreading ??
Yes it uses threads, new video later today covers more like this!
Wondering if a way to use a GPU also with the CPU to improve performance.
Is it possible to use this with Scrapy as well?
Hey John! Thanks for your content. Really helpful stuff! I'm trying to download about 4 million of images using threading and requests sessions. From my estimates, it should take about 30 hours. I was wondering if there is a way to create checkpoints for moments of bad connections or if the server temporarily blocks my session, and stuff like that. Just so I don't have to start all over from scratch. Is there any recommendations in this sense? I appreciate any opinions on that.
Put the whole thing in a while loop and use try and except for each error
Go with scrapy and proxy middleware
cool work
Hello John I hope your're very well.
John I need orientation. I have a service with FastApi. I'm using Selenium with python too.
My code visit one page at a time, for each request.
How to Can I integrate concurrent.futures in my service. We remenber, this execute for each request.
Do you have any examples?
I hope your answer.
Regards, Nelson
Bro can u make tutorial like this vidio but,in the case we want to shutdown the proccess,because mybe something error
Ctrl+C stops the process at any point.
@@mikesoertsz222 no bro,because in this case we must die the child,but because i'm really newbie on python, i dont know what i can do
thanks man
Will this work on headless browser?
Yes I don’t see why not, although I’d be wary of the amount of ram several instances of a headless browser would use
Is there any other solution
Hi john and everyone..
I have around 27k urls to visit on 3 websites (around 9k each) so while using requests libraty would i need a dynamic ip? Will the website block soo many requests at once?
Please comment your opinions/experiences.
Yes I’d recommend proxies, it’s likely a single ip will get blocked. But 9k on each is perfectly doable
@@JohnWatsonRooney Thank you john for your prompt reply
you forget that using Concurrent Futures do not free memory...
and with larg data or long running task you we'll run out of ram😮 and the process crash.... as_compleet or shutdown... has no effect 😢
you can test this simply with on task vs for loop
result c.f wille consume more ram
and that is the problem of c.f
what is the reason for having str() in requests.get(str(url))? Shouldn't requests.get(url) work fine?
Making sure the url is the right data type, if it was an integer it would not work
Web scraping is such a huge part of my job and the one time I tried this, it didn’t work 😪
bro scrape with http client
whats that. elaborate a bit. thanks
Thank you! it was quiet helpful