PARALLEL and CONCURRENCY in Python for FAST Web Scraping

แชร์
ฝัง
  • เผยแพร่เมื่อ 31 ธ.ค. 2024

ความคิดเห็น • 93

  • @hardik12361
    @hardik12361 3 ปีที่แล้ว +7

    Missed the intro !! :) , You again did it mate...solved a problem in such an easy way!
    I have had that issue [one of the script i send in email](code review)...After your video my script just got more efficient!! hehe

  • @BringMe_Back
    @BringMe_Back 2 ปีที่แล้ว +5

    your two lines of code saved my 20 minutes of time :)

  • @ugurdev
    @ugurdev 3 ปีที่แล้ว +6

    Man, I watched a few videos and read the section of a scraping book to figure this out, with this short video, it finally clicked, thank you! 1000 URLs gone from 5-6 mins job to about a min tops. (My internet is not great plus I am grabbing quite a bit of data)

  • @camel4717
    @camel4717 4 ปีที่แล้ว +12

    This is what exactly I was looking for. I am python newbie. You are awesome!!! thank you so much!

  • @evanfonseka9068
    @evanfonseka9068 3 ปีที่แล้ว +2

    Just completed some unit tests, you have increased my speeds by 2/3!!!!! Thanks for that!

  • @amineboutaghou4714
    @amineboutaghou4714 4 ปีที่แล้ว +4

    Another great video and tips shared. Many thanks John !

  • @buxA57
    @buxA57 2 ปีที่แล้ว +1

    youtube recommended me this video just when i needed it, thanks its really good video

    • @JohnWatsonRooney
      @JohnWatsonRooney  2 ปีที่แล้ว +2

      That’s great I’m glad you found it useful!

  • @srikanthkoltur6911
    @srikanthkoltur6911 ปีที่แล้ว +1

    i am amazed on how fast it is, I crawled 60000 links in just 30 mins before it showed me 10 days OMG Thanks

  • @texodus_et6313
    @texodus_et6313 3 ปีที่แล้ว +1

    Tested it on my selenium scripts as well. Works like a charm!!! Kudos! John and Thank You :) ... Just came across this content and subscribed asap. FYI ;)

  • @shivamkumar-qp1jm
    @shivamkumar-qp1jm ปีที่แล้ว +1

    Today I used it giving awesome results

  • @sayyadsalman9132
    @sayyadsalman9132 4 ปีที่แล้ว +2

    John, thanks a lot. It's a really really excellent module for processing big records. Keep making videos of this kind. when can I see you live ?

    • @JohnWatsonRooney
      @JohnWatsonRooney  4 ปีที่แล้ว +2

      Thanks Sayyad - soon I hope! I will schedule it on YT as early as I can to give as many people the chance to watch

  • @lukerbs
    @lukerbs 3 ปีที่แล้ว +5

    Hi John thank you for the video. One question: do you know how to output the titles in the original order of the urls list? Thank you

  • @simeoneholanda6420
    @simeoneholanda6420 3 ปีที่แล้ว +1

    Hey thanks, you help me a lot. From 21 seconds went to 1.5 seconds.

  • @snopz
    @snopz ปีที่แล้ว +1

    I already used this before but it doesn't work well with async programming and also there is ProcessPoolExecutor class that i didn't understood what it does

  • @DittoRahmat
    @DittoRahmat 3 ปีที่แล้ว +2

    Hi John,
    Thanks for the tutorial.
    Can concurrent futures used to optimize "while True loop" with if then break at the end ?
    I saw your tutorial and also did some googling and can't found any example.
    Most of the example are 'for loop' or 'while loop' with predefined range.

  • @vishvamnaik9935
    @vishvamnaik9935 3 ปีที่แล้ว +1

    I really really love your content John..
    It has helped me a lot
    Thank you!

  • @ricardoamendoeira3800
    @ricardoamendoeira3800 3 ปีที่แล้ว +1

    IIRC the system measurement uses 100% to mean 1 cpu core, so that means it used a little bit more than one core on average during the run.

  • @11hamma
    @11hamma 4 ปีที่แล้ว +1

    John thanks a lot. this looks really cool

  • @davida99
    @davida99 3 ปีที่แล้ว +1

    Thanks this helped me understand 😌. This is exactly what i needed

  • @jonathanfriz4410
    @jonathanfriz4410 4 ปีที่แล้ว +2

    Hi John,. another excellent video man!. I have a use for this, but one huge doubt. I made a list links and I make a request in a loop with the complete list of (8000-12000 links of the same server). It works one by one, but I need to keep the computer on for like 12 hours since I make a random time to sleep in order not to overload the server while I getting the data. With this it is possible to make the complete requests en minutes? how is that possible, the server doesn't block you or anything like that? I only get 4 lines of text from those links. --- Edited sorry didn't see your answer below. For different sites works. Thank you!

  • @mr.strange7002
    @mr.strange7002 3 ปีที่แล้ว +1

    Very helpful content & into the point .. great It speeds up my code.. thanks❤️

  • @highwaygroup2821
    @highwaygroup2821 3 ปีที่แล้ว +1

    thanks. understood the concept

  • @nachoeigu
    @nachoeigu 3 ปีที่แล้ว +1

    Wow, that is an amazing tip. Thank you very much

  • @siamtourist
    @siamtourist 3 ปีที่แล้ว +1

    Thank a lot for sharing.

  • @ranu9376
    @ranu9376 2 ปีที่แล้ว +1

    Great video, what's happening under the hood? is the speed dependent on number of processor cores the machine has?

    • @JohnWatsonRooney
      @JohnWatsonRooney  2 ปีที่แล้ว

      It’s async, so still single thread. It uses the time whilst waiting for a server response to create more requesst

  • @deanemarks8611
    @deanemarks8611 3 ปีที่แล้ว +1

    BROOOOOOOOOOOOOOOOOOOOOO!!!!!! YOOOOOOUUUUUUU DAAAAAAA MANNNNNNNNN!

  • @pascal831
    @pascal831 ปีที่แล้ว +1

    Thanks John!

  • @rahalmehdiabdelaziz8121
    @rahalmehdiabdelaziz8121 3 ปีที่แล้ว +1

    Thanx for the great content, however, I've tried this with function scraping many data and it doesn't work, any explication

  • @ThespecialOtaku
    @ThespecialOtaku 3 ปีที่แล้ว +1

    that was really helpful, thanks a lot.

  • @hypercortical7772
    @hypercortical7772 3 ปีที่แล้ว +1

    wouldn't this get you rate limited? My scraper for a site i like colleting data from is slow because I deliberately put a time buffer between requests to keep from getting rate limited

    • @JohnWatsonRooney
      @JohnWatsonRooney  3 ปีที่แล้ว

      Yes it absolutely can. You would want to combine this with some good proxies to avoid being blocked

  • @shebe3807
    @shebe3807 ปีที่แล้ว +2

    Great Great and Awesome

  • @violence1371
    @violence1371 2 ปีที่แล้ว +2

    Is concurrent futures compatible with scrapy or Selenium? if not, would beautifulsoup be faster with this module?

    • @JohnWatsonRooney
      @JohnWatsonRooney  2 ปีที่แล้ว +1

      It won't work with selenium now, as you need to load up the browser instance. Scrapy is inherently async anyway (twisted reactor), and although I've never tested them side by side I'd expect it to be equally as fast

    • @violence1371
      @violence1371 2 ปีที่แล้ว +1

      ​@@JohnWatsonRooney Alright John, thank you so much for all the help you give us!

  • @miguellopez7089
    @miguellopez7089 3 ปีที่แล้ว

    Hi, how would you store the prints jn a list when the executor is ran?

  • @sassydesi7913
    @sassydesi7913 3 ปีที่แล้ว +2

    This is great!! Thanks:)
    I have one question.
    For my scraping job, I need to scrape a days worth of data from an API. The API accepts start_time_epoch and end_time_epoch as input request.
    I have the script ready, but is taking ~2 hours to complete the job. Now I’m thinking of parallelizing this job. Please note that the API rate limits requests coming from single IP.
    So I’m planning to distribute over a cluster of 24 nodes each scraping data for an hour. So basically I’ll change the input requests
    1 -> start_time_hour1_epoch, end_time_hour1_epoch
    2 -> start_time_hour2_epoch, end_time_hour2_epoch
    .
    .
    .
    24 -> start_time_hour24_epoch, end_time_hour24_epoch
    What would be the most cost effective way to accomplish this using any AWS service. These jobs are not super critical, so in case of failure, I can just rerun it.
    Any help appreciated.

  • @MadaMediaproduction
    @MadaMediaproduction 4 ปีที่แล้ว

    Hello John thats alot for your content and i want to ask about how can we add tqdm to the progress of concurrent can you help me with that

  • @Lapookie
    @Lapookie ปีที่แล้ว

    Possible to do quite the same wih selenium with find_element(ByX_PATH, "").get_attribute("href) ?

  • @sinamobasheri3632
    @sinamobasheri3632 4 ปีที่แล้ว +1

    but little question John
    isn't this face us to server refuses or too many requests or something like that problem ?

    • @JohnWatsonRooney
      @JohnWatsonRooney  4 ปีที่แล้ว +2

      Yes unfortunately it does - but for some cases like the one I showed if we are scraping multiple urls, but from different servers we can scrape much quicker. It would also work well with rotating proxies

    • @irfankalam509
      @irfankalam509 4 ปีที่แล้ว +1

      @@JohnWatsonRooney can you make a video regarding rotating proxies?

    • @JohnWatsonRooney
      @JohnWatsonRooney  4 ปีที่แล้ว

      @@irfankalam509 I can (wokring on it now)

  • @sinamobasheri3632
    @sinamobasheri3632 4 ปีที่แล้ว +1

    Nice this is very useful 👌🏻👌🏻👌🏻🙏🏻🙏🏻

  • @Zale370
    @Zale370 3 ปีที่แล้ว +1

    Would be nice if you could make a video on requests-html with the async feature and compare the speed.

  • @w33k3nd5
    @w33k3nd5 3 ปีที่แล้ว

    hey hi sir , just for making sure , it won't work if we are using pagination to get the urls , it tried with pagination ,which will also scrap the url and then scrap it. its not working in that scenario.i just want to make sure that Am I doing something wrong or it is not supposed to be working . thanks

  • @SachinGupta-dn7wt
    @SachinGupta-dn7wt 3 ปีที่แล้ว

    Great video! I am having a doubt. You have not passed url arguement while calling the transform function. Still code is working, how?

  • @HWASEON-f6i
    @HWASEON-f6i 3 ปีที่แล้ว

    Hi. I have a question.
    In windows, I can't append list...
    Then... How can i use appendlist in windows?

  • @Live_draw_today
    @Live_draw_today 2 ปีที่แล้ว

    Hiii sir, in my code 1 result is printing multiple times . How to stop them from reprinting. Plz reply

  • @wajdanmahbub3580
    @wajdanmahbub3580 3 ปีที่แล้ว +1

    can you also explain the use of append(row[0]) instead of just append(row) ?

  • @sumedhajagtap8319
    @sumedhajagtap8319 11 หลายเดือนก่อน

    Hi, I have used playwright and created a scraper. It is taking too much time. I have used Chromium for light weight use. But still my code is taking too much time.

  • @AlienZom
    @AlienZom 3 ปีที่แล้ว +1

    I created a script to track GPUs on Best buy (personal use) and your video sped up my process 20x times! Used for loop and holy it was slow.
    How do you exit the script with concurrent.futures? Tried ctrl-c, it stops but doesn't quiet stop.

    • @JohnWatsonRooney
      @JohnWatsonRooney  3 ปีที่แล้ว +1

      That’s great. Not sure what you mean it doesn’t quite stop? That shortcut It should terminal the program there and then

    • @JohnWatsonRooney
      @JohnWatsonRooney  3 ปีที่แล้ว +1

      Also check out my async videos too I think you’ll find them useful!

  • @ktrades2898
    @ktrades2898 3 ปีที่แล้ว

    Is there a similar in Windows?

  • @BringMe_Back
    @BringMe_Back 2 ปีที่แล้ว +1

    thanks man is it similar to multithreading ??

    • @JohnWatsonRooney
      @JohnWatsonRooney  2 ปีที่แล้ว +1

      Yes it uses threads, new video later today covers more like this!

  • @k2icc
    @k2icc 3 ปีที่แล้ว

    Wondering if a way to use a GPU also with the CPU to improve performance.

  • @jonathanlee8162
    @jonathanlee8162 2 ปีที่แล้ว

    Is it possible to use this with Scrapy as well?

  • @nelbn
    @nelbn 2 ปีที่แล้ว

    Hey John! Thanks for your content. Really helpful stuff! I'm trying to download about 4 million of images using threading and requests sessions. From my estimates, it should take about 30 hours. I was wondering if there is a way to create checkpoints for moments of bad connections or if the server temporarily blocks my session, and stuff like that. Just so I don't have to start all over from scratch. Is there any recommendations in this sense? I appreciate any opinions on that.

    • @ahmaddeviix8146
      @ahmaddeviix8146 2 ปีที่แล้ว

      Put the whole thing in a while loop and use try and except for each error

    • @gshan994
      @gshan994 2 ปีที่แล้ว

      Go with scrapy and proxy middleware

  • @-__--__aaaa
    @-__--__aaaa 4 ปีที่แล้ว +2

    cool work

  • @nelsongomez8547
    @nelsongomez8547 2 ปีที่แล้ว

    Hello John I hope your're very well.
    John I need orientation. I have a service with FastApi. I'm using Selenium with python too.
    My code visit one page at a time, for each request.
    How to Can I integrate concurrent.futures in my service. We remenber, this execute for each request.
    Do you have any examples?
    I hope your answer.
    Regards, Nelson

  • @sayidinaahmadalqososyi9770
    @sayidinaahmadalqososyi9770 4 ปีที่แล้ว

    Bro can u make tutorial like this vidio but,in the case we want to shutdown the proccess,because mybe something error

    • @mikesoertsz222
      @mikesoertsz222 4 ปีที่แล้ว

      Ctrl+C stops the process at any point.

    • @sayidinaahmadalqososyi9770
      @sayidinaahmadalqososyi9770 4 ปีที่แล้ว

      ​@@mikesoertsz222 no bro,because in this case we must die the child,but because i'm really newbie on python, i dont know what i can do

  • @TECH_KG
    @TECH_KG 2 ปีที่แล้ว +1

    thanks man

  • @shadow_qa
    @shadow_qa 4 ปีที่แล้ว +1

    Will this work on headless browser?

    • @JohnWatsonRooney
      @JohnWatsonRooney  4 ปีที่แล้ว

      Yes I don’t see why not, although I’d be wary of the amount of ram several instances of a headless browser would use

  • @dataanalysiscourse785
    @dataanalysiscourse785 3 หลายเดือนก่อน

    Is there any other solution

  • @vishalprasadacoustic
    @vishalprasadacoustic 2 ปีที่แล้ว +1

    Hi john and everyone..
    I have around 27k urls to visit on 3 websites (around 9k each) so while using requests libraty would i need a dynamic ip? Will the website block soo many requests at once?
    Please comment your opinions/experiences.

    • @JohnWatsonRooney
      @JohnWatsonRooney  2 ปีที่แล้ว

      Yes I’d recommend proxies, it’s likely a single ip will get blocked. But 9k on each is perfectly doable

    • @vishalprasadacoustic
      @vishalprasadacoustic 2 ปีที่แล้ว +1

      @@JohnWatsonRooney Thank you john for your prompt reply

  • @draxler.a
    @draxler.a ปีที่แล้ว +1

    you forget that using Concurrent Futures do not free memory...
    and with larg data or long running task you we'll run out of ram😮 and the process crash.... as_compleet or shutdown... has no effect 😢
    you can test this simply with on task vs for loop
    result c.f wille consume more ram
    and that is the problem of c.f

  • @wajdanmahbub3580
    @wajdanmahbub3580 3 ปีที่แล้ว +1

    what is the reason for having str() in requests.get(str(url))? Shouldn't requests.get(url) work fine?

    • @JohnWatsonRooney
      @JohnWatsonRooney  3 ปีที่แล้ว +2

      Making sure the url is the right data type, if it was an integer it would not work

  • @lakshkataria
    @lakshkataria ปีที่แล้ว

    Web scraping is such a huge part of my job and the one time I tried this, it didn’t work 😪

  • @-__--__aaaa
    @-__--__aaaa 4 ปีที่แล้ว +3

    bro scrape with http client

    • @11hamma
      @11hamma 4 ปีที่แล้ว +1

      whats that. elaborate a bit. thanks

  • @ZENITH-07
    @ZENITH-07 ปีที่แล้ว +2

    Thank you! it was quiet helpful