PARALLEL and CONCURRENCY in Python for FAST Web Scraping

แชร์
ฝัง
  • เผยแพร่เมื่อ 15 ก.ย. 2020
  • In this video I demo how using concurrent futures could help you speed up your web scraping scripts. I will show you how long it takes to scrape 1000 urls with and without concurrent futures and compare the times taken, with just a few lines of code.
    code: github.com/jhnwr/speedupscraping
    -------------------------------------
    twitter / jhnwr
    code editor code.visualstudio.com/
    WSL2 (linux on windows) docs.microsoft.com/en-us/wind...
    -------------------------------------
    Disclaimer: These are affiliate links and as an Amazon Associate I earn from qualifying purchases
    mouse amzn.to/2SH1ssK
    27" monitor amzn.to/2GAH4r9
    24" monitor (vertical) amzn.to/3jIFamt
    dual monitor arm amzn.to/3lyFS6s
    microphone amzn.to/36TbaAW
    mic arm amzn.to/33NJI5v
    audio interface amzn.to/2FlnfU0
    keyboard amzn.to/2SKrjQA
    lights amzn.to/2GN7INg
    webcam amzn.to/2SJHopS
    camera amzn.to/3iVIJol
    gfx card amzn.to/2SKYraW
    ssd amzn.to/3lAjMAy
  • วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 93

  • @ugurdev
    @ugurdev 3 ปีที่แล้ว +6

    Man, I watched a few videos and read the section of a scraping book to figure this out, with this short video, it finally clicked, thank you! 1000 URLs gone from 5-6 mins job to about a min tops. (My internet is not great plus I am grabbing quite a bit of data)

  • @evanfonseka9068
    @evanfonseka9068 3 ปีที่แล้ว +2

    Just completed some unit tests, you have increased my speeds by 2/3!!!!! Thanks for that!

  • @BringMe_Back
    @BringMe_Back 2 ปีที่แล้ว +5

    your two lines of code saved my 20 minutes of time :)

  • @camel4717
    @camel4717 3 ปีที่แล้ว +12

    This is what exactly I was looking for. I am python newbie. You are awesome!!! thank you so much!

  • @amineboutaghou4714
    @amineboutaghou4714 3 ปีที่แล้ว +4

    Another great video and tips shared. Many thanks John !

  • @vishvamnaik9935
    @vishvamnaik9935 3 ปีที่แล้ว +1

    I really really love your content John..
    It has helped me a lot
    Thank you!

  • @11hamma
    @11hamma 3 ปีที่แล้ว +1

    John thanks a lot. this looks really cool

  • @davida99
    @davida99 3 ปีที่แล้ว +1

    Thanks this helped me understand 😌. This is exactly what i needed

  • @mr.strange7002
    @mr.strange7002 2 ปีที่แล้ว +1

    Very helpful content & into the point .. great It speeds up my code.. thanks❤️

  • @ZENITH-07
    @ZENITH-07 9 หลายเดือนก่อน +1

    Thank you! it was quiet helpful

  • @hardik12361
    @hardik12361 3 ปีที่แล้ว +7

    Missed the intro !! :) , You again did it mate...solved a problem in such an easy way!
    I have had that issue [one of the script i send in email](code review)...After your video my script just got more efficient!! hehe

  • @buxA57
    @buxA57 2 ปีที่แล้ว +1

    youtube recommended me this video just when i needed it, thanks its really good video

    • @JohnWatsonRooney
      @JohnWatsonRooney  2 ปีที่แล้ว +2

      That’s great I’m glad you found it useful!

  • @pascal831
    @pascal831 10 หลายเดือนก่อน +1

    Thanks John!

  • @ThespecialOtaku
    @ThespecialOtaku 3 ปีที่แล้ว +1

    that was really helpful, thanks a lot.

  • @sayyadsalman9132
    @sayyadsalman9132 3 ปีที่แล้ว +2

    John, thanks a lot. It's a really really excellent module for processing big records. Keep making videos of this kind. when can I see you live ?

    • @JohnWatsonRooney
      @JohnWatsonRooney  3 ปีที่แล้ว +2

      Thanks Sayyad - soon I hope! I will schedule it on YT as early as I can to give as many people the chance to watch

  • @simeoneholanda6420
    @simeoneholanda6420 3 ปีที่แล้ว +1

    Hey thanks, you help me a lot. From 21 seconds went to 1.5 seconds.

  • @shivamkumar-qp1jm
    @shivamkumar-qp1jm ปีที่แล้ว +1

    Today I used it giving awesome results

  • @srikanthkoltur6911
    @srikanthkoltur6911 ปีที่แล้ว +1

    i am amazed on how fast it is, I crawled 60000 links in just 30 mins before it showed me 10 days OMG Thanks

  • @texodus_et6313
    @texodus_et6313 2 ปีที่แล้ว +1

    Tested it on my selenium scripts as well. Works like a charm!!! Kudos! John and Thank You :) ... Just came across this content and subscribed asap. FYI ;)

  • @highwaygroup2821
    @highwaygroup2821 2 ปีที่แล้ว +1

    thanks. understood the concept

  • @nachoeigu
    @nachoeigu 2 ปีที่แล้ว +1

    Wow, that is an amazing tip. Thank you very much

  • @siamtourist
    @siamtourist 2 ปีที่แล้ว +1

    Thank a lot for sharing.

  • @sinamobasheri3632
    @sinamobasheri3632 3 ปีที่แล้ว +1

    Nice this is very useful 👌🏻👌🏻👌🏻🙏🏻🙏🏻

  • @lukerbs
    @lukerbs 3 ปีที่แล้ว +5

    Hi John thank you for the video. One question: do you know how to output the titles in the original order of the urls list? Thank you

  • @-__--__aaaa
    @-__--__aaaa 3 ปีที่แล้ว +2

    cool work

  • @shebe3807
    @shebe3807 ปีที่แล้ว +2

    Great Great and Awesome

  • @DittoRahmat
    @DittoRahmat 2 ปีที่แล้ว +2

    Hi John,
    Thanks for the tutorial.
    Can concurrent futures used to optimize "while True loop" with if then break at the end ?
    I saw your tutorial and also did some googling and can't found any example.
    Most of the example are 'for loop' or 'while loop' with predefined range.

  • @TECH_KG
    @TECH_KG 2 ปีที่แล้ว +1

    thanks man

  • @deanemarks8611
    @deanemarks8611 2 ปีที่แล้ว +1

    BROOOOOOOOOOOOOOOOOOOOOO!!!!!! YOOOOOOUUUUUUU DAAAAAAA MANNNNNNNNN!

  • @MadaMediaproduction
    @MadaMediaproduction 3 ปีที่แล้ว

    Hello John thats alot for your content and i want to ask about how can we add tqdm to the progress of concurrent can you help me with that

  • @jonathanfriz4410
    @jonathanfriz4410 3 ปีที่แล้ว +2

    Hi John,. another excellent video man!. I have a use for this, but one huge doubt. I made a list links and I make a request in a loop with the complete list of (8000-12000 links of the same server). It works one by one, but I need to keep the computer on for like 12 hours since I make a random time to sleep in order not to overload the server while I getting the data. With this it is possible to make the complete requests en minutes? how is that possible, the server doesn't block you or anything like that? I only get 4 lines of text from those links. --- Edited sorry didn't see your answer below. For different sites works. Thank you!

  • @miguellopez7089
    @miguellopez7089 3 ปีที่แล้ว

    Hi, how would you store the prints jn a list when the executor is ran?

  • @ricardoamendoeira3800
    @ricardoamendoeira3800 3 ปีที่แล้ว +1

    IIRC the system measurement uses 100% to mean 1 cpu core, so that means it used a little bit more than one core on average during the run.

  • @ktrades2898
    @ktrades2898 3 ปีที่แล้ว

    Is there a similar in Windows?

  • @SachinGupta-dn7wt
    @SachinGupta-dn7wt 3 ปีที่แล้ว

    Great video! I am having a doubt. You have not passed url arguement while calling the transform function. Still code is working, how?

  • @k2icc
    @k2icc 3 ปีที่แล้ว

    Wondering if a way to use a GPU also with the CPU to improve performance.

  • @Zale370
    @Zale370 3 ปีที่แล้ว +1

    Would be nice if you could make a video on requests-html with the async feature and compare the speed.

  • @ranu9376
    @ranu9376 2 ปีที่แล้ว +1

    Great video, what's happening under the hood? is the speed dependent on number of processor cores the machine has?

    • @JohnWatsonRooney
      @JohnWatsonRooney  2 ปีที่แล้ว

      It’s async, so still single thread. It uses the time whilst waiting for a server response to create more requesst

  • @rahalmehdiabdelaziz8121
    @rahalmehdiabdelaziz8121 3 ปีที่แล้ว +1

    Thanx for the great content, however, I've tried this with function scraping many data and it doesn't work, any explication

  • @skateforlife3679
    @skateforlife3679 ปีที่แล้ว

    Possible to do quite the same wih selenium with find_element(ByX_PATH, "").get_attribute("href) ?

  • @nelbn
    @nelbn ปีที่แล้ว

    Hey John! Thanks for your content. Really helpful stuff! I'm trying to download about 4 million of images using threading and requests sessions. From my estimates, it should take about 30 hours. I was wondering if there is a way to create checkpoints for moments of bad connections or if the server temporarily blocks my session, and stuff like that. Just so I don't have to start all over from scratch. Is there any recommendations in this sense? I appreciate any opinions on that.

    • @ahmaddeviix8146
      @ahmaddeviix8146 ปีที่แล้ว

      Put the whole thing in a while loop and use try and except for each error

    • @gshan994
      @gshan994 ปีที่แล้ว

      Go with scrapy and proxy middleware

  • @AlienZom
    @AlienZom 3 ปีที่แล้ว +1

    I created a script to track GPUs on Best buy (personal use) and your video sped up my process 20x times! Used for loop and holy it was slow.
    How do you exit the script with concurrent.futures? Tried ctrl-c, it stops but doesn't quiet stop.

    • @JohnWatsonRooney
      @JohnWatsonRooney  3 ปีที่แล้ว +1

      That’s great. Not sure what you mean it doesn’t quite stop? That shortcut It should terminal the program there and then

    • @JohnWatsonRooney
      @JohnWatsonRooney  3 ปีที่แล้ว +1

      Also check out my async videos too I think you’ll find them useful!

  • @w33k3nd5
    @w33k3nd5 3 ปีที่แล้ว

    hey hi sir , just for making sure , it won't work if we are using pagination to get the urls , it tried with pagination ,which will also scrap the url and then scrap it. its not working in that scenario.i just want to make sure that Am I doing something wrong or it is not supposed to be working . thanks

  • @wajdanmahbub3580
    @wajdanmahbub3580 3 ปีที่แล้ว +1

    can you also explain the use of append(row[0]) instead of just append(row) ?

  • @snopz
    @snopz 8 หลายเดือนก่อน +1

    I already used this before but it doesn't work well with async programming and also there is ProcessPoolExecutor class that i didn't understood what it does

  • @user-kg6st9kv6m
    @user-kg6st9kv6m 3 ปีที่แล้ว

    Hi. I have a question.
    In windows, I can't append list...
    Then... How can i use appendlist in windows?

  • @daddy_eddy
    @daddy_eddy 2 ปีที่แล้ว

    Thank you!
    Could you make video about "coinmarketcap".
    1. Get all links and to write to the TXT-file.
    2. Get name, price, rate eeveryone coin (using our txt-file)
    3. To write all data to JSON-file.
    We need to make this process more faster, because we have 10000 links.

  • @sinamobasheri3632
    @sinamobasheri3632 3 ปีที่แล้ว +1

    but little question John
    isn't this face us to server refuses or too many requests or something like that problem ?

    • @JohnWatsonRooney
      @JohnWatsonRooney  3 ปีที่แล้ว +2

      Yes unfortunately it does - but for some cases like the one I showed if we are scraping multiple urls, but from different servers we can scrape much quicker. It would also work well with rotating proxies

    • @irfankalam509
      @irfankalam509 3 ปีที่แล้ว +1

      @@JohnWatsonRooney can you make a video regarding rotating proxies?

    • @JohnWatsonRooney
      @JohnWatsonRooney  3 ปีที่แล้ว

      @@irfankalam509 I can (wokring on it now)

  • @jonathanlee8162
    @jonathanlee8162 2 ปีที่แล้ว

    Is it possible to use this with Scrapy as well?

  • @sassydesi7913
    @sassydesi7913 3 ปีที่แล้ว +2

    This is great!! Thanks:)
    I have one question.
    For my scraping job, I need to scrape a days worth of data from an API. The API accepts start_time_epoch and end_time_epoch as input request.
    I have the script ready, but is taking ~2 hours to complete the job. Now I’m thinking of parallelizing this job. Please note that the API rate limits requests coming from single IP.
    So I’m planning to distribute over a cluster of 24 nodes each scraping data for an hour. So basically I’ll change the input requests
    1 -> start_time_hour1_epoch, end_time_hour1_epoch
    2 -> start_time_hour2_epoch, end_time_hour2_epoch
    .
    .
    .
    24 -> start_time_hour24_epoch, end_time_hour24_epoch
    What would be the most cost effective way to accomplish this using any AWS service. These jobs are not super critical, so in case of failure, I can just rerun it.
    Any help appreciated.

  • @hypercortical7772
    @hypercortical7772 2 ปีที่แล้ว +1

    wouldn't this get you rate limited? My scraper for a site i like colleting data from is slow because I deliberately put a time buffer between requests to keep from getting rate limited

    • @JohnWatsonRooney
      @JohnWatsonRooney  2 ปีที่แล้ว

      Yes it absolutely can. You would want to combine this with some good proxies to avoid being blocked

  • @violence1371
    @violence1371 2 ปีที่แล้ว +2

    Is concurrent futures compatible with scrapy or Selenium? if not, would beautifulsoup be faster with this module?

    • @JohnWatsonRooney
      @JohnWatsonRooney  2 ปีที่แล้ว +1

      It won't work with selenium now, as you need to load up the browser instance. Scrapy is inherently async anyway (twisted reactor), and although I've never tested them side by side I'd expect it to be equally as fast

    • @violence1371
      @violence1371 2 ปีที่แล้ว +1

      ​@@JohnWatsonRooney Alright John, thank you so much for all the help you give us!

  • @vishalprasadacoustic
    @vishalprasadacoustic 2 ปีที่แล้ว +1

    Hi john and everyone..
    I have around 27k urls to visit on 3 websites (around 9k each) so while using requests libraty would i need a dynamic ip? Will the website block soo many requests at once?
    Please comment your opinions/experiences.

    • @JohnWatsonRooney
      @JohnWatsonRooney  2 ปีที่แล้ว

      Yes I’d recommend proxies, it’s likely a single ip will get blocked. But 9k on each is perfectly doable

    • @vishalprasadacoustic
      @vishalprasadacoustic 2 ปีที่แล้ว +1

      @@JohnWatsonRooney Thank you john for your prompt reply

  • @BringMe_Back
    @BringMe_Back 2 ปีที่แล้ว +1

    thanks man is it similar to multithreading ??

    • @JohnWatsonRooney
      @JohnWatsonRooney  2 ปีที่แล้ว +1

      Yes it uses threads, new video later today covers more like this!

  • @Ecoute_AI
    @Ecoute_AI ปีที่แล้ว

    Hiii sir, in my code 1 result is printing multiple times . How to stop them from reprinting. Plz reply

  • @sumedhajagtap8319
    @sumedhajagtap8319 6 หลายเดือนก่อน

    Hi, I have used playwright and created a scraper. It is taking too much time. I have used Chromium for light weight use. But still my code is taking too much time.

  • @nelsongomez8547
    @nelsongomez8547 2 ปีที่แล้ว

    Hello John I hope your're very well.
    John I need orientation. I have a service with FastApi. I'm using Selenium with python too.
    My code visit one page at a time, for each request.
    How to Can I integrate concurrent.futures in my service. We remenber, this execute for each request.
    Do you have any examples?
    I hope your answer.
    Regards, Nelson

  • @shadow_qa
    @shadow_qa 3 ปีที่แล้ว +1

    Will this work on headless browser?

    • @JohnWatsonRooney
      @JohnWatsonRooney  3 ปีที่แล้ว

      Yes I don’t see why not, although I’d be wary of the amount of ram several instances of a headless browser would use

  • @sayidinaahmadalqososyi9770
    @sayidinaahmadalqososyi9770 3 ปีที่แล้ว

    Bro can u make tutorial like this vidio but,in the case we want to shutdown the proccess,because mybe something error

    • @mikesoertsz222
      @mikesoertsz222 3 ปีที่แล้ว

      Ctrl+C stops the process at any point.

    • @sayidinaahmadalqososyi9770
      @sayidinaahmadalqososyi9770 3 ปีที่แล้ว

      ​@@mikesoertsz222 no bro,because in this case we must die the child,but because i'm really newbie on python, i dont know what i can do

  • @draxler.a
    @draxler.a 10 หลายเดือนก่อน +1

    you forget that using Concurrent Futures do not free memory...
    and with larg data or long running task you we'll run out of ram😮 and the process crash.... as_compleet or shutdown... has no effect 😢
    you can test this simply with on task vs for loop
    result c.f wille consume more ram
    and that is the problem of c.f

  • @lakshkataria
    @lakshkataria ปีที่แล้ว

    Web scraping is such a huge part of my job and the one time I tried this, it didn’t work 😪

  • @wajdanmahbub3580
    @wajdanmahbub3580 3 ปีที่แล้ว +1

    what is the reason for having str() in requests.get(str(url))? Shouldn't requests.get(url) work fine?

    • @JohnWatsonRooney
      @JohnWatsonRooney  3 ปีที่แล้ว +2

      Making sure the url is the right data type, if it was an integer it would not work

  • @-__--__aaaa
    @-__--__aaaa 3 ปีที่แล้ว +3

    bro scrape with http client

    • @11hamma
      @11hamma 3 ปีที่แล้ว +1

      whats that. elaborate a bit. thanks