Fastest Python Web Scraper - Exploring Sessions, Multiprocessing, Multithreading, and Scrapy

แชร์
ฝัง
  • เผยแพร่เมื่อ 8 ก.พ. 2025
  • In this video, we will make a fast web scraper. We will begin with BeautifulSoup.
    🚀 The first script takes 128 seconds and after optimization, takes as little as 2.5 seconds.
    Finally, we will create a scrapy spider without optimization and see what kind of results we get.
    We will use BeautifulSoup, Requests, Sessions, Multithreading, Multiprocessing, and Scrapy.
    👩‍💻 Source Code: github.com/eup...
    You can jump to the sections you like:
    00:31 Scraper Objective
    00:44 Creating Scraper with Requests+BS4
    9:20 First Run
    10:07 Sessions
    13:58 Multiprocessing
    17:22 Multithreading
    22:36 Scrapy Without Optimization
    Related videos
    -------
    👩‍💻 Watch the Playlist to Learn the Basics of Scrapy: • Scrapy for Beginners
    👨‍💻Join all courses on my site: coderecode.com...
    ----------------------------------------------
    What is Web Scraping?
    In a nutshell: Web Scraping = Getting Data from Websites with Code
    What is Scrapy?
    Scrapy is a Python library to makes web scraping very powerful, fast, and efficient.
    There are other libraries too like BeautifulSoup, for web scraping. However, when it comes to true power and flexibility, Scrapy is the most powerful.
    Why Learn Scrapy?
    Most powerful library for scraping
    Easy to master
    Cross-platform: doesn't matter which OS you are using
    Cloud-ready: Can be run on the cloud with a free account
    Most Important: You would be able to earn by taking up some of the web scraping gigs as a freelancer
    #scrapy #fast #beautifulsoup #multiprocessing #multithreading
    ~-~~-~~~-~~-~
    Please watch: "Making Scrapy Playwright fast and reliable"
    • How to make Scrapy Pla...
    ~-~~-~~~-~~-~

ความคิดเห็น • 49

  • @anamashraf8996
    @anamashraf8996 6 หลายเดือนก่อน

    Very well explained and structured video. I love the way you took us from without optimization till scrapy. Thank you for this video, it was very helpful!

    • @codeRECODE
      @codeRECODE  6 หลายเดือนก่อน

      You're very welcome!

  • @codeRECODE
    @codeRECODE  3 ปีที่แล้ว +6

    Hello everyone. This time the text the smaller than my other videos. How is readability? Is it okay or larger would be better?
    Looking forward to your comments.
    PS: Please subscribe and like (or dislike) this video 🙂

    • @tubelessHuma
      @tubelessHuma 3 ปีที่แล้ว

      It is ok and readable. Thanks for your effort.👍

    • @dmitrymitrofanov3920
      @dmitrymitrofanov3920 3 ปีที่แล้ว

      Hello, every thing ok. Thank you for video.

    • @gcu1
      @gcu1 3 ปีที่แล้ว

      I didn't even notice that the text was smaller. The readability is perfectly fine. And the content was terrific as usual. Yours is the best web scraping channel on TH-cam (or anywhere else for that matter). I'm looking forward to the master class update! And one more thing....Scrapy FTW!!!!!!!!!!!

  • @ارمینمحمدجانی
    @ارمینمحمدجانی 3 ปีที่แล้ว +1

    Hi Upendra, this is very useful , thanks a lot

  • @arvininer
    @arvininer 2 ปีที่แล้ว

    Great video. Thank you!

  • @ataimebenson
    @ataimebenson 2 ปีที่แล้ว

    Thanks alot for this video, Helped me to solve a problem 💪🏿

  • @sheikhakbar2067
    @sheikhakbar2067 3 ปีที่แล้ว +1

    Very helpful tutorial; thanks a lot!

    • @codeRECODE
      @codeRECODE  3 ปีที่แล้ว

      Glad it was helpful!

  • @bruce2790
    @bruce2790 3 ปีที่แล้ว

    Keep up the good work, thanks for the video

    • @codeRECODE
      @codeRECODE  3 ปีที่แล้ว

      Thanks for watching!

  • @133839297
    @133839297 2 ปีที่แล้ว

    Great teacher.

  • @zone66
    @zone66 2 ปีที่แล้ว +3

    hm sadly Scrapy is single-threaded and Selenium is blocking if its called within a Spider, so the Spiders will not execute concurrently then (if they use Selenium instead of requests, to resolve an url). I wonder how it is possible to crawl that fast with Scrapy while also using Selenium for HTML-Rendering. Great video btw!

  • @roshanyadav4459
    @roshanyadav4459 2 ปีที่แล้ว

    😇now I became your big fan

  • @botsboss
    @botsboss 2 ปีที่แล้ว +2

    Moral of the video - Use scrapy.

  • @billygene589
    @billygene589 2 ปีที่แล้ว

    Wow! Awesome video. Would you please let me know if it is possible to perform both multiprocessing and multithreading at the same time?

  • @DittoRahmat
    @DittoRahmat 3 ปีที่แล้ว +1

    Hi Upendra,
    Thanks for the tutorial.
    Can concurrent futures used to optimize "while True loop" with if then break at the end ?
    I saw your tutorial and also did some googling and can't found any example.
    Most of the example are 'for loop' or 'while loop' with predefined range.

  • @chadGPT6969
    @chadGPT6969 ปีที่แล้ว

    the video I currently need. just curious, can you make scrapy faster than that?

    • @codeRECODE
      @codeRECODE  8 หลายเดือนก่อน

      Of course! There are many ways to increase of speed of scraping.

  • @surajoliver1
    @surajoliver1 3 ปีที่แล้ว +1

    Very helpfull !

    • @codeRECODE
      @codeRECODE  3 ปีที่แล้ว

      Glad it was helpful!

  • @vijay123464
    @vijay123464 2 ปีที่แล้ว

    Hi Sir,
    Awesome Video
    i am getting
    "It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.
    See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
    return cls(crawler)"
    Can you tell me about this? Please!
    And also please tell me, where will i get output file?

    • @codeRECODE
      @codeRECODE  2 ปีที่แล้ว

      You can ignore it. In fact, if you read the message carefully, you will notice "In other words, it is normal to get this warning"

  • @MohitAswani
    @MohitAswani 2 ปีที่แล้ว

    I encountered a scenario. While using scraper_helper library to run spider directly from the script using vs code, I get below error:
    "ImportError: attempted relative import with no known parent package"
    I have to import the items file inside the spider which is why it throws this error, any solutions for this?

    • @codeRECODE
      @codeRECODE  2 ปีที่แล้ว

      Double check that your script, where you are using scraper_helper, is in the same directory where you have scrapy.cfg file

  • @hayathbasha4519
    @hayathbasha4519 3 ปีที่แล้ว

    Is it possible to automate cli using scrapy

    • @codeRECODE
      @codeRECODE  3 ปีที่แล้ว

      I am not sure if I understand your question. Scrapy is for automating http. If you want to run it from cli, you can do that.

    • @hayathbasha4519
      @hayathbasha4519 3 ปีที่แล้ว

      @@codeRECODE thanks for replying to my message
      I am having scenario where I have to run some commands in cli for login to website once I successfully logged in then it will return me the website url
      I am trying automate above scenario

    • @codeRECODE
      @codeRECODE  3 ปีที่แล้ว

      Reading cli arguments can be done with ArgParser
      I have posted a video on this channel.

  • @ataimebenson
    @ataimebenson 3 ปีที่แล้ว

    Do you have a video on how to implement multithreading in scrapy?

    • @codeRECODE
      @codeRECODE  3 ปีที่แล้ว +1

      Scrapy is multi threaded by default.

    • @ataimebenson
      @ataimebenson 3 ปีที่แล้ว

      @@codeRECODE Is there a way to make it faster ?
      There are 200,000 urls to send a request to. That will take about 5 days with a Download delay of 2 seconds. Is there a way to drastically reduce the amount of days ?

    • @codeRECODE
      @codeRECODE  3 ปีที่แล้ว

      As you are using download_delay, it looks like web site is not allowing faster scraping. Otherwise I would have suggested to increase the concurrent_requests.
      In your case, looks like using proxies is the only way.

    • @ataimebenson
      @ataimebenson 3 ปีที่แล้ว

      @@codeRECODE I was trying to be on the safe side, that's why I used Download delay. I can remove the download delay. And increase the concurrent request.
      Which paid proxy provider do you recommend for rotating proxies with scrapy ?
      The easiest to use with scrapy

    • @codeRECODE
      @codeRECODE  3 ปีที่แล้ว +1

      If your budget allows it, go with Zyte. These are the guys who made scrapy. Second best choice is scraperapi
      Check my video on proxies to get an idea

  • @ashish23555
    @ashish23555 3 ปีที่แล้ว

    Sir, make some videos on development part at server end

    • @codeRECODE
      @codeRECODE  3 ปีที่แล้ว +2

      If you mean web development, it is in the pipeline. Will take couple of months though.

  • @nelsongomez8547
    @nelsongomez8547 3 ปีที่แล้ว

    Hello friend, congratulations for such an excellent video.
    Friend I have the problem, and I don't know if I can solve it that way, I appreciate your great guidance.
    I am creating a web service with FastApi, which has 2 endpoints where I extract to 2 websites.
    .... /demo1
    .... /demo2
    When from postman for example I make a request. I want demo1 the browser opens and everything is fine, it does the extraction and it works perfect.
    Following the example from postman, if I make a request to demo1 and at once I give it to demo 2... demo 2, I must wait for demo 1 to finish so that it opens the browser and does the extraction.
    Can you please guide me on how to solve that.
    I hope you can help me.
    Greetings.

  • @CherifRahal
    @CherifRahal 8 หลายเดือนก่อน

    Python is not multithreaded unfortunately

    • @codeRECODE
      @codeRECODE  8 หลายเดือนก่อน +1

      Hey! Thanks for stopping by for the comment.
      Well, Python does support multithreading unless we talk about details of cpython limitations. What I showed in the video works well for many tasks.
      The main point is that Scrapy can make web scraping super fast and easy, without needing to worry about all the threading and multiprocessing details.

    • @CherifRahal
      @CherifRahal 8 หลายเดือนก่อน

      @@codeRECODE I understand, thank you. I have a quick question, should I learn only scrapy or I need to learn beautiful soup ? for web scraping , I just want to focus on something powerful

    • @codeRECODE
      @codeRECODE  8 หลายเดือนก่อน

      ​@@CherifRahal Forgot everything and focus on Scrapy!
      Anything you can do with BeautifulSoup, you can do with the Scrapy selector module-and Scrapy offers 100 more features!
      Start with my mini course:
      courses.coderecode.com/p/scrapy-crash-course?coupon_code=YTFREE&product_id=2412425
      Use the coupon code YTFREE to access it for free.
      After that, you can invest in any paid course for structured learning, or find more free videos posted here.