How to Scrape JavaScript Websites with Scrapy and Playwright

แชร์
ฝัง
  • เผยแพร่เมื่อ 15 ก.ย. 2024
  • No page is out of reach! Using scrapy and playwright we have the best of both worlds for javascript rendering and data scraping capabilities. In this project i will show you how to get started with a basic scraper on a javascript heavy website, using scrapy-playwright. By putting the headless browser infront of scrapy to make the requests we are able to render out the page, and even wait for certain selectors to be visible before we return the page DOM/HTML and have it be parsed with Scrapy
    Doing it this way we have many benefits; scrapy items, item loader, pipelines, middleware all accessible for us to use. There are a few drawbacks however, any web scraping using a real browser is inheritly slower - this is something we can't avoid, as the nature of this method requries loading a browser up to access the page. It does however give us access to sites that we previously would have issues scraping.
    github.com/scr...
    Support Me:
    Patreon: / johnwatsonrooney (NEW)
    Amazon UK: amzn.to/2OYuMwo
    Hosting: Digital Ocean: m.do.co/c/c7c9...
    Gear Used: jhnwr.com/gear/ (NEW)
    -------------------------------------
    Disclaimer: These are affiliate links and as an Amazon Associate I earn from qualifying purchases
    -------------------------------------

ความคิดเห็น • 169

  • @alexanderscott2456
    @alexanderscott2456 2 ปีที่แล้ว +20

    I started my first playwright project after constantly failing to extract json from an endpoint because of some graphql nonsense. My constant thought was "I sure wish I could integrate playwright this with scrapy." You and the algorithm gods have answered my prayers.

    • @JohnWatsonRooney
      @JohnWatsonRooney  2 ปีที่แล้ว +2

      That’s great I’m glad i could help!

    • @alexanderscott2456
      @alexanderscott2456 2 ปีที่แล้ว +8

      If it isn't too much trouble, would you mind eventually making a video on hidden json api endpoints that require some kind of cookie or header authentication? Thank you for all the invaluable content :)

    • @learncodeinbangla1852
      @learncodeinbangla1852 2 ปีที่แล้ว +2

      Nice video again!!

  • @tommifish322
    @tommifish322 2 ปีที่แล้ว +17

    I've only just started scraping(lol) the surface of web scraping so alot of your content goes over my head but, your videos are really great and a complete gold mine to anyone who is trying to learn. Thank you!

  • @drac.96
    @drac.96 2 ปีที่แล้ว +6

    This TH-cam channel is probably the only one with the best website crawling software and techniques I've seen! Thank you very much for the amazing content, John! You should make a course about this stuff, really useful.

  • @adnanpramudio6109
    @adnanpramudio6109 2 ปีที่แล้ว +1

    I thought about scrapy + playwright as replacement of selenium and now you upload this. Thank you so much!

  • @realpropagandalf
    @realpropagandalf 2 ปีที่แล้ว +1

    Hey John! It’s rarely that I comment on youtube videos, but I just must say that your content is golden. Keep it up!

    • @JohnWatsonRooney
      @JohnWatsonRooney  2 ปีที่แล้ว

      Thanks, appreciate it. Cool user name ha

  • @ruhollahmozafari2343
    @ruhollahmozafari2343 2 ปีที่แล้ว +1

    I have just started using scrapy for crawling, you're videos are very helpful. 👍

  • @SamirMamude
    @SamirMamude 2 ปีที่แล้ว +1

    Hi, currently I'm working with crawling in my job, your videos is helping me alot!

  • @celerystalk390
    @celerystalk390 2 ปีที่แล้ว +2

    Thanks so much for introducing another great tool! Definitely worth learning after Selenium/Helium. Great job again John!

    • @JohnWatsonRooney
      @JohnWatsonRooney  2 ปีที่แล้ว

      My pleasure! Glad you enjoyed it

    • @ZainAli-hq1gu
      @ZainAli-hq1gu 2 ปีที่แล้ว

      @@JohnWatsonRooney kindly push code on github.thanks

  • @CodePhiles
    @CodePhiles 2 ปีที่แล้ว +1

    new great library that helps for dynamic pages, thanks a lot John

  • @automationhungry3617
    @automationhungry3617 2 ปีที่แล้ว +1

    Great Video Man. Want to see more videos Scrapy with Playwright

  • @gianfrancodagostino3938
    @gianfrancodagostino3938 2 ปีที่แล้ว +2

    Awesome video, very well explained. Definitely worth the time. Pure gold. Thank you.

  • @spotshot7023
    @spotshot7023 ปีที่แล้ว +3

    Hi John, I am trying to run the exact same code in my Windows machine which you showed here but I am getting lot of errors like "AttributeError: 'PipeTransport' object has no attribute '_output'" and "AttributeError: 'ScrapyPlaywrightDownloadHandler' object has no attribute 'browser_type'". I have done the exact same setting like you did. Kindly help me. Thanks

    • @user-dc8pe5er8o
      @user-dc8pe5er8o ปีที่แล้ว +3

      Have you figured out a solution to this problem? I am having the same issue.

    • @hamzahbhatti2273
      @hamzahbhatti2273 10 หลายเดือนก่อน +1

      Do you find the solution I am also encounter with same problem on windows

  • @jensshumway3652
    @jensshumway3652 2 ปีที่แล้ว +2

    is anyone else getting this error: AttributeError: 'PipeTransport' object has no attribute '_output

  • @samibdh
    @samibdh 2 ปีที่แล้ว +1

    That was exactly what i was looking for thank you ! (splash wasn't able to load javascript)

  • @napeters7069
    @napeters7069 2 ปีที่แล้ว +1

    Hello John, when I go to run this the script it seems to hang, in the cmd console:
    [asyncio] DEBUG: Using proactor: IocpProactor
    [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    I am on win10 machine, using python 3.7.10 and running everything from Anaconda virtual env. Any idea what might be the issue?

  • @vasugupta7265
    @vasugupta7265 2 ปีที่แล้ว +3

    ERROR: AttributeError: 'PipeTransport' object has no attribute '_output' , same code, can you fix this please?

    • @hamzahbhatti2273
      @hamzahbhatti2273 10 หลายเดือนก่อน

      Are you using windows? Because I am also encountering this error. And after searching it was found that it works good in Linux. Windows is not compatible.

  • @dennistanui7085
    @dennistanui7085 2 ปีที่แล้ว +4

    Awesome video! Could you also make a video about scraping websites that make repetitive calls to an api and then use javascript to format the json response (i.e making direct calls to the api returns gibberish json values). Thanks a lot mate.

  • @dcevansuk
    @dcevansuk 2 ปีที่แล้ว

    Many thanks, as always clear and concise,
    It will be interesting to see how we handle PageCoroutine when loading parent and child pages with different 'wait_of_selector' values.

  • @melih.a
    @melih.a 2 ปีที่แล้ว

    This video is great! now I got to figure out how to customise this for a login page.

  • @SaMi-se2qs
    @SaMi-se2qs ปีที่แล้ว +1

    Hi John....I got an error when i run the spider.
    AttributeError: 'PipeTransport' object has no attribute '_output'
    Please tell me How i can handle this error'?

  • @hamzahbhatti2273
    @hamzahbhatti2273 10 หลายเดือนก่อน +1

    Hi, After following the whole steps I am encountering the error (AttributeError: 'PipeTransport' object has no attribute '_output') and (exception=NotImplementedError()>)

  • @ChristianRevil
    @ChristianRevil 19 วันที่ผ่านมา

    Hey, John maybe you can help me with which will the best to use on my project. I want to scrape on tables from different website more or less 10 websites and from that tables I will compare all each table from what data that I have.

  • @tubelessHuma
    @tubelessHuma 2 ปีที่แล้ว +1

    Playwright making scraping life easy. Great 💖

  • @HP-wo8kv
    @HP-wo8kv 4 หลายเดือนก่อน

    I have started scrapy crawling in my windows with all the things installed, but getting NotImplementedError,AttributeError

  • @sivaranjjan2491
    @sivaranjjan2491 ปีที่แล้ว

    Hi john, i need to click show more button scrolling down.
    page.evaluate("window.scrollTo(0, document.body.scrollHeight)") takes to bottom of the page but "show more" button is not in the viewpoint so playwright couldnt able to find the button. Any solution ? I thought of clicking the show more button till no more "show button" is available then i get full page_content and store as response.
    page_content = page.content()
    response = HTML(html=page_content)
    Now i can use the response to get the data.

  • @greoipsec7258
    @greoipsec7258 2 ปีที่แล้ว +2

    Hi John, thank you for the videos, it helped me alot! I am a bit stuck at the moment with the JS website. How can I do the "callback" to go to the next page when I have 2 functions now? I have tried to run them in a while loop but with little result. How would you do it on this example if it would have multiple pages?

  • @Scuurpro
    @Scuurpro 2 ปีที่แล้ว +2

    I'm new to python and scrapy. Following your tutorials in just 3 days I've been able to build and get a much better understanding of scrapy and python. My current site has a pagination that is in javascript it my understanding that I'll need to use splash or playwright. Which one would you recommend for a beginner?

    • @JohnWatsonRooney
      @JohnWatsonRooney  2 ปีที่แล้ว +1

      That’s great! I’d recommend playwright first

    • @Scuurpro
      @Scuurpro 2 ปีที่แล้ว

      @@JohnWatsonRooney I do have one question I'm using a crawl spider do I do def start request afterr my allowed_domains and start_urls or before?

  • @franke3562
    @franke3562 2 ปีที่แล้ว

    Curious: Dynamic websites (SPA and the like) are served by an API (likely a JSON API) to populate their dynamic content. Why not consume this API directly to extract the data we need? I don’t see the use case of rendering pages through a “virtual” browser first, only to then scrape data (that was provided by some network request / API anyways) again by means of CSS selectors and the like. Seems inefficient and much slower. Am I missing something?

  • @learncodeinbangla1852
    @learncodeinbangla1852 2 ปีที่แล้ว +1

    Nice video. I am very glad to see it. It was not known to me .
    Thank you very much!!!

    • @learncodeinbangla1852
      @learncodeinbangla1852 2 ปีที่แล้ว

      Could you please show us a tutorial how to submit form and login. And how to click page and pagination. All the detail about page coroutine.

  • @chandrakalagowda3129
    @chandrakalagowda3129 10 หลายเดือนก่อน +1

    Thank you. This video. saved my day!

  • @ShahidulsPerspective
    @ShahidulsPerspective 2 ปีที่แล้ว

    Hey John, I don't know why it is not working for me. The coroutine is not working for me. What should I do? Does anyone face any problems? How did you solve it?

  • @mahmoudkhair-eldin6814
    @mahmoudkhair-eldin6814 2 ปีที่แล้ว +1

    If I'm having a Scrapy issue, I can count on JWR having a video how to solve it! I'm currently trying to scrape a website with multiple pages and running into this exact issue. In that case, would I use link extractor first, then have playwright open each request pulled by link extractor?

    • @JohnWatsonRooney
      @JohnWatsonRooney  2 ปีที่แล้ว

      Yes that would work. This just fits in the normal request/response flow which is why it’s so useful

  • @FabioRBelotto
    @FabioRBelotto 9 หลายเดือนก่อน

    Does it help to bypass anti bot measures like Amazon has? It's impossible to use scrapy on it anymore

  • @lostfsoul
    @lostfsoul 2 ปีที่แล้ว +1

    Perfect ! Thanks John .

  • @joschabisping7910
    @joschabisping7910 2 ปีที่แล้ว +1

    Your videos are awesome, thanks!

  • @gaifut
    @gaifut 4 หลายเดือนก่อน

    How can I use playwright to click a button with scrapy still scraping? I have a website that I can scrap but on the last page there is a button that should be clicked in order to show more results and it is possible to click it like 5 times to get the extras. I was able to use playwright to get the code for clicking but I dont know how to combine that with scrapy. Was thinking since this is only for the last page to make an if statement and either run playwright there or somehow enable playwritish scrapy there. In any case I have no idea how to let in scrapy scrap that generated page. Can anyone please help?

  • @ataimebenson
    @ataimebenson 2 ปีที่แล้ว +1

    This is awesome. Thanks for the video

  • @vintonchen6210
    @vintonchen6210 2 ปีที่แล้ว +1

    unfortunately it doesn’t work on Windows because of Twister’s compatibility issue, any fix? thanks as always John

    • @JohnWatsonRooney
      @JohnWatsonRooney  2 ปีที่แล้ว

      Oh doesn’t it? Sorry I didn’t realise I’ve used either Linux or WSL in windows for years for development

  • @Ligthus
    @Ligthus ปีที่แล้ว +2

    "AttributeError: 'PipeTransport' object has no attribute '_output'"

  • @msakhmat
    @msakhmat 2 ปีที่แล้ว +1

    What about if I want to install playwright on a separate server, can I do that and use the same setup you did?

    • @JohnWatsonRooney
      @JohnWatsonRooney  2 ปีที่แล้ว

      Yes you can, I haven’t done it in a while I think there is more setup required though

  • @stephenellis4777
    @stephenellis4777 8 หลายเดือนก่อน +1

    This tutorial doesn't work for those of us that use Windows! You should have stated that from the very beginning. I am getting attribute error when trying to run the spider!

  • @akhmadfaizal7792
    @akhmadfaizal7792 2 ปีที่แล้ว

    when i try on scrapy crawl pwspider -o output.json i can't gain element based on url. what happen that?

  • @LukeGarbuttGaming
    @LukeGarbuttGaming 2 ปีที่แล้ว

    Great video, just discovered your channel and went on a binge.
    I think I've covered the core of your main methods but I could be missing some - how would you go about scraping odds information from somewhere like williamhill? They seem to have an api but I can't figure it out. Would this require playwright as in this video here? It appears so to me, just curious if I'm barking up the wrong tree or not.

  • @jeroenvermunt3372
    @jeroenvermunt3372 2 ปีที่แล้ว +1

    I need to click on a button on the webpage which basically means: "read more ..." . Would playwright be the suitable tool?

    • @JohnWatsonRooney
      @JohnWatsonRooney  2 ปีที่แล้ว

      Yes it can absolutely do that. It would be worth check what happens network wise when you click it though, often you don’t need to do that at all and can make the same request without playwright. Check out my video on hidden APIs (best scraping method)

    • @jeroenvermunt3372
      @jeroenvermunt3372 2 ปีที่แล้ว

      @@JohnWatsonRooney thanks for the suggestion, I saw the video on the API but there doesn't seem to be anything I can use.

  • @thewheeldeal8439
    @thewheeldeal8439 2 ปีที่แล้ว

    Can one update the response after performing a playwright coroutine? Or do you always have to load a new page via callback?

  • @kabirsainivlogs
    @kabirsainivlogs 2 ปีที่แล้ว +1

    Scrapy playwright takes an average time of 6 seconds to scrape a website. Is there any way to speed things up?

    • @JohnWatsonRooney
      @JohnWatsonRooney  2 ปีที่แล้ว +1

      I only would use this as a last resort, I would explore other methods first, html scraping, reverse engineering the API. This method is good where time isn't a requirement

  • @berionikkk8582
    @berionikkk8582 2 ปีที่แล้ว

    Thank you John for great work! Could you help me please? I'm scraping a site about books. Some books have a short description, the others a long one hidden under the "expand" button. If I use PageCoroutine (click and wait_for_selector) in meta on long description pages it works well. But on pages without "expand" button I get an error. I don't know how to solve this problem.

  • @rapterkingofthebrozone7490
    @rapterkingofthebrozone7490 2 ปีที่แล้ว

    Hello! I was just watching your video from about a year ago on scraping shopify stores. I was curious if there was a way to find a total number of products in the shop so that I can set the limit for a python script to pull all the product information at once?

  • @heqlatax8690
    @heqlatax8690 2 ปีที่แล้ว

    hey ! i made a function to login to a page, then it returns me a session (or HTML session, as I want)
    But, i can't get anything from my session because this website uses JS. When I try to render and print r.html.html, it returns me to the login page, even if I am already logged in. do you have any idea of what I should do ? Thanks a lot !

  • @alphabelta6092
    @alphabelta6092 3 หลายเดือนก่อน

    Thank you very much for this video. I followed you, but I got this error: AttributeError: 'PipeTransport' object has no attribute '_output'. Does anyone have the solutions?

  • @azhari7968
    @azhari7968 ปีที่แล้ว +1

    hey this doesn't work on windows :(

  • @kacheck855
    @kacheck855 2 ปีที่แล้ว +1

    Thank you John. Unfortunately, not working on my MacBook. Unsupported URL scheme 'https': No module named 'scrapy_playwright'. Maybe it's the problem of the M1 chip, I can run it on other platform.

    • @JohnWatsonRooney
      @JohnWatsonRooney  2 ปีที่แล้ว

      Hey did you try pip installing scrapy-playwright?

    • @BoerBell
      @BoerBell 2 ปีที่แล้ว +2

      You have not reported whether John's reply helped. I encountered the same issue when working with Pycharm in a project venv. Resolved by having PyCharm install the package (preferences/project interpreter).

  • @FranzAllanSee
    @FranzAllanSee ปีที่แล้ว

    So what's scrapy for? Sounds like 98% of the heavy lifting was done by playwright. Why not just drop the middleman? 😅

  • @KhalilYasser
    @KhalilYasser 2 ปีที่แล้ว +2

    Thank you very much. At 356, I tested the code after I applied all the steps correctly but got an error.

    • @KhalilYasser
      @KhalilYasser 2 ปีที่แล้ว +2

      The error like that `Traceback (most recent call last):
      File "C:\Users\Future\AppData\Local\Programs\Python\Python39\lib
      unpy.py", line 197, in _run_module_as_main
      return _run_code(code, main_globals, None,
      File "C:\Users\Future\AppData\Local\Programs\Python\Python39\lib
      unpy.py", line 87, in _run_code
      exec(code, run_globals)
      File "C:\Users\Future\Desktop\venv\Scripts\scrapy.exe\__main__.py", line 7, in
      File "C:\Users\Future\Desktop\venv\lib\site-packages\scrapy\cmdline.py", line 144, in execute
      cmd.crawler_process = CrawlerProcess(settings)
      File "C:\Users\Future\Desktop\venv\lib\site-packages\scrapy\crawler.py", line 280, in __init__
      super().__init__(settings)
      File "C:\Users\Future\Desktop\venv\lib\site-packages\scrapy\crawler.py", line 156, in __init__
      self._handle_twisted_reactor()
      File "C:\Users\Future\Desktop\venv\lib\site-packages\scrapy\crawler.py", line 343, in _handle_twisted_reactor
      install_reactor(self.settings["TWISTED_REACTOR"], self.settings["ASYNCIO_EVENT_LOOP"])
      File "C:\Users\Future\Desktop\venv\lib\site-packages\scrapy\utils
      eactor.py", line 66, in install_reactor
      asyncioreactor.install(eventloop=event_loop)
      File "C:\Users\Future\Desktop\venv\lib\site-packages\twisted\internet\asyncioreactor.py", line 308, in install
      reactor = AsyncioSelectorReactor(eventloop)
      File "C:\Users\Future\Desktop\venv\lib\site-packages\twisted\internet\asyncioreactor.py", line 63, in __init__
      raise TypeError(
      TypeError: ProactorEventLoop is not supported, got: `

    • @Rodrigodacostarodrigocostaful
      @Rodrigodacostarodrigocostaful 2 ปีที่แล้ว

      @@KhalilYasser me too

  • @marvio_rocha
    @marvio_rocha ปีที่แล้ว

    Men, I loved your tutorial, very simple to make it. Way did you use only CSS that the XPath? Cheers from Brazil

  • @anhuynh2689
    @anhuynh2689 2 ปีที่แล้ว

    hello sir, i want to scrape a webpage that have a list of data place in rows and, at every rows (that containt a link) i want to click it to open the popup page, scrape the data inside the popup and then close the page and go ahead with the next row for the scraping, can you teach me how to do, cause i'm a rookie and got stuck for so long. thanks !!

  • @cosmicblack
    @cosmicblack 2 ปีที่แล้ว

    Great Video, i just started into web sacarping, i was using selenium and im gonna try playwright
    Jus, i have a question...i'm searching for the meta options and i can't find anythin related to playwright_include_page in playwrgigth documentation or scrapy-playwright
    Where do i get the meta possible options?
    Do they are the methods for pages and lcoators from playwright?

  • @alqam2011
    @alqam2011 2 ปีที่แล้ว +1

    Thank you so much for this tutorial. However when I try to use playwright on windows it's not working I googled it and tried multiple solutions, still not-working with this error message (AttributeError: 'ScrapyPlaywrightDownloadHandler' object has no attribute 'browser_type')
    please if anyone can help me!

    • @JohnWatsonRooney
      @JohnWatsonRooney  2 ปีที่แล้ว +1

      I've heard from another viewer that playwright has problems on windows but i don't have a solution i'm afraid, maybe checking out their github to see issues might help

    • @alqam2011
      @alqam2011 2 ปีที่แล้ว +1

      @@JohnWatsonRooney thank you so much I guess I'll have to find alternative for JavaScript websites other than Scrapy

  • @dragon3602010
    @dragon3602010 2 ปีที่แล้ว +2

    Can we use playwright in the headless mode to false with scrapy?

    • @JohnWatsonRooney
      @JohnWatsonRooney  2 ปีที่แล้ว +1

      I didn’t actually try that! I would expect yes and we can pass it into the meta arguments at the top

    • @georgemuholi6396
      @georgemuholi6396 2 ปีที่แล้ว +3

      Yes you can using the setting PLAYWRIGHT_LAUNCH_OPTIONS = {“headless”: False}

  • @pkavenger9990
    @pkavenger9990 ปีที่แล้ว +1

    this looks easier and better than splash.

  • @pranit449
    @pranit449 2 ปีที่แล้ว

    Hello John, can you please make a video on handling javascript alerts (like asking location, clicking allow etc in browser). Can't figure out with selenium or playwright. Thank you very much

  • @cleo0318
    @cleo0318 2 ปีที่แล้ว +1

    With these libraries is it still possible to get your ip banned/blocked?

    • @JohnWatsonRooney
      @JohnWatsonRooney  2 ปีที่แล้ว

      Yes it is, it does depend a lot on the sites protection and how many requests you are making

  • @georgemuholi6396
    @georgemuholi6396 2 ปีที่แล้ว

    Thanks for the great video. Am having a challenge integrating rotating proxies with scrapy playwright. How can I go about it?

  • @drac.96
    @drac.96 2 ปีที่แล้ว

    How do you deal with the shadow root inside of a webpage? Any tricks to getting through them with Playwright/Puppeteer? Which software work better or worse in these cases? Thanks!
    I know that in-browser you can do `$('#nav-element').shadowRoot()` or similar which also works in Puppeteer, using `await page.addScriptTag({path: "jquery.js"})` to add JQuery if it isn't already included on the page (sometimes it isn't) but JS only.

  • @raisulislam4161
    @raisulislam4161 2 ปีที่แล้ว

    @John Watson Rooney how can I manage to set "headless = False"?

  • @David-rm1wn
    @David-rm1wn 2 ปีที่แล้ว +3

    thanks for the video...Your videos are always great..I tried to run as you showed. I got this error...Any idea what went wonrg?
    TypeError: SelectorEventLoop required, instead got:

    • @Rodrigodacostarodrigocostaful
      @Rodrigodacostarodrigocostaful 2 ปีที่แล้ว +1

      me too

    • @adamdavies1956
      @adamdavies1956 2 ปีที่แล้ว +1

      Did you manage to fix this? I'm having same problem

    • @oktayozkan2256
      @oktayozkan2256 2 ปีที่แล้ว +1

      scrapy-playwright is not working on windows. you can try it on linux.

    • @valostudent6074
      @valostudent6074 2 ปีที่แล้ว

      @@oktayozkan2256 me too, windows dont supported

    • @oktayozkan2256
      @oktayozkan2256 2 ปีที่แล้ว

      @@valostudent6074 if the problem persists, wsl (windows subsystem for linux) could be a good solution.

  • @diegovargas3853
    @diegovargas3853 2 ปีที่แล้ว +1

    Hi John, what's your opinion about playwright vs splash?

    • @JohnWatsonRooney
      @JohnWatsonRooney  2 ปีที่แล้ว

      Hey, the end goal is the same but they go about it in slightly different ways. Splash is Specifically designed for rendering pages but requires a bit more setup, while the scrapy playwright integration is newer and is self contained rather than a separate service. There are use cases for both but right now I’d lean to playwright, certainly for personal projects

  • @janpost8598
    @janpost8598 ปีที่แล้ว +1

    Nice headset haircut. 😆

  • @maggiekay1
    @maggiekay1 2 ปีที่แล้ว +1

    Dude I love your canal, you let me learn a new method to replace selenium (again), I thought Splash is the alternative, so the question is which performance is better? And I would love the see more advanced scrapy project(like scra[ping some socialmedia website)

    • @JohnWatsonRooney
      @JohnWatsonRooney  2 ปีที่แล้ว

      Thanks! I would use playwright over splash right now for general projects - it works well and is easy to use and install (no docker needed)

    • @maggiekay1
      @maggiekay1 2 ปีที่แล้ว +1

      @@JohnWatsonRooney yeah, definitely. But I noticed that framework donot work on Windows, so sad !

  • @drac.96
    @drac.96 2 ปีที่แล้ว

    How do we re-use browser cookie information to interact with webpage JSON APIs? That would be super useful instead of using the browser to parse HTML for PWA sites. You should create a blog or start up some Udemy courses on stuff like this!

  • @RicRod
    @RicRod ปีที่แล้ว +1

    Hi, does anyone know what environment or what version of python he handles in his video, I have tried this and other tutorials and I am getting a couple of complicated errors and I think it is because of the version of python and some components.

    • @JohnWatsonRooney
      @JohnWatsonRooney  ปีที่แล้ว +1

      hey its Python 3.9.7 with a virtual environment using venv. I've done a similar thing to this recently on 3.10 without issues as well

    • @RicRod
      @RicRod ปีที่แล้ว +1

      @@JohnWatsonRooney thank you very much for the answer, I was working with version 3.8 and I did not know whether to downgrade to 3.7 or move to 3.10 at once, have a nice day or night.

    • @JohnWatsonRooney
      @JohnWatsonRooney  ปีที่แล้ว +1

      @@RicRod no worries, try 3.11 if you can it has some improvements in it

    • @mecrayavcin
      @mecrayavcin ปีที่แล้ว

      ​@@JohnWatsonRooney This does not work on windows with vscode + anaconda. Did you use Mac pc or Linux pc for this? Thanks

  • @dragon3602010
    @dragon3602010 2 ปีที่แล้ว +3

    Awesome can you make a video about playwright stealth for automation

  • @jonas230ph
    @jonas230ph 2 ปีที่แล้ว

    Thanks so much for introducing another great tool!, would want to know if playwright able authenticate login and can possible pass data on scrapy for scraping?

  • @amithreddy93
    @amithreddy93 ปีที่แล้ว

    I am getting few errors, like notimplemented error and attribute error

    • @amithreddy93
      @amithreddy93 ปีที่แล้ว

      AttributeError: 'PipeTransport' object has no attribute '_output'

  • @johnmarkacala3098
    @johnmarkacala3098 2 ปีที่แล้ว +1

    what theme are you using?

    • @JohnWatsonRooney
      @JohnWatsonRooney  2 ปีที่แล้ว

      I think this is gruvbox material, or gruvbox dark medium

  • @FabioRBelotto
    @FabioRBelotto 9 หลายเดือนก่อน

    I am quite a noob. I couldn't understand the advantages of using scrapy instead of just playwright.

  • @MrTASGER
    @MrTASGER 2 ปีที่แล้ว

    Async browser in async scraper.
    Great! 👍

  • @ajaniyeseer5947
    @ajaniyeseer5947 7 หลายเดือนก่อน

    You videos are golden!
    Please I will like if can make a video on how to add proxy from company like bright Data to scrapy_playwright project .Thank you

  • @hamzaehsankhan
    @hamzaehsankhan 2 หลายเดือนก่อน

    Great stuff

  • @raisulislam4161
    @raisulislam4161 2 ปีที่แล้ว

    Hello John,
    Why I am getting this error?
    "TypeError: ProactorEventLoop is not supported, got: "

    • @adamdavies1956
      @adamdavies1956 2 ปีที่แล้ว +1

      Did you manage to fix this? I'm having same problem too

    • @raisulislam4161
      @raisulislam4161 2 ปีที่แล้ว

      @@adamdavies1956 Scrapy-Playwright now only works on Linux/ Ubuntu. I setup ubuntu then I managed to run this

    • @thecouchman2112
      @thecouchman2112 2 ปีที่แล้ว

      @@raisulislam4161 Hey Raisul, how do you know it only works on Windows?

  • @StephenForder
    @StephenForder ปีที่แล้ว +1

    Thanks for the excellent tutorials. I'm battling to deploy a spider which uses scrapy-playwright to my scrapyd service. Am I being too ambitious?

    • @JohnWatsonRooney
      @JohnWatsonRooney  ปีที่แล้ว +1

      It’s not something I’ve done before, there might be issues running the actual browser - is that what’s getting stuck?

    • @StephenForder
      @StephenForder ปีที่แล้ว +1

      @@JohnWatsonRooney Yup exactly. Runs fine with 'scrapy crawl' command, but 'curl ....' gets stuck at the playwright part. Maybe rely on cron for now?

    • @JohnWatsonRooney
      @JohnWatsonRooney  ปีที่แล้ว +1

      @@StephenForder I have limited experience with scrapyd i';m afraid, maybe yeah just run it on cron

    • @StephenForder
      @StephenForder ปีที่แล้ว

      @@JohnWatsonRooney thanks John :)

    • @StephenForder
      @StephenForder ปีที่แล้ว +1

      I'm not sure what went wrong on my first server setup attempt, but on my second attempt, scrapy-playwright and scrapyd are playing fine together on Ubuntu 22.04 👍

  • @SecurityTalent
    @SecurityTalent 2 ปีที่แล้ว +1

    Great bro..

  • @learncodeinbangla1852
    @learncodeinbangla1852 2 ปีที่แล้ว

    Could you please show us a tutorial how to submit form and login. And how to click page and pagination. All the detail about page coroutine.

  • @return_1101
    @return_1101 2 ปีที่แล้ว +1

    Good video. You are awesome!

  • @rentman1740
    @rentman1740 2 ปีที่แล้ว +2

    so we need chromium ?

    • @JohnWatsonRooney
      @JohnWatsonRooney  2 ปีที่แล้ว

      Yes but it will be installed by playwright

  • @gracyfg
    @gracyfg 5 หลายเดือนก่อน

    Does it work on Windows ?

  • @vicscrapingmachine
    @vicscrapingmachine 2 ปีที่แล้ว

    Hi, im getting this error, can anyone help me?: scrapy.exceptions.NotSupported: Unsupported URL scheme 'https': The installed reactor (twisted.internet.selectreactor.SelectReactor) does not match the requested one
    (twisted.internet.asyncioreactor.AsyncioSelectorReactor)

  • @ariankarimi8941
    @ariankarimi8941 ปีที่แล้ว

    this method doesn't work any more :(

  • @ThrashSkull
    @ThrashSkull 4 หลายเดือนก่อน

    Hey nice tutorial, can you make one of the same but using CrawlSpider? please

  • @burgasdragonheirsilentgods
    @burgasdragonheirsilentgods 2 ปีที่แล้ว

    Thank you so much sir 💙

  • @daremotivationeveryday
    @daremotivationeveryday 2 ปีที่แล้ว

    Hello please how can I use it to extract number

  • @mohfatkurrozi4069
    @mohfatkurrozi4069 2 ปีที่แล้ว +1

    Amazing videos

  • @KG-lr2qw
    @KG-lr2qw 2 ปีที่แล้ว +1

    It looks so simple, unless you get a huge list of errors, NotImplementedError, Error caught on signal handler, AttributeError: 'ScrapyPlaywrightDownloadHandler' object has no attribute 'browser'. Isn't it fun when a ten minute tutorial turns into hours and hours of googling....

    • @JohnWatsonRooney
      @JohnWatsonRooney  2 ปีที่แล้ว +2

      Absolutely. I do my best to show you the method but what you don’t see is the hours and hours of learning I did when I was new to it. It’s not easy but every time you overcome and error you learn what it was and why it happened and learn the insides of the errors. I promise if you keep working at it you’ll get there.

  • @apaapsson774
    @apaapsson774 2 ปีที่แล้ว

    shit this will come inhandy for me. Thanks

  • @valostudent6074
    @valostudent6074 2 ปีที่แล้ว +1

    i think windows dont support it yet

    • @JohnWatsonRooney
      @JohnWatsonRooney  2 ปีที่แล้ว

      Oh really? I thought it would but I don’t use windows anymore to check

  • @WolfSingh
    @WolfSingh 10 หลายเดือนก่อน

    your videos are great you should start a TH-cam channel

  • @Achiesamablog
    @Achiesamablog 2 ปีที่แล้ว +1

    playwright seems to be new selenium

    • @JohnWatsonRooney
      @JohnWatsonRooney  2 ปีที่แล้ว

      Selenium is still used very widely but it’s great to have an alternative and playwright has been brilliant to work with

  • @Raminber
    @Raminber ปีที่แล้ว +1

    Important: Doesn't work for Windows ;(

  • @markcuello5
    @markcuello5 2 ปีที่แล้ว

    Help me

  • @jensshumway3652
    @jensshumway3652 2 ปีที่แล้ว +1

    this guy is the laziest tutorial guy i've seen XD