This script I threw together saves me hours.

แชร์
ฝัง
  • เผยแพร่เมื่อ 15 ส.ค. 2023
  • Finding out the best way to scrape data from a site is time consuming, this script uses selenium wire to view the network requests from a site and give you back a list of urls and json responses.
    Proxies: nodemaven.com/?a_aid=JohnWats...
    Patreon: / johnwatsonrooney (NEW free tier)
    Scraper API www.scrapingbee.com/?fpr=jhnwr
    Donations: www.paypal.com/donate/?hosted...
    Hosting: Digital Ocean: m.do.co/c/c7c90f161ff6
    Gear I use: www.amazon.co.uk/shop/johnwat...
  • วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 69

  • @liketheduck
    @liketheduck 3 หลายเดือนก่อน +3

    Fantastic “apprentice” content. This assumes a basic understand but also pushes the novice forward. I really appreciate it!

  • @DerekMurawsky
    @DerekMurawsky 2 หลายเดือนก่อน

    This is really great, and a great foundation, too. I can see this being extended to support so many things, too.

  • @jessejames3169
    @jessejames3169 11 หลายเดือนก่อน +11

    Love your thought process behind writing this! It makes it easy to follow why you do a certain step, and if it’s necessary for others! Great vids keep it up!

  • @Extrey
    @Extrey 11 หลายเดือนก่อน +7

    I didn't even know that selenium can be used like this, thank you very much, great work as always))

  • @sandunwijethunga6787
    @sandunwijethunga6787 11 หลายเดือนก่อน +1

    great video. thank you john❤

  • @TimoTalksTech
    @TimoTalksTech 11 หลายเดือนก่อน

    Amazing, just something I was looking for. Need to look into more if I could fetch all the IPs too

  • @kocahmet1
    @kocahmet1 11 หลายเดือนก่อน +1

    golden content here

  • @jagdish1o1
    @jagdish1o1 10 หลายเดือนก่อน +5

    I used seleniumwire for create a scraping bot. It’s a very good package to grab the backend requests. What i did was using selenium i logged-in than grab the cookies and the backend api ;) than i simply closed the browser and used the python requests lib to make the request to make thing little bit faster. Eventually, i dockerized everything and than i have this container image which i than pushed on aws ecr and run parallel on aws ecs.
    Pretty amazing.

    • @datacleaningchallenge2029
      @datacleaningchallenge2029 10 หลายเดือนก่อน

      impressive, what's your email, need to ask you a question as relate to your code

  • @kite759
    @kite759 11 หลายเดือนก่อน +1

    that's very useful, thank you

  • @ivanowdenis
    @ivanowdenis 11 หลายเดือนก่อน +2

    Hello John, could you make a video how to scrape data which a server send trough a websocket connection in live mode?

  • @StonedApe420
    @StonedApe420 11 หลายเดือนก่อน

    Can it make complete copy of requests with url, headers and payload?

  • @tizianonakamader8177
    @tizianonakamader8177 11 หลายเดือนก่อน +1

    Amazing content thank you

  • @pldvs
    @pldvs 11 หลายเดือนก่อน +6

    "Because. I. Don't. Care..." 😂😂

  • @zakariaboulouarde4591
    @zakariaboulouarde4591 2 หลายเดือนก่อน +1

    Hello thank you for the amazing video. Wanna ask please how can I bypass 403 forbidden, for cloudflare when I am requesting an Api? Thank you for all your efforts 🙏🏽

  • @maloukemallouke9735
    @maloukemallouke9735 11 หลายเดือนก่อน

    thank you,
    i am wondering if you wine money with this tools ????

  • @TheCulpritgamer
    @TheCulpritgamer 4 หลายเดือนก่อน

    can you please share the script that you created for my future reference ??

  • @user-qi2kt8ow5r
    @user-qi2kt8ow5r 10 หลายเดือนก่อน

    Can I bypass hqq.tv devtool blocking using this?

  • @throwyourmindat
    @throwyourmindat 10 หลายเดือนก่อน

    Hi
    Are you aware of self healing selenium scripts? Can you explain the concept of self healing and how is it even possible!? Because we find element on web page using a locator if that element isn't found we get error. How can self healing find that locator. For eg. An element found by //input[@name=email] if not found, can automatically guess the element was updated in next build as //input[@name=mailing-addrress] using self healing approach.. it would be great if you can help us understand that

  • @darylhunt9070
    @darylhunt9070 11 หลายเดือนก่อน +1

    good video . Do you capture keys for api in Selium wire as well. As some api use session keys

    • @JohnWatsonRooney
      @JohnWatsonRooney  11 หลายเดือนก่อน +2

      you can grab any headers and cookies yeah

  • @AleksT28
    @AleksT28 11 หลายเดือนก่อน +1

    i was working with selenium / selenium-wire until i could not debug the issue while selenium-wire is not listening the right port where selenium is running while dockerised.

    • @JohnWatsonRooney
      @JohnWatsonRooney  11 หลายเดือนก่อน

      that's interesting, i haven't tried dockerising it but i will keep an eye open for issues

  • @mitvpankaj2454
    @mitvpankaj2454 11 หลายเดือนก่อน +1

    Great work bro!! And I have one question also if I want scrape Walmart everytime robot or human pop-up comes so can you please guide me how to Bypass this type of bot detection system? Thanks and love your content because of you i learned python!! 👍

    • @JohnWatsonRooney
      @JohnWatsonRooney  11 หลายเดือนก่อน +1

      Check out undetected chrome driver - there’s some good information for it that might help

    • @mitvpankaj2454
      @mitvpankaj2454 11 หลายเดือนก่อน

      I tried bro but still it's showing the same issue if you have any reference or video can you please suggest me it'll be very helpful for me and other also :)

  • @satyajeetkumar3993
    @satyajeetkumar3993 11 หลายเดือนก่อน +1

    Hi John!! I really appreciate this new content. I have a query to ask. I was using selenium webdriver in chrome to fetch data from a website. The script is working just fine but after certain iterations, the driver is not working properly or the way it should. I am getting a NoneType error. I tried clearing the cookie and starting a new session and then continue from where I left off but it is still not working. Any suggestions on this?? I really appreciate it!! Thanks!!

    • @JohnWatsonRooney
      @JohnWatsonRooney  11 หลายเดือนก่อน

      hard to say but when i get problems like this i always check to see what the direct output from loading the page is, you could be hitting a captcha

    • @satyajeetkumar3993
      @satyajeetkumar3993 11 หลายเดือนก่อน

      Actually that new page is loading properly. I didn't check for terminal output but the page is loading. After that when I am looking for an element on the same page which I know is available there, I am getting an error.

  • @user-nj2om2vt8u
    @user-nj2om2vt8u 11 หลายเดือนก่อน +1

    are you using JetBrains Mono font? If yes, then how it looks so thin?

    • @JohnWatsonRooney
      @JohnWatsonRooney  11 หลายเดือนก่อน

      it is yeah, I don't know I didn't do anything other than select that font sorry

  • @AllifIzzuddin
    @AllifIzzuddin 11 หลายเดือนก่อน +1

    So this is kinda like playwright network events right?

    • @JohnWatsonRooney
      @JohnWatsonRooney  11 หลายเดือนก่อน +1

      Yes same thing but I found it better to use

  • @iamshiva003
    @iamshiva003 11 หลายเดือนก่อน +1

    What is the vscode theme and the font used in this video?

    • @JohnWatsonRooney
      @JohnWatsonRooney  11 หลายเดือนก่อน +1

      github dark theme and jet brains mono!

    • @iamshiva003
      @iamshiva003 11 หลายเดือนก่อน

      @@JohnWatsonRooney thank you

  • @satwikawasthi2002
    @satwikawasthi2002 11 หลายเดือนก่อน +1

    What if api only called when any user action occurs then?

    • @JohnWatsonRooney
      @JohnWatsonRooney  11 หลายเดือนก่อน

      the next step to upgrade this would be to run the same but insert clicks on various page links first and check each one

    • @satwikawasthi2002
      @satwikawasthi2002 11 หลายเดือนก่อน

      @@JohnWatsonRooney thanks for reply🙏 also most important thing post method api which accept custom keys in its headers or payload, will not give expected response, please make video of this thing for executing it.

  • @abdelrahmankhaled8239
    @abdelrahmankhaled8239 2 หลายเดือนก่อน

    complete noob here just started web scraping
    for some reason the seleniumwire import is giving me this error
    import blinker._saferef
    ModuleNotFoundError: No module named 'blinker._saferef'
    I've been searching online for help for hours. changed python versions (currently using the same one you're using in the video)
    nothing seems to work.
    please help
    thank you in advance

    • @DudethatGross
      @DudethatGross 2 หลายเดือนก่อน

      pip install blinker ?

  • @AhmedThahir2002
    @AhmedThahir2002 11 หลายเดือนก่อน

    Hi John! Love your work. Could you share the codes of your videos.

    • @markbennett5626
      @markbennett5626 11 หลายเดือนก่อน +1

      Maybe John has the code available to Patreon members ;)

    • @AhmedThahir2002
      @AhmedThahir2002 11 หลายเดือนก่อน

      @@markbennett5626Ohhhhh okay no issues hehe :)

  • @AndyTutify
    @AndyTutify 11 หลายเดือนก่อน +1

    Are you no longer using neovim?

    • @JohnWatsonRooney
      @JohnWatsonRooney  11 หลายเดือนก่อน

      I still use neovim, i decided to use VS Code for video demos as i thought it would include more people

  • @user-tk5ir1hg7l
    @user-tk5ir1hg7l 11 หลายเดือนก่อน +1

    is this better than pupeteet network events?

    • @JohnWatsonRooney
      @JohnWatsonRooney  11 หลายเดือนก่อน

      I have limited experience with pupeteer, i expect it to be the same - although I prefer seelnium-wire to playwright for network events

    • @user-tk5ir1hg7l
      @user-tk5ir1hg7l 11 หลายเดือนก่อน

      @@JohnWatsonRooney ok, how about playwright network events, does it have similar functionality or would you still recommend going with seleniumwire

  • @Niuroteya
    @Niuroteya 11 หลายเดือนก่อน +1

    I don't really get it.. I mean you can filter Network tab by link or a word "api" too if you want to. Plus this solution will not work for everything, but Network tab will. Other than filtering only needed requests this solution doesn't seem to do anything. And yeah, you can do a bit more advanced filtering here, but.. Does this really saving a lot of time for some kind of task?
    It's just hard to see how for me. Did I miss something? I'm making AJAX scripts dealing with forms for the past year+ and for me it would be absolutely useless.

    • @JohnWatsonRooney
      @JohnWatsonRooney  11 หลายเดือนก่อน +4

      I use it when I am given a URL and want to do some quick checks - saving any JSON output so I can search inside all from my terminal. I chose to semi automate something I was doing regularly is all.

    • @markbennett5626
      @markbennett5626 11 หลายเดือนก่อน +2

      Maybe not for everyone but once scripted including user prompt for url, it'll be quicker than using network tab and much nicer response, plus can see adding the ability for the additional steps of recording session keys and further calls.. Thanks John

  • @linuxkerem
    @linuxkerem 11 หลายเดือนก่อน +1

    Are you using arch linux sir ? And thanks for the content ! 🥰

    • @JohnWatsonRooney
      @JohnWatsonRooney  11 หลายเดือนก่อน

      thanks! its actually just ubuntu + i3

    • @linuxkerem
      @linuxkerem 11 หลายเดือนก่อน

      ​@@JohnWatsonRooney Wow, I guess my mind went straight to arch when I saw a hyperland style window manager 😁

  • @spab87
    @spab87 5 หลายเดือนก่อน

    Hi, thanks a lot, this was very helpfull to learn. I use contextlib.surpress, its actually faster than try/except and it looks better i think. Your function would look like this:
    import contextlib
    for request in driver.requests:
    with contextlib.suppress(Exception):
    data = decodesw(
    request.response.body,
    request.response.headers.get("Content-Encoding", "identity")
    )
    resp = json.loads(data.decode("UTF-16"))
    resps.append(resp)
    return resps

  • @valoclips2896
    @valoclips2896 11 หลายเดือนก่อน +1

    Nice idea. But I will still prefer to log the requests via Network tab or Burp suite.
    The chromedriver detection will also kick in for some sites.

    • @JohnWatsonRooney
      @JohnWatsonRooney  11 หลายเดือนก่อน +1

      fair enough, it does have some uses but also limitations as you say.

  • @twelfth4927
    @twelfth4927 3 หลายเดือนก่อน

    Guys, I'm watching with passion but for what it would be helpful? What are web-scrapers actually doing?

    • @DudethatGross
      @DudethatGross 2 หลายเดือนก่อน

      Gathering data that would otherwise be difficult to get without a proper API

  • @Septumsempra8818
    @Septumsempra8818 11 หลายเดือนก่อน

    Anyone else update chrome on their pc and had all their scrapers break?😅

  • @bakasenpaidesu
    @bakasenpaidesu 10 หลายเดือนก่อน +1

    .

  • @MasoomNini
    @MasoomNini 9 หลายเดือนก่อน

    Hi John, big fan. Thanks for toturials ❤
    I need to contact you on any social media, i need one site scrape help kindly