This script I threw together saves me hours.
ฝัง
- เผยแพร่เมื่อ 15 ส.ค. 2023
- Finding out the best way to scrape data from a site is time consuming, this script uses selenium wire to view the network requests from a site and give you back a list of urls and json responses.
Proxies: nodemaven.com/?a_aid=JohnWats...
Patreon: / johnwatsonrooney (NEW free tier)
Scraper API www.scrapingbee.com/?fpr=jhnwr
Donations: www.paypal.com/donate/?hosted...
Hosting: Digital Ocean: m.do.co/c/c7c90f161ff6
Gear I use: www.amazon.co.uk/shop/johnwat... - วิทยาศาสตร์และเทคโนโลยี
Fantastic “apprentice” content. This assumes a basic understand but also pushes the novice forward. I really appreciate it!
This is really great, and a great foundation, too. I can see this being extended to support so many things, too.
Love your thought process behind writing this! It makes it easy to follow why you do a certain step, and if it’s necessary for others! Great vids keep it up!
Glad it was helpful!
I didn't even know that selenium can be used like this, thank you very much, great work as always))
great video. thank you john❤
Amazing, just something I was looking for. Need to look into more if I could fetch all the IPs too
golden content here
I used seleniumwire for create a scraping bot. It’s a very good package to grab the backend requests. What i did was using selenium i logged-in than grab the cookies and the backend api ;) than i simply closed the browser and used the python requests lib to make the request to make thing little bit faster. Eventually, i dockerized everything and than i have this container image which i than pushed on aws ecr and run parallel on aws ecs.
Pretty amazing.
impressive, what's your email, need to ask you a question as relate to your code
that's very useful, thank you
Hello John, could you make a video how to scrape data which a server send trough a websocket connection in live mode?
Can it make complete copy of requests with url, headers and payload?
Amazing content thank you
Very welcome
"Because. I. Don't. Care..." 😂😂
haha
Hello thank you for the amazing video. Wanna ask please how can I bypass 403 forbidden, for cloudflare when I am requesting an Api? Thank you for all your efforts 🙏🏽
thank you,
i am wondering if you wine money with this tools ????
can you please share the script that you created for my future reference ??
Can I bypass hqq.tv devtool blocking using this?
Hi
Are you aware of self healing selenium scripts? Can you explain the concept of self healing and how is it even possible!? Because we find element on web page using a locator if that element isn't found we get error. How can self healing find that locator. For eg. An element found by //input[@name=email] if not found, can automatically guess the element was updated in next build as //input[@name=mailing-addrress] using self healing approach.. it would be great if you can help us understand that
good video . Do you capture keys for api in Selium wire as well. As some api use session keys
you can grab any headers and cookies yeah
i was working with selenium / selenium-wire until i could not debug the issue while selenium-wire is not listening the right port where selenium is running while dockerised.
that's interesting, i haven't tried dockerising it but i will keep an eye open for issues
Great work bro!! And I have one question also if I want scrape Walmart everytime robot or human pop-up comes so can you please guide me how to Bypass this type of bot detection system? Thanks and love your content because of you i learned python!! 👍
Check out undetected chrome driver - there’s some good information for it that might help
I tried bro but still it's showing the same issue if you have any reference or video can you please suggest me it'll be very helpful for me and other also :)
Hi John!! I really appreciate this new content. I have a query to ask. I was using selenium webdriver in chrome to fetch data from a website. The script is working just fine but after certain iterations, the driver is not working properly or the way it should. I am getting a NoneType error. I tried clearing the cookie and starting a new session and then continue from where I left off but it is still not working. Any suggestions on this?? I really appreciate it!! Thanks!!
hard to say but when i get problems like this i always check to see what the direct output from loading the page is, you could be hitting a captcha
Actually that new page is loading properly. I didn't check for terminal output but the page is loading. After that when I am looking for an element on the same page which I know is available there, I am getting an error.
are you using JetBrains Mono font? If yes, then how it looks so thin?
it is yeah, I don't know I didn't do anything other than select that font sorry
So this is kinda like playwright network events right?
Yes same thing but I found it better to use
What is the vscode theme and the font used in this video?
github dark theme and jet brains mono!
@@JohnWatsonRooney thank you
What if api only called when any user action occurs then?
the next step to upgrade this would be to run the same but insert clicks on various page links first and check each one
@@JohnWatsonRooney thanks for reply🙏 also most important thing post method api which accept custom keys in its headers or payload, will not give expected response, please make video of this thing for executing it.
complete noob here just started web scraping
for some reason the seleniumwire import is giving me this error
import blinker._saferef
ModuleNotFoundError: No module named 'blinker._saferef'
I've been searching online for help for hours. changed python versions (currently using the same one you're using in the video)
nothing seems to work.
please help
thank you in advance
pip install blinker ?
Hi John! Love your work. Could you share the codes of your videos.
Maybe John has the code available to Patreon members ;)
@@markbennett5626Ohhhhh okay no issues hehe :)
Are you no longer using neovim?
I still use neovim, i decided to use VS Code for video demos as i thought it would include more people
is this better than pupeteet network events?
I have limited experience with pupeteer, i expect it to be the same - although I prefer seelnium-wire to playwright for network events
@@JohnWatsonRooney ok, how about playwright network events, does it have similar functionality or would you still recommend going with seleniumwire
I don't really get it.. I mean you can filter Network tab by link or a word "api" too if you want to. Plus this solution will not work for everything, but Network tab will. Other than filtering only needed requests this solution doesn't seem to do anything. And yeah, you can do a bit more advanced filtering here, but.. Does this really saving a lot of time for some kind of task?
It's just hard to see how for me. Did I miss something? I'm making AJAX scripts dealing with forms for the past year+ and for me it would be absolutely useless.
I use it when I am given a URL and want to do some quick checks - saving any JSON output so I can search inside all from my terminal. I chose to semi automate something I was doing regularly is all.
Maybe not for everyone but once scripted including user prompt for url, it'll be quicker than using network tab and much nicer response, plus can see adding the ability for the additional steps of recording session keys and further calls.. Thanks John
Are you using arch linux sir ? And thanks for the content ! 🥰
thanks! its actually just ubuntu + i3
@@JohnWatsonRooney Wow, I guess my mind went straight to arch when I saw a hyperland style window manager 😁
Hi, thanks a lot, this was very helpfull to learn. I use contextlib.surpress, its actually faster than try/except and it looks better i think. Your function would look like this:
import contextlib
for request in driver.requests:
with contextlib.suppress(Exception):
data = decodesw(
request.response.body,
request.response.headers.get("Content-Encoding", "identity")
)
resp = json.loads(data.decode("UTF-16"))
resps.append(resp)
return resps
Nice idea. But I will still prefer to log the requests via Network tab or Burp suite.
The chromedriver detection will also kick in for some sites.
fair enough, it does have some uses but also limitations as you say.
Guys, I'm watching with passion but for what it would be helpful? What are web-scrapers actually doing?
Gathering data that would otherwise be difficult to get without a proper API
Anyone else update chrome on their pc and had all their scrapers break?😅
.
Hi John, big fan. Thanks for toturials ❤
I need to contact you on any social media, i need one site scrape help kindly