Python and Scrapy - Scraping Dynamic Site (Populated with JavaScript)

Code [RE] Code

มุมมอง 78 996

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 15 ก.ย. 2024

ความคิดเห็น • 201

@codeRECODE 3 ปีที่แล้ว ⁺⁸
Hi everyone, I need your support to get this channel running. *Please SUBSCRIBE and Like!*
Leave a comment with your questions, suggestions, or a word of appreciation :-)
I would love your suggestions for new videos.
@harshnambiar 4 ปีที่แล้ว ⁺²⁴
You did this without even using docker or splash. That is pretty cool. 🌸
@codeRECODE 4 ปีที่แล้ว ⁺²
Thank you! 😊
@osmarribeiro 4 ปีที่แล้ว ⁺⁶
OMG! Amazing video. I'm learning scrapy now, this video help me a lot.
@codeRECODE 4 ปีที่แล้ว
Glad it helped!
@julian.borisov 3 ปีที่แล้ว ⁺¹⁹
"Without Selenium" caught my attention!
@klarnorbert 3 ปีที่แล้ว ⁺¹
I mean, selenium is not for web scraping(it's mostly used for automating web app testing). If you can reverse engineer the API, like in this video, Scrapy is more than enough.
@k.m.jiaulislamjibon1443 3 ปีที่แล้ว
@@klarnorbert but sometimes you have no other way other tan use selenium. Some webapp developer is so much clever to encapsulate ta funcion calls that page don't show xhr request. i had to use selenium for parsing data in a. webapp
@igorwarzee 3 ปีที่แล้ว ⁺²
It really helped me a lot. Thank you and congrats. Cheers from Brazil!
@codeRECODE 3 ปีที่แล้ว
Cheers!
@kenrosenberg8835 3 ปีที่แล้ว ⁺²
wow! You are a very smart programmer, I never thought of making REST API calls directly and then parsing the response, very nice, there is a lot to learn in your videos, more than just scraping.
@codeRECODE 3 ปีที่แล้ว
Glad it was helpful!
@gamelin1234 3 ปีที่แล้ว ⁺³
Just used this technique to scrape a huge dataset after struggling for a couple of hours with requests+BS. Thank you so much for the great content!
@codeRECODE 3 ปีที่แล้ว
Glad it helped :-)
@lambissol7423 3 ปีที่แล้ว ⁺³
excellent!! i feel like you doubled my knowledge on web scraping!
@codeRECODE 3 ปีที่แล้ว ⁺¹
That's awesome!
@sebleaf8433 3 ปีที่แล้ว ⁺⁴
Wow!! This is awesome! Thank you so much for teaching us new things with scrapy :)
@codeRECODE 3 ปีที่แล้ว
Thank you :-)
@mohamedbhasith90 9 หลายเดือนก่อน
@@codeRECODE Hi sir, I'm trying to scrape a website with hidden apis like you did in this video. but, the data is in POST request not in GET request like you have in the video.. I'm really stuck here.. can you make a video on scraping with hidden api using POST request? i hope you find this comment
@RonZuidema 4 ปีที่แล้ว ⁺³
Great video, thanks for the simple but precise instruction!
@codeRECODE 4 ปีที่แล้ว
Glad it was helpful!
@helloworld-sk1hr 4 ปีที่แล้ว ⁺²
Before watching this video I was doing this using selenium when I am watching your video then I am laughing myself what I was doing.
This video has saved my day
Your videos are amazing 🔥
@Chris-vx6eb 4 ปีที่แล้ว ⁺⁶
This took me 2 days to figure out. If you're having trouble with json.loads(), I found out that the json data i scraped was actually a byte string type, and so i had to decode it BEFORE using json.loads. So where he had (9:47)
*raw_data = response.body*
replace with: *raw_data = response.body.decode("utf-8")*
then continue on with: *data = json.loads(raw_data)*
TO CHECK IF YOU NEED TO DO THIS, RUN THIS TEST:
*raw_data = repr(response.body)* #repr() is a built in function that (1) turns python objects into printable objects, so you can see what you're dealing with and (2) in my case, if it prints out your object, you can find out if you have a byte string because you will get a 'b' infront of your string.
*print(raw_data)*
output>>> b'{ {data:...}, otherdata: [{...},{...}] }'
if you have this b, use the method I described above. Hope I saved someone time, stackoverflow doesn't have a question for this yet (:
@codeRECODE 4 ปีที่แล้ว ⁺²
@chris - Good Catch!
Short answer: replace response.body.decode("utf-8") with response.text
Detailed answer:
Let's understand text and body
response.body contains the raw response without any decoding
response.text contains the decoded response text as string
In this video, response.body worked because there are no special decoding required
Your method is correct. Even better approach would be use response.text as it actually is TextResponse which is encoding aware object.
Bonus tip: install ipython and you will have a much better python console
Good luck!
@Chris-vx6eb 4 ปีที่แล้ว ⁺¹
@@codeRECODE awesome, thanks!
@tokoindependen7458 3 ปีที่แล้ว ⁺¹
Bro paste this article on website, so many people can easily to find out
@pythonically 2 ปีที่แล้ว
raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not tuple
this same error?
@navdeeprana8477 3 หลายเดือนก่อน
the video is too good as I am trying to learn scrapy but i thought it was far difficult for me to understand
but you made it simple
@carryminatifan9928 3 ปีที่แล้ว ⁺²
Beautiful shop and selenium is not for large data scraping.
Scrapy is best 👍
@codeRECODE 3 ปีที่แล้ว
Yes indeed
@lorderiksson3377 5 หลายเดือนก่อน ⁺¹
This technique is fantastic. And thanks a lot for great content on your youtube page. Keep up the great job.
But how to implement pagination? Bit of a shame it wasn't shown here.
Let's say the schools are in a list of 25 items per page. 10 pages in total. How to do then?
@codeRECODE 3 หลายเดือนก่อน
Shame is a strong word, no?
I try to cover one single topic in one video. Pagination itself is a topic by itself. I have a video on that too.
If you rather learn in a structured manner, you can try my course for a week.
@daddyofalltrades 3 ปีที่แล้ว ⁺²
Sir thanks a lot !! This series will definitely help me ❤️
@codeRECODE 3 ปีที่แล้ว
Glad to hear that
@cueva_mc 3 ปีที่แล้ว ⁺²
This is amazing, thank you!
@codeRECODE 3 ปีที่แล้ว ⁺¹
Glad you like it!
@EnglishRain ปีที่แล้ว
FANTASTIC explanation!!
@stealthseeker18 3 ปีที่แล้ว ⁺³
Can you do web scraping if the website behind cloudflare version 2?
@yusufrifqi5006 2 ปีที่แล้ว
all of your tutorial is very helpful, big thanks to you, and i will wait for another scrapy content
@codeRECODE 2 ปีที่แล้ว ⁺¹
Coming soon!
@yusufrifqi5006 2 ปีที่แล้ว
@@codeRECODE nice! I will waiting for scrapy asyncronus program
@joaocarlosariedifilho4934 4 ปีที่แล้ว ⁺⁴
Excellent, sometimes there is no reason to use Splash we need only to understand what and how js is making the requests, Thank you!
@codeRECODE 4 ปีที่แล้ว ⁺²
Exactly! It's must faster and the web server doesn't have to send all those CSS, JS, Images etc. Everyone is happier :-)
@shashikiranneelakantaiah6237 4 ปีที่แล้ว
@@codeRECODE Hi there, I am facing an issue with a website, I can hit the first page but from then on if I make any request it redirects back to first page itself. It would be of great help if you could summarise as why this behaviour occurs on some sites. Thanks. And if I make the request to the same url with scrapy-splash I am getting lot of time out errors.
@codeRECODE 4 ปีที่แล้ว ⁺¹
@@shashikiranneelakantaiah6237 - double check that you are passing all the request headers, except cookies and content-length
cookies will be handled by scrapy.
content-length will vary and will break things instead of fixing it
@shashikiranneelakantaiah6237 4 ปีที่แล้ว ⁺¹
@@codeRECODE Thank you for replying, will give it a try, please do more videos on scrapy, your way of explaining the topics are excellent. Once again Thank you.
@BreakItGaming 4 ปีที่แล้ว ⁺²
Sir Please Complete this Series To Complete Advanced level.I have looked at many youtube channels but i did'nt find any series which is complete one.
So it is my kind request.
Anyway thanks for stating such initiative.
@codeRECODE 4 ปีที่แล้ว
Glad that you liked it. I will add more videos in the future for sure :-)
@ruksharalam173 ปีที่แล้ว
Wow, learning something new about scrapy everyday
@codeRECODE ปีที่แล้ว
Oh yes! It is really vast!
@tunoajohnson256 4 ปีที่แล้ว ⁺¹
This is a great tutorial. You taught me a lot and my app runs way faster than using Selenium now. Many Thanks, I hope to encourage you to keep teaching!
@codeRECODE 4 ปีที่แล้ว
Thank you Tunoa!
@gsudhanshu 4 ปีที่แล้ว ⁺¹
I am trying to copy what you did in the video but with the same code I am getting error on fetching first api i.e. getAllSchools . 2020-08-23 18:57:38 [scrapy.core.scraper] ERROR: Spider error processing (referer: directory.ntschools.net/)
Traceback (most recent call last):
File "/home/sudhanshu/.local/lib/python3.6/site-packages/scrapy/utils/defer.py", line 120, in iter_errback
yield next(it)
File "/home/sudhanshu/.local/lib/python3.6/site-packages/scrapy/utils/python.py", line 347, in __next__
return next(self.data)
@AmitKumar-qv2or 3 ปีที่แล้ว ⁺¹
thank you so much sir....
@gracyfg 4 หลายเดือนก่อน
Can you extend this and show us how to scrap all pages next and all product details and make it a production quallity product. or some points to make this a productions quality code with exceptions etc...
@codeRECODE 3 หลายเดือนก่อน
All these topics need a lot of details. most of these topics are covered across many videos.
You can also try my course and ask for a refund within a week if you don’t like it.
Happy learning!
@andycruz3893 หลายเดือนก่อน
Thanks man
@jagdish1o1 3 ปีที่แล้ว ⁺¹
It's an awesome tutorial. I've learned alot thanks. I have a question, I want to set a default value if there's no value.
I've tried with pipelines but item.setdefault('field', 'value') in process_item but it's not working.
@codeRECODE 3 ปีที่แล้ว
def process_item(self, item, spider):
for field in item.fields:
if item.get(field) is None: # Any other checks you need
item[field]="-1"
return item
@nadyamoscow2461 3 ปีที่แล้ว
Many thanks! I`ve learned a lot and it all works fine.
@codeRECODE 3 ปีที่แล้ว ⁺¹
Glad it helped
@hayathbasha4519 3 ปีที่แล้ว
Hi,
Please advice me on how to improve / speed up the scrapy process
@codeRECODE 3 ปีที่แล้ว
You can increase the CONCURRENT_REQUESTS from default 16 to a higher number.
In most cases, you will need proxies if you want to scrape faster.
@Ankush_1991 3 ปีที่แล้ว
Hi Sir the video is great because of its simplicity and clarity. I am a beginner in webscraping and I am stuck at a point for very long time now can u help me. How do We contact you for our doubts please mention something in ur video descriptions.
@codeRECODE 3 ปีที่แล้ว ⁺¹
you can post your doubts here or comments section of my website. It is not always possible to reply to every question due to sheer volume though. I am planning to start a facebook group where everyone can help everyone else. Let me know how it sounds.
@Pablo-wh4vl 4 ปีที่แล้ว ⁺¹
How will yo go if instead of in XHR, the content is loaded with the following one, with calls in the JS tab? Is still possible with requests?
@codeRECODE 4 ปีที่แล้ว ⁺¹
Tabs are only for logical grouping. You can extract info from any request, just that the code will change based on how data is organized.
@UmmairRadi ปีที่แล้ว
Thank you this is awesome. what about a website that gets data using Graphql?
@dashkandhar 4 ปีที่แล้ว ⁺¹
very knowledgeable and clear content, Kudos! ,
and what if an API is taking time to return response data than how to handle that?
@codeRECODE 4 ปีที่แล้ว
Thanks!
If it taking time, change the DOWNLOAD_TIMEOUT in settings. Add this line to your spider class
custom_settings={
'DOWNLOAD_TIMEOUT' : 360 # in seconds. Default is 180 seconds
}
@azwan1992 2 ปีที่แล้ว
Nice!
@emmanuelowino4291 2 ปีที่แล้ว ⁺¹
Thanks for this, It really helped , but what if instead of a json file it returns a xhr response
@codeRECODE 2 ปีที่แล้ว
Nothing changes. JSON and XHR is just browser's way of logically grouping information in this case.
@AndresPerez-qd8pn 4 ปีที่แล้ว ⁺¹
Hey i love your videos,
I'm a little stuck with some code, could you help me? That would be very nice (some tutoring).
@felinetech9215 4 ปีที่แล้ว ⁺¹
I followed along all your videos to be able to scrape a javascript generated webpage, but the data I want to scrape isn't in the XHR tab. Any suggestions sir ?
@codeRECODE 4 ปีที่แล้ว
Check the source of the main document
@felinetech9215 4 ปีที่แล้ว
@@codeRECODE any info on how to do that sir ?
@charisthawhite2793 3 ปีที่แล้ว
your video is very helpful, deserve to subscribe
@codeRECODE 3 ปีที่แล้ว
Glad it helped
@157sk8er 3 ปีที่แล้ว
I am trying to scrape information from a weather site but my code is not showing up in the XHR but it is showing up in the JS tab. How do I scrape data from this tab?
@codeRECODE 3 ปีที่แล้ว ⁺¹
Nothing changes! All, JS, XHR is Chrome's way of organizing URLs. You will find every under the All tab as well. Just use the same technique.
@shamblini_6170 2 ปีที่แล้ว
What happens when you encounter a 400 Code with the API link address? Can't seem to get past the API as the response.text shows "No API key found in request."
@codeRECODE 2 ปีที่แล้ว
Find the API key and add it to headers
@HoustonKhanyile 3 ปีที่แล้ว
Çould you please make video scrapping a music streaming service like Soundcloud.
@orlandespiritu2961 3 ปีที่แล้ว
Hi can you help me write a code that grabs hotel data from Agoda using this? I’ve been stuck.Running out of time for an exercise. Just started learning Python 3 weeks ago.
@RahulT-oy1br 4 ปีที่แล้ว ⁺³
You just earned ₹7000 in 30 mins. Wowza
@codeRECODE 4 ปีที่แล้ว ⁺⁵
Thank you, but let's be honest. This is NOT a get rich quick scheme. There is work involved in learning, analyzing the site, and finally, finding someone who will pay YOU for this task. Involves hard work :-)
That being said, this is one of the fastest paths to actually earn money as a freelancer.
@RahulT-oy1br 4 ปีที่แล้ว ⁺¹
@@codeRECODE Any particular freelancing or online short-term internship sites you'd recommend?
@codeRECODE 4 ปีที่แล้ว ⁺³
@@RahulT-oy1br Any of the freelancing sites is fine. Practice with jobs already closed. Once you are confident, start applying for new jobs
@fabiof.deaquino4731 4 ปีที่แล้ว
@@codeRECODE great recommendations. Really appreciate all the work that you have been doing! Thanks a lot.
@zangruver132 4 ปีที่แล้ว
@@codeRECODE Well I have never done freelancing nor do I have any idea. Can you still suggest me atleast one or two sites for me to start web scraping freelancing in India? Also do I need any prior experience?
@codingfun915 3 ปีที่แล้ว
How can I get the information if i have all the links of the schools and want to extract data from these links? Where should I keep all the links?? In the starting_urls or where please help me asap
@l0remipsum991 3 ปีที่แล้ว
Thank you so much. 1437! You literally saved my a$$. Subbed!
@codeRECODE 3 ปีที่แล้ว ⁺¹
Thanks for the sub!
@kamaralam914 ปีที่แล้ว
Sir, in my case i am using it for india mart and not getting any data on the response tab!
@niteeshmishra2790 ปีที่แล้ว
hi i am wondering if i want to scrape multiple field then how to do it,suppose i searched mobile on amazon now i get mobile brand name description link and complete details along with next page.
@codeRECODE ปีที่แล้ว
See this th-cam.com/video/LfSsbJtby-M/w-d-xo.html
@cueva_mc 3 ปีที่แล้ว
is it possible to parse the "base_url" instead of copying it?
@cueva_mc 3 ปีที่แล้ว ⁺¹
Or is it possible to parse the XHR urls from python?
@codeRECODE 3 ปีที่แล้ว
I am not sure what you want to ask, can you expand your question?
@bibashacharya2637 2 ปีที่แล้ว
hello sir my question is that can we do exactly same things with docker and spalsh?? please reply
@codeRECODE 2 ปีที่แล้ว
Yes -- See this th-cam.com/video/RgdaP54RvUM/w-d-xo.html
@muhammedjaabir2609 4 ปีที่แล้ว
why iam getting thi error ???
"raise JSONDecodeError("Expecting value", s, err.value) from None
"
@ThangHuynhTu 2 ปีที่แล้ว
(7:00) : How can you copy paste the headers like that ? I try to copy as you but I have to put quote by myself? Is there anyway to copy fast as yours?
@codeRECODE 2 ปีที่แล้ว ⁺¹
Oh, I understand the confusion. I removed that part to keep the video short. Anyways, you can make it quick and easy by following these steps.
pip install scraper-helper
This library contains some useful functions that I created for my personal use and later made it open source.
Once you have this installed, you can use the headers that you copied directly without formatting. Simply use the function get_dict() and send the headers in a triple-quoted string:
headers = scraper_helper.get_dict('''
accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-
accept-encoding: gzip, deflate, br
accept-language: en-GB,en;q=0.9
''')
It will also take care of cleaning up unwanted headers like cookies, content-length etc. Good luck
@ThangHuynhTu 2 ปีที่แล้ว
@@codeRECODE Really nice. Thanks for your clarifying!
@sowson4347 4 ปีที่แล้ว ⁺¹
Thank you for the easy to follow videos done in an calm unhurried manner. I notice you used VSCode for part of the work and CMD for running Scrapy. I found it extremely difficult to load Scrapy into VSCode even with a virtual environment. I could not run it in the VSCode terminal. How did you do it?
@codeRECODE 4 ปีที่แล้ว ⁺²
I work on scrapy a lot so I have it installed at the system level ("pip install scrapy" at cmd with admin rights). Just saves me a few steps. When I have to distribute the code, I always create a virtual environment and use scrapy inside it
If I want to use VS Code terminal, I just use the bottom left area where the python environment in use is listed, click it, and change it to set to the current virtual environment.
@sowson4347 4 ปีที่แล้ว ⁺¹
@@codeRECODE Thank you for responding so quickly. I was under the impression that Scrapy could run in VSCode just like BS. I solved the issue after watching your video many times over and reading up numerous other sites. What I had failed to comprehend was Scrapy has to be run in the Anaconda cmd environment not within a VSCode notebook. VSCode is just an editor being used to create the spider file. Your use of ntschools.py file in C:\Users\Work also confused me. I have now created my first Scrapy spider and can follow your videos better. Thanks keep up the good work.
Scrapy refused to install at the system level. I had to use Anaconda.
@codeRECODE 4 ปีที่แล้ว
Good that the issue is resolved. Never had a problem installing scrapy with elevated cmd (run as administrator) or sudo pip3 install
Don't know why you faced a problem
BTW, "work" was just my user id.
@sowson4347 4 ปีที่แล้ว
@@codeRECODE User Error 101 - RTFM
@TheCherryCM 4 ปีที่แล้ว
Hi,
Could you help me to solve a similar kind of problem? I tried this header but still not getting any data.
@abukaium2106 4 ปีที่แล้ว
Hello sir, I have made a spider same to your coding but it show twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost, what can I do to solve it. Please reply. Thanks
@codeRECODE 4 ปีที่แล้ว
some connectivity issue. See if you can connect using scrapy shell
@himanshuranjan7456 4 ปีที่แล้ว
Just one question, does scrapy has support of async. I mean when taking a look at libraries like request or request-html they have async support, so the time consumed during scrapping is very less.
@codeRECODE 4 ปีที่แล้ว
Yes and better!
It is based on twisted. The whole framework is built around the idea of async. You would have to use to appreciate how fast it is.
@amarchinta4463 3 ปีที่แล้ว
Hi sir, I have one question about not this tutorial. I want to fetch multiple different domains having the same page structure with a single spyder. How I can achieve this ? Please help
@codeRECODE 3 ปีที่แล้ว
If same structure means same selectors for all those domains, just add them to start_urls or create a crawl spider.
@FBR2169 2 ปีที่แล้ว
Hello Sir. A quick question. What if the Request Method of the website is POST instead of GET? Will this still work? If not what should I do?
@codeRECODE 2 ปีที่แล้ว
Yes it will.
See my many videos on POST requests - th-cam.com/users/CodeRECODEsearch?query=post
@yashnenwani9261 3 ปีที่แล้ว
Sir i want to use search bar to search for a particular thing and then extract related data
Pls. Help!
@codeRECODE 3 ปีที่แล้ว
Open dev tools and check the network tab. See what happens when you click search.
If you can't figure it out, use selenium
@chakrabmonoj 3 ปีที่แล้ว
In fact I followed your steps into the XHR and 1. It does not show accept.json (but the site is run by JS which I checked by the hack shown by you here) 2. It also says 'eval' not allowed on the site (not sure what that means) - it shows no file being generated as you have shown for this site.
what could be happening here?
I am trying to sort all my connections by the total number of reactions their posts have got.
Can u help with a suggestion for coding this?
thanks
@codeRECODE 3 ปีที่แล้ว ⁺¹
I am attaching the link to the code. I just tried it and it works. Make sure that you run this with *scrapy runspider **ntschools.py* , not like a python script.
Source: gist.github.com/eupendra/7900849c56872925635d0c6c6b8f78f5
@chakrabmonoj 3 ปีที่แล้ว
@@codeRECODE Thanks for the quick revert. What I forgot to mention is I was trying to use your code on LinkedIn. Does it have excessive privacy policies because of which it is not showing any Json file being generated? Any help appreciated.
@harshgupta-ds2cw 4 ปีที่แล้ว
I have been trying to find a webscraper which will work on OTT platforms. Your method didn't gave me any results. I need help.
@codeRECODE 4 ปีที่แล้ว
Scraping OTT is almost impossible due to technical reasons -as they have multiple layers of defenses to stop privacy, AND legal reasons. I am not going to attempt it for sure :-)
@the_akpathi 2 ปีที่แล้ว
Is it legally ok to send headers from a script like this? Specially headers like user-agent?
@codeRECODE 2 ปีที่แล้ว
This is an educational video aiming to teach how things work. For legal issues, you would need to talk to your lawyer.
@maysgumir3972 4 ปีที่แล้ว
HI,
I need your help. I am trying to scrape details from the e-commerce site www.banggood.com, the price is ajax loaded and I cannot retrieve it with scrappy then I tried to get the ajax request manually as you teach in the video but I cannot find the exact path for the request. Could you please make a video on this particular website (to find ajax request manually). Your help will be more appreciable. you can choose any category for scraping details.
@Code / RECODE
@arunk6435 2 ปีที่แล้ว
Hello, Mr Upendra. Every time I start to scrape, my data usage reaches its limit too fast. What is your data plan? I mean, How many GBs are you allowed to use per day?
@codeRECODE 2 ปีที่แล้ว
It's really hard to calculate how many GBs your project is going to consume. If you can probably run your project on any of the cloud services.
For any serious work, I would suggest to get a broadband connection with no data cap.
@arunk6435 2 ปีที่แล้ว
@@codeRECODE Thank You, Mr Upendra. I would like to know what data plan you use. What is your daily Data Limit?
@chapidi99 3 ปีที่แล้ว
Hello, is there an example how to scrape if there is paging?
@codeRECODE 3 ปีที่แล้ว
I have covered pagination in many videos. I am planning to create one video to cover all kind of pagination in one video.
@stalluri11 3 ปีที่แล้ว
is there a way to scrape webpages in python when url doesnot change with page numbers
@codeRECODE 3 ปีที่แล้ว ⁺¹
Yes, I have covered this in many videos. I am planning to do a dedicated video on pagination.
@stalluri11 3 ปีที่แล้ว
@@codeRECODE look forward to it. I can't find a video on this
@beebeeoii5461 3 ปีที่แล้ว
hi, great video but sadly this will not work if the site does some hashing/encrypting of their API. for eg, a token has to be attached as the header and the token can only be achieved through some kind of computation done by the webpage
@codeRECODE 3 ปีที่แล้ว ⁺²
If your browser can handle encryption, hashing, you can do that with Scrapy too. Most of the time, they will just send some unique key which you have to send in the next request.
If you don't have time to examine how it is working, you can use splash/selenium or something similar and save time. It will be faster to code but slower in execution.
If you do figure out APIs, the scrapes are going to be very fast, especially when you want to get millions of items every day.
Finally, just think of it as another tool in your arsenal. Use the one that suits the problem at hand :-)
Good luck!
@udayposia5069 3 ปีที่แล้ว
I want to send null value for one of the formdata using FormREquest.form_response. How should I pass null value. Its not accepting ' ' or None.
@codeRECODE 3 ปีที่แล้ว
Share your code. Usually blank strings work.
@shubhamsaxena3220 2 ปีที่แล้ว
Can we scrape any dynamic website using this method?
@codeRECODE 2 ปีที่แล้ว ⁺¹
Short answer - No. There are multiple techniques to scrape dynamic websites. Every site is different and would need a different technique.
@adityapandit7344 3 ปีที่แล้ว
Hi Sir,
How can we scrape json data from a website using scrapy.
@codeRECODE 3 ปีที่แล้ว
Create a regular scrapy request for the url that contains the json data. In the Call back method (for example, parse) you can access the json directly using response.json() in the newer versions
@adityapandit7344 3 ปีที่แล้ว
@@codeRECODE hi sir have you posted any video on it?
@adityapandit7344 3 ปีที่แล้ว
Hii sir when I loads the json data then I M facing json decode error expecting value line 1 . What is the solution of it
@codeRECODE 3 ปีที่แล้ว
It means that the string that you are trying to load with json is not in the form of valid json format. It may need some clean up
@adityapandit7344 3 ปีที่แล้ว
@@codeRECODE yes sir the error has been resolved. Now can you give me an idea how can I link scrapy with django. It will be very greatful.sorry I am asking too many questions. But I M doing it practically that's why I M facing these problems.
@oktayozkan2256 2 ปีที่แล้ว
this is API scraping. some websites use csrftoken and sessions in their API, this makes the website nearly impossible to scrape from API.
@codeRECODE 2 ปีที่แล้ว
While CSRFtoken and sessions can be handled, I do agree that this technique does not work everywhere.
However, this should be the first thing that we should try. Rendering using Selenium/Playwright should be the last resort.
Even after that, many websites will not work, and there will be no workaround. 🙂
@zaferbagdu5001 4 ปีที่แล้ว
hi , i tried to write a code but in query response return 'Failed to load response data' , as a result there are jquery links , am i use them
@codeRECODE 4 ปีที่แล้ว ⁺¹
share your code in pastebin or something similar. I will try to find the problem
@zaferbagdu5001 4 ปีที่แล้ว
@@codeRECODE code here=pastebin.pl/view/ee0b7d3d
Shortly the real page is www.tjk.org/TR/YarisSever/Info/Page/GunlukYarisSonuclari
i want to scrap tables in this page
thanks for everything
@WDMatt02 3 ปีที่แล้ว ⁺¹
i love u indian buddy, thanks to ur rook sacrifice
@codeRECODE 3 ปีที่แล้ว
Glad that my videos are helpful :-)
@adityapandit7344 3 ปีที่แล้ว
Hi sir
@codeRECODE 3 ปีที่แล้ว
Please watch the XPath video I posted. That will help you. It will be something like this:
//script[@type="application/ld+json]"
@adityapandit7344 3 ปีที่แล้ว
@@codeRECODE yes it's but it's the second script tag in this page how can we mention the second one
@codeRECODE 3 ปีที่แล้ว
just add [2]
@adityapandit7344 3 ปีที่แล้ว
@@codeRECODE where can I add 2 can you tell me
@taimoor722 4 ปีที่แล้ว
i need help regarding how to approach client for webscrapping project
@codeRECODE 4 ปีที่แล้ว
I would be including some tips in my upcoming courses and videos
@naijalaff6946 4 ปีที่แล้ว ⁺¹
great video.Thanks you so much.
@codeRECODE 4 ปีที่แล้ว
Glad you liked it!
@harshnambiar 4 ปีที่แล้ว
Also, can you scrape bseindia this way?
@codeRECODE 4 ปีที่แล้ว
Haven't tried bse. Have a look at my blog to see how I did it for NSE.
coderecode.com/scrapy-json-simple-spider/
@monika6800 4 ปีที่แล้ว
Hi
Could me please help me in scraping one of the dynamic site?
@codeRECODE 4 ปีที่แล้ว ⁺¹
Which site is that? What is the problem you are facing?
@engineerbaaniya4846 4 ปีที่แล้ว
Where I can get detailed tutorial
@codeRECODE 4 ปีที่แล้ว
courses.coderecode.com/p/mastering-web-scraping-with-python
@sunilghimire6990 4 ปีที่แล้ว
scrapy crawl generates error like
DEBUG: Rule at line 1702 without any user agent to enforce it on.
Help me
@codeRECODE 4 ปีที่แล้ว
What exactly are you trying to achive? Are you going through the same exercise as I showed in the video?
@sunilghimire6990 4 ปีที่แล้ว
I am following your tutorials and i tried to scrape website
Title = response.css('title::text'). extract ()
Yield Title
I got the title but also got unusual error as mentioned.
@codeRECODE 4 ปีที่แล้ว ⁺¹
@@sunilghimire6990
It looks like you are either not passing the Headers in the request OR something is wrong with the user-agent part of the header dictionary OR the header dictionary itself is not correctly formatted.
Here are a few other things I can suggest:
1. You are using extract(), which is the same as getall() This is confusing and that's why it is outdated now.
2. Probably you are using "scrapy CRAWL" to run the spider. What I have created here is a standalone spider which needs to be run using "scrapy runspider"
3. Take up my free course to get the basics clear. I am sure it will help you. Here is the coderecode.com/scrapy-crash-course
4. Once you register for the free course, you will find the complete source code that you can run. If you face any problem, you can attach screenprint and code in the comments in my course and I will surely help in detail
@sunilghimire6990 4 ปีที่แล้ว
@@codeRECODE thank you sir
@ashish23555 3 ปีที่แล้ว ⁺¹
Really scrapy is the best but it needs time to be pro.
@codeRECODE 3 ปีที่แล้ว
Oh yes, Scrapy is best!
@zangruver132 4 ปีที่แล้ว ⁺¹
Hey. I wanted to scrape number of comments of each game in the following link (fitgirl-repacks.site/all-my-repacks-a-z/). But I can't find it anywhere in the network tab. Yes the html without JS provides a number of comments with it but it is outdated one.
@codeRECODE 4 ปีที่แล้ว ⁺¹
it's there! Here is how to find it. Open the site, press F12, go to the Network tab and open any listing. On the top, you will see something like 238 comments. Now, make sure that your focus is on the Network tab and press CTRL F. Now search for this number 238. You will quite a few results and one of them will be a .js file that will have this data.
You will note that this comes from a third-party commenting system.
Reminder - Getting this data using web scraping may not be legal. I do not give advice on what is legal and what is not. What I explained is only for learning how websites work. Good luck!
@Ahmad-sn9kh หลายเดือนก่อน
i want scrape data from tiktok can you help me
@nimoDiary 4 ปีที่แล้ว
Can you please teach how to scrape badminton players data from pbl site
@codeRECODE 4 ปีที่แล้ว
Whats the site URL? What have you tried and what problem are you facing?
@nimoDiary 4 ปีที่แล้ว
www.pbl-india.com/
I am trying to extract the data of squads of all teams with all their details including names, country, world rank, etc
@codeRECODE 4 ปีที่แล้ว
@@naijalaff6946 Thank you for the mention in readme. Feels good :-)
@ashish23555 3 ปีที่แล้ว
Why need of scrappy or selenium as these r not helpful on AJAX
@codeRECODE 3 ปีที่แล้ว
I am not sure I understand your question. Can you elaborate?
@ashish23555 3 ปีที่แล้ว
@@codeRECODE how to scrapp pages from a website protected with reCAPTCHA.
@codeRECODE 3 ปีที่แล้ว ⁺¹
@@ashish23555 Use a service like 2captcha.com
@kaifscarbrow ปีที่แล้ว
Cool price. I've been doing ~500k records for $100 🥲
@codeRECODE ปีที่แล้ว
Increase your price!
@KartikSir_ 2 ปีที่แล้ว
Getting Error :
[scrapy.core.engine] DEBUG: Crawled (403)
@shannoncole6425 3 ปีที่แล้ว
Nice!
@codeRECODE 3 ปีที่แล้ว
Thank you

ต่อไป

เล่นอัตโนมัติ

Selenium - Real World Web Scraping - Challenges and Solutions