Scrape Dynamic Sites with Splash and Python Scrapy - From Docker Installation to Scrapy Project

Code [RE] Code

มุมมอง 19 472

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 2 ก.พ. 2025

ความคิดเห็น • 75

@saifmahin7425 2 ปีที่แล้ว
I have been following your videos for couple of days now. You described the things very clearly. I have learned many things from you that helped me to improve my coding. Thank You Very Much.
@codeRECODE 2 ปีที่แล้ว ⁺¹
So nice of you
@sheikhakbar2067 3 ปีที่แล้ว
As usual exceptional and to-the-point tutorial.
@codeRECODE 3 ปีที่แล้ว
Thank you 😃
@umair5807 ปีที่แล้ว
1:56 I got this error here:
C:\Users\M Umair>docker pull scrapinghub/splash
Using default tag: latest
Error response from daemon: open \\.\pipe\docker_engine_linux: The system cannot find the file specified.
@rubenpradesgrau8430 3 ปีที่แล้ว ⁺¹
Thank you ji! All your contents are very useful, well explained and organized. I wish you'd have been my teacher back when I was studying
@codeRECODE 3 ปีที่แล้ว
Thank you so much 🙂
@antulatajain3129 4 ปีที่แล้ว ⁺¹
Very informative
thank you
@codeRECODE 4 ปีที่แล้ว
Glad you liked it!
@user8ZAKC1X6KC 2 ปีที่แล้ว
How are you dealing with header issues and splash? I found the documentation, but I can't quite figure out how to implement it. Edit: specifically when using scrapy shell?
@villagenaturbd4579 4 ปีที่แล้ว
We are grateful to you Because your videos are always hep us to learn new things.
Thank you very much!!!
@codeRECODE 4 ปีที่แล้ว
Glad to hear that
@learncodeinbangla1852 3 ปีที่แล้ว
Sir,
I am fail to enable virtual environment. Could you please tell me how Can I do it.
@cebysquire 3 ปีที่แล้ว ⁺¹
Hello sir, I've encountered a problem , for python interpreter 3.9, and 3.7.
ScrapyDeprecationWar
ning: Call to deprecated function to_native_str. Use to_unicode instead.
url = to_native_str(url)
from scrapy_splash library. Is there any way around this?
@codeRECODE 3 ปีที่แล้ว ⁺¹
Did you try to_unicode as the message suggests?
@cebysquire 3 ปีที่แล้ว
@@codeRECODE yes sir, still it didn't work. I import from (scrapy.utils.python import to_unicode) still got the same depreciation warning.
@codeRECODE 3 ปีที่แล้ว
@@cebysquire share your code
@cebysquire 3 ปีที่แล้ว
@@codeRECODE Hello sir, It's working fine now. The (to_unicode) method needed to have an exact encoding parameter. So I added a detect encoding function for the url.
However, the scrapy log will still show the deprecated warning.
code screenshot:
i.postimg.cc/02FTjYW7/Capture.png
Thank you for replying sir.
@codeRECODE 3 ปีที่แล้ว ⁺¹
@@cebysquire Hey! Came back to this now. This is not a correct approach.
I guess you are facing issues in exporting in Unicode format. Scrapy exports in UTF-8 by default, except for JSON format. See this from the documentation:
docs.scrapy.org/en/latest/topics/feed-exports.html#std-setting-FEED_EXPORT_ENCODING
FEED_EXPORT_ENCODING
If unset or set to None (default) it uses UTF-8 for everything except JSON output, which uses safe numeric encoding (\uXXXX sequences) for historic reasons.
Use utf-8 if you want UTF-8 for JSON too.
@gabbhasounds4785 3 หลายเดือนก่อน
Thank you. I'm having lots of problems installing docker.
WSL2 should be installed first, even changing configuration from BIOS.
Is it right?
@codeRECODE 3 หลายเดือนก่อน ⁺¹
Try Playwright. You don't need splash anymore.
@gabbhasounds4785 3 หลายเดือนก่อน
@@codeRECODE Thank you master! I'll continue with the rest of the playlist videos.
Regards!
@brunomgfernandes 3 ปีที่แล้ว
Thanks for video series! Will you ever address on how to simply crawl all the website following every href in it? Also, whe websites use Shadow Dom?
@psycode5569 2 ปีที่แล้ว
Hi, do you have a splash tutorial for pages that have login?
@pythonically 2 ปีที่แล้ว
using this code i'm only getting all the results in one line in csv . why?
@marcossahade9369 2 ปีที่แล้ว
Is it posible to use splash with CrawlSpider? Or use linkExtractor with splash? Thanks you very much for your ...
@marcossahade9369 2 ปีที่แล้ว
Your videos...
@rabbiaarshad3547 4 ปีที่แล้ว
After installing docker when I run the scrapinghub/splash command docker is showing an error that mentions below:
error during connect: In the default daemon configuration on Windows, the docker client must be run with elevated privileges to connect.: Post %2F%2F.%2Fpipe%2Fdocker_engine/v1.24/images/create?fromImage=scrapinghub%2Fsplash&tag=latest: open //./pipe/docker_engine: The system cannot find the file specified.
Kindly tell me how to solve this?
@codeRECODE 4 ปีที่แล้ว
try running docker HelloWorld sample first to see if docker installation is working.
docker run hello-world (This should show something like "Not found locally, downloading and then Hello from docker.
If this doesn't work, check the documentation docs.docker.com/docker-for-windows/install/
Good luck!
@codeRECODE 4 ปีที่แล้ว
By the way, read the error carefully - "docker client must be run with elevated privileges to connect"
Did you try running docker with Admin rights? See this: stackoverflow.com/questions/40459280/docker-cannot-start-on-windows
@ThallaSampathKumar ปีที่แล้ว
DNS lookup failed: no results for hostname lookup: x.
2023-07-29 11:45:46 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying (failed 2 times): DNS lookup failed: no results for hostname lookup: x.
2023-07-29 11:45:48 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying (failed 3 times): DNS lookup failed: no results for hostname lookup: x.
2023-07-29 11:45:48 [scrapy.core.scraper] ERROR: Error downloading
can any know these why iam getting these error
@cueva_mc 4 ปีที่แล้ว
Thank you, very useful
@codeRECODE 4 ปีที่แล้ว
Glad to hear that!
@diegovargas3853 4 ปีที่แล้ว ⁺¹
Can you explain how can we use splash + crawl spider? Please.
@codeRECODE 4 ปีที่แล้ว ⁺¹
WIll try to find some samples for you.
@villagenaturbd4579 3 ปีที่แล้ว
Sir,
At the time of using dockers, I fail to enable a virtual environment using CMD. Could you please tell me how Can I do it?
How can I go on the ven file location as you?
Thaks.
Shahidul.
@codeRECODE 3 ปีที่แล้ว
Hey, got back to this now. What was the problem?
@abukaium2106 4 ปีที่แล้ว
Great tutorial. I follow you everytime. Would you make a video of preventing get blocked in scrapy??
@codeRECODE 4 ปีที่แล้ว ⁺¹
Use Download_delay (docs.scrapy.org/en/latest/topics/settings.html#download-delay) and Auto_throttle (docs.scrapy.org/en/latest/topics/autothrottle.html#topics-autothrottle). If these two don't work, use proxies. Already covered proxies in one of my videos.
@digoingame151 3 ปีที่แล้ว
Thanks a lot bro you so helping me
@codeRECODE 3 ปีที่แล้ว
Happy to help
@samibdh 4 ปีที่แล้ว
very useful thank you !
@codeRECODE 4 ปีที่แล้ว
Glad it was helpful!
@beefwater 2 ปีที่แล้ว
I'm viewing this video about a year and a half later, and I wanted to know if you still felt this was valid or if there was a newer better solution today?
@codeRECODE 2 ปีที่แล้ว
Good question. This is one of the solutions. Playwright is getting a lot of attention these days, though.
@miladmoradnia2844 3 ปีที่แล้ว
i have problem docker pull scrapinghub/splash >>>> unauthorized: authentication required . my windows version 10ENTERPRICE LTSC
@codeRECODE 3 ปีที่แล้ว
Looks like you are able to install but not pull. Windows 10 64-bit: Pro, Enterprise, or Education (Build 17134 or later) are supported officially. Try this first.
*docker login -u username*
If it doesn't work, then google would be your friend. Share your findings for others :-)
@alichaudhary1832 4 ปีที่แล้ว
I download docker but can not inatall it some errors comes that this installation need window 10 pro although my window is 10 and pro. I am not understanding how to fixe it
@codeRECODE 4 ปีที่แล้ว
Check the system requirements. docs.docker.com/docker-for-windows/install/
You already have Win Pro, otherwise for Home the instructions are here: well docs.docker.com/docker-for-windows/install-windows-home/
@hythamaly9624 3 ปีที่แล้ว
Can you please determine the IDE you are using. I cannot find the settings.py file when I create a new python project using PyDev with Eclipse
@codeRECODE 3 ปีที่แล้ว ⁺¹
IDE does not matter.
If you run scrapy startproject yourprojectname* from the terminal, it will create the complete project structure including settings.py
@codeRECODE 3 ปีที่แล้ว ⁺¹
By the way, I use VS Code and Pycharm. But again, this does not matter
@hythamaly9624 3 ปีที่แล้ว
@@codeRECODE Thanks a million, the video really helped me!
@BASUDEV87 3 ปีที่แล้ว
Thank you providing useful content.
But I am getting stuck with below error. please find solution.
Connection was refused by other side: 10061: No connection could be made because the target machine actively refused it..
@pranavbelgaonkar8634 3 ปีที่แล้ว
your docker is not running
@arefebsh5461 4 ปีที่แล้ว ⁺²
Can you make a video to crawl information from Instagram?
thank you very much
@codeRECODE 4 ปีที่แล้ว
Why not use their API? They explicitly ban scraping, thus no plans to cover it.
@ALANAMUL 4 ปีที่แล้ว
sir can u show us how scrape pages having " Load More" button.
I have been looking for a solution to scrape such site
@codeRECODE 4 ปีที่แล้ว
See my video on infinite scroll
@turanahmad2306 4 ปีที่แล้ว
Hello Sir, Firstly, Thanks a lot for the video. I have a question regarding scraping pages. I am doing the same thing for another website. However I do not just get the title and price from first page. Instead of that, what I do is to extract first 40 items with their links and then send another call with (SplashReuqest) (meaning I create second parse function) and define the items I want to extract. However it fails each time and only extract 5 to 8 items out of 40. Could you please let me know if there is any way to get all the items?
@codeRECODE 4 ปีที่แล้ว
Looks like the page is taking longer to load. Try adding wait to splash request - yield SplashRequest(url, args={'wait': 5})
@turanahmad2306 4 ปีที่แล้ว
@@codeRECODE Thanks for the response. I actually use the wait however it still doesn't help to me. The code for Splash Request and the output error that I got is in the below. Please let me know if you have nay idea why this happens.
yield SplashRequest(url=absolute_url, callback=self.parse_product, magic_response=True,
meta={'handle_httpstatus_all': True}, endpoint='execute',
args={'lua_source': self.script2, 'wait': 25,
'timeout': 90, 'resource_timeout': 10
})
This is the code for the second section. It still fails to extract all the items I ask in parse_product function. Some links works some not. The error:
[scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying
@TheWhoIsTom 4 ปีที่แล้ว
very good tut. Can you maybe show how to use rotation proxy? Cant figure out how to use it in docker and splash/scrapy :/
@codeRECODE 4 ปีที่แล้ว
Have covered proxies on my channel( th-cam.com/video/qHahcxoGfpc/w-d-xo.html ) , but not with splash. ScraperAPI that I covered in my video can accept additional parameter and they will do the rendering. That would $249 plan. There are more service but almost all are more expensive. See this article for a comparison. It should give you a general idea about prices. Don;t forget to check JS Render option on the top of the page. www.scraperapi.com/compare-best-residential-datacenter-rotating-proxy-providers-for-web-scraping
@kizord9552 4 ปีที่แล้ว
Thanks !
@codeRECODE 4 ปีที่แล้ว
Welcome!
@bekhzodortikov421 ปีที่แล้ว
Where I can find results?(
@codeRECODE ปีที่แล้ว
You can save the output using -o switch. For example, scrapy crawl laptop -o yourfile.csv
@musiangong4640 4 ปีที่แล้ว
could you share using the item pipline?
@codeRECODE 4 ปีที่แล้ว
This is a good idea for the next video. Thanks
@SaMi-se2qs 2 ปีที่แล้ว
Sir, I'm not able to get to the next page when I run this code.. what's the problem here I don't know.
import scrapy
class BooksSpider(scrapy.Spider):
name = 'books'
allowed_domains=['books.toscrape.com']
start_urls = ['books.toscrape.com/']
def parse(self, response):
books = response.css('ol.row li')
for url in books:
url = url.css('div.image_container a::attr(href)').get()
url=response.urljoin(url)
yield scrapy.Request(url,callback=self.parse_books)
def parse_books(self,response):
yield {
'title':response.css('div>h1::text').get().strip(),
'catagories':response.css('ul.breadcrumb>:nth-child(3)>a::text').get().strip()
}
next_page = response.css('.next > a::attr(href)').get()
if next_page:
next_page=response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse_books)
Note: I used the same script on other sites it works fine.

ต่อไป

เล่นอัตโนมัติ

Scraping Dynamic Site WITHOUT Selenium - Only Requests and Pandas