I have been following your videos for couple of days now. You described the things very clearly. I have learned many things from you that helped me to improve my coding. Thank You Very Much.
1:56 I got this error here: C:\Users\M Umair>docker pull scrapinghub/splash Using default tag: latest Error response from daemon: open \\.\pipe\docker_engine_linux: The system cannot find the file specified.
How are you dealing with header issues and splash? I found the documentation, but I can't quite figure out how to implement it. Edit: specifically when using scrapy shell?
Hello sir, I've encountered a problem , for python interpreter 3.9, and 3.7. ScrapyDeprecationWar ning: Call to deprecated function to_native_str. Use to_unicode instead. url = to_native_str(url) from scrapy_splash library. Is there any way around this?
@@codeRECODE Hello sir, It's working fine now. The (to_unicode) method needed to have an exact encoding parameter. So I added a detect encoding function for the url. However, the scrapy log will still show the deprecated warning. code screenshot: i.postimg.cc/02FTjYW7/Capture.png Thank you for replying sir.
@@cebysquire Hey! Came back to this now. This is not a correct approach. I guess you are facing issues in exporting in Unicode format. Scrapy exports in UTF-8 by default, except for JSON format. See this from the documentation: docs.scrapy.org/en/latest/topics/feed-exports.html#std-setting-FEED_EXPORT_ENCODING FEED_EXPORT_ENCODING If unset or set to None (default) it uses UTF-8 for everything except JSON output, which uses safe numeric encoding (\uXXXX sequences) for historic reasons. Use utf-8 if you want UTF-8 for JSON too.
After installing docker when I run the scrapinghub/splash command docker is showing an error that mentions below: error during connect: In the default daemon configuration on Windows, the docker client must be run with elevated privileges to connect.: Post %2F%2F.%2Fpipe%2Fdocker_engine/v1.24/images/create?fromImage=scrapinghub%2Fsplash&tag=latest: open //./pipe/docker_engine: The system cannot find the file specified. Kindly tell me how to solve this?
try running docker HelloWorld sample first to see if docker installation is working. docker run hello-world (This should show something like "Not found locally, downloading and then Hello from docker. If this doesn't work, check the documentation docs.docker.com/docker-for-windows/install/ Good luck!
By the way, read the error carefully - "docker client must be run with elevated privileges to connect" Did you try running docker with Admin rights? See this: stackoverflow.com/questions/40459280/docker-cannot-start-on-windows
DNS lookup failed: no results for hostname lookup: x. 2023-07-29 11:45:46 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying (failed 2 times): DNS lookup failed: no results for hostname lookup: x. 2023-07-29 11:45:48 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying (failed 3 times): DNS lookup failed: no results for hostname lookup: x. 2023-07-29 11:45:48 [scrapy.core.scraper] ERROR: Error downloading can any know these why iam getting these error
Sir, At the time of using dockers, I fail to enable a virtual environment using CMD. Could you please tell me how Can I do it? How can I go on the ven file location as you? Thaks. Shahidul.
Use Download_delay (docs.scrapy.org/en/latest/topics/settings.html#download-delay) and Auto_throttle (docs.scrapy.org/en/latest/topics/autothrottle.html#topics-autothrottle). If these two don't work, use proxies. Already covered proxies in one of my videos.
I'm viewing this video about a year and a half later, and I wanted to know if you still felt this was valid or if there was a newer better solution today?
Looks like you are able to install but not pull. Windows 10 64-bit: Pro, Enterprise, or Education (Build 17134 or later) are supported officially. Try this first. *docker login -u username* If it doesn't work, then google would be your friend. Share your findings for others :-)
I download docker but can not inatall it some errors comes that this installation need window 10 pro although my window is 10 and pro. I am not understanding how to fixe it
Check the system requirements. docs.docker.com/docker-for-windows/install/ You already have Win Pro, otherwise for Home the instructions are here: well docs.docker.com/docker-for-windows/install-windows-home/
IDE does not matter. If you run scrapy startproject yourprojectname* from the terminal, it will create the complete project structure including settings.py
Thank you providing useful content. But I am getting stuck with below error. please find solution. Connection was refused by other side: 10061: No connection could be made because the target machine actively refused it..
Hello Sir, Firstly, Thanks a lot for the video. I have a question regarding scraping pages. I am doing the same thing for another website. However I do not just get the title and price from first page. Instead of that, what I do is to extract first 40 items with their links and then send another call with (SplashReuqest) (meaning I create second parse function) and define the items I want to extract. However it fails each time and only extract 5 to 8 items out of 40. Could you please let me know if there is any way to get all the items?
@@codeRECODE Thanks for the response. I actually use the wait however it still doesn't help to me. The code for Splash Request and the output error that I got is in the below. Please let me know if you have nay idea why this happens. yield SplashRequest(url=absolute_url, callback=self.parse_product, magic_response=True, meta={'handle_httpstatus_all': True}, endpoint='execute', args={'lua_source': self.script2, 'wait': 25, 'timeout': 90, 'resource_timeout': 10 }) This is the code for the second section. It still fails to extract all the items I ask in parse_product function. Some links works some not. The error: [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying
Have covered proxies on my channel( th-cam.com/video/qHahcxoGfpc/w-d-xo.html ) , but not with splash. ScraperAPI that I covered in my video can accept additional parameter and they will do the rendering. That would $249 plan. There are more service but almost all are more expensive. See this article for a comparison. It should give you a general idea about prices. Don;t forget to check JS Render option on the top of the page. www.scraperapi.com/compare-best-residential-datacenter-rotating-proxy-providers-for-web-scraping
Sir, I'm not able to get to the next page when I run this code.. what's the problem here I don't know. import scrapy class BooksSpider(scrapy.Spider): name = 'books' allowed_domains=['books.toscrape.com'] start_urls = ['books.toscrape.com/'] def parse(self, response): books = response.css('ol.row li') for url in books: url = url.css('div.image_container a::attr(href)').get() url=response.urljoin(url) yield scrapy.Request(url,callback=self.parse_books) def parse_books(self,response): yield { 'title':response.css('div>h1::text').get().strip(), 'catagories':response.css('ul.breadcrumb>:nth-child(3)>a::text').get().strip() } next_page = response.css('.next > a::attr(href)').get() if next_page: next_page=response.urljoin(next_page) yield scrapy.Request(next_page, callback=self.parse_books) Note: I used the same script on other sites it works fine.
I have been following your videos for couple of days now. You described the things very clearly. I have learned many things from you that helped me to improve my coding. Thank You Very Much.
So nice of you
As usual exceptional and to-the-point tutorial.
Thank you 😃
1:56 I got this error here:
C:\Users\M Umair>docker pull scrapinghub/splash
Using default tag: latest
Error response from daemon: open \\.\pipe\docker_engine_linux: The system cannot find the file specified.
Thank you ji! All your contents are very useful, well explained and organized. I wish you'd have been my teacher back when I was studying
Thank you so much 🙂
Very informative
thank you
Glad you liked it!
How are you dealing with header issues and splash? I found the documentation, but I can't quite figure out how to implement it. Edit: specifically when using scrapy shell?
We are grateful to you Because your videos are always hep us to learn new things.
Thank you very much!!!
Glad to hear that
Sir,
I am fail to enable virtual environment. Could you please tell me how Can I do it.
Hello sir, I've encountered a problem , for python interpreter 3.9, and 3.7.
ScrapyDeprecationWar
ning: Call to deprecated function to_native_str. Use to_unicode instead.
url = to_native_str(url)
from scrapy_splash library. Is there any way around this?
Did you try to_unicode as the message suggests?
@@codeRECODE yes sir, still it didn't work. I import from (scrapy.utils.python import to_unicode) still got the same depreciation warning.
@@cebysquire share your code
@@codeRECODE Hello sir, It's working fine now. The (to_unicode) method needed to have an exact encoding parameter. So I added a detect encoding function for the url.
However, the scrapy log will still show the deprecated warning.
code screenshot:
i.postimg.cc/02FTjYW7/Capture.png
Thank you for replying sir.
@@cebysquire Hey! Came back to this now. This is not a correct approach.
I guess you are facing issues in exporting in Unicode format. Scrapy exports in UTF-8 by default, except for JSON format. See this from the documentation:
docs.scrapy.org/en/latest/topics/feed-exports.html#std-setting-FEED_EXPORT_ENCODING
FEED_EXPORT_ENCODING
If unset or set to None (default) it uses UTF-8 for everything except JSON output, which uses safe numeric encoding (\uXXXX sequences) for historic reasons.
Use utf-8 if you want UTF-8 for JSON too.
Thank you. I'm having lots of problems installing docker.
WSL2 should be installed first, even changing configuration from BIOS.
Is it right?
Try Playwright. You don't need splash anymore.
@@codeRECODE Thank you master! I'll continue with the rest of the playlist videos.
Regards!
Thanks for video series! Will you ever address on how to simply crawl all the website following every href in it? Also, whe websites use Shadow Dom?
Hi, do you have a splash tutorial for pages that have login?
using this code i'm only getting all the results in one line in csv . why?
Is it posible to use splash with CrawlSpider? Or use linkExtractor with splash? Thanks you very much for your ...
Your videos...
After installing docker when I run the scrapinghub/splash command docker is showing an error that mentions below:
error during connect: In the default daemon configuration on Windows, the docker client must be run with elevated privileges to connect.: Post %2F%2F.%2Fpipe%2Fdocker_engine/v1.24/images/create?fromImage=scrapinghub%2Fsplash&tag=latest: open //./pipe/docker_engine: The system cannot find the file specified.
Kindly tell me how to solve this?
try running docker HelloWorld sample first to see if docker installation is working.
docker run hello-world (This should show something like "Not found locally, downloading and then Hello from docker.
If this doesn't work, check the documentation docs.docker.com/docker-for-windows/install/
Good luck!
By the way, read the error carefully - "docker client must be run with elevated privileges to connect"
Did you try running docker with Admin rights? See this: stackoverflow.com/questions/40459280/docker-cannot-start-on-windows
DNS lookup failed: no results for hostname lookup: x.
2023-07-29 11:45:46 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying (failed 2 times): DNS lookup failed: no results for hostname lookup: x.
2023-07-29 11:45:48 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying (failed 3 times): DNS lookup failed: no results for hostname lookup: x.
2023-07-29 11:45:48 [scrapy.core.scraper] ERROR: Error downloading
can any know these why iam getting these error
Thank you, very useful
Glad to hear that!
Can you explain how can we use splash + crawl spider? Please.
WIll try to find some samples for you.
Sir,
At the time of using dockers, I fail to enable a virtual environment using CMD. Could you please tell me how Can I do it?
How can I go on the ven file location as you?
Thaks.
Shahidul.
Hey, got back to this now. What was the problem?
Great tutorial. I follow you everytime. Would you make a video of preventing get blocked in scrapy??
Use Download_delay (docs.scrapy.org/en/latest/topics/settings.html#download-delay) and Auto_throttle (docs.scrapy.org/en/latest/topics/autothrottle.html#topics-autothrottle). If these two don't work, use proxies. Already covered proxies in one of my videos.
Thanks a lot bro you so helping me
Happy to help
very useful thank you !
Glad it was helpful!
I'm viewing this video about a year and a half later, and I wanted to know if you still felt this was valid or if there was a newer better solution today?
Good question. This is one of the solutions. Playwright is getting a lot of attention these days, though.
i have problem docker pull scrapinghub/splash >>>> unauthorized: authentication required . my windows version 10ENTERPRICE LTSC
Looks like you are able to install but not pull. Windows 10 64-bit: Pro, Enterprise, or Education (Build 17134 or later) are supported officially. Try this first.
*docker login -u username*
If it doesn't work, then google would be your friend. Share your findings for others :-)
I download docker but can not inatall it some errors comes that this installation need window 10 pro although my window is 10 and pro. I am not understanding how to fixe it
Check the system requirements. docs.docker.com/docker-for-windows/install/
You already have Win Pro, otherwise for Home the instructions are here: well docs.docker.com/docker-for-windows/install-windows-home/
Can you please determine the IDE you are using. I cannot find the settings.py file when I create a new python project using PyDev with Eclipse
IDE does not matter.
If you run scrapy startproject yourprojectname* from the terminal, it will create the complete project structure including settings.py
By the way, I use VS Code and Pycharm. But again, this does not matter
@@codeRECODE Thanks a million, the video really helped me!
Thank you providing useful content.
But I am getting stuck with below error. please find solution.
Connection was refused by other side: 10061: No connection could be made because the target machine actively refused it..
your docker is not running
Can you make a video to crawl information from Instagram?
thank you very much
Why not use their API? They explicitly ban scraping, thus no plans to cover it.
sir can u show us how scrape pages having " Load More" button.
I have been looking for a solution to scrape such site
See my video on infinite scroll
Hello Sir, Firstly, Thanks a lot for the video. I have a question regarding scraping pages. I am doing the same thing for another website. However I do not just get the title and price from first page. Instead of that, what I do is to extract first 40 items with their links and then send another call with (SplashReuqest) (meaning I create second parse function) and define the items I want to extract. However it fails each time and only extract 5 to 8 items out of 40. Could you please let me know if there is any way to get all the items?
Looks like the page is taking longer to load. Try adding wait to splash request - yield SplashRequest(url, args={'wait': 5})
@@codeRECODE Thanks for the response. I actually use the wait however it still doesn't help to me. The code for Splash Request and the output error that I got is in the below. Please let me know if you have nay idea why this happens.
yield SplashRequest(url=absolute_url, callback=self.parse_product, magic_response=True,
meta={'handle_httpstatus_all': True}, endpoint='execute',
args={'lua_source': self.script2, 'wait': 25,
'timeout': 90, 'resource_timeout': 10
})
This is the code for the second section. It still fails to extract all the items I ask in parse_product function. Some links works some not. The error:
[scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying
very good tut. Can you maybe show how to use rotation proxy? Cant figure out how to use it in docker and splash/scrapy :/
Have covered proxies on my channel( th-cam.com/video/qHahcxoGfpc/w-d-xo.html ) , but not with splash. ScraperAPI that I covered in my video can accept additional parameter and they will do the rendering. That would $249 plan. There are more service but almost all are more expensive. See this article for a comparison. It should give you a general idea about prices. Don;t forget to check JS Render option on the top of the page. www.scraperapi.com/compare-best-residential-datacenter-rotating-proxy-providers-for-web-scraping
Thanks !
Welcome!
Where I can find results?(
You can save the output using -o switch. For example, scrapy crawl laptop -o yourfile.csv
could you share using the item pipline?
This is a good idea for the next video. Thanks
Sir, I'm not able to get to the next page when I run this code.. what's the problem here I don't know.
import scrapy
class BooksSpider(scrapy.Spider):
name = 'books'
allowed_domains=['books.toscrape.com']
start_urls = ['books.toscrape.com/']
def parse(self, response):
books = response.css('ol.row li')
for url in books:
url = url.css('div.image_container a::attr(href)').get()
url=response.urljoin(url)
yield scrapy.Request(url,callback=self.parse_books)
def parse_books(self,response):
yield {
'title':response.css('div>h1::text').get().strip(),
'catagories':response.css('ul.breadcrumb>:nth-child(3)>a::text').get().strip()
}
next_page = response.css('.next > a::attr(href)').get()
if next_page:
next_page=response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse_books)
Note: I used the same script on other sites it works fine.