Web Scraping NEWS Articles with Python

John Watson Rooney

มุมมอง 74 083

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 27 ม.ค. 2025

ความคิดเห็น • 59

@jimmysonerian2504 4 ปีที่แล้ว ⁺⁷
Very useful video John. Keep them coming. If you had made this video a day earlier it would have saved lots of my time. But for the future it's a good reference.
@parsairani110 ปีที่แล้ว ⁺⁴
Awesome vid with easy to understand explanations! Thanks John. Would you ever consider adding a script that would then open each article and scrape the contents? That would be super useful to see!
@garrysingh8065 ปีที่แล้ว
Hey did you create something like that?
@anirudhnuti9146 3 ปีที่แล้ว ⁺²
This was really helpful. Thanks a lot!!
@chileendatos9523 4 ปีที่แล้ว ⁺⁴
Thank you. But I have this error "Cannot use HTMLSession within an existing event loop. Use AsyncHTMLSession instead."
@JohnWatsonRooney 4 ปีที่แล้ว ⁺⁴
Are you using jupyter notebook? It doesn’t work in those unfortunately
@jonvincentmedenilla1296 4 ปีที่แล้ว ⁺²
@@JohnWatsonRooney so how do we got about it in Jupyter then?
@jfqlkd 3 ปีที่แล้ว
@@jonvincentmedenilla1296
from requests_html import AsyncHTMLSession
url = 'news.google.com/topstories?hl=en-GB&gl=GB&ceid=GB:en'
asession = AsyncHTMLSession()
resp = await asession.get(url)
await resp.html.arender(sleep=1, scrolldown=5)
articles = resp.html.find('article')
r = resp.html.raw_html
That should be the start of the code now :D
@petkomarinov6897 2 ปีที่แล้ว ⁺¹
I cant even start. " No module named " requests_html". Please, help me.
@utkarshtyagi277 4 ปีที่แล้ว ⁺¹
Hey , i got some error when i run this , that is in render
AttributeError : coroutine object has no attribute newPage
runtimewarning coroutine launch was never awaited
@ma.t.t.9096 4 ปีที่แล้ว ⁺⁸
Thank you for the great video. How can I scrape all the news from every page, not only page 1 of the web?
@jayjoshi64 4 ปีที่แล้ว ⁺⁶
+1. Is it possible to make a generic script which can scrap news from any news website?
@wormalpaca3116 3 ปีที่แล้ว
You would have to make a separate web scraper for every website, as every website does the html in their own way.
@jfqlkd 3 ปีที่แล้ว ⁺³
My list seems to stop at 100 articles? Is there a way to circumvent this?
@JohnWatsonRooney 3 ปีที่แล้ว ⁺²
That seems to be some sort of Google soft cap, I’ll look in to it
@vispinet 2 ปีที่แล้ว
@@JohnWatsonRooney I have encountered the same issue
@manasimalbari3293 2 ปีที่แล้ว ⁺¹
I am getting this RuntimeError: Cannot use HTMLSession within an existing event loop. Use AsyncHTMLSession instead.
@JohnWatsonRooney 2 ปีที่แล้ว
Are you trying to use it in a jupyter notebook? That’s the error I see usually when that’s the case
@harivignesh4443 2 ปีที่แล้ว ⁺¹
@@JohnWatsonRooney how do you overcome this in Jupyter notebook?
@abhishekr5447 ปีที่แล้ว
@@harivignesh4443 did you figure this out???
@sanadmasoud9898 2 ปีที่แล้ว ⁺³
Thanks for video, can I use the same code, to return only articles with specific name in the header ?
@JohnWatsonRooney 2 ปีที่แล้ว
Yea, you would need to add your own logic to match names but it’s definitely doable
@kids-stories-k7w ปีที่แล้ว
where is content if i want to open each article and scrape content like title name how to do that?
@ismaelRR 4 ปีที่แล้ว ⁺¹
Great video Jhon !¿Can you tell me what does html render does technically to our program?
@augastinendeti4448 7 หลายเดือนก่อน
Great video sir. How can we modify this to save the results in a well-structured spreadsheet?
@raypedits 3 หลายเดือนก่อน
how can i make it get only the newest news and make it run for ever
is it efficient to use a while loop?
@amanmishra6951 4 ปีที่แล้ว ⁺¹
why does r.html.render takes so much time for me...but it gets executed quickly for you
@JohnWatsonRooney 4 ปีที่แล้ว
Not sure! It loads up a browser behind the scenes so does use a little bit of ram and processing power
@amanmishra6951 4 ปีที่แล้ว
@@JohnWatsonRooney ok... Also it does not renders the JS script data. I'm scraping trip.com , on this site the hotel details are rendered dynamically using AJAX . But requests-html is unable to render this data
@scg565081 2 ปีที่แล้ว ⁺⁴
John, thanks for making everything so easily accessible. Going through this step by step has (a) worn out my pause/play finger but (b) allowed me to understand how you’ve been building this up. having followed through it there are a couple of questions which I have as a ‘first timer’. I’m using python from within Anaconda and VS Code but the ‘render’ isn’t turning blue and it’s telling me “newsarticle” is not defined…. Any suggestions? have to admit I’m on a MacBook Pro but everything else seems to be fine. Thanks John.
@WalterWhite-kv5jt 10 หลายเดือนก่อน
How can I get the content of the news rather than the link
@martinabozzi8702 2 ปีที่แล้ว ⁺²
great video! would it be possible to scrape the whole contenet of the news? I am doing aproject about fake news detection and I would need the whole content :)
@idmahadraps3119 ปีที่แล้ว
I got the same project bro. So is scraping the whole article possible?
@gmog7857 4 ปีที่แล้ว
How can this be used to do anything since its not a software?
@shubhamsaxena3220 2 ปีที่แล้ว
I am getting same number of articles when i am using scrolldown=0 or scrolldown=5
Can anyone explain, why?
@AshishBangwal 4 ปีที่แล้ว ⁺²
thanks sir but i think for only top headline we can just use our bs4 and return the first h1 tag text
@pietpanzerpanzer5335 3 ปีที่แล้ว
Sorry but if i try to do that i get a bunch of errors, most importend of which is "OS Error: [WinError 14001] This application couldnt be stated, as the Side-by-Side-Konfiguration in invalid. " i guess. Did i forgot to install something?
@avani112 2 ปีที่แล้ว
I fixed this issue by uninstalling all of my python installations via control panel. Then I redownloaded python directly from the python website NOT the windows store.
@17U55IF3R 2 ปีที่แล้ว ⁺¹
what vscode theme are you using??
@JohnWatsonRooney 2 ปีที่แล้ว ⁺¹
One dark pro I think
@17U55IF3R 2 ปีที่แล้ว
@@JohnWatsonRooney yeah i think it is thank you
@SunDevilThor 3 ปีที่แล้ว
I ended up getting duplicates in my list for reason. Each story title and link is listed at least 5 times each.
@codevacaphe3763 2 ปีที่แล้ว
Try using hash table or something to manage your data
@Rotrix 2 ปีที่แล้ว
I only want it to show 10 news, how do I do that?
@masinde.charles 2 ปีที่แล้ว ⁺¹
[:10]
@Rotrix 2 ปีที่แล้ว
@@masinde.charles Excuse me please can you explain me better
@suryamahendran3769 11 หลายเดือนก่อน
thanks sir its very damn useful
@YunchenChen-t7v ปีที่แล้ว
how can i get the date
@santisaldivar ปีที่แล้ว
Look for the element that contains the date.
date = Item.find(‘the element that contains the date’).text
@food.lovestory9847 3 ปีที่แล้ว ⁺²
How to scrap the content of the news
@alexlytle089 4 ปีที่แล้ว ⁺²
Thank you so much for this video. Question, wouldnt this be a perfect situation to use selenium?
@JohnWatsonRooney 4 ปีที่แล้ว ⁺⁴
It would be easy to do with selenium yes, but I try to stay away from that as much as possible - don’t always want the browser to pop up every time. I would be able to run this script on a Linux server every day easier like this than with selenium, for example.
@alexlytle089 4 ปีที่แล้ว
@@JohnWatsonRooney thanks for the insight
@edcoughlan5742 4 ปีที่แล้ว ⁺¹
👊
@JohnWatsonRooney 4 ปีที่แล้ว
👊
@aberema7949 2 ปีที่แล้ว ⁺²
did anyone else notice how my man wrote 'kink' real quick
@julian-frederiksimon5532 4 ปีที่แล้ว ⁺¹
Thank you for the video. When I try to run my code I get multiple error messages. One of those is that one here: "ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate"
Do you know what went wrong?
@ugurdev 3 ปีที่แล้ว
I fixed it by going to python folder inside applications and click install certificates command. As far as I understand this problem is unique to Mac only and has to do with a certain library commonly used by other libraries.
@martinflavell3045 7 หลายเดือนก่อน
pmsl do any of your tutorials work lad.

ต่อไป

เล่นอัตโนมัติ

A Better Web Scraper - 3 Steps Demo Python Web Scraping