The Biggest Issues I've Faced Web Scraping (and how to fix them)

แชร์
ฝัง
  • เผยแพร่เมื่อ 23 พ.ย. 2024

ความคิดเห็น • 104

  • @PaoloAnzani_1
    @PaoloAnzani_1 6 หลายเดือนก่อน +53

    In my opinion as i developed multiple web scraping application, half of the time is not spent coding but instead trying to reverse engineer the web application. Simple ones are just matter of looking at requests from dev tools and manually make api calls, while most complicated ones involve backtracing how content is loaded on the page to find the js code responsable to do that. Basically its 70% reverse engineering and 30% coding, if you do things the smart way.

    • @pranitmane
      @pranitmane 5 หลายเดือนก่อน

      Yep!

    • @mateusb09
      @mateusb09 3 หลายเดือนก่อน +4

      What's the benefit of manually doing API calls instead of just letting selenium click the buttons which will do the exact same thing?

    • @kaj1543
      @kaj1543 3 หลายเดือนก่อน

      ​@@mateusb09selenium has overhead

    • @Anthony-qg5hj
      @Anthony-qg5hj 3 หลายเดือนก่อน +2

      ​@@mateusb09 because it's faster, less code, lower cost, easier to maintain

    • @mateusb09
      @mateusb09 3 หลายเดือนก่อน

      @@Anthony-qg5hj I had a selenium project in which I tried the approach you’re talking about. Not only needed to attach the login cookies (which expire) to the request anyway but also I needed to manually construct the request skeleton.
      So in the end I had a similar effort as I would have if I just force selenium to click buttons

  • @yafethtb
    @yafethtb 8 หลายเดือนก่อน +17

    Yeah. Scraping a dynamic website really makes me want to scream like Linus Torvalds to NVIDIA. And I also hate CloudFlare 😂

    • @gamecast4432
      @gamecast4432 หลายเดือนก่อน

      You can start a new browser or new context for every "goto()" with a different user-agent, that's how i do with CloudFare

  • @delsix1222
    @delsix1222 8 หลายเดือนก่อน +30

    interesting timing to see this video, literally the day after I completed my first full-stack application which literally revolves around web-scraping :D

    • @flipygmd
      @flipygmd 8 หลายเดือนก่อน +1

      You're the next Mark Zuckerberg

    • @Noumaan_Ahamed
      @Noumaan_Ahamed 8 หลายเดือนก่อน

      How do you web scrape secure website?

    • @IshaqKhan010
      @IshaqKhan010 5 หลายเดือนก่อน

      share website url

    • @delsix1222
      @delsix1222 5 หลายเดือนก่อน

      @@IshaqKhan010 cant share url in yt comments, gets autofiltered

    • @pablom8854
      @pablom8854 3 หลายเดือนก่อน

      And I'm starting a web scraping project

  • @Dalamain
    @Dalamain 8 หลายเดือนก่อน +24

    I used to web scrape all the time, but stupid js frameworks obsfucated css class names has made it very difficutlt.

    • @gamecast4432
      @gamecast4432 หลายเดือนก่อน

      I use the "[data-something="foo"], luckly most of the sites i need to scrape make use of this attr

  • @rikawrites7104
    @rikawrites7104 15 วันที่ผ่านมา

    i started learning about web scraping YESTERDAY, and stumbled upon your video today. GODDAMN the way you explain stuff and speak really stuck with me! thank you for providing such value and motivating me to improve my communication skills as well :D

  • @xlafxx
    @xlafxx 8 หลายเดือนก่อน +1

    I remember starting to watch your videos when I was entering computer science Ba, and as a 28 year old 1 semester left to graduate, you’re still uploading good content that’s unique. Never get tired of your vids , keep it up brother . I’m also concerned with the job market , can you make a vid about new grad Cs students ? For example seems almost every job wants front end or something and my school never taught any of it

    • @mrrobot-mn6re
      @mrrobot-mn6re 8 หลายเดือนก่อน +1

      You want to get a job from what your school taught you? You are in for a ride brother. Tech is about your own research and self learning, every fucking day.I pity people that majored in CS because they heard about a programmer earning 6figs

    • @Hshjshshjsj72727
      @Hshjshshjsj72727 6 หลายเดือนก่อน

      Unless u went to ivy league and wanna be a quant then u gotta do front end js react sql are key for majority. School is duhm unless ivybleague except for piece of paper

  • @JefCollier
    @JefCollier 3 หลายเดือนก่อน +1

    I saw this video recommended to me about two days after I had to scrape a ton of images and convert them to a PDF. The images are loaded dynamically and I will confess with shame that my script would scroll slowly down the entire page until it couldn't get any further. Then it would queue up all the appropriate image files and compile them into a local directory before turning them into a single PDF file.

  • @v1d300
    @v1d300 8 หลายเดือนก่อน +7

    I am working on building a project that heavily requires scraping so I been doing a lot of research. And its really hard to find anything good that is not sponsored by brightdata. I get it, their marketing team has done a great job with tapping a perfect niche of creators who provide valuable information but this also creates a problem to ending up finding that almost each good resource is related to using brightdata and its not something I want to pay for when starting a hobby project.
    Anyway, this is a great video either way. I learned a lot of things I hadn't considered in my planning. Like the ETL(thats a new rabbit hole I need to dive into) or adaptive content extraction to account of layout changes. I was just assuming I will set up reporting to notify me when I start getting no content and then I will fix it.
    So thank you for that.
    Do you setup redis or something to make sure some requests are accessed from the cache of recently requested data than scraping again or accessing the db? is that necessary?
    And at what point should a webhook be setup and for what purpose exactly?
    Thank you

  • @EduardoEscarez
    @EduardoEscarez 8 หลายเดือนก่อน +2

    AFAIK the button highlighting is a feature based on video subtitles, including those generated automatically, but still somewhat random. I didn't catch those because I was already subscribed and like the video a moment before you said it.

    • @v1d300
      @v1d300 8 หลายเดือนก่อน

      I don't think its a video subtitles feature. It just happens randomly in my experience. The thumb up button shakes and subscribe highlights. Didn't happen for me on this video though :(

  • @redbill5197
    @redbill5197 8 หลายเดือนก่อน +6

    Thank you for the amazing video! Much appreciated as a young web developer. By the way, none of the buttons lit up or did any animations... I am a subscriber, so I don't know if that's why.
    Peace!!!

    • @beaconxy
      @beaconxy 7 หลายเดือนก่อน

      It actually didn't.

  • @danielabraham3022
    @danielabraham3022 8 หลายเดือนก่อน +2

    To be honest, i subscribed because the button lit up. Also, I love your content.

  • @V4rrow
    @V4rrow 8 หลายเดือนก่อน +19

    dude is literally gilfoyle from silicon valley(love your vids)

    • @theparten
      @theparten 8 หลายเดือนก่อน

      i wasn't looking for web scraping video but his face drew my attention, i was like wait this is Gilfoyle right😂❤...

    • @FFl1s
      @FFl1s 8 หลายเดือนก่อน

      Fr

  • @robinbreed2439
    @robinbreed2439 หลายเดือนก่อน

    Great video and really nice energy, and I think you answered my question by using scrape browser to render javascipt headlessly. Thank you

  • @xdcountry
    @xdcountry 8 หลายเดือนก่อน +6

    This guy gets it-I’ve been there. I can’t wait to make this all an easy ass python plugin

  • @Smallbusiness0007
    @Smallbusiness0007 8 หลายเดือนก่อน +5

    The JD bottle in the background 😉

    • @obiwanfisher537
      @obiwanfisher537 หลายเดือนก่อน +1

      The cigars on the shelf ;)

  • @LM-ty8xg
    @LM-ty8xg หลายเดือนก่อน

    Amazing content,
    Brother, please make a video explaining how to scrape dybamically loading powerBI tables on a website. There is simply no change in the html/css structure when you engage😅

  • @doublesushi5990
    @doublesushi5990 8 หลายเดือนก่อน +2

    such a chill vid

  • @olhodetamarutaca
    @olhodetamarutaca 5 หลายเดือนก่อน +1

    I really like the way you explain things and also the pronunciation issues

  • @nrgstudios612
    @nrgstudios612 3 หลายเดือนก่อน

    The subscribe button didn't light up because I was already subscribed 👍

  • @phethindabamkhwanazi3546
    @phethindabamkhwanazi3546 8 หลายเดือนก่อน +1

    Hey, man do you have another channel where you teach live?????

    • @phethindabamkhwanazi3546
      @phethindabamkhwanazi3546 8 หลายเดือนก่อน

      If you have provide the link, please so I start learning more.

  • @olasunkanmioyetunji9254
    @olasunkanmioyetunji9254 7 หลายเดือนก่อน +1

    Can you recommend a course to learn web scraping. A course that taught the tool and techniques you mentioned and other concepts

    • @ravimahto3606
      @ravimahto3606 29 วันที่ผ่านมา

      i am searching for it too, beginner in webscraping

  • @brianmorin5547
    @brianmorin5547 7 หลายเดือนก่อน

    Is there a reason/advantage to using Bright Data's "scraping browser" product instead of integrating their proxy and IP rotation services into a script I'm running on my own server?

  • @tomasemilio
    @tomasemilio 8 หลายเดือนก่อน +3

    Boom. Thanks

  • @sakibullah3577
    @sakibullah3577 2 หลายเดือนก่อน

    can anyone help me? I can't seem to bypass cloudflare loading page with heedless brightdata webscraper

  • @manumartinezkcxu
    @manumartinezkcxu 5 หลายเดือนก่อน

    what are the best ai scraping apps : suggestion/recommendations? Just looking for how our nonprofit organization is aligned with other organizations within a county of california in order to partner with them

  • @javancheongyujing2531
    @javancheongyujing2531 8 หลายเดือนก่อน +1

    Is web scraping under data science or software engineering structure?

    • @dedswift
      @dedswift 2 หลายเดือนก่อน

      Depends on the purpose of the data you’re scraping and how it’s used, but it can be both.

  • @Cryogenics12
    @Cryogenics12 8 หลายเดือนก่อน +2

    Hi Forrest. I was wondering how you still feel about AI and the future of software engineering. With chat GPT out for over a year now, have your views changed much? Maybe a good topic for another vid.

  • @johnknox4293
    @johnknox4293 8 หลายเดือนก่อน

    interesting....thanks man

  • @dmytro-skh
    @dmytro-skh 7 หลายเดือนก่อน

    this video is what I need. But whoaa so fast changes of screens with code... I'm too old at 35 to be able to push the pause button so fast 😅 Do you have some links with those hacks?

  • @juan7114
    @juan7114 4 หลายเดือนก่อน

    I hate 502 error, I don't know how to solve it

  • @VishalJangid1
    @VishalJangid1 8 หลายเดือนก่อน +1

    hopefully brightdata ain't a snitch 🫠

  • @consolemodding1015
    @consolemodding1015 3 หลายเดือนก่อน

    The funny thing is when they block the ranges used by bright data xD

  • @realshiiiiiit8349
    @realshiiiiiit8349 8 หลายเดือนก่อน

    Damn this guy is cool

  • @ramelox
    @ramelox 8 หลายเดือนก่อน +97

    When I see brightdata sponsorship, I instantly stop watching. Paying to brightdata is not a webscraping skill.

    • @zeddscarlxrd4331
      @zeddscarlxrd4331 8 หลายเดือนก่อน +5

      Did u know how to bypass cloudflare or captcha without bright data?

    • @ZacMagee
      @ZacMagee 8 หลายเดือนก่อน +7

      Some people 😂
      That's like saying.
      "Oh well, these stupid people who drive cars, why would they do that when we still have horses?"

    • @vasyavasin7364
      @vasyavasin7364 8 หลายเดือนก่อน +12

      ​@@ZacMagee why should I pay it if I can do it free?😂

    • @vasyavasin7364
      @vasyavasin7364 8 หลายเดือนก่อน

      ​@@zeddscarlxrd4331 How to bypass cloudflare you can find easy.

    • @Ohiostategenerationx
      @Ohiostategenerationx 7 หลายเดือนก่อน +1

      ​@@vasyavasin7364do you still not need to scrap a bunch of proxies to use?

  • @storymode9085
    @storymode9085 8 หลายเดือนก่อน

    wow... i got a long way to go

  • @oeerturk
    @oeerturk หลายเดือนก่อน

    u said u prepared the video without the need of brightdata but for every issue except data storage u propose using brightdata for the most important&challenging parts....................? :/

  • @carsonjamesiv2512
    @carsonjamesiv2512 8 หลายเดือนก่อน

    GOOD VIDEO🎉👍

  • @JoaquimDornelles95
    @JoaquimDornelles95 8 หลายเดือนก่อน

    My fucking hero

    • @einekleineente1
      @einekleineente1 8 หลายเดือนก่อน

      are there vids of that ???

  • @paulshorey7528
    @paulshorey7528 3 หลายเดือนก่อน

    I like your mustache

  • @botobeni
    @botobeni 7 หลายเดือนก่อน

    12:30 nuh uh 🗿🗿

  • @OnlyUseMeEquip
    @OnlyUseMeEquip 5 หลายเดือนก่อน +1

    if you are using selenium,puppeteer, or any other browser automation, you will never be a good web scraper, they are just too damn slow, if you are relying on them to get you passed the WAF javascript function and generate your cookies for you to then go scrape others will beat you to the punch with pure code

    • @consolemodding1015
      @consolemodding1015 3 หลายเดือนก่อน

      Define slow?

    • @OnlyUseMeEquip
      @OnlyUseMeEquip 3 หลายเดือนก่อน

      @@consolemodding1015 if you have to login repeatedly and solve captcha's, that delay is almost negated , pure code bots just generate new valid cookies, once you hit your 403 forbidden or 401 captcha new tokens are loaded and carry on, not to mention threads instead of instances, , reversing the WAF JS function is the key. a good pure code bot vs a good browser bot is likely to be around 100x more efficient

    • @mianashhad9802
      @mianashhad9802 3 หลายเดือนก่อน

      How can you scrape dynamic content without these tools? Anything else besides trying to find the API endpoint?
      I am a beginner who knows how to scrape simple pages. I want to learn how to scrape dynamic content. Would love to know your thoughts.

    • @heritage1834
      @heritage1834 2 หลายเดือนก่อน +1

      ​@@mianashhad9802A method that works is to clone the api calls that get the data from the backend server. You can find it in the network tab (fetch) in your browser's developer tools tab

    • @gdolphy
      @gdolphy หลายเดือนก่อน

      ​@mianashhad9802 : if attribute data changes, target the tag. If tag changes, target the Ajax calls.

  • @justcode_99
    @justcode_99 8 หลายเดือนก่อน

    Your mustache looks like a hedgehog 😂

  • @GEMSofGOD_com
    @GEMSofGOD_com หลายเดือนก่อน

    Thank you Jesus

  • @YouStillNeedToSleep
    @YouStillNeedToSleep 7 หลายเดือนก่อน

    Examples. Are you a Leo? he he

  • @francishubertovasquez2139
    @francishubertovasquez2139 8 หลายเดือนก่อน

    Speaking of Females, if Hitler's fuhrer have Magog carrier of motorized machine monsters then the Northern Magog have ice snow predominant in their place near Arctic circle, and ice surface can better conduct gases and science elements and compounds interaction which can attract those science things from everywhere, who between them is stronger except for the Super Magog Dark Matter? Will they suffice at full force during the final battle end times?

  • @abe_is_live
    @abe_is_live 8 หลายเดือนก่อน

    stop web scraping