“Wait, this Agent can Scrape ANYTHING?!” - Build universal web scraping agent

แชร์
ฝัง
  • เผยแพร่เมื่อ 9 มิ.ย. 2024
  • Build an universal Web Scraper for ecommerce sites in 5 min;
    Try CleanMyMac X with a 7 day-free trial bit.ly/AIJasonCleanMyMacX. Use my code AIJASON for 20% off
    🔗 Links
    - Follow me on twitter: / jasonzhou1993
    - Join my AI email list: www.ai-jason.com/
    - My discord: / discord
    - Universal Scraping Agent: forms.gle/8xaWBBfR9EL5w8jr6
    - Firecrawl: www.firecrawl.dev/
    - AgentQL: docs.agentql.com/
    - Browserbase: www.browserbase.com/
    ⏱️ Timestamps
    0:00 Intro
    3:00 Challenges with web scraping
    6:05 How LLM enable universal web scraper
    10:51 Potential solutions
    18:36 Solution 1: API based web agent - Researcher
    25:81 Solution 2: Browser based agent - Universal ecommerce scraper
    👋🏻 About Me
    My name is Jason Zhou, a product designer who shares interesting AI experiments & products. Email me if you need help building AI apps! ask@ai-jason.com
    #agents #webscraping #scrapers #webagent #gpt5 #autogen #gpt4 #autogpt #ai #artificialintelligence #tutorial #stepbystep #openai #llm #chatgpt #largelanguagemodels #largelanguagemodel #bestaiagent #chatgpt #agentgpt #agent #babyagi
  • วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 97

  • @AIJasonZ
    @AIJasonZ  24 วันที่ผ่านมา +10

    If you are interested in universal web scraper i'm building, please leave your email in this waiting list: forms.gle/8xaWBBfR9EL5w8jr6

    • @teegees
      @teegees 22 วันที่ผ่านมา

      Can you pass credentials along with the scraper in a secure manner? For example I want to scrape NYTimes but with my NYTimes account.

    • @24-7gpts
      @24-7gpts 22 วันที่ผ่านมา

      @@teegees I don't think that's probable because of privacy and security

    • @thrashassault1
      @thrashassault1 8 วันที่ผ่านมา +2

      i need to bypass cloudflare etc.

  • @Joe-bp5mo
    @Joe-bp5mo 24 วันที่ผ่านมา +29

    It's interesting how much performance gain you got from clean markdown data like firecrawl, sometimes you dont need much stronger reasoning, you just need to give agent better tools

    • @nigeldogg
      @nigeldogg 22 วันที่ผ่านมา +3

      All you need are tools

    • @djpete2009
      @djpete2009 20 วันที่ผ่านมา +1

      @@nigeldogg Love it!

  • @agenticmark
    @agenticmark 23 วันที่ผ่านมา +23

    I am already doing this. Its the same way I trained models to play video games - take a screensshot, convert to greyscale, but instead of inserting that into a CNN, I pipe it into an agent that I built and it has mouse and keyboard tools instead of the typical selenium/headless tools. It works pretty damn good although some models will refuse cpatchas outright.

    • @Hshjshshjsj72727
      @Hshjshshjsj72727 20 วันที่ผ่านมา

      Captchas? Maybe one of those “uncensored” llms

    • @Plash14
      @Plash14 20 วันที่ผ่านมา

      How does your mouse know where to click?

  • @elon-randgul
    @elon-randgul วันที่ผ่านมา

    I am recently thinking about this idea too. Many thanks for sharing your result!!

  • @jasonfinance
    @jasonfinance 24 วันที่ผ่านมา +13

    Gonna try out the 2 examples soon, and please please launch the universal web scraping agent, i will pay you for that in a heartbeat!

    • @ianmoore5502
      @ianmoore5502 18 วันที่ผ่านมา +2

      Pls Jason AI
      - Jason Finance

  • @nestpasunepipe1173
    @nestpasunepipe1173 10 วันที่ผ่านมา

    dear jason, i am really amateur with coding so i don't have a clue on so many topics that i try to execute. i have come across some of your interesting videos while trying to achieve but failed miserably on most of em. but today i just came for the thumbnail and rolling my sleeves to implement this masterpice. thank you so much & peace from 🇹🇷

  • @googleyoutubechannel8554
    @googleyoutubechannel8554 24 วันที่ผ่านมา +74

    You talked about 'universal scrapers' then you used a bunch of expensive services to create a very vanilla hyper-specific scraper that doesn't' require LLMs at all.... hmm....

    • @user-il1hu5xp2x
      @user-il1hu5xp2x 22 วันที่ผ่านมา +7

      It's just stupid, it's all about them using these services and putting the affiliate link, then finding true budget friendly alternatives. I can build the same with public API of a llm service, I will take hours but at the end, never again I will need to waste my time, you can even make the llm find names of classes and ids you want to scrape them the llm create the code, and run it automaticly.

    • @colecrouch4389
      @colecrouch4389 22 วันที่ผ่านมา +2

      Yeah i believe this commenter and I just unsubbesd. What’s with the web scraping grift lately?

    • @kilianlindberg
      @kilianlindberg 21 วันที่ผ่านมา +2

      9:43 lol

    • @rowkiing
      @rowkiing 19 วันที่ผ่านมา +4

      I made some video on my LinkedIn on building in public something similar a web scraper that summarize website and make outreach message based on that, everything free as chrome extension u just need a good to computer to run lllama locally

    • @krisvq
      @krisvq 19 วันที่ผ่านมา +1

      Everything he did can be done free with the use of python libraries he showed. He also explained the issue with scraping very throughly and accurately and then demonstrated the solution quite clearly. And then explained the use of agents LLMs in this context. I really don't understand what you think you just watched.

  • @Jim-ey3ry
    @Jim-ey3ry 24 วันที่ผ่านมา +9

    Holy shit, that universal ecommerce scraping agent in the end is sick, thanks for sharing that framework!!

  • @eduardoribeiro3313
    @eduardoribeiro3313 24 วันที่ผ่านมา

    Great work!! I'm currently tackling web scraping challenges, especially with certain sites where determining the delivery location or dealing with pop-ups obstructing the content poses issues. This often requires user action before the search query can proceed. What do you believe are the most effective methods or tools to overcome these hurdles? Sometimes, even the agentql struggle to resolve these issues.

  • @paulevans3060
    @paulevans3060 24 วันที่ผ่านมา +3

    can it be used for scrapping estate agents for finding a house to buy?

  • @tkp2843
    @tkp2843 24 วันที่ผ่านมา +1

    Fire video🔥🔥🔥

  • @amandamate9117
    @amandamate9117 24 วันที่ผ่านมา +5

    perplexity should use this crawler since their models are hallucinating reference URLs LOL

  • @maloukemallouke9735
    @maloukemallouke9735 21 วันที่ผ่านมา

    Thank you for share

  • @javiermarti_author
    @javiermarti_author 24 วันที่ผ่านมา

    Great work

  • @AhmedMekallach
    @AhmedMekallach 24 วันที่ผ่านมา +1

    Is bounding box method open-source ?
    Looking for a function that returns an X,Y coordinate of an element.
    Def FindCoordinates(instruction, screenshot)
    Return (x coordinate, y coordonate)

  • @MechanicumMinds
    @MechanicumMinds 13 วันที่ผ่านมา +1

    I never knew web scraping was so hard. I mean, I ve been trying to scrape together a decent Instagram following for years, but I guess that's not what they mean by web scraping.
    Anyway, who knew websites were like the cool kids at school, only loading their content when you scroll into their 'cool zone' and making you jump through hoops to get to the good stuff

  • @dannyquiroz5777
    @dannyquiroz5777 24 วันที่ผ่านมา +1

    I'm here for the thumbnail

  • @productresearchgeek
    @productresearchgeek 20 วันที่ผ่านมา

    what's the event about scraping you quoted in your video? please cite the link

  • @AllenGodswill-im3op
    @AllenGodswill-im3op 24 วันที่ผ่านมา +2

    With all these expensive tools, I think it will best to build with playwright.
    Though it will take weeks or months, but it will be cost effective.

    • @helix8847
      @helix8847 24 วันที่ผ่านมา +1

      Issue with just Playwright it will be detected as a bot.

    • @AllenGodswill-im3op
      @AllenGodswill-im3op 22 วันที่ผ่านมา

      @@helix8847 You know any better alternative?

  • @damionmurray8244
    @damionmurray8244 24 วันที่ผ่านมา +3

    We are in a world where data is the most sought after commodity. And AI is going to make accessing information trivial. I wonder how Big Business will respond. I suspect they'll start pushing for laws to criminalize web scraping in the not too distant future. It will be interesting to see how this all plays out in the years to come.

    • @krisvq
      @krisvq 19 วันที่ผ่านมา +1

      They would never win with that kind of law. If you show data publicly it's there for the picking. If A.I. can have vision and mimic a human user, it's game over for hiding data.

  • @bobharris5093
    @bobharris5093 3 วันที่ผ่านมา

    i never can understand why you need an api for the search. is there any tool that can just type in the google search bar at all ??

  • @AryaArsh
    @AryaArsh 24 วันที่ผ่านมา +27

    _Advertisements ✅️ Knowledge ❌️_

    • @nonstopper
      @nonstopper 24 วันที่ผ่านมา +2

      Average AI Jason video

    • @rajchinagundi7498
      @rajchinagundi7498 24 วันที่ผ่านมา +4

      @@nonstopper True this guy has stopped creating value content

    • @helix8847
      @helix8847 24 วันที่ผ่านมา

      Sadly it does feel like that now. Nearly everything he shows now cost money. While there are free alternatives to most of what he shows.

    • @SamuelJunghenn
      @SamuelJunghenn 23 วันที่ผ่านมา

      And all the trolls come out.. never created a piece of value in their lives for anyone else for free, but they rag on content producers who dedicate a lot of time to bring value to others. Thumbs up guys keep your value less contributions coming, you’re really heroes here.

    • @djpete2009
      @djpete2009 20 วันที่ผ่านมา

  • @danielcave9606
    @danielcave9606 20 วันที่ผ่านมา +1

    The cost per request for this must be through the roof!

    • @krisvq
      @krisvq 19 วันที่ผ่านมา

      Not if you run llama on olama on your own server or local machine, which is doable. Hopefully soon this cost goes further down with services we can't host.

    • @danielcave9606
      @danielcave9606 18 วันที่ผ่านมา +1

      I mean in comparison to other more specialised ML models currently used in industry, where hundreds of millions, to billions, of requests are being made where cost per request really matters.
      What LLMs like this CAN give you is speed to data which is great for a subset projects, from any site while eliminating the need to write selectors and extraction code, but at the expense of high cost per request.
      But again we have ML that can deliver that at scale at a fraction of the cost, and at a much higher accuracy.
      In a world where simply adding a headless browser to access HTML can 30x the cost per request and kill a project. Adding a LLM is simply a no go.
      I’m excited to see the future of LLMs in scraping, but it’s VERY early days but I haven’t seen usecase where LLMs are used for extracting and structuring the data are significantly faster or cheaper better than the existing tech.
      Where I have seen LLMs provide practical utility is in the post extraction process where it can be used effectively to extract data from unstructured text which as item descriptions.
      I’m excited for the future of LLMs when they become practical and the when the benefits can outweigh the cost in real world applications, but for now I view them as interesting research projects pushing things forward, and as fun tools for smaller personal projects where budgets are not an issue.
      I love these kinds of discussions, and last year I attended and spoke at extract summit in Ireland, I hope to be going again this year to hear more about the latest AI use cases.
      To wrap up, I think the best use of LLMs I’ve seen is to generate xpaths and to use those inside cheap to run spiders/crawlers. And I’m looking forward to seeing what people come up with next.

  • @bernardthongvanh5613
    @bernardthongvanh5613 24 วันที่ผ่านมา +1

    In movies they do all they can so the AI cannot access the internet, in real life : we need web scrapping man, give it access!

  • @justafreak15able
    @justafreak15able 18 วันที่ผ่านมา

    The cost of making is comparatively so costly than creating a website specific scrapper and maintaining it.

  • @dipkumardhawa3513
    @dipkumardhawa3513 24 วันที่ผ่านมา

    Hi I am a student, I want to build same kind of thing for LinkedIn can it possible.
    Thank you so much for sharing this knowledge❤

  • @CordeleMinceyIII
    @CordeleMinceyIII 21 วันที่ผ่านมา

    How does it handle s?

  • @smokedoutmotions_
    @smokedoutmotions_ 15 วันที่ผ่านมา

    Cool video

  • @gRosh08
    @gRosh08 4 วันที่ผ่านมา

    Cool.

  • @syberkitten1
    @syberkitten1 19 วันที่ผ่านมา +2

    I don't believe it's possible to create a universal scraping solution that would be efficient in many edge cases. A custom solution would likely be faster and cheaper, especially if you need to scale.
    I've evaluated a lot of scraping SaaS services and used everything from Selenium to headless browsers. There are so many protection mechanisms, including headers, API checks, cookies, etc., and I'm sure I haven't seen a fraction of them. Some sites even require the browser to load JS and render changes on screen.
    With AI, we can get closer to an ideal solution. For example, you could take a screenshot if necessary (if the data is graphic and not part of the HTML source) and at the same time scrape the HTML. Then, pass them together to an LLM with your question. The structured data should then answer what you need it to become.
    However, you need to run the LLM yourself. Any solution using an LLM should allow users to provide an extraction schema, which needs to be very flexible as a prompt. This could be a nice service for hobbyists, but for scale, it would be too expensive. A custom implementation would probably serve better.

    • @AIJasonZ
      @AIJasonZ  7 วันที่ผ่านมา

      I agree it is not easy to build an universal one that works for every website - one path im exploring now is to build good scraper for specific website category; e.g. one scraper for all ecommerce, one scraper for all company websites, one scraper for all blogs, etc. Then you have something to route to the right scraper;

    • @HarpaAI
      @HarpaAI 3 วันที่ผ่านมา

      @@AIJasonZ Agree with the assessment. In our tests, GPT-4 is still a bottleneck, no matter how good the tools and clean the data you give it, for a Universal scrapping / web automation task it often fails to provide a correct next best action to take, goes into loops, performs redundant actions, does not abort / complete execution etc. If you build your agent around a specific workflow where you predefine the sequence of steps to take - that's a different story. But that approach is far from universal.

  • @gamewithmichael
    @gamewithmichael 6 วันที่ผ่านมา

    Hi, are You planning to create video, about making music with some AI model?

  • @sanchaythalnerkar9736
    @sanchaythalnerkar9736 24 วันที่ผ่านมา +1

    Would it be possible for me to contribute and collaborate on this project? I’m also working on developing a universal scraper myself.

  • @techfren
    @techfren 24 วันที่ผ่านมา +3

    first lesgoo 🔥

  • @onlineinformation5320
    @onlineinformation5320 24 วันที่ผ่านมา

    hey can u make a video on Multion

  • @eugenetaranov4549
    @eugenetaranov4549 วันที่ผ่านมา

    Curl is a protocol 😂

  • @ShadowD2C
    @ShadowD2C 23 วันที่ผ่านมา

    Hi, Im building a PDF QA chatbot than answers from 10 long pdfs, Ive experimented with RAG but the chunks I get from the vector db often dont provide the correct context, what can I do to get reliable answers based on my pdfs? will passing the entirety of the pdfs to an llm with a large max tokens help? it doesnt seem effecient to pass the entirety of the pdfs with every question ask.... Im lost please help

    • @matiascoco1999
      @matiascoco1999 23 วันที่ผ่านมา

      Try using claude models. They have huge context windows and some models are pretty cheap

    • @productresearchgeek
      @productresearchgeek 21 วันที่ผ่านมา

      1 try different sized chunks 2 add adjacent chunks to what vector db returns 3 include section titles in the chunks

  • @yashsrivastava677
    @yashsrivastava677 24 วันที่ผ่านมา +2

    I wonder if this is an Advertisement video or a knowledge sharing video..Nothing is open source.

  • @brianWreaves
    @brianWreaves 24 วันที่ผ่านมา

    🏆

  • @fathin7480
    @fathin7480 19 วันที่ผ่านมา

    Did anyone manage to write the full script? or has access to it?

  • @hernandosierra8759
    @hernandosierra8759 24 วันที่ผ่านมา

    Excelente. Gracias.

  • @kilianlindberg
    @kilianlindberg 21 วันที่ผ่านมา

    10:42 i follow tutorial, build scraper with cleanmymac, nothing happen, install twice, Ubuntu 22.04 only get many index.html

  • @chauhanpiyush
    @chauhanpiyush 24 วันที่ผ่านมา

    You didn't put the signup link for your universal scraper agent.

    • @AIJasonZ
      @AIJasonZ  24 วันที่ผ่านมา

      thanks for the notes! here is the link: forms.gle/8xaWBBfR9EL5w8jr6

  • @rishabnandi9593
    @rishabnandi9593 24 วันที่ผ่านมา +2

    This looks sus selenium could do this why do all this work if gpt 4o is generating selenium scripts faster than an Asian thinking

  • @user-ti7fg7gh7t
    @user-ti7fg7gh7t 23 วันที่ผ่านมา

    You didn't name the title of the speech, the names of the authors or team, got to give credit where it's due... can we get a link to the videos your using? the source? i would like to see the whole thing

  • @krisvq
    @krisvq 19 วันที่ผ่านมา

    Good walkthrough. Now we need better hardware to run better models so we can stop paying for lobotomized AI

  • @mble
    @mble 15 วันที่ผ่านมา

    Great work, yet I am not willing to use anything that is propriatary

  • @garic4
    @garic4 23 วันที่ผ่านมา

    Any TLDR here for this nightmare long blob video?

  • @TheSurfingSushiChef
    @TheSurfingSushiChef 3 วันที่ผ่านมา

    Does this make money ? Or a waste of time?

  • @yunyang6267
    @yunyang6267 24 วันที่ผ่านมา

    why are you building a startup every week

  • @Passive_j
    @Passive_j 23 วันที่ผ่านมา

    Who wants to be a millionare? Scrape linkedin with AI and become a Zoominfo competitor. Youre welcome.

  • @uwegenosdude
    @uwegenosdude 23 วันที่ผ่านมา

    Hi Jason, thanks for your interesting video. Would it be possible to place your microphon so that we can see your lips when you are talking. For me it's easier to understand english, if I can see them. You huge mic covers so much of your face. Thanks.

  • @Septumsempra8818
    @Septumsempra8818 24 วันที่ผ่านมา +1

    My whole startup is based on scraping. I hope this doesn't catch up...

    • @AIJasonZ
      @AIJasonZ  7 วันที่ผ่านมา

      hah what does your startup do?

  • @JD-xm3pe
    @JD-xm3pe 24 วันที่ผ่านมา +2

    Your content is fantastic, your English is top-notch but your accent adds some overhead to understanding. I hope that doesn't feel insulting, your vocabulary and grammar is better than most native English speakers. So an idea... Could you look at using gpt-4o to improve elecution (not just English) in a foreign language? It would be quite useful for many people.

  • @vitaly1219
    @vitaly1219 18 วันที่ผ่านมา

    It’s like pirate game but not to buy it

  • @ashishtater3363
    @ashishtater3363 24 วันที่ผ่านมา +1

    Total nonsense

    • @Phanboy
      @Phanboy 24 วันที่ผ่านมา

      Noob

  • @tinato67
    @tinato67 18 วันที่ผ่านมา

    unsubscribed

  • @thichxemphim1981
    @thichxemphim1981 10 วันที่ผ่านมา

    Ugly banner image

  • @nullvoid12
    @nullvoid12 4 วันที่ผ่านมา

    What a waste of time!