GPT4V + Puppeteer = AI agent browse web like human? 🤖

แชร์
ฝัง
  • เผยแพร่เมื่อ 7 ม.ค. 2025

ความคิดเห็น •

  • @AIJasonZ
    @AIJasonZ  ปีที่แล้ว +15

    What AI agent use case do you think it will unlock with browser & computer access? Comment and let me know!

    • @metinEsturb
      @metinEsturb ปีที่แล้ว +9

      It can now watch youtube shorts for me, so that I can focus my lifetime on watching productive videos like yours 😉

    • @minutecodingtips
      @minutecodingtips ปีที่แล้ว +3

      Can you provide github repo for it sir?

    • @ahsin.shabbir
      @ahsin.shabbir ปีที่แล้ว +1

      Find me all relevant documents in my inbox and my desktop so that I can prepare my tax filing for 2024.

    • @rab0309
      @rab0309 ปีที่แล้ว +2

      would be interesting to try it out with models that are open source and not something where you'd need for example an open ai api key

    • @martinchudomel7117
      @martinchudomel7117 ปีที่แล้ว +2

      maybe automated testing

  • @TheHeatzz33
    @TheHeatzz33 ปีที่แล้ว +35

    reCAPTCHA: Are you a human?
    Web AI Agent: *click*

  • @unconv
    @unconv ปีที่แล้ว +11

    Thanks for featuring my project!

  • @elevenchicago
    @elevenchicago ปีที่แล้ว +16

    You have been making some of the best videos on this stuff since day one. Always super practical yet very valuable content Jason. Keep it up and glad to see the HubSpot sponser! Esketit

  • @ElijahZuBailey
    @ElijahZuBailey ปีที่แล้ว +11

    🎯 Key Takeaways for quick navigation:
    00:03 🌐 *Web AI agents with direct computer access are a trending topic, allowing AI models like GPT-4V to control computers and browsers for various tasks.*
    01:28 🏢 *Robotic Process Automation (RPA) is one market category that self-operating computers can impact. RPA automates repetitive tasks in enterprises.*
    03:32 🖥️ *Self-operating AI agents can handle complex and non-standardized tasks, potentially reducing setup costs compared to traditional RPA solutions.*
    04:27 📊 *Understanding the end-to-end workflow of specific job functions is crucial for delivering useful AI worker solutions.*
    05:36 🤖 *Two common implementations for AI agents to control web browsers involve sending simplified HTML or using screenshots with annotations for better instruction.*
    08:09 🖥️ *Self-operating computer frameworks like the one discussed in the video use annotations on screenshots to instruct AI agents for interactions.*
    10:02 🌐 *The tutorial demonstrates how to create a web scraper using GPT-4V for extracting data from web pages, providing cleaner results than traditional scraping.*
    16:59 ⚙️ *A more advanced AI web agent is discussed, capable of interacting with websites, clicking links, and performing complex web tasks.*
    21:19 🌐 *The web agent script can handle URLs provided by the agent, allowing it to navigate to specific web pages.*
    22:29 🔄 *The web agent continuously clicks on links to navigate, find information, and scrape new websites.*
    23:11 📊 *The web agent can interact with websites to answer questions or complete tasks, making it a valuable tool for various purposes.*
    23:53 📷 *The web agent can perform complex tasks like conducting a Google search, visiting Instagram accounts, and extracting information, showcasing its potential for various applications.*
    Made with HARPA AI

  • @avi7278
    @avi7278 ปีที่แล้ว +8

    Bro, you are such a gangster. You're the only person I follow actually putting out valuable work at this level. All these AI grifters are on their six hundredth basic ass autogen demo, just zero value videos for clicks.

    • @stephantual
      @stephantual 11 หลายเดือนก่อน +1

      This exactly.

    • @itisprofile
      @itisprofile 11 หลายเดือนก่อน

      same for me! it is really hard to find valuable info about AI beside just basics

  • @redamarzouk
    @redamarzouk ปีที่แล้ว +2

    I've been very excited about the last video when I first saw the paper of GPT-4v and the automation applications.
    I do RPA full time and even though GPT-4v is not there yet in terms of what it can do on its own for form data entry and complex automation processes, I can see the use cases where it can outshine RPA when it's enhanced.

  • @oshodikolapo2159
    @oshodikolapo2159 ปีที่แล้ว +4

    This is a really smart and intutive process. Thanks Jason!

  • @jhoningsoft
    @jhoningsoft ปีที่แล้ว

    Thanks!

  • @BrianMosleyUK
    @BrianMosleyUK ปีที่แล้ว +11

    This is really valuable work Jason, beautifully conceived and executed. Keep going, this is huge. 🙏👍

  • @jasonfinance
    @jasonfinance ปีที่แล้ว +4

    Great job mate! Love to see all the agent possibilities

  • @paresh-ranaut
    @paresh-ranaut ปีที่แล้ว +8

    Great Work Jason!! would love to see you explore creating a Autogen like framework with Open source Multimodal agents, seems like it can open a lot of possibilities without the huge costs associated with using openAI's APIs for intensive tasks. Keep up the great work.!

    • @EricWatyekele
      @EricWatyekele ปีที่แล้ว +1

      Awesome work Jason, I was thinking the same @paresh3915 - Open source Multimodal agents instead of OpenAI

  • @amandamate9117
    @amandamate9117 ปีที่แล้ว +4

    AutoGen now has a AgentBuilder class. Its exactly doing how it sound, building a swarm of agents.

  • @EverythinTechnology
    @EverythinTechnology ปีที่แล้ว +6

    Great video! I would love to see if this can be done with an open source model like LLaVa.

  • @paulmartz5053
    @paulmartz5053 ปีที่แล้ว +2

    Congrats on your work, really useful content. Keep it up!

  • @Livanback
    @Livanback ปีที่แล้ว +1

    thanks jason for this valuable work. This toutrial are advance but very valuable spec in the market. You could easliy make big money from this method.

  • @akshaygoel8576
    @akshaygoel8576 10 หลายเดือนก่อน

    Amazing work! Thanks for putting in so much effort to bring quality content to the viewers.

  • @carterjames199
    @carterjames199 ปีที่แล้ว

    Great video again Jason love to see it

  • @ossamaji
    @ossamaji 2 หลายเดือนก่อน

    Fantastico ! Mind blowing ; Thx a lot

  • @binthem7997
    @binthem7997 ปีที่แล้ว

    This is an amazing tutorial of how to integrate GPT into daily productivity tools. Thanks man!

  • @RoadTo19
    @RoadTo19 ปีที่แล้ว

    Another great share, cheers!
    One request... please consider adding links in the description to the products mentioned in videos; perhaps as affiliate links for yourself. Makes for improved UX for viewers.

  • @Jim-ey3ry
    @Jim-ey3ry ปีที่แล้ว +4

    Nice, was wondering how to build a chrome plugin to control web like hyperwrite, this is handy 👍 Also the hubspot research is actually good

    • @carterjames199
      @carterjames199 ปีที่แล้ว

      Same here trying to figure out how to connect the local repo to a browser plugin

  • @JayeeeMaooo
    @JayeeeMaooo ปีที่แล้ว +1

    great explanation, like this kind of tutorial on AI projects!

  • @sanesanyo
    @sanesanyo ปีที่แล้ว

    Really awesome job. Keep such videos coming. 🎉🎉

  • @oakbots3035
    @oakbots3035 ปีที่แล้ว

    Great Jason... this is kind of innovation is mind blowing!

  • @JoeyGartin
    @JoeyGartin ปีที่แล้ว

    You are awesome. You should have your own instructional (paid membership) website for all of this.

  • @youwang9156
    @youwang9156 ปีที่แล้ว +2

    Jason, wondering if you have some kinda plans for diving into details of finetuning open source models like llama2?

  • @JewelEyedRanchAndRoofing
    @JewelEyedRanchAndRoofing ปีที่แล้ว +1

    Can we combine ai with something like ui.vision? Have the ai agent learn how to use ui.vision and then it would be able to control computer and browser

  • @Taskade
    @Taskade ปีที่แล้ว

    Absolutely beautiful! Really appreciate you sharing this insightful video. It's always great to see such valuable content. Thanks for broadening our horizons!

  • @deepneuralnetworks
    @deepneuralnetworks ปีที่แล้ว

    This is so sick.. thanks man

  • @leewsimpson1
    @leewsimpson1 ปีที่แล้ว +4

    Nice work- thank you!!
    Do we really need to run node? Can’t we just use python browser libraries ?

  • @muhammadhilal5807
    @muhammadhilal5807 หลายเดือนก่อน

    Great research

  • @DAWEAP1
    @DAWEAP1 ปีที่แล้ว

    Super interesting Jason!

  • @ahsin.shabbir
    @ahsin.shabbir ปีที่แล้ว +1

    What if you have an AI agent for a consumer product that handles event management tasks normally performed by the event host? For example if there is a school doing a field trip, the AI agent can answer many of the parent's queries, send updates about the trip such as weather forecasts that may effect the trip, collect certain forms, and more. Same can be applied for events like conferences, weddings, birthdays, housewarming, etc.

    • @fullcrum2089
      @fullcrum2089 ปีที่แล้ว +2

      i'm building this

    • @ahsin.shabbir
      @ahsin.shabbir ปีที่แล้ว

      @@fullcrum2089 how far along are you?

    • @amandamate9117
      @amandamate9117 ปีที่แล้ว

      build ngga build @@fullcrum2089

  • @williamrich3909
    @williamrich3909 ปีที่แล้ว

    Mind blowing! Fantastic.

  • @leewsimpson1
    @leewsimpson1 ปีที่แล้ว +1

    Hey Jason, how would you implement the ability to type into textboxes? They usually do not have text in them, so the AI cannot identify them. This feature would likely be needed to add the ability to do more stuff like order a pizza.

  • @JoeyZero
    @JoeyZero ปีที่แล้ว +1

    Just throwing this out there: If using OpenCV to drive your annotations, you can annotate coords directly on the image without too much difficulty (to give some guidance for when GPT4V is trying to recommend a location to click). Haven't done this with puppeteer, but I'm sure it can be done ;)
    Anybody wanna collab on some experiments in this space? 🙏 Lolol

  • @sanesanyo
    @sanesanyo ปีที่แล้ว +1

    Any idea about how to implement scrolling in the web agent?

  • @Nuninecko
    @Nuninecko 10 หลายเดือนก่อน

    why are not all open source LLMs working on these autonomous web agents? Are there any open source, PC based, replacements of Rabbit R1?

  • @giovannidamico6523
    @giovannidamico6523 6 หลายเดือนก่อน

    Great work Jason! Been looking for examples of scraping with screenshots for a while! I am just wondering what are the costs implications compared to traditional scrapers

  • @TraveleroftheSoul7674
    @TraveleroftheSoul7674 8 หลายเดือนก่อน

    how can we provide the assistant api key of gpt 4 which is not currently supporting the image input?

  • @parmesanzero7678
    @parmesanzero7678 ปีที่แล้ว

    24:28 Having an AI agent spend tokens to check the weather when the same search in Google results in the same reminds me of that robot that is told its whole purpose is to pass the butter.

  • @stojancejovanovv8108
    @stojancejovanovv8108 6 หลายเดือนก่อน +1

    i have error in openai library and gpt is saying that i need api key

    • @stojancejovanovv8108
      @stojancejovanovv8108 5 หลายเดือนก่อน +1

      the solution for this is you should change into gpt-4-turbo and it works for me

    • @stojancejovanovv8108
      @stojancejovanovv8108 3 หลายเดือนก่อน

      Actually gpt4v

  • @chrisder1814
    @chrisder1814 5 หลายเดือนก่อน

    Hello, can you explain to me what it is possible to do with the 4v API?

  • @ex3aliber
    @ex3aliber ปีที่แล้ว +5

    Hey Jason amazing video as usual! Can you do one using canva to setup and automate a youtube channel..... from creation of videos using canva to scheduling the posts... do you think this will be possible?

    • @wizaaeed
      @wizaaeed ปีที่แล้ว +4

      It's not about automation, it's about being authentic. If you think posting unlimited auto generated videos to youtube will give you an advantage over something, you're far off the line.

    • @avi7278
      @avi7278 ปีที่แล้ว +1

      LOL, you'd end up paying $50-$100 for a single crappy AI video that nobody will watch. There's cheaper way to spam your AI crap videos if you want to do that.

  • @squiddymute
    @squiddymute 11 หลายเดือนก่อน +1

    Are there any of this sort of tutorials out there that don't use an open AI subscription and actually do the job ? Is there an actual value in this if you have to pay open AI for every new page chatgpt 4 has to process??

  • @swetharangaraj4521
    @swetharangaraj4521 ปีที่แล้ว

    how to build query bot ,it should query from the database like mangodb where the data are stored from the application

  • @morohicham2579
    @morohicham2579 8 หลายเดือนก่อน

    One question here , can the AI agent bypass cloudflare captcha check as a humain ?

  • @podcasttakes
    @podcasttakes ปีที่แล้ว +1

    More of JavaScript and node please :)

  • @user-wr4yl7tx3w
    @user-wr4yl7tx3w 11 หลายเดือนก่อน

    Isn’t reading the dom tree better than screen shots given how the latter needs to interpret images as opposed to structured data?

    • @AIJasonZ
      @AIJasonZ  11 หลายเดือนก่อน +1

      Yea DOM will be more effective, however the issue is each website DOM is very different and including lots of messy data; but if you figure out an universal way to clean up DOM then it will be effective
      Also with DOM, it lost ability to do some img related tasks

  • @new-parents-survival-diary
    @new-parents-survival-diary 11 หลายเดือนก่อน

    can we use Gemini Pro Vision instead of OpenAI? Could you please provide the tutorials?

  • @Roy-h2q
    @Roy-h2q ปีที่แล้ว

    Been looking for such approach for while now. Until i see this video , very exciting.
    But I also concern over the pricing of GPT-4 vision, I hope other alternative works as well like LlvAa 2 vision or CogVLM & CogAgent

  • @matthewnaples5672
    @matthewnaples5672 ปีที่แล้ว

    Did you have trouble getting your chrome profile working with Chrome Canary? That seems to be my only sticking point.

  • @MunirJojoVerge
    @MunirJojoVerge ปีที่แล้ว +1

    Hi Jason, I've been trying to reach out by email for a professional consultation for a company. It looks like the email might have fallen through the cracks of a million others or under the spam folder. Is there any other way I can reach out to you?

  • @techfren
    @techfren ปีที่แล้ว +2

    Amazing as always! I'm still trying to build my browsing ai agent from the sxsw hackathon in Sydney lmao 😅

    • @KamilKaczmarekSolutions
      @KamilKaczmarekSolutions ปีที่แล้ว

      meaning clickolas cage? 😂😂 love that

    • @techfren
      @techfren ปีที่แล้ว +1

      @@KamilKaczmarekSolutions haha yup!

  • @persas1683
    @persas1683 ปีที่แล้ว

    Hi, I do the same as you but my Chrome Canary still require login

  • @emmcode3414
    @emmcode3414 ปีที่แล้ว +17

    Don't sacrifice quality for views! You have an amazing TH-cam channel that could be useful and profitable for you and for your viewers. I think I am not the only one who thinks that you should open a patron account and create premium content for people looking for more developed material 😁

    • @stateportSound_wav
      @stateportSound_wav ปีที่แล้ว

      That’s not a bad idea. Do you feel the content has been watered down for the sake of views?

    • @emmcode3414
      @emmcode3414 ปีที่แล้ว +2

      @@stateportSound_wav No I don't, and I don't want to. That's why I am trying to encourage the channel to open a Patreon account, to preventing ourselves for loosing quality.

    • @stateportSound_wav
      @stateportSound_wav ปีที่แล้ว

      @@emmcode3414 10-4

    • @stateportSound_wav
      @stateportSound_wav ปีที่แล้ว

      @@emmcode3414 I hope that his skills have been able to get him to a point of not relying on youtube for financial stability, but that’s his business. Jason mentioned previously working on startups, so it sounds like he’s no rookie to business, and could be well off to the point your comment could (I hope, for Jason and ourselves!) be irrelevant.

    • @itisprofile
      @itisprofile 11 หลายเดือนก่อน +1

      @@stateportSound_wav Many channel about AI started just posting random news about AI or making clickbait videos just to get views which is really annoying. At this point I only check this channel

  • @prispeshnik-istini2
    @prispeshnik-istini2 11 หลายเดือนก่อน

    Hello! Thanks for the video, and yes, it’s already February 2024, and I would like to know what’s new, what crazy projects are being implemented using this technology? what business cases are already working today? And yes, can you help me set up this process for my task?
    You are super! I looked at it in translation, but I realized that you are very skilled! Bravo!

  • @angrycrypto465
    @angrycrypto465 ปีที่แล้ว

    All I need to know is how I can use this to create a trading bot that can learn and update trading knowledge based on my trades, then it can trade going forward without me

  • @asifabbas8811
    @asifabbas8811 ปีที่แล้ว

    Hi the github link is missing alot of files can you update it please also with the requirement.txt files.

  • @BenedekZajkas
    @BenedekZajkas ปีที่แล้ว

    This is Amazing !

  • @JewelEyedRanchAndRoofing
    @JewelEyedRanchAndRoofing ปีที่แล้ว

    I really appreciate your video, I am wondering if you know a good resource or if you can do a video that walks through how to do this 100% from start to finish using windows and vscode

  • @alperenersan
    @alperenersan ปีที่แล้ว

    Is it possible to implement with Llama 2 ?

  • @vaibhavfromtruffle
    @vaibhavfromtruffle ปีที่แล้ว

    Do people recommend building an AI agent from scratch using Puppeteer and GPT or using an existing self-operating system like OthersideAI or MultiOn?

  • @xonack
    @xonack ปีที่แล้ว

    let's get it!

  • @edgarl.mardal8256
    @edgarl.mardal8256 7 หลายเดือนก่อน

    Can someone create a chatGPT manager to oversee that the prompt you set it to do, is actually done, and fine tune the prompt by it self? I am using chatGPT, with puppeteer, to take screenshot, and respond to the chat based on the dialog, this goes ok, but, it keeps making wrong responses as it messes up understanding the prompt.

  • @normanluismadrid422
    @normanluismadrid422 9 หลายเดือนก่อน

    is this less detectable than using puppeteer alone?

  • @squiddymute
    @squiddymute 10 หลายเดือนก่อน

    does this work with ollama?

  • @alexanderroodt5052
    @alexanderroodt5052 ปีที่แล้ว

    Wicked sick!

  • @bt1940a
    @bt1940a ปีที่แล้ว

    Might i know the cost per screenshot and api?

  • @techxartisanstudio
    @techxartisanstudio ปีที่แล้ว

    Thanks man!

  • @yamani3882
    @yamani3882 ปีที่แล้ว

    Can you scrape websites without getting detected?

  • @moelkblau
    @moelkblau ปีที่แล้ว +3

    Interesting use case. Could you add the OpenAI API costs as reference?

    • @Humpty0Dumpty
      @Humpty0Dumpty ปีที่แล้ว +2

      A simple 4-5 step task will cost you $0.24, try complex tasks and reply back the numbers, if possible

    • @moelkblau
      @moelkblau ปีที่แล้ว

      @@Humpty0Dumpty thanks!

    • @vladimirgetselevich4704
      @vladimirgetselevich4704 ปีที่แล้ว

      @@Humpty0Dumpty Pretty expensive at the moment, to do anything valuable in quantities.

  • @wingman2tuc
    @wingman2tuc ปีที่แล้ว

    Can we use LlaVA?

  • @sidhuk3128
    @sidhuk3128 ปีที่แล้ว

    I dont know python. is there an alternative for Js, google didnt help me

  • @LearnCode_withAI
    @LearnCode_withAI ปีที่แล้ว

    Can you share the repo link?

  • @jorgezamudio8457
    @jorgezamudio8457 ปีที่แล้ว

    mind blowing

  • @CheatLayer
    @CheatLayer ปีที่แล้ว +2

    Cheat Layer Desktop has solved this problem at the desktop level using a new multi-modal model we're launching in the coming days, which enables a generalized agent that can detect UI elements directly: th-cam.com/video/IQuBA7MvUas/w-d-xo.html

    • @avi7278
      @avi7278 ปีที่แล้ว

      I'm not convinced. The step by step setup cost is too high and it still relies on GPT4 so the cost savings aren't even there. Or am I mistaken?

    • @CheatLayer
      @CheatLayer ปีที่แล้ว

      @@avi7278 our marketing agents drive traffic that would typically cost thousands more per month on google, and the sales agents get up to 11% reply rates--better than our human reps with another service.

  • @cas818028
    @cas818028 ปีที่แล้ว

    Why would you jump back and forth between JavaScript and python? Just pic one or the other this seems super ineffective and inefficient

  • @craigparker349
    @craigparker349 ปีที่แล้ว

    Do you thing someone with zero coding ability can build what you've just shown? If not do you build this for people to use?
    Great video, found it very interesting and inspiring.

  • @AmplifyAmbition
    @AmplifyAmbition ปีที่แล้ว

    How much does each run cost on this?

  • @user-wr4yl7tx3w
    @user-wr4yl7tx3w 11 หลายเดือนก่อน

    Can this be done in Python?

  • @ai_osas
    @ai_osas 10 หลายเดือนก่อน

    Has anyone here connected WebQL to a Crew agent? If yes, I would like to know how or a tutorial that can assist me. Thank you.

  • @davidwylie8491
    @davidwylie8491 ปีที่แล้ว

    Amazing

  • @HungVo-vc4no
    @HungVo-vc4no ปีที่แล้ว

    Crazy !!!

  • @croncoder862
    @croncoder862 8 หลายเดือนก่อน

    You sound exactly like a mature Jian Yang 😆

  • @ekstrajohn
    @ekstrajohn ปีที่แล้ว +3

    Twitter bot armies have never been easier to make. :)

  • @vigneshngws6684
    @vigneshngws6684 2 หลายเดือนก่อน

    Cool

  • @aliabassi1
    @aliabassi1 ปีที่แล้ว +1

    you are ... so fking smart man.

  • @GargolaAli
    @GargolaAli ปีที่แล้ว

    Hi Jason, thanks for this practical video. I would like to have your thoughts on my podcast on Process Automation and mining using AI, would be interested ?

  • @milostean8615
    @milostean8615 ปีที่แล้ว

    Market size unlimited

  • @JockJutManhwaRecap
    @JockJutManhwaRecap ปีที่แล้ว

    im lost

  • @patrickfogui3620
    @patrickfogui3620 หลายเดือนก่อน

    can it scrape nsfw websites like pornhub?

    • @shivanshsrivastava9279
      @shivanshsrivastava9279 17 วันที่ผ่านมา

      I think you can, but what are you thinking to do with that data? make another phub, create new distribution channels?

  • @supersal3478
    @supersal3478 8 หลายเดือนก่อน

    why no one has built an online tool with API for this process? @aijason

  • @thegreatestdao
    @thegreatestdao ปีที่แล้ว

    Sky torrents

  • @grubbs420
    @grubbs420 ปีที่แล้ว +1

    never trust this guy you will end up hacked

    • @amandamate9117
      @amandamate9117 ปีที่แล้ว

      i have the same feeling. last time he promoted some free openAI APis and also his codes are fishy. or? also his hair or face looks like is processed with some AI filter. i think he works for chinese government.

    • @mrfalvo6671
      @mrfalvo6671 ปีที่แล้ว

      lol@@amandamate9117

    • @KamilKaczmarekSolutions
      @KamilKaczmarekSolutions ปีที่แล้ว +1

      @@amandamate9117 he 100% certified chinese government agent, agreed. in fact ai agent * 💀

    • @itisprofile
      @itisprofile 11 หลายเดือนก่อน

      @@amandamate9117 you can just check his codes yourself. It is not like secret program where you don't know what data is going off from you computer