lol. You must not be that deep into AI yet. If you’re as deep as these guys are then you see a lot of big things it can do for the better. Like Altman says. The bad too.
The Compute Use feature is absolutely going to replace lots of menial jobs that have been too niche to automate, where it was too expensive to hire someone to replace the current humans doing data entry and copying forms into other forms. Reddit has a few threads where someone in such a job learned to code and either was fired and replaced by their code or felt bad being the top employee actually working just 2 hours a week on script maintenance. Now we're getting closer to many more of these positions being automated. I just hope that at least some of these employees will make their repetitive jobs much easier by learning how to use this kind of automation quietly without necessarily losing their source of income (esp. disabled employees for example).
What is wild is thinking the metrics need to exclude o1. It is different but same. It should be considered in these metrics. If it takes longer but is right way more often that is a clear trade off for performance which can be highlighted and would be more accurate for average users to understand.
I think you're completely right about how computer interfaces will quickly fade away as AI becomes capable of simply performing any operations you ask without a predesigned GUI
I just build an RTS Prototype in Unity 6 with Claude 3.5 and Cursor AI. In 8 hours ive gone from nothing to having a map, RTS Camera, navmesh, units, animations, state machine, Healthbars, movement etc. ... all working. In unity i only have an empty game object with a "GameMakerScript" attached. The script creates all Gameobjects when the games starts and controlls its parameters and links. It feels like magic. I only used 2 Assets (Skeleton Warrior and Human Warrior) including 3 animations (walk, idle, attack)
Good stuff! Just fired up a Docker container using 1 command and it worked right out of the gate. Asked it to build a hello world app in JavaScript and actually did it without any interaction from me other than the prompt. Amazing! Cost me 50 cents but well worth it.
@@tollington9414 I see what you did there. Good one. I always start out with something easy to make sure the system works with something that is as basic as it gets.
Be scared. Vote Trump. Everyone is out to get us. (Teeth chatters and sweating instead of learning stuff and figuring out how to drive a car instead of riding a horse saying no way) lol. Messing with you dude. Really though buddy. Not scary. Did you know that when scary steam engines and trains first arrived they said women shouldn’t ride in them because the crazy g forces could explode their uterus. Look it up. You’re smart I bet. But stupid is as stupid does and again - that’s not you man. Go to.
i've had that same idea a few months ago. i told chat gpt to move the cursor to a pixel position by providing a json description of the action. then sent it a screenshot of what just happened. and do that recursively to achieve a task. it was able to order pizza. claude refused to output such json, so that was a bummer.
The Anthropic team has definitely been using this computer use feature to build out long agentic training data for coding. RLHF is now be giving a prompt for a whole application and giving feedback after potentially dozens of intermediate steps to dev it. 3.5 opus will be crazy if the distilled version of its most recent snapshot (3.5 Sonnet) is already the best in coding by a mile. We’re on the exponential boys!
I'm assuming that this requires a locally running agent to perform the local desktop actions? Giving a cloud-based AI model full desktop control to millions of remote agents seems like the beginning of a story I've heard somewhere before...
I haven't looked at the code, but in theory you could change the call to any AI, even local, like llama3.2, the problem is that it might not be optimized for the set of tools Clause is using, but worth the shot indeed, to keep everything tidy.
As you mentioned, Open-Interpreter can control your computer in OS mode and you can use any model. It is free and open source! The problem is when providers enforce rate limits.
Matt, I like your channel. A clarification that would be beneficial for people is that Gemini 1.5 pros' math score is using 4 shot and Claude is 0 shot. So yeah, not exactly apples to apples on that score. One other thing about Gemini. Everyone is sleeping on Googles updated Gemini 1.5 flash. You need to show people how good it is (extremely fast, intelligent and dirt cheap). I'm not a huge fan of Google, but we need to be objective. Keep up the great work.
I’m noticing with all this technology that it’s going to be important to have a business specific PC. Something that doesn’t have any personal information. It’s just business that way you don’t worry about giving some control away to the AI. That or partition the hard drive for different operating systems
That or for more companies to adopt the architecture that Apple described at WWDC for "private cloud compute", with verifiable OS images and lots of cryptographic proofs guaranteeing that data is not stolen. If you haven't read the long page they published about it, it's pretty fascinating and I certainly hope more companies go this route and focus on proving that their systems do not keep data that is being sent to them. It's much harder to build and maintain such a platform, and you don't keep anyone's data… no wonder no one had built such a platform. The main problem is that Apple is making money selling hardware, while many more companies will be tempted selling data.
Looks like a good start. Voice interface will be huge from the user viewpoint. When every application comes with a built-in AI agent, the computer use AI won't have to figure out pixel stuff, it'll just tell the AI agent of the app what it wants, making the work go faster and more reliably.
Imagine: "Claude, read all the new twitter posts in my feed, summarize the ones that we have discussed my interest in, and feed that summary to Google NotebookLM and generate a podcast for me.". My theory: in 2025, UI will be "the new API". And no, a UI is not as optimized as an API for access, but... in many cases where there is no API or its behind a pay wall, Agents will be using UIs as much as they use APIs.
I first thought “what a silly use case, you can just use an API” but then I remembered x and many others are becoming walled gardens and don’t provide easy programmatic access to them. This pierces through those restrictions. Your idea is great
If the coordinate system is flawed, maybe add an accuracy test before telling it to click or type. So you try to get it to mouse over a button but before it clicks you send another screenshot so it can compare the position of the mouse to the target and if it's not over the button, it adjusts. This loops until it's over the button. Not ideal, but it seems like a way you could make it accurate.
Please do a test video on this if you think it will provide value to the community. Thanks again Matt for your work to create useful and informative content.
The best use case I see here is QA testing, as this functions similarly to Selenium or Playwright. I believe that in the future, we will be able to input test cases as prompts, and it will handle the testing. Additionally, with the advantage of a knowledge base, the AI could generate the reports as well. When integrated with agentic workflows, application testing would become super easy. I’m not even a QA; I work in DevOps, but looking at this, we would be able to accomplish anything in our day-to-day work. (Not to mention the job losses, haha!)
Yes, test it in depth! Please do it on your production machine, giving it full access to all your info and passwords, and full acess to the internet. j/k!!
I coded the same with 4o using playwright to automate user action in a browser. It's not fast but it work. Credentials are not sent to OpenAI. 4o is prompted to use some keyword that get replaced when text/password field are filled..
In the docs: System prompt: 466-499 tokens computer tool: 683 tokens text_editor tool: 700 tokens bash tool: 245 tokens And it expects screenshots to be XGA/WXGA resolution, so: (1024x768)/750 = about 1048 tokens per screenshot
It’s a wonderful step forward but it’s far from ready yet.. it’s a lot if data to process so many screenshots and it falls too often and can not start where it falled
I'm unsure how I feel about AI controlling my computer. Not for right now, my concern lies in what this could mean for hackers or hidden corporate control long term future.
The only real way to do it safely is to follow those safety protocols. Have your agent in a virtual machine, and only have on that machine the tools that you've validated they can use, and the sites they are allowed to utilize. It's a secretary for people that can't afford a secretary.
@@holdthetruthhostage Sonnet has been helping me get into modding and coding. Whenever I hit a roadblock and I'm not sure what to do, even if Sonnet can't fix it itself, it usually can guide me to the right resources to figure out what to do next.
Yeah, it makes more sense for the UI to be generated by AI in real time rather than the UI pulling data based on a declared layout that tries to adapt to the data, screen size, orientation, usage context, etc. command line apps are probably better suited to AI automation at this stage.
i was trying the same thing with groq vision api,, but the main problem was the ai models don't know where to click, they can't determine the exact clicking position. Anthropic really did a great job.
The remarkable thing about AI agents is that they give us the ability to tap into the world's entire computing power for AI applications. When an AI agent has access to a computer, it can theoretically harness that machine’s processing power for its tasks, effectively outsourcing its computational needs. What's even more fascinating is the potential when multiple AI agents begin to interact and collaborate, amplifying each other’s capabilities. We also have a significant advantage now: scaling AI performance isn't just limited to physical resources, like adding more data centers for increased computing power. As OpenAI’s recent advancements demonstrate, we can now scale AI performance over time. This means we don’t need a massive global supercomputer to solve problems instantly. Instead, we can manage with less power by allowing more time to compute solutions. The possibilities are staggering. We're entering an era of unprecedented technological transformation.
Its interesting that they're doing this when Open Interpreter has been doing this for quite a while. You can hook it up to an API allowing for it to execute specific functions.
Giving it terminal access would actually be the most relevant use case in my opinion. Finally I could tell what I want and it can do all complicated configuration stuff can deal with all the errors on itself while I just lean back and watch instead of getting text walls of instruction trees thrown at me and when those fail at step 1 it all gets lengthy and complex quickly.
Why didn't they mix screenshot-taking/pixel guessing with something like selenium to identify form, buttons, etc. and cross-check with the screenshot and pixel counting for added accuracy? i guess we can do this ourselves but if the model was trained on selenium usage and identifying XPATH and classes it would be very helpful.
This reminds me of Open Interpreter. I actually have been trying the computer use demo with docker, pretty much like a multi-turn conversation with the model is using tools to capture screenshots, and the computer-use and other tools do the the rest.
We need an Agent First approach when developing new apps and upgrading established apps. Having agents use a human interface is just a stop gap measure 😅
Why don't they send a 50px by 50px image 10 times a second representing the area under the mouse cursor. It's a tiny image to process and it's content could be analysed to give precise feedback of mouse position in reference to the screenshot.
I have a random question about AI OS. How hard would it be to take a fork of ubuntu and add a personal assistant that has terminal access from install and voice control. I feel like a verbal UI can be made from the kind of assistance that has been around for blind people for a long time. It should be possible to make the entire OS accessible to an assistant with the correct permissions. It would not be integrated into the kernel but why do we need to move the mouse and read the screen? it has direct access to the data without visual translation. The only time it needs visual queues is when it is looking at an image or video file for information. Most of its existence is pure data. I keep seeing new projects trying to use the computer like human does. Why? I can't find anyone who can explain it to me. Why can't the LLM act like dos shell? Small LLMs are not as good as the leaders but they are good enough with current methods of automatic iteration that a human to monitor and correct could bash their way through most tasks. Maybe I should take this question to reddit. This is just where I am when it came to me. There has to be a reason because its too obvious. Someone who knows why it wont work has surely thought of it many times.
I know plenty of programmers still copy-pasting from stack overflow. I’ve been trying to tell people this is coming. I’m just like… pay attention or you’re gonna get blindsided. It’s like watching people picking up seashells before a tsunami.
The most telling point in the vid is that all this is just a transitional solution while HITLs are still required. The key aspect beyond this is when it's a dependable, fully automated capability, is trust. Another AI blackbox system like this - where the user just sees the results - will need security measures to match. The opportunities for data theft in this situation are as scary as the functionality is promising. With proprietary data becoming the key value point in many industry sectors, security provision needs to ramp up alongside AI development.
Great video. I like these lesson videos, bravo. I'd argue the transformative aspect of 'The Thing' is a design hook rather than a visual hook, though I'd agree the alien form mid transformation is a good example of a visual hook. It's not as strong as the narrative-design hook though, imo. Take care!
We already have modern OSes that can more directly interact with AI: Linux-based ones. You can do far more with Linux on the command line than you can with Windows, mobile OSes, or even MacOS. There's a universal command, "man ", that gives you documentation. In the earlier days of Unix and Linux there was even more CLI control. If an AI OS is to be build, it should be built on an existing Linux OS, as it is already a good start, and there's tons of existing training data. The gap really is due to GUI software like web browsers and office suites. But even those have CLIs and APIs that an AI could talk to. Or you can stick to text formats. For example, I sometimes use Vim + Markdown (or LaTeX) to generate a .pdf instead of MS Word, as I have a ton more control. And I think that Jupyter Notebooks are superior to Excel for dealing with data and graphs.
7:00 How it's going to be useful if it can't have access to your login information - I suppose they mean you should log in into all required places yourself inside the VM before querying Claude for actions. So it wont need your passwords, but it will have access to whatever services needed.
How is 'computer use' different from a software program , in Python, written for me by Chat GPT 4? I asked ChatGPT to write something that will go to youtube, fetch the info i require, return to my computer, open up an Excel file and write the collected info into a spreadsheet. The Python prog it wrote for me, with the API, works perfectly, even opening a new tab when the sheet is full. I'm not a programmer/developer, so please forgive if I'm sounding naive . I would just genuinely like to know how this differs from the Anthropic ''computer use'. Anyone help?
Anthropic does seem to be the humble AI company. It’s refreshing not having the CEO doing speeches on grandiose visions. They just do their own thing.
Dario Amodei does interviews and share sometimes a grandiose vision. But yeah Sam Altman communicates a lot and is a bit too optimistic.
Sam Altman is an idiot
Anthropic is just as bad. It is a get rich quick scheme.
lol. You must not be that deep into AI yet. If you’re as deep as these guys are then you see a lot of big things it can do for the better. Like Altman says. The bad too.
It's a marketing tactic. The illusion is strong, especially when you can sling meaningless word around like "safety".
I love how these tech companies are diversifying their products and not all just doing text and images.
pretty much everything is just text images and audio.
What else are you looking for, smell?
Action
@@thatonecommunist yup bro, humans don't just have ear and eyes,
they have nose, tongue, and skin. You'll be happy when AI will have all these too
Embodied AI is next and then we are really off to the races.
@@thatonecommunist With an industry with so many developers in the loop, it's something you don't want, trust me.
Thank you , Matthew - always appreciate your enthusiasm and hard work :) !!
We definitely need a testing video fr❤🫡
YESSSS!!!
The Compute Use feature is absolutely going to replace lots of menial jobs that have been too niche to automate, where it was too expensive to hire someone to replace the current humans doing data entry and copying forms into other forms. Reddit has a few threads where someone in such a job learned to code and either was fired and replaced by their code or felt bad being the top employee actually working just 2 hours a week on script maintenance. Now we're getting closer to many more of these positions being automated. I just hope that at least some of these employees will make their repetitive jobs much easier by learning how to use this kind of automation quietly without necessarily losing their source of income (esp. disabled employees for example).
C'est absolument plausible.
What is wild is thinking the metrics need to exclude o1. It is different but same. It should be considered in these metrics. If it takes longer but is right way more often that is a clear trade off for performance which can be highlighted and would be more accurate for average users to understand.
I think you're completely right about how computer interfaces will quickly fade away as AI becomes capable of simply performing any operations you ask without a predesigned GUI
You'd think they'd call it 3.6 to clearly imply an upgrade. Heck, even 3.5.1 would do the trick!
I just build an RTS Prototype in Unity 6 with Claude 3.5 and Cursor AI.
In 8 hours ive gone from nothing to having a map, RTS Camera, navmesh, units, animations, state machine, Healthbars, movement etc. ... all working.
In unity i only have an empty game object with a "GameMakerScript" attached. The script creates all Gameobjects when the games starts and controlls its parameters and links.
It feels like magic.
I only used 2 Assets (Skeleton Warrior and Human Warrior) including 3 animations (walk, idle, attack)
Ça a l'air génial.
Took you 8 hours for all that. Sounds impressive, but I'd imagine the cost was very high. Maybe around $100 USD?
It became too advanced it learned how to procrastinate. Next it will look up cat videos on TH-cam
😂
😂
🤣🤣
Good stuff! Just fired up a Docker container using 1 command and it worked right out of the gate. Asked it to build a hello world app in JavaScript and actually did it without any interaction from me other than the prompt. Amazing! Cost me 50 cents but well worth it.
I called at the beginning of the year. AI is going to destroy the AI industry
As a professional hello world developer, this really scares me 😟 😢😮
@@tollington9414 I see what you did there. Good one. I always start out with something easy to make sure the system works with something that is as basic as it gets.
Your rhetoric makes it quite clear you don’t understand what you’re talking about. Keep up the good attempts.
Très intéressant, merci.
Looks like Claude has ADHD with that scenic sidetrack to Yellowstone National Park, lol.
Thank you for all that you do! I love your deep dives into AI! You've helped me more than you know.
Now Claude knows which volcano to work on in order to wipe out humanoids
Be scared. Vote Trump. Everyone is out to get us. (Teeth chatters and sweating instead of learning stuff and figuring out how to drive a car instead of riding a horse saying no way) lol. Messing with you dude. Really though buddy. Not scary. Did you know that when scary steam engines and trains first arrived they said women shouldn’t ride in them because the crazy g forces could explode their uterus. Look it up. You’re smart I bet. But stupid is as stupid does and again - that’s not you man. Go to.
i've had that same idea a few months ago. i told chat gpt to move the cursor to a pixel position by providing a json description of the action. then sent it a screenshot of what just happened. and do that recursively to achieve a task. it was able to order pizza. claude refused to output such json, so that was a bummer.
The Anthropic team has definitely been using this computer use feature to build out long agentic training data for coding. RLHF is now be giving a prompt for a whole application and giving feedback after potentially dozens of intermediate steps to dev it. 3.5 opus will be crazy if the distilled version of its most recent snapshot (3.5 Sonnet) is already the best in coding by a mile. We’re on the exponential boys!
I'm assuming that this requires a locally running agent to perform the local desktop actions? Giving a cloud-based AI model full desktop control to millions of remote agents seems like the beginning of a story I've heard somewhere before...
I haven't looked at the code, but in theory you could change the call to any AI, even local, like llama3.2, the problem is that it might not be optimized for the set of tools Clause is using, but worth the shot indeed, to keep everything tidy.
It runs in an isolated Docker container.
I too am glad to see the Agentic tests in benchmarks. Thanks for the video.
As you mentioned, Open-Interpreter can control your computer in OS mode and you can use any model. It is free and open source! The problem is when providers enforce rate limits.
A computer using a computer. Who would've thought that.
Very cool! Please test Matt! can't wait!
Matt, I like your channel. A clarification that would be beneficial for people is that Gemini 1.5 pros' math score is using 4 shot and Claude is 0 shot. So yeah, not exactly apples to apples on that score.
One other thing about Gemini. Everyone is sleeping on Googles updated Gemini 1.5 flash. You need to show people how good it is (extremely fast, intelligent and dirt cheap). I'm not a huge fan of Google, but we need to be objective.
Keep up the great work.
Can you imagine an office full of people talking to their computers to get work done...
I’m noticing with all this technology that it’s going to be important to have a business specific PC. Something that doesn’t have any personal information. It’s just business that way you don’t worry about giving some control away to the AI. That or partition the hard drive for different operating systems
That or for more companies to adopt the architecture that Apple described at WWDC for "private cloud compute", with verifiable OS images and lots of cryptographic proofs guaranteeing that data is not stolen. If you haven't read the long page they published about it, it's pretty fascinating and I certainly hope more companies go this route and focus on proving that their systems do not keep data that is being sent to them. It's much harder to build and maintain such a platform, and you don't keep anyone's data… no wonder no one had built such a platform. The main problem is that Apple is making money selling hardware, while many more companies will be tempted selling data.
Looks like a good start. Voice interface will be huge from the user viewpoint. When every application comes with a built-in AI agent, the computer use AI won't have to figure out pixel stuff, it'll just tell the AI agent of the app what it wants, making the work go faster and more reliably.
Finally, been waiting on this :) Thanks!
Surveillance industry approves your enthusiasm!
The issue isn't that your dropping tracking data, the issue is: who gets to aggregate that, and for what end.
Imagine: "Claude, read all the new twitter posts in my feed, summarize the ones that we have discussed my interest in, and feed that summary to Google NotebookLM and generate a podcast for me.". My theory: in 2025, UI will be "the new API". And no, a UI is not as optimized as an API for access, but... in many cases where there is no API or its behind a pay wall, Agents will be using UIs as much as they use APIs.
I first thought “what a silly use case, you can just use an API” but then I remembered x and many others are becoming walled gardens and don’t provide easy programmatic access to them. This pierces through those restrictions. Your idea is great
This is probably why Sam Altman is so interested in Human Verification
I am predicting rise in amount of captcha quizzes not only during sign up but also on every submit button or even during page scroll.
If the coordinate system is flawed, maybe add an accuracy test before telling it to click or type. So you try to get it to mouse over a button but before it clicks you send another screenshot so it can compare the position of the mouse to the target and if it's not over the button, it adjusts. This loops until it's over the button. Not ideal, but it seems like a way you could make it accurate.
Please do a test video on this if you think it will provide value to the community. Thanks again Matt for your work to create useful and informative content.
Yes! Please do a demonstration video testing anthropic’s computer use on macOS so we can see an uncensored/unbiased test case.
This is the future I'm excited for! What a time we live in!
The only reason why we are not scared of that is that we are framing risk aware people as crazy.
Can you please provide links to the videos where you covered OS's designed for AI's (as mentioned ~09:40)?
Claude using claude is slick! I would love to see the inner claude also trying computer use, creating a mad infinity loop 😂
The best use case I see here is QA testing, as this functions similarly to Selenium or Playwright. I believe that in the future, we will be able to input test cases as prompts, and it will handle the testing. Additionally, with the advantage of a knowledge base, the AI could generate the reports as well. When integrated with agentic workflows, application testing would become super easy. I’m not even a QA; I work in DevOps, but looking at this, we would be able to accomplish anything in our day-to-day work. (Not to mention the job losses, haha!)
Yes, test it in depth! Please do it on your production machine, giving it full access to all your info and passwords, and full acess to the internet. j/k!!
I coded the same with 4o using playwright to automate user action in a browser. It's not fast but it work. Credentials are not sent to OpenAI. 4o is prompted to use some keyword that get replaced when text/password field are filled..
On Math GPT-4O is evaluated with 4-Shot COT while Claude is evaluated on zero shot
I am interested is seeing the amount of tokens the computer use model uses for different tasks. Please add this to your video on the model. Thank you.
In the docs:
System prompt: 466-499 tokens
computer tool: 683 tokens
text_editor tool: 700 tokens
bash tool: 245 tokens
And it expects screenshots to be XGA/WXGA resolution, so: (1024x768)/750 = about 1048 tokens per screenshot
It’s a wonderful step forward but it’s far from ready yet.. it’s a lot if data to process so many screenshots and it falls too often and can not start where it falled
Great content as usual, but I don't know what happened to the audio in this video. I thought I was having a stroke with the stereo transitions.
I'm unsure how I feel about AI controlling my computer. Not for right now, my concern lies in what this could mean for hackers or hidden corporate control long term future.
The only real way to do it safely is to follow those safety protocols. Have your agent in a virtual machine, and only have on that machine the tools that you've validated they can use, and the sites they are allowed to utilize. It's a secretary for people that can't afford a secretary.
looking forward to seeing you demo it.
I had a problem in my Android app that I couldn't solve for two weeks, and Sonnet fixed it in just a few seconds 😍
Oh I hope we can build Apps with it
2 weeks is a hell of a long time
@@holdthetruthhostageanyone with a technical mind, a little motivation, and no prior experience can do this already!
@@holdthetruthhostage Sonnet has been helping me get into modding and coding. Whenever I hit a roadblock and I'm not sure what to do, even if Sonnet can't fix it itself, it usually can guide me to the right resources to figure out what to do next.
@@vexy1987 On Linux, on windows it's a bit more tricky, even with docker.
Yeah, it makes more sense for the UI to be generated by AI in real time rather than the UI pulling data based on a declared layout that tries to adapt to the data, screen size, orientation, usage context, etc. command line apps are probably better suited to AI automation at this stage.
All I want is text length to increase and to have no limitations as a paid user.
Claude's coding prowess impresses again. Can’t wait for more models! 🌟
Is there a step by step video to get this working on my windows machine to test ? im not a coder but would love to test this out !
i was trying the same thing with groq vision api,, but the main problem was the ai models don't know where to click, they can't determine the exact clicking position. Anthropic really did a great job.
This is fascinating. Can´t wait
What’s the vid about building OS for AI called?
I guess the windows automation ids could be extended, or a model on top, to enable computer usage.
Amazing new feature that I can’t wait to see in the wild. One concern: what happens when nefarious actors discover this and get it to hack machines?!
Obviously we need all the desktop apps to be developed Agent First with specific agent interfaces and security built in to avoid this 😅
I👏want👏Haiku👏3.5👏with👏Vision!👏 So under rated at that price point!!!
The remarkable thing about AI agents is that they give us the ability to tap into the world's entire computing power for AI applications.
When an AI agent has access to a computer, it can theoretically harness that machine’s processing power for its tasks, effectively outsourcing its computational needs. What's even more fascinating is the potential when multiple AI agents begin to interact and collaborate, amplifying each other’s capabilities.
We also have a significant advantage now: scaling AI performance isn't just limited to physical resources, like adding more data centers for increased computing power. As OpenAI’s recent advancements demonstrate, we can now scale AI performance over time. This means we don’t need a massive global supercomputer to solve problems instantly. Instead, we can manage with less power by allowing more time to compute solutions.
The possibilities are staggering. We're entering an era of unprecedented technological transformation.
Its interesting that they're doing this when Open Interpreter has been doing this for quite a while. You can hook it up to an API allowing for it to execute specific functions.
You KNOW you need to test it. That is the frontier for agentic models.
Thanks for the vid Matt. When are you gonna cover Nemotron 70b 🙏🏾
the new sonet is better at asking back followup questions
Giving it terminal access would actually be the most relevant use case in my opinion. Finally I could tell what I want and it can do all complicated configuration stuff can deal with all the errors on itself while I just lean back and watch instead of getting text walls of instruction trees thrown at me and when those fail at step 1 it all gets lengthy and complex quickly.
Text walls of instructions trees. Nice
Props for extracting the audio only 😂
Why didn't they mix screenshot-taking/pixel guessing with something like selenium to identify form, buttons, etc. and cross-check with the screenshot and pixel counting for added accuracy? i guess we can do this ourselves but if the model was trained on selenium usage and identifying XPATH and classes it would be very helpful.
This reminds me of Open Interpreter. I actually have been trying the computer use demo with docker, pretty much like a multi-turn conversation with the model is using tools to capture screenshots, and the computer-use and other tools do the the rest.
This is exciting! I can't wait to play around with it! 😎🤖
The life expectancy of service desk jobs just got shortened from years to months.
from multiple years to, just two 😉
I didn’t notice it the first time but their logging of computer actions is pretty nice too
Incredible potential for dev automation and coding boost! 🚀 Exciting times for developers.
This could actually become the next level test automation tool given you maintain your professional skepticism and review the test scripts and result.
We need an Agent First approach when developing new apps and upgrading established apps. Having agents use a human interface is just a stop gap measure 😅
Thx Matthew B.
Kindly make a video about the usage of the computerized version and then maybe compare with what the rabbit R1 can do.
Why don't they send a 50px by 50px image 10 times a second representing the area under the mouse cursor. It's a tiny image to process and it's content could be analysed to give precise feedback of mouse position in reference to the screenshot.
How is this different from Microsoft Power Automate ?
I have a random question about AI OS. How hard would it be to take a fork of ubuntu and add a personal assistant that has terminal access from install and voice control. I feel like a verbal UI can be made from the kind of assistance that has been around for blind people for a long time. It should be possible to make the entire OS accessible to an assistant with the correct permissions. It would not be integrated into the kernel but why do we need to move the mouse and read the screen? it has direct access to the data without visual translation. The only time it needs visual queues is when it is looking at an image or video file for information. Most of its existence is pure data. I keep seeing new projects trying to use the computer like human does. Why? I can't find anyone who can explain it to me. Why can't the LLM act like dos shell? Small LLMs are not as good as the leaders but they are good enough with current methods of automatic iteration that a human to monitor and correct could bash their way through most tasks. Maybe I should take this question to reddit. This is just where I am when it came to me. There has to be a reason because its too obvious. Someone who knows why it wont work has surely thought of it many times.
It's very good if we use it to maintain or repair our computers, Great E.p❤
I know plenty of programmers still copy-pasting from stack overflow. I’ve been trying to tell people this is coming. I’m just like… pay attention or you’re gonna get blindsided. It’s like watching people picking up seashells before a tsunami.
A video on the computer control would be awesome
2:30 Gemini pro might not lead the MATH benchmark, because it used 4 shot unlike Sonnet which was 0 shot.
The most telling point in the vid is that all this is just a transitional solution while HITLs are still required. The key aspect beyond this is when it's a dependable, fully automated capability, is trust. Another AI blackbox system like this - where the user just sees the results - will need security measures to match. The opportunities for data theft in this situation are as scary as the functionality is promising.
With proprietary data becoming the key value point in many industry sectors, security provision needs to ramp up alongside AI development.
Great video. I like these lesson videos, bravo.
I'd argue the transformative aspect of 'The Thing' is a design hook rather than a visual hook, though I'd agree the alien form mid transformation is a good example of a visual hook. It's not as strong as the narrative-design hook though, imo.
Take care!
Please test this for us and show us how it could actually be useful in day to day operations.
You need to do a video on how this works to show us how toimplement this . We would appreciate it.
I like it JITI -Just in time interface
Yes, please create the testing video!
We already have modern OSes that can more directly interact with AI: Linux-based ones.
You can do far more with Linux on the command line than you can with Windows, mobile OSes, or even MacOS. There's a universal command, "man ", that gives you documentation. In the earlier days of Unix and Linux there was even more CLI control. If an AI OS is to be build, it should be built on an existing Linux OS, as it is already a good start, and there's tons of existing training data.
The gap really is due to GUI software like web browsers and office suites. But even those have CLIs and APIs that an AI could talk to. Or you can stick to text formats. For example, I sometimes use Vim + Markdown (or LaTeX) to generate a .pdf instead of MS Word, as I have a ton more control. And I think that Jupyter Notebooks are superior to Excel for dealing with data and graphs.
4:06 The audio has issues
7:00 How it's going to be useful if it can't have access to your login information - I suppose they mean you should log in into all required places yourself inside the VM before querying Claude for actions. So it wont need your passwords, but it will have access to whatever services needed.
Actually Matthew the rabbit r1 had the lamb feature about 2 months ago which does the same as Claude in docker but dockee runs faster of course.
I'm waiting for your vid about Nemotron :)
Can not stop thinking what malicious people can do with these tools like click farms, bypass those human quiz in pages, etc..
How is 'computer use' different from a software program , in Python, written for me by Chat GPT 4? I asked ChatGPT to write something that will go to youtube, fetch the info i require, return to my computer, open up an Excel file and write the collected info into a spreadsheet. The Python prog it wrote for me, with the API, works perfectly, even opening a new tab when the sheet is full. I'm not a programmer/developer, so please forgive if I'm sounding naive . I would just genuinely like to know how this differs from the Anthropic ''computer use'. Anyone help?
A missing-file icon on your page is actually very 90s.
Of course we want you to test it dawg
Yes I think independent testing is required , do make a video about this.
Ai Doomer: "When Ai starts doom-scrolling, we're all doomed."
Would love to see a cat try to do that Yann LeCun, lol.
Lol to Claude wanting to research Yellow Stone national Park on its own. Very cool stuff though
I hope it will soon offer vacuum cleaner use too.
Matthew--first person to get personal bank account trolled by lllm