Anthropic’s Claude Computer Use Is A Game Changer | YC Decoded
By Y Combinator
Summary
## Key takeaways - **Claude Trained on Screenshots**: Claude has had the ability to analyze images since Claude 3 in March; the new addition trains it on screenshots of a computer to output click locations and keyboard presses with minimal extra training. [01:45], [02:18] - **Agent Loop Drives Autonomy**: Claude analyzes the prompt, decides on a tool, takes screenshots to check progress, and loops back with adjustments until completing the task; this repeatable loop of deciding, evaluating, and acting handles complicated step-by-step tasks. [03:10], [03:42] - **Demo: Fills Spreadsheet via Search**: Claude takes screenshots, realizes the ant equipment company isn't in the spreadsheet, searches for a match, scrolls the page, and gathers info to fill out the form. [02:38], [03:00] - **Safety Demos: Hike and OSHA**: Claude plans a sunrise hike at Golden Gate Bridge by searching the web and creating a Google Calendar event; it monitors a construction site video for safety issues, notes gear, spots problems, and compiles a spreadsheet for OSHA compliance. [03:42], [04:35] - **Key Limitations Exposed**: Computer use is slower than typical models, tends to crash, missteps in tool selection, gets confused, and veered off task by searching Yellowstone pictures mid-session; it's vulnerable to prompt injection like uploading password manager contents. [05:22], [06:25] - **Paradigm Shift: Model Fits Tools**: Up until now developers made tools to fit the model with custom environments; now the model fits the tools, massively lowering barriers for developers and enabling AI to handle drudge work. [04:36], [05:10]
Topics Covered
- Claude reads screenshots to click pixels
- Agent loop automates step-by-step tasks
- Model fits tools, flipping developer paradigm
- AI agents eliminate human drudge work
- Computer-using AI reshapes daily lives
Full Transcript
the rocks can talk but they can also read they can see and now they can use a computer browsing the web clicking
buttons typing text all by itself the age of AI agents is here one of the first out Gates is clawed computer use
anthropics brand new AI agent let's dive into how it works what it can do and how it may change AI forever
[Music] in October anthropic made waves when it released a set of upgraded models Claude
3.5 Haiku and a new 3.5 Sonic they also released something special computer use but they're not the only ones in the space we already know Sam Altman is
working to recreate Samantha from the movie her and open AI is said to be releasing its own agent operator in the new year Google is working on something
similar too the landscape for AI agents is growing fast and so far anthropic is the first of the big AI labs to get into the game right now Claude computer use
is still in public beta as developers put it to the test but already it's looking like a complete GameChanger so how does it work Claude had the ability
to understand images for a while so the next step was to train it on how and when to perform specific actions like clicking buttons or writing text based
on what's displayed on the screen Claud has has had for a long time since since Claude 3 back in March the ability to analyze images and respond to them with
text the the only new thing we added is those images can be screenshots of a computer and in response we train the model to give a location on the screen
where you can click Andor buttons on the keyboard you can press in order to take action and it turns out that with actually not all that much additional training the models can get quite good
at that task it's a good example of generalization for this anthropic needed to train Claude to recognize exact locations on the screen down to the
pixel anthropic was then able to train Claude to understand what's happening on screen and to reason about how it should use its software tools to do tasks for
example it might help you automate boring and repetitive tasks cla's going to start taking screenshots of my screen and quickly realizes that the ant equipment company isn't actually in the
spreadsheet luckily we get a search match and Claude then starts scrolling through the page looking for all the information it needs to fill out this form to get started with computer use developers have to run it in a virtual
machine or container like Docker you'll also need an anthropic API key once that's all set you can then open a dedicated browser window which shows the user prompt on the left and cla's
activity on the right Claud starts by analyzing The Prompt and deciding which tool to use as it works it takes a screenshot at each step to check its
progress making sure the task is on track if adjustments are needed Claude Loops back to try different actions or tools until it completes the task this
repeatable Loop of deciding evaluating and acting is called the agent Loop and it's how Claude handles complicated step-by-step tasks all on own so what
else can computer use make possible in their own demos anthropic shows us a few different tasks like this one of Claude helping to plan a sunrise hike at the
Golden Gate Bridge it searches the web figures out some important details and then creates an event in Google Calendar in another example Wharton
Professor Ethan mollik puts CLA computer use to the test by feeding it a video of a construction site and prompting Claude to monitor the site and look for issues
with safety you'll see Claude takes screenshot after screenshot analyzing different parts of the site making note of all the gear and materials and trying
to spot any potential issues it even finishes up by putting everything together in a nice neat spreadsheet automated OSHA compliance check by now
it should be clear that computer use is a step forward for AI up until now developers have had to make tools to fit the model coming up with custom environments where AIS use specially
designed tools to do different various tasks now we can make the model fit the tools that's a powerful change computer
use opens up so many applications businesses can automate repetitive tasks and increase efficiency while the average user can save time on routine things like booking flights or ordering
food it's easy to see a future where AI agents handle most of the Drudge work for us and for developers computer use massively lowers the barriers to entry
llms have already made tasks like coding way more accessible to the average person and computer use takes that a whole step further computer use is still a work in
progress so it has some bugs and limitations it's much slower than typical models and has a tendency to crash from time to time so reliability
is still an early concern occasionally Claude will misstep in its tool selection get confused or even sometimes Veer off task during one session that
anthropic shared on YouTube Claude unexplainably started searching for pictures of Yellowstone National Park out of nowhere in the middle of its task
to be fair humans get distracted and sometimes do that too Claud does have guard reils since it could easily be used used for abuse it steers clear of things like account creation or content
generation for social media it's also vulnerable to prompt injection a security risk where the model can be tricked to follow different information
or prompts embedded in the online sources it visits rather than sticking to the original prompt imagine a website prompt injecting Claud to upload the
contents of your password manager that'd be bad anthropic thought about and tries to keep users safe by keeping actions contained to a secure virtual machine
limiting access to sensitive data and strictly controlling approved sites however many of these limitations could be lifted soon because this beta is just
the beginning anthropics already said that computer use will rapidly improve to become faster more reliable and more useful for the tasks users want to
complete plenty of startups are getting into the mix too just recently a YC company Kura released their own browser agents that seem to outperform Claud computer use on the web Voyager
Benchmark achieving a new state-of-the-art in the near future llms with the full ability to use and controll computers will reshape
everything how developers write software how CEOs run their companies and even how we all live our daily lives each new
groundbreaking appc application will transform how we work connect and live this kind of AI won't just be an assistant it'll take on entire tasks
that once needed whole teams or companies so what will you build with computer use [Music]
Loading video analysis...