TLDW logo

The new OpenAI Agents Platform: CUA, Web Search, Responses API, Agents SDK!!

By Latent Space

Summary

## Key takeaways - **Responses API Unifies APIs**: Responses API merges the best of Assistants API and Chat Completions, supporting tools conveniently with one API request instead of six objects, while Chat Completions stays supported long-term. [02:31], [03:56] - **GPT-4o Search Jumps to 90% Accuracy**: GPT-4o search preview achieves 90% accuracy on simple QA versus 38% for base GPT-4o, using synthetic data and o1-style distillation to stay factual and cite accurately. [08:12], [09:36] - **Improved File Search Adds Metadata Filtering**: File Search now includes metadata filtering critical for vector stores over 5-10k records, plus query optimization and custom reranking, as a managed RAG service versus DIY. [16:14], [17:47] - **Computer Use Enables Screen Automation**: Computer Use tool, behind Operator, takes screenshots and outputs actions like click, scroll, type for multi-turn browser tasks, using a fine-tuned model like GPT-1 of computer use. [18:46], [19:45] - **Agents SDK Upgrades Swarm with Tracing**: Agents SDK evolves Swarm with types, guardrailing, handoffs, and OpenAI dashboard tracing for multi-agent orchestration, supporting any chat completions-compatible API. [22:00], [24:01]

Topics Covered

  • Responses API Unifies Agentic Workflows
  • GPT-4o Search Jumps 38% to 90% Accuracy
  • Combine File Search with Web for Personalization
  • Computer Use Enables Browser Task Automation
  • Agents SDK Evolves Handoffs with Tracing

Full Transcript

Hey everyone, welcome back to another Len Space Lightning episode. This is

Allesio, partner and CTO at Desible and I'm joined by Spixs, founder Osmi. Hi.

And today we have a super special episode because we're talking with our old friend Roman. Hi, welcome. Thank

you. Thank you for having me. And Nikun

who is most famously is if anyone has ever tried to get any access to anything on the API, Nikun is the guy. So I I know I know I know your emails cuz I look forward to them. Yeah, nice to meet all of you. Yeah, I think that we're

basically convening today to talk about the new API. So perhaps you guys want to just kick off uh what is OpenAI launching today? Yeah, so uh I can kick

launching today? Yeah, so uh I can kick it off. We're launching a bunch of new

it off. We're launching a bunch of new things today. Uh we're going to do three

things today. Uh we're going to do three new built-in tools. So we're launching the web search tool. This is basically chat GBD for search but available in the API. We're launching uh an improved file

API. We're launching uh an improved file search tool. So this is you bringing

search tool. So this is you bringing your data to OpenAI. You upload it. We

you know take care of parsing it, chunking it, embedding it, making it searchable, give you this like ready vector store that you can use. So that's

the file search tool. And then we're also launching our computer use tool. So

this is the tool behind the operator product in chat GPD. So that's coming to developers today. And to support all of

developers today. And to support all of these tools, we're going to have a new API. So you know, we launched chat

API. So you know, we launched chat completions like I think March 2023 or so. It's been a while. So, so we're

so. It's been a while. So, so we're looking for an update over here to support all the new things that the models can do. And so, we're launching this new API called the responses API.

It is uh you know it works with tools.

We think it'll be like a great option for all the future agentic products that we build. And so, that is also launching

we build. And so, that is also launching today. Actually, the last thing we're

today. Actually, the last thing we're launching is the agents SDK. We launched

this thing called Swarm last year where you know it was an experimental SDK for people to do multi-gen orchestration and stuff like that. It was supposed to be like educational experimental but like

people people really loved it. They like

ate it up and so we were like all right let's let's upgrade this thing. Let's

give it a new name. And so we're calling it the agents SDK. It's going to have built-in tracing in the OpenAI dashboard. So lots of cool stuff going

dashboard. So lots of cool stuff going out. Uh so yeah, excited about it.

out. Uh so yeah, excited about it.

That's a lot. But we said 2025 was the year of agents. So there you have it like a lot of new tools to build these agents for developers. Okay. I guess I guess we'll just kind of go one by one and we'll leave the agents SDK towards

the end. So responses API I think the

the end. So responses API I think the sort of primary concern that people have and something I think I voiced to you guys when when I was uh talking with you in in the planning process was is chat completions going away? Uh so I just

wanted to let it uh let you guys respond to the concerns that people might have.

Chat completion is definitely like here to stay. You know it's a bare metal API

to stay. You know it's a bare metal API we've had for quite some time. uh lots

of tools built around it. Uh so we want to make sure that it's maintained and people can confidently keep on building on it. At the same time, it was kind of

on it. At the same time, it was kind of optimized for a different world, right?

It was optimized for a free multimodality world. We also optimized

multimodality world. We also optimized for kind of single turn text to prompt text prompt in text uh response out. And

now with these agentic um workflows, we we notice that like developers and companies want to build um longer horizon tasks, you know, like things that require multiple turns to get the

task accomplished and computer use is one of those for instance. And so that's why the responses API came to life to kind of support these new agentic workflows. But chat completion is

workflows. But chat completion is definitely here to stay. And assistance

API we've uh have a target sunset date of first half of 2026. So this is kind of like in my mind there was a kind of very poetic mirroring of the API with the models. This I kind of view this as

the models. This I kind of view this as like kind of the merging of assistance API and check completions right into one unified responses. So it's kind of like

unified responses. So it's kind of like how GBT and the old series models are also unifying. Yeah, that's exactly the

also unifying. Yeah, that's exactly the right uh that's the right framing, right? Like I think we took the best of

right? Like I think we took the best of what we learned from the assistance API, especially like being able to access tools very uh very like conveniently, but at the same time like simplifying

the way you have to integrate like you no longer have to think about six different objects to kind of get access to these tools with the responses API.

You just get one API request and suddenly you can sweep in those tools, right? Yeah, absolutely. And I think

right? Yeah, absolutely. And I think we're going to make it really easy and straightforward for assistance API users to migrate over to responsive API without any loss of functionality or

data. So our plan is absolutely to add

data. So our plan is absolutely to add you know assistant like objects and thread light objects to that that work really well with the responses API.

We'll also add like the code interpreter tool which is not launching today but will come soon. And uh we'll add async mode to responses API because that's another difference with with assistance.

We'll have web hooks and stuff like that. But I think it's going to be like

that. But I think it's going to be like a pretty smooth transition uh once we have all of that in place and we'll give folks like a full year to migrate and and help them through any issues they they they face. So overall I feel like

assistance users are really going to benefit from this longer term uh with this more flexible primitive. How should

people think about when to use each type of API? So I know that in the past the

of API? So I know that in the past the assistance was maybe more stateful kind of like longunning many tool use kind of like file based things and the chat completions is more stateless you know

kind of like traditional completion API is that still the mental model that people should have or like should you by default always try and use the responses API so so the responses API is going to support everything that the it's at

launch going to support everything that chat completion supports and then over time it's going to support everything that assistance supports. So, it's going to be a pretty good fit for anyone starting out with OpenAI. They should be

able to like go to responses. Responses,

by the way, also has a stateless mode.

So, you can pass in store false and and that'll make the whole API stateless just like chat completions. We're really

trying to like get this unification story in. So, people don't have to

story in. So, people don't have to juggle multiple endpoints. That being

said, like chat completions just like the most widely adopted API. It's it's

so popular. So, we're still going to like support it for years with like new models and features, but if you're a new user, you want to or if you want like existing user, you want to tap into some of these like built-in tools or

something, you should feel feel totally fine migrating to responses and you'll have more capabilities and performance than than check completions. I think the the messaging that I agreed that I I think resonated the most when I talked

to you was that it is a strict superset, right? Like you should be able to do

right? Like you should be able to do everything that you could do in tracks and with assistance. The thing that I just assumed that because you're you're now, you know, by default is stateful,

you actually storing the the chat uh logs or the the chat state. I thought

you'd be charging me for it. So, uh you know, to me it was very surprising that you figured out how to make it free.

Yeah, it's free. We we store your state for 30 days. You can turn it off. Uh but

yeah, it's it's free. The interesting

thing on state is that it just like makes particularly for me it makes like debugging things and building things so much simpler where I can like create a responses object. It's like pretty

responses object. It's like pretty complicated and part of this more complex uh application that I've built.

I can just go into my dashboard and see exactly what happened. Did I mess up my prompt? Did it like not call one of

prompt? Did it like not call one of these tools? Did I misconfigure one of

these tools? Did I misconfigure one of the tools? And like the the visual

the tools? And like the the visual observability of everything that you're doing is is so so helpful. So, I'm

excited like about people trying that out and and getting benefits from it, too. Yeah, it's uh it's really I think a

too. Yeah, it's uh it's really I think a really nice to have. Uh but I all I'll say is that uh my friend Corey Quinn says that anything that can be used as a database will be used as a database. So,

be prepared for some abuse.

All right. Yeah, that's a good one. Some

of that with the metadata that's some people are very very creative at stuffing data object. Yeah, we do have metadata

data object. Yeah, we do have metadata with responses. Exactly.

with responses. Exactly.

Let's get through all of these. So, web

search, I think the when I first said web search, I thought you were going to just expose a API that then return kind of like a nice list of thing, but the way it's name is like GBD40 search preview. So, I'm guessing you have

preview. So, I'm guessing you have you're using basically the same model that is in the chat GBD search which is fine-tuned for search. I'm guessing it's a different model than the base one and it's impressive the jump in performance.

So just to give an example in simple QA GBD 40 is 38% accuracy 40 search is 90%.

We always talk about how tools are like models is not everything you need like tools around it are just as important.

So yeah maybe give people a quick preview on like the work that went into making this special. Should I take that?

Yeah. Um so uh firstly we're launching web search in two ways. One in responses API which is our API for tools. It's

going to be available as a web search tool itself. So you'll be able to like

tool itself. So you'll be able to like go tools, turn on web search, and you're ready to go. Uh we still wanted to give chat completions people access to real-time information. So in that chat

real-time information. So in that chat completions API, which does not support built-in tools, we're launching the direct access to the the fine-tuned model that chat GBD for search uses and

we call it GBD40 search preview. And how

is this model built? Basically, we have our our search research team has been working on this for a while. Their main

goal is to like get information like get a bunch of information from all of our our data sources that we use to to gather information for search and then pick the right things and then site them

as accurately as possible. And that's

what the search team has really focused on. They've done some like pretty cool

on. They've done some like pretty cool stuff. They use like synthetic data

stuff. They use like synthetic data techniques. They've done like oer model

techniques. They've done like oer model distillation to like make these 40 fine tunes really good. But yeah, the main thing is like can it remain factual? Can

it answer questions based on what it retrieves and can it cite it accurately?

And that's what this like fine-tuned model really excels at. And so yeah, I'm super excited that like it's going to be directly available in chat completions along with being available as a tool.

Yeah, just to clarify, if I'm using the responses API, this is a tool, but if I'm using chat completions, have to switch model. I cannot use 01 and call

switch model. I cannot use 01 and call search as a tool. That's right. Exactly.

I think what's really compelling at least for me and my own um uses of it so far is that when you use like web search as a tool it combines nicely with every other tool and every other feature of

the platform. So think about this for a

the platform. So think about this for a second. For instance, imagine you have

second. For instance, imagine you have like a responses API call with the web search tool. But suddenly you turn on

search tool. But suddenly you turn on function calling and also turn on let's say structured outputs. Now you can have like the ability to structure any data from the web in real time in the JSON

schema that you need for your application. So it's quite powerful when

application. So it's quite powerful when you start combining those features and tools together. It's kind of like an API

tools together. It's kind of like an API for the internet almost, you know, like you you uh you get like access to the precise schema you need for your app.

Yeah. And then just to wrap up on the infrastructure side of it, I read on the post that people publisher can choose to appear in the web search. So are people by default in it? Like how can we get

latent space in the in the web search API? Yeah. Yeah. I I I think uh the we

API? Yeah. Yeah. I I I think uh the we have some documentation around how uh websites publishers can can control like what shows up in our web search tool and

I think uh you should be able to like read that. I think we should we should

read that. I think we should we should be able to get lat space in for sure.

Yeah. You know, I I think so I compare this to a broader trend that I started covering last year of online LLMs. Uh actually perplexity I think was the first to say to offer an API that is

connected to search and then Gemini had the sort of search grounding API and I think you guys I actually didn't I missed this in the original reading of the docs but you even give like citations with like the exact uh sub sub

paragraph that is matching which I which I think is the standard nowadays. I

think my question is how do we take what a knowledge cutoff is for something like this right because like now basically there's no knowledge cutoff is always live but then there's a difference

between what the model has sort of internalized in its back propagation and what is searching up is rag I think it kind of depends on the use case right and and what you want to uh to showcase

as the source like for instance you take a company like hea that has used this like web search tool they can combine like for for credit firms or law firm they can find like you know public information from the internet with the

the live sources and citation that sometimes you do want to have access to as opposed to like the internal knowledge but if you're building something different where like you just want to have an assistant that relies on

the deep knowledge that the model has you may not need to have these like direct citations so I think it kind of depends on the use case a little bit but there are many uh many companies like Hebia that will need that access to

these citations to precisely know where the information comes from. Yeah. Yeah.

uh for sure and then one thing on the on like the breath you know I think a lot of the deep research open deep research implementations have this sort of par hyperparameter about you know how deep they're searching and how wide they're

searching I don't see that in the docs but is that something that we can tune is that something you recommend thinking about super interesting it's definitely not a parameter today but but we should

explore that it's it's very interesting I imagine like how you would do it with the web search tool and responses API is you would have some form of like you know agent orchestration over here where you have a planning step and then each

like web search call that you do like explicitly goes a layer deeper and deeper and deeper but it's not a parameter that's available out of the box but it's a cool it's a cool thing to think about yeah the only guidance I'll

offer there is a lot of these implementations offer top K which is like you know top 10 top 20 but actually you don't really want that you want like sort of some kind of similarity cut off

right like some matching uh score cut off because if there's only five things, five documents that match, fine. If

there's 500 that match, maybe that's what I want, right? Yeah. But also that might that might make my cost very unpredictable cuz the costs are something like $30 per thousand queries, right? So

right? So yeah. Yeah. I guess you could you could

yeah. Yeah. I guess you could you could have some form of like a context budget and then you're like go as deep as you can and pick the best stuff and put it into like X number of tokens. There

could be some creative ways of of managing cost. Yeah, that's a super

managing cost. Yeah, that's a super interesting thing to explore. Do you see people uh using the files and the search API together where you can kind of search and then store everything in the file so that next time I'm not paying

for the search again and like yeah how should people balance that? That's

actually a very interesting question.

Let me first tell you about how I've seen a really cool way I've seen people use files and search together is they put their user preferences or memories in the vector store. And so a query

comes in, you use the file search tool to like get someone's like reading preferences or like fashion preferences and stuff like that. And then you search the web for information or products that

they can buy related to those preferences. And you then render

preferences. And you then render something beautiful to show them like here are five things that you might be interested in. So that's how I've seen

interested in. So that's how I've seen like file search, web search work together. And by the way, that's like a

together. And by the way, that's like a single responses API call which is really cool. So you just like configure

really cool. So you just like configure these things go boom and like everything just happens. But yeah, that's how I've

just happens. But yeah, that's how I've seen like files and web work together.

But I think that what you're pointing out is like interesting and I'm sure developers will surprise us as they always do in terms of how they combine these tools and how they might use file search as a way to have memory and

preferences like Nik says. But I think like zooming out, what I find very compelling and powerful here is like when you have these like neural nets that have like all of the knowledge that

they have today, plus real time access to the internet for like any kind of real-time information that you might need for your app and file search where you can have a lot of company private

documents, private details. You combine

those three and you have like very very compelling and precise answers for any kind of use case that your company or or your product might want to enable.

There's a difference between sort of internal documents versus the the open web, right? Like you're going to need

web, right? Like you're going to need both. Exactly. Exactly. I never thought

both. Exactly. Exactly. I never thought about it doing memory as well. I guess

again, you know, anything that's a database, you can store it and they will use it as a database. That sounds

awesome. But I think also you've been, you know, uh expanding the file search, you have more file types. You have query optimization, custom reranking. So, it

really seems like, you know, it's been fleshed out. Obviously, I haven't been

fleshed out. Obviously, I haven't been paying a ton of attention to to the file search capability, but it sounds like your team has added a lot of features.

Yeah, metadata filtering was like the main thing people were asking us for for a while, and that's the one I'm super excited about. I mean, it's just so

excited about. I mean, it's just so critical once your like vector store size goes over, you know, more than like, you know, 5 10,000 records. You

you kind of need that. So, yeah,

metadata filtering is coming too. Yeah.

And for most companies, it's also not like a competency that you want to rebuild in house necessarily. You know,

like you know, thinking about embeddings and chunking and you know how all of that like it sounds like very complex for something very uh like um obvious to ship for your users like companies like Navan for instance they were able to

build with the file search like you know take all of the FAQ and travel policies for instance that you have you you put that in file search tool and then you don't have to think about anything. Now

your assistant becomes naturally much more aware of all of these policies from the files. The question is like there's

the files. The question is like there's a very very vibrant rag industry already as you well know. So there's many other vector databases, many other frameworks probably if it's an open source stack. I

I I'll say like the a lot of the AI engineers I talked to want to own this part of the stack and it feels like you know like when should we DIY and when should we just use whatever open offers?

Yeah. I mean like if you're doing something completely from scratch, you're going to have more control, right? Like so super supportive of, you

right? Like so super supportive of, you know, people trying to like roll up their sleeves, build their like super custom junking strategy and super custom retrieval strategy and all of that. And

those are things that like will be harder to do with open tools. Open tool

has like we have an out of the box solution. We give you some knobs to

solution. We give you some knobs to customize things, but it's more of like a managed rack service. So my

recommendation would be like start with the open thing, see if it like meets your needs. And over time, we're going

your needs. And over time, we're going to be adding more and more knobs to to make it even more customizable, but you know, if you want like the completely custom thing, you want control over

every single thing, then you'd probably want to go and and handroll it using other solutions. So we're supportive of

other solutions. So we're supportive of both like engineers should should pick.

Yeah. And then we got computer use, which I think operator was obviously uh one of the hot releases of the year.

We're only two months in. Let's talk

about that. And that's also it seems like a separate model that has been fine tuned for operator that has browser access. Yeah, absolutely. I mean the

access. Yeah, absolutely. I mean the computer use models are exciting. They

the cool thing about computer use is that we're just so so early. It's like

the GPD2 of computer use or maybe GPD1 of computer use right now. But uh it is it is a separate model that has been you know that the computer use team has been working on. uh you send it screenshots

working on. uh you send it screenshots and it tells you what action to take. So

the the outputs of it are almost always tool calls and uh you're you're inputting screenshots based on whatever computer you're trying to operate. Maybe

zooming out for a second cuz like I'm sure your audience is like super super like AI native obviously but like what is computer use as a tool, right? And

what's operator? So the idea for computer use is like how do we let developers also build agents that can complete tasks for the users but using a

computer or a browser instead and so how do you get that done and so that's why we have this custom model like optimized for computer use that we use like for operator ourselves but the idea behind

like putting it as an API is that imagine like now you want to you want to automate some tasks for your product or your your own customers then now you can

you can have like the ability to spin up one of these agents that will look at the screen and act on the screen. So

that means able the ability to click, the ability to scroll, the ability to type and to report back on the action.

So that's what we mean by computer use and and wrapping it as a tool also in the responses API. So now like that gives a hint also at the multi-turn thing that we were hinting at earlier.

The idea that like yeah maybe one of these actions can take a couple minutes to complete because there's maybe like uh 20 steps to complete that task but now you can. Do you think uh computer

use can play Pokemon?

Interesting. I guess we tried it. I

guess we should try it. You know

there's a lot of interest. I think

Pokemon really is a good agent benchmark to be honest. Like it seems like Claude is Claude is running into a lot of trouble. Uh, sounds like we should make

trouble. Uh, sounds like we should make that a new eval it looks like. Uh, yeah.

Yeah, yeah. Oh, and then one more one more thing before we move on to agents SDK. I know you have a hard stop.

SDK. I know you have a hard stop.

There's all these like, you know, blah blah preview, right? Like search

preview, computer use preview, right?

And they seem all like fine-tunes of 40.

I think the question is are we are they all going to be merged into the main branch or are we basically always going to have subsets of these models? Yeah, I

think in the early days research teams at OpenAI like operate with like fine-tuned models and then once the thing gets like more stable, we we sort of merge it into the mainline. So that's

definitely the vision like going out of preview as we get more comfortable with uh and and learn about all the developer use cases and we're doing a good job at them. We'll sort of like make them part

them. We'll sort of like make them part of like the core models so that you don't have to like deal with the bifurcation. You should think of it this

bifurcation. You should think of it this way as exactly what happened last year when we introduced vision capabilities.

you know, vision capabilities were in like a vision preview model based off of GPT4 and then vision capabilities now are like obviously built into GPT4. You

can think about it the same way for like the other modalities like audio and those kind of like models like optimized for search and computer use. Uh AJS SDK we have uh a few minutes left. So let's

just assume that everyone has looked at swarm.

I think that swarm has really popularized the handoff technique which uh I thought was like you know really really interesting for sort of a multi- aent world. What is new with the SDK?

aent world. What is new with the SDK?

Yeah. Yeah, for sure. Uh so we've uh basically added support for types. Uh

we've we've made this uh like a lot Yeah. Like we've added support for

Yeah. Like we've added support for types. We've added support for

types. We've added support for guardrailing uh which is a very common pattern. So in in the guardrail example,

pattern. So in in the guardrail example, you basically have two things happen in parallel. the guardrail can sort of

parallel. the guardrail can sort of block the executions. It's a type of like optimistic generation that happens and uh I think we've added support for tracing. So you can basically look at

tracing. So you can basically look at the traces that the agents SDK creates in the OpenAI dashboard. We also like made this pretty flexible. So you can

pick any API from any provider that supports the chat completions API format. So it supports responses by

format. So it supports responses by default but you can like easily plug it in to anyone that uses the chat completions API and similarly on the tracing side you can support like multiple tracing providers by default it

sort of points to the open AI dashboard but uh you know there's like so many tracing companies out there and and we'll we'll announce some partnerships on that front too. So just like you know adding lots of core features and making

it more usable but still centered around like handoffs is like the main main concept. And by the way, it's

concept. And by the way, it's interesting, right? Because Swarm just

interesting, right? Because Swarm just came to life out of like learning from customers directly that like orchestrating agents in production was pretty hard. You know, simple ideas

pretty hard. You know, simple ideas could quickly turn very complex like what are those guard rails, what are those handoffs, etc. So that came out of like learning from customers and was initially shipped as a like lowkey

experiment I'd say. But we were kind of like taken by surprise at how much momentum there was around this concept.

And so we decided to learn from that and embrace it to be like okay maybe we should just embrace that as a core primitive of the openi platform and that's kind of what led to the agents

SDK and I think now as nu mentioned is like adding all of these new capabilities to it like leveraging the handoffs that we had but tracing also and I think what's very compelling for developers is like instead of having one

agent to rule them all and you stuff like a lot of tool calls in there that can be hard to monitor now you have the tools you need to kind of like separate spread the logic, right? And you can

have a triage agent that based on an intent goes to different kind of agents.

And then on the OpenAI dashboard, we're releasing a lot of uh new user interface logs as well. So you can see all of the tracing UIs. Essentially, you'll be able

tracing UIs. Essentially, you'll be able to troubleshoot like what exactly happened in that workflow when the triage agent did a handoff to a secondary agent and the third and see

the tool calls etc. So we think that the agents SDK combined with the tracing UIs will definitely help users and developers build better agentic workflows. And just before we wrap, are

workflows. And just before we wrap, are you thinking of connecting this with also the RFT uh API? Uh because I know you already have you kind of store my text completions and then I can do finetuning of that. Is that going to be

similar for agents where you're storing kind of like my traces and then help me improve the agents? Yeah, absolutely.

Like you got to tie the traces to the eval product so that you can generate good evals. Once you have good evalers

good evals. Once you have good evalers and tasks, you can use that to do reinforcement fine-tuning and uh you know, lots of details to be figured out over here. But that's the vision and I

over here. But that's the vision and I think we're going to go after it like pretty hard and and hope we can like make this whole workflow a lot easier for developers. Awesome. Thank you so

for developers. Awesome. Thank you so much for the time. I'm sure you'll be busy on Twitter tomorrow with all the um developer feedback.

Yeah, thank you so much for having us and uh as always, we can't wait to see what developers will build with these tools and and how we can like learn as quickly as we can from them to to make them even better over time. Awesome.

Thank you guys. Thank you. Thank you

both. Awesome.

[Music]

Loading...

Loading video analysis...