The new OpenAI Agents Platform: CUA, Web Search, Responses API, Agents SDK!!
By Latent Space
Summary
## Key takeaways - **Responses API Unifies APIs**: Responses API merges the best of Assistants API and Chat Completions, supporting tools conveniently with one API request instead of six objects, while Chat Completions stays supported long-term. [02:31], [03:56] - **GPT-4o Search Jumps to 90% Accuracy**: GPT-4o search preview achieves 90% accuracy on simple QA versus 38% for base GPT-4o, using synthetic data and o1-style distillation to stay factual and cite accurately. [08:12], [09:36] - **Improved File Search Adds Metadata Filtering**: File Search now includes metadata filtering critical for vector stores over 5-10k records, plus query optimization and custom reranking, as a managed RAG service versus DIY. [16:14], [17:47] - **Computer Use Enables Screen Automation**: Computer Use tool, behind Operator, takes screenshots and outputs actions like click, scroll, type for multi-turn browser tasks, using a fine-tuned model like GPT-1 of computer use. [18:46], [19:45] - **Agents SDK Upgrades Swarm with Tracing**: Agents SDK evolves Swarm with types, guardrailing, handoffs, and OpenAI dashboard tracing for multi-agent orchestration, supporting any chat completions-compatible API. [22:00], [24:01]
Topics Covered
- Responses API Unifies Agentic Workflows
- GPT-4o Search Jumps 38% to 90% Accuracy
- Combine File Search with Web for Personalization
- Computer Use Enables Browser Task Automation
- Agents SDK Evolves Handoffs with Tracing
Full Transcript
Hey everyone, welcome back to another Len Space Lightning episode. This is
Allesio, partner and CTO at Desible and I'm joined by Spixs, founder Osmi. Hi.
And today we have a super special episode because we're talking with our old friend Roman. Hi, welcome. Thank
you. Thank you for having me. And Nikun
who is most famously is if anyone has ever tried to get any access to anything on the API, Nikun is the guy. So I I know I know I know your emails cuz I look forward to them. Yeah, nice to meet all of you. Yeah, I think that we're
basically convening today to talk about the new API. So perhaps you guys want to just kick off uh what is OpenAI launching today? Yeah, so uh I can kick
launching today? Yeah, so uh I can kick it off. We're launching a bunch of new
it off. We're launching a bunch of new things today. Uh we're going to do three
things today. Uh we're going to do three new built-in tools. So we're launching the web search tool. This is basically chat GBD for search but available in the API. We're launching uh an improved file
API. We're launching uh an improved file search tool. So this is you bringing
search tool. So this is you bringing your data to OpenAI. You upload it. We
you know take care of parsing it, chunking it, embedding it, making it searchable, give you this like ready vector store that you can use. So that's
the file search tool. And then we're also launching our computer use tool. So
this is the tool behind the operator product in chat GPD. So that's coming to developers today. And to support all of
developers today. And to support all of these tools, we're going to have a new API. So you know, we launched chat
API. So you know, we launched chat completions like I think March 2023 or so. It's been a while. So, so we're
so. It's been a while. So, so we're looking for an update over here to support all the new things that the models can do. And so, we're launching this new API called the responses API.
It is uh you know it works with tools.
We think it'll be like a great option for all the future agentic products that we build. And so, that is also launching
we build. And so, that is also launching today. Actually, the last thing we're
today. Actually, the last thing we're launching is the agents SDK. We launched
this thing called Swarm last year where you know it was an experimental SDK for people to do multi-gen orchestration and stuff like that. It was supposed to be like educational experimental but like
people people really loved it. They like
ate it up and so we were like all right let's let's upgrade this thing. Let's
give it a new name. And so we're calling it the agents SDK. It's going to have built-in tracing in the OpenAI dashboard. So lots of cool stuff going
dashboard. So lots of cool stuff going out. Uh so yeah, excited about it.
out. Uh so yeah, excited about it.
That's a lot. But we said 2025 was the year of agents. So there you have it like a lot of new tools to build these agents for developers. Okay. I guess I guess we'll just kind of go one by one and we'll leave the agents SDK towards
the end. So responses API I think the
the end. So responses API I think the sort of primary concern that people have and something I think I voiced to you guys when when I was uh talking with you in in the planning process was is chat completions going away? Uh so I just
wanted to let it uh let you guys respond to the concerns that people might have.
Chat completion is definitely like here to stay. You know it's a bare metal API
to stay. You know it's a bare metal API we've had for quite some time. uh lots
of tools built around it. Uh so we want to make sure that it's maintained and people can confidently keep on building on it. At the same time, it was kind of
on it. At the same time, it was kind of optimized for a different world, right?
It was optimized for a free multimodality world. We also optimized
multimodality world. We also optimized for kind of single turn text to prompt text prompt in text uh response out. And
now with these agentic um workflows, we we notice that like developers and companies want to build um longer horizon tasks, you know, like things that require multiple turns to get the
task accomplished and computer use is one of those for instance. And so that's why the responses API came to life to kind of support these new agentic workflows. But chat completion is
workflows. But chat completion is definitely here to stay. And assistance
API we've uh have a target sunset date of first half of 2026. So this is kind of like in my mind there was a kind of very poetic mirroring of the API with the models. This I kind of view this as
the models. This I kind of view this as like kind of the merging of assistance API and check completions right into one unified responses. So it's kind of like
unified responses. So it's kind of like how GBT and the old series models are also unifying. Yeah, that's exactly the
also unifying. Yeah, that's exactly the right uh that's the right framing, right? Like I think we took the best of
right? Like I think we took the best of what we learned from the assistance API, especially like being able to access tools very uh very like conveniently, but at the same time like simplifying
the way you have to integrate like you no longer have to think about six different objects to kind of get access to these tools with the responses API.
You just get one API request and suddenly you can sweep in those tools, right? Yeah, absolutely. And I think
right? Yeah, absolutely. And I think we're going to make it really easy and straightforward for assistance API users to migrate over to responsive API without any loss of functionality or
data. So our plan is absolutely to add
data. So our plan is absolutely to add you know assistant like objects and thread light objects to that that work really well with the responses API.
We'll also add like the code interpreter tool which is not launching today but will come soon. And uh we'll add async mode to responses API because that's another difference with with assistance.
We'll have web hooks and stuff like that. But I think it's going to be like
that. But I think it's going to be like a pretty smooth transition uh once we have all of that in place and we'll give folks like a full year to migrate and and help them through any issues they they they face. So overall I feel like
assistance users are really going to benefit from this longer term uh with this more flexible primitive. How should
people think about when to use each type of API? So I know that in the past the
of API? So I know that in the past the assistance was maybe more stateful kind of like longunning many tool use kind of like file based things and the chat completions is more stateless you know
kind of like traditional completion API is that still the mental model that people should have or like should you by default always try and use the responses API so so the responses API is going to support everything that the it's at
launch going to support everything that chat completion supports and then over time it's going to support everything that assistance supports. So, it's going to be a pretty good fit for anyone starting out with OpenAI. They should be
able to like go to responses. Responses,
by the way, also has a stateless mode.
So, you can pass in store false and and that'll make the whole API stateless just like chat completions. We're really
trying to like get this unification story in. So, people don't have to
story in. So, people don't have to juggle multiple endpoints. That being
said, like chat completions just like the most widely adopted API. It's it's
so popular. So, we're still going to like support it for years with like new models and features, but if you're a new user, you want to or if you want like existing user, you want to tap into some of these like built-in tools or
something, you should feel feel totally fine migrating to responses and you'll have more capabilities and performance than than check completions. I think the the messaging that I agreed that I I think resonated the most when I talked
to you was that it is a strict superset, right? Like you should be able to do
right? Like you should be able to do everything that you could do in tracks and with assistance. The thing that I just assumed that because you're you're now, you know, by default is stateful,
you actually storing the the chat uh logs or the the chat state. I thought
you'd be charging me for it. So, uh you know, to me it was very surprising that you figured out how to make it free.
Yeah, it's free. We we store your state for 30 days. You can turn it off. Uh but
yeah, it's it's free. The interesting
thing on state is that it just like makes particularly for me it makes like debugging things and building things so much simpler where I can like create a responses object. It's like pretty
responses object. It's like pretty complicated and part of this more complex uh application that I've built.
I can just go into my dashboard and see exactly what happened. Did I mess up my prompt? Did it like not call one of
prompt? Did it like not call one of these tools? Did I misconfigure one of
these tools? Did I misconfigure one of the tools? And like the the visual
the tools? And like the the visual observability of everything that you're doing is is so so helpful. So, I'm
excited like about people trying that out and and getting benefits from it, too. Yeah, it's uh it's really I think a
too. Yeah, it's uh it's really I think a really nice to have. Uh but I all I'll say is that uh my friend Corey Quinn says that anything that can be used as a database will be used as a database. So,
be prepared for some abuse.
All right. Yeah, that's a good one. Some
of that with the metadata that's some people are very very creative at stuffing data object. Yeah, we do have metadata
data object. Yeah, we do have metadata with responses. Exactly.
with responses. Exactly.
Let's get through all of these. So, web
search, I think the when I first said web search, I thought you were going to just expose a API that then return kind of like a nice list of thing, but the way it's name is like GBD40 search preview. So, I'm guessing you have
preview. So, I'm guessing you have you're using basically the same model that is in the chat GBD search which is fine-tuned for search. I'm guessing it's a different model than the base one and it's impressive the jump in performance.
So just to give an example in simple QA GBD 40 is 38% accuracy 40 search is 90%.
We always talk about how tools are like models is not everything you need like tools around it are just as important.
So yeah maybe give people a quick preview on like the work that went into making this special. Should I take that?
Yeah. Um so uh firstly we're launching web search in two ways. One in responses API which is our API for tools. It's
going to be available as a web search tool itself. So you'll be able to like
tool itself. So you'll be able to like go tools, turn on web search, and you're ready to go. Uh we still wanted to give chat completions people access to real-time information. So in that chat
real-time information. So in that chat completions API, which does not support built-in tools, we're launching the direct access to the the fine-tuned model that chat GBD for search uses and
we call it GBD40 search preview. And how
is this model built? Basically, we have our our search research team has been working on this for a while. Their main
goal is to like get information like get a bunch of information from all of our our data sources that we use to to gather information for search and then pick the right things and then site them
as accurately as possible. And that's
what the search team has really focused on. They've done some like pretty cool
on. They've done some like pretty cool stuff. They use like synthetic data
stuff. They use like synthetic data techniques. They've done like oer model
techniques. They've done like oer model distillation to like make these 40 fine tunes really good. But yeah, the main thing is like can it remain factual? Can
it answer questions based on what it retrieves and can it cite it accurately?
And that's what this like fine-tuned model really excels at. And so yeah, I'm super excited that like it's going to be directly available in chat completions along with being available as a tool.
Yeah, just to clarify, if I'm using the responses API, this is a tool, but if I'm using chat completions, have to switch model. I cannot use 01 and call
switch model. I cannot use 01 and call search as a tool. That's right. Exactly.
I think what's really compelling at least for me and my own um uses of it so far is that when you use like web search as a tool it combines nicely with every other tool and every other feature of
the platform. So think about this for a
the platform. So think about this for a second. For instance, imagine you have
second. For instance, imagine you have like a responses API call with the web search tool. But suddenly you turn on
search tool. But suddenly you turn on function calling and also turn on let's say structured outputs. Now you can have like the ability to structure any data from the web in real time in the JSON
schema that you need for your application. So it's quite powerful when
application. So it's quite powerful when you start combining those features and tools together. It's kind of like an API
tools together. It's kind of like an API for the internet almost, you know, like you you uh you get like access to the precise schema you need for your app.
Yeah. And then just to wrap up on the infrastructure side of it, I read on the post that people publisher can choose to appear in the web search. So are people by default in it? Like how can we get
latent space in the in the web search API? Yeah. Yeah. I I I think uh the we
API? Yeah. Yeah. I I I think uh the we have some documentation around how uh websites publishers can can control like what shows up in our web search tool and
I think uh you should be able to like read that. I think we should we should
read that. I think we should we should be able to get lat space in for sure.
Yeah. You know, I I think so I compare this to a broader trend that I started covering last year of online LLMs. Uh actually perplexity I think was the first to say to offer an API that is
connected to search and then Gemini had the sort of search grounding API and I think you guys I actually didn't I missed this in the original reading of the docs but you even give like citations with like the exact uh sub sub
paragraph that is matching which I which I think is the standard nowadays. I
think my question is how do we take what a knowledge cutoff is for something like this right because like now basically there's no knowledge cutoff is always live but then there's a difference
between what the model has sort of internalized in its back propagation and what is searching up is rag I think it kind of depends on the use case right and and what you want to uh to showcase
as the source like for instance you take a company like hea that has used this like web search tool they can combine like for for credit firms or law firm they can find like you know public information from the internet with the
the live sources and citation that sometimes you do want to have access to as opposed to like the internal knowledge but if you're building something different where like you just want to have an assistant that relies on
the deep knowledge that the model has you may not need to have these like direct citations so I think it kind of depends on the use case a little bit but there are many uh many companies like Hebia that will need that access to
these citations to precisely know where the information comes from. Yeah. Yeah.
uh for sure and then one thing on the on like the breath you know I think a lot of the deep research open deep research implementations have this sort of par hyperparameter about you know how deep they're searching and how wide they're
searching I don't see that in the docs but is that something that we can tune is that something you recommend thinking about super interesting it's definitely not a parameter today but but we should
explore that it's it's very interesting I imagine like how you would do it with the web search tool and responses API is you would have some form of like you know agent orchestration over here where you have a planning step and then each
like web search call that you do like explicitly goes a layer deeper and deeper and deeper but it's not a parameter that's available out of the box but it's a cool it's a cool thing to think about yeah the only guidance I'll
offer there is a lot of these implementations offer top K which is like you know top 10 top 20 but actually you don't really want that you want like sort of some kind of similarity cut off
right like some matching uh score cut off because if there's only five things, five documents that match, fine. If
there's 500 that match, maybe that's what I want, right? Yeah. But also that might that might make my cost very unpredictable cuz the costs are something like $30 per thousand queries, right? So
right? So yeah. Yeah. I guess you could you could
yeah. Yeah. I guess you could you could have some form of like a context budget and then you're like go as deep as you can and pick the best stuff and put it into like X number of tokens. There
could be some creative ways of of managing cost. Yeah, that's a super
managing cost. Yeah, that's a super interesting thing to explore. Do you see people uh using the files and the search API together where you can kind of search and then store everything in the file so that next time I'm not paying
for the search again and like yeah how should people balance that? That's
actually a very interesting question.
Let me first tell you about how I've seen a really cool way I've seen people use files and search together is they put their user preferences or memories in the vector store. And so a query
comes in, you use the file search tool to like get someone's like reading preferences or like fashion preferences and stuff like that. And then you search the web for information or products that
they can buy related to those preferences. And you then render
preferences. And you then render something beautiful to show them like here are five things that you might be interested in. So that's how I've seen
interested in. So that's how I've seen like file search, web search work together. And by the way, that's like a
together. And by the way, that's like a single responses API call which is really cool. So you just like configure
really cool. So you just like configure these things go boom and like everything just happens. But yeah, that's how I've
just happens. But yeah, that's how I've seen like files and web work together.
But I think that what you're pointing out is like interesting and I'm sure developers will surprise us as they always do in terms of how they combine these tools and how they might use file search as a way to have memory and
preferences like Nik says. But I think like zooming out, what I find very compelling and powerful here is like when you have these like neural nets that have like all of the knowledge that
they have today, plus real time access to the internet for like any kind of real-time information that you might need for your app and file search where you can have a lot of company private
documents, private details. You combine
those three and you have like very very compelling and precise answers for any kind of use case that your company or or your product might want to enable.
There's a difference between sort of internal documents versus the the open web, right? Like you're going to need
web, right? Like you're going to need both. Exactly. Exactly. I never thought
both. Exactly. Exactly. I never thought about it doing memory as well. I guess
again, you know, anything that's a database, you can store it and they will use it as a database. That sounds
awesome. But I think also you've been, you know, uh expanding the file search, you have more file types. You have query optimization, custom reranking. So, it
really seems like, you know, it's been fleshed out. Obviously, I haven't been
fleshed out. Obviously, I haven't been paying a ton of attention to to the file search capability, but it sounds like your team has added a lot of features.
Yeah, metadata filtering was like the main thing people were asking us for for a while, and that's the one I'm super excited about. I mean, it's just so
excited about. I mean, it's just so critical once your like vector store size goes over, you know, more than like, you know, 5 10,000 records. You
you kind of need that. So, yeah,
metadata filtering is coming too. Yeah.
And for most companies, it's also not like a competency that you want to rebuild in house necessarily. You know,
like you know, thinking about embeddings and chunking and you know how all of that like it sounds like very complex for something very uh like um obvious to ship for your users like companies like Navan for instance they were able to
build with the file search like you know take all of the FAQ and travel policies for instance that you have you you put that in file search tool and then you don't have to think about anything. Now
your assistant becomes naturally much more aware of all of these policies from the files. The question is like there's
the files. The question is like there's a very very vibrant rag industry already as you well know. So there's many other vector databases, many other frameworks probably if it's an open source stack. I
I I'll say like the a lot of the AI engineers I talked to want to own this part of the stack and it feels like you know like when should we DIY and when should we just use whatever open offers?
Yeah. I mean like if you're doing something completely from scratch, you're going to have more control, right? Like so super supportive of, you
right? Like so super supportive of, you know, people trying to like roll up their sleeves, build their like super custom junking strategy and super custom retrieval strategy and all of that. And
those are things that like will be harder to do with open tools. Open tool
has like we have an out of the box solution. We give you some knobs to
solution. We give you some knobs to customize things, but it's more of like a managed rack service. So my
recommendation would be like start with the open thing, see if it like meets your needs. And over time, we're going
your needs. And over time, we're going to be adding more and more knobs to to make it even more customizable, but you know, if you want like the completely custom thing, you want control over
every single thing, then you'd probably want to go and and handroll it using other solutions. So we're supportive of
other solutions. So we're supportive of both like engineers should should pick.
Yeah. And then we got computer use, which I think operator was obviously uh one of the hot releases of the year.
We're only two months in. Let's talk
about that. And that's also it seems like a separate model that has been fine tuned for operator that has browser access. Yeah, absolutely. I mean the
access. Yeah, absolutely. I mean the computer use models are exciting. They
the cool thing about computer use is that we're just so so early. It's like
the GPD2 of computer use or maybe GPD1 of computer use right now. But uh it is it is a separate model that has been you know that the computer use team has been working on. uh you send it screenshots
working on. uh you send it screenshots and it tells you what action to take. So
the the outputs of it are almost always tool calls and uh you're you're inputting screenshots based on whatever computer you're trying to operate. Maybe
zooming out for a second cuz like I'm sure your audience is like super super like AI native obviously but like what is computer use as a tool, right? And
what's operator? So the idea for computer use is like how do we let developers also build agents that can complete tasks for the users but using a
computer or a browser instead and so how do you get that done and so that's why we have this custom model like optimized for computer use that we use like for operator ourselves but the idea behind
like putting it as an API is that imagine like now you want to you want to automate some tasks for your product or your your own customers then now you can
you can have like the ability to spin up one of these agents that will look at the screen and act on the screen. So
that means able the ability to click, the ability to scroll, the ability to type and to report back on the action.
So that's what we mean by computer use and and wrapping it as a tool also in the responses API. So now like that gives a hint also at the multi-turn thing that we were hinting at earlier.
The idea that like yeah maybe one of these actions can take a couple minutes to complete because there's maybe like uh 20 steps to complete that task but now you can. Do you think uh computer
use can play Pokemon?
Interesting. I guess we tried it. I
guess we should try it. You know
there's a lot of interest. I think
Pokemon really is a good agent benchmark to be honest. Like it seems like Claude is Claude is running into a lot of trouble. Uh, sounds like we should make
trouble. Uh, sounds like we should make that a new eval it looks like. Uh, yeah.
Yeah, yeah. Oh, and then one more one more thing before we move on to agents SDK. I know you have a hard stop.
SDK. I know you have a hard stop.
There's all these like, you know, blah blah preview, right? Like search
preview, computer use preview, right?
And they seem all like fine-tunes of 40.
I think the question is are we are they all going to be merged into the main branch or are we basically always going to have subsets of these models? Yeah, I
think in the early days research teams at OpenAI like operate with like fine-tuned models and then once the thing gets like more stable, we we sort of merge it into the mainline. So that's
definitely the vision like going out of preview as we get more comfortable with uh and and learn about all the developer use cases and we're doing a good job at them. We'll sort of like make them part
them. We'll sort of like make them part of like the core models so that you don't have to like deal with the bifurcation. You should think of it this
bifurcation. You should think of it this way as exactly what happened last year when we introduced vision capabilities.
you know, vision capabilities were in like a vision preview model based off of GPT4 and then vision capabilities now are like obviously built into GPT4. You
can think about it the same way for like the other modalities like audio and those kind of like models like optimized for search and computer use. Uh AJS SDK we have uh a few minutes left. So let's
just assume that everyone has looked at swarm.
I think that swarm has really popularized the handoff technique which uh I thought was like you know really really interesting for sort of a multi- aent world. What is new with the SDK?
aent world. What is new with the SDK?
Yeah. Yeah, for sure. Uh so we've uh basically added support for types. Uh
we've we've made this uh like a lot Yeah. Like we've added support for
Yeah. Like we've added support for types. We've added support for
types. We've added support for guardrailing uh which is a very common pattern. So in in the guardrail example,
pattern. So in in the guardrail example, you basically have two things happen in parallel. the guardrail can sort of
parallel. the guardrail can sort of block the executions. It's a type of like optimistic generation that happens and uh I think we've added support for tracing. So you can basically look at
tracing. So you can basically look at the traces that the agents SDK creates in the OpenAI dashboard. We also like made this pretty flexible. So you can
pick any API from any provider that supports the chat completions API format. So it supports responses by
format. So it supports responses by default but you can like easily plug it in to anyone that uses the chat completions API and similarly on the tracing side you can support like multiple tracing providers by default it
sort of points to the open AI dashboard but uh you know there's like so many tracing companies out there and and we'll we'll announce some partnerships on that front too. So just like you know adding lots of core features and making
it more usable but still centered around like handoffs is like the main main concept. And by the way, it's
concept. And by the way, it's interesting, right? Because Swarm just
interesting, right? Because Swarm just came to life out of like learning from customers directly that like orchestrating agents in production was pretty hard. You know, simple ideas
pretty hard. You know, simple ideas could quickly turn very complex like what are those guard rails, what are those handoffs, etc. So that came out of like learning from customers and was initially shipped as a like lowkey
experiment I'd say. But we were kind of like taken by surprise at how much momentum there was around this concept.
And so we decided to learn from that and embrace it to be like okay maybe we should just embrace that as a core primitive of the openi platform and that's kind of what led to the agents
SDK and I think now as nu mentioned is like adding all of these new capabilities to it like leveraging the handoffs that we had but tracing also and I think what's very compelling for developers is like instead of having one
agent to rule them all and you stuff like a lot of tool calls in there that can be hard to monitor now you have the tools you need to kind of like separate spread the logic, right? And you can
have a triage agent that based on an intent goes to different kind of agents.
And then on the OpenAI dashboard, we're releasing a lot of uh new user interface logs as well. So you can see all of the tracing UIs. Essentially, you'll be able
tracing UIs. Essentially, you'll be able to troubleshoot like what exactly happened in that workflow when the triage agent did a handoff to a secondary agent and the third and see
the tool calls etc. So we think that the agents SDK combined with the tracing UIs will definitely help users and developers build better agentic workflows. And just before we wrap, are
workflows. And just before we wrap, are you thinking of connecting this with also the RFT uh API? Uh because I know you already have you kind of store my text completions and then I can do finetuning of that. Is that going to be
similar for agents where you're storing kind of like my traces and then help me improve the agents? Yeah, absolutely.
Like you got to tie the traces to the eval product so that you can generate good evals. Once you have good evalers
good evals. Once you have good evalers and tasks, you can use that to do reinforcement fine-tuning and uh you know, lots of details to be figured out over here. But that's the vision and I
over here. But that's the vision and I think we're going to go after it like pretty hard and and hope we can like make this whole workflow a lot easier for developers. Awesome. Thank you so
for developers. Awesome. Thank you so much for the time. I'm sure you'll be busy on Twitter tomorrow with all the um developer feedback.
Yeah, thank you so much for having us and uh as always, we can't wait to see what developers will build with these tools and and how we can like learn as quickly as we can from them to to make them even better over time. Awesome.
Thank you guys. Thank you. Thank you
both. Awesome.
[Music]
Loading video analysis...