Mark Chen: GPT-5, Open-Source, Agents, Future of OpenAI, and more!
By Matthew Berman
Summary
## Key takeaways - **GPT-5 Marries Pre-Training and Reasoning**: GPT-5 is one of the first models that marries both the pre-training paradigm and the reasoning paradigm together, heavily leaning on o-series models to provide reasoning when needed and fast responses otherwise. [03:40], [04:00] - **Synthetic Data Powers GPT-5**: OpenAI uses synthetic data generated by models to improve coverage in key areas like code for GPT-5, believing it can meaningfully improve beyond predecessors despite industry skepticism. [04:14], [05:28] - **Mark Chen's Vibe Check Tests**: Mark tests models with specific math problems like uniform random modulo 42 using primes, UI generation, physical simulations, and creative writing to check intuitive grasp of style, persuasiveness, and reliability. [10:37], [13:06] - **Coding Leap in GPT-5**: GPT-5 shows 70%+ win rate over GPT-4o/4o-mini in coding, generating robust, hallucination-free code over 1000 lines, handling long repos, with superior agentic tool calling. [17:37], [19:11] - **Unchanged Research Roadmap**: OpenAI's research roadmap towards AGI hasn't changed in years despite Chinese open-source releases like DeepSeek, as they pursue conviction-based path independently. [15:16], [16:28] - **Open-Source Raises Safety Bar**: New 20B/120B open-source models run on laptops/phones after extensive safety testing via preparedness framework to ensure they pass high danger thresholds like cyber risks before release. [31:36], [33:22]
Topics Covered
- Research is the product
- GPT5 fuses pretraining reasoning
- Synthetic data scales meaningfully
- Roadmap trumps open-source distractions
- AI self-improves research soon
Full Transcript
a lot of excitement leading into GPT5.
I'm wondering like what the energy is like internally at OpenAI leading up to such a big launch. We have a research road map. In the last several years,
road map. In the last several years, this road map hasn't changed much at all.
>> Interesting. It's able to cut the thinking time that it needs to extract the information by several factors.
>> You are head of research. So, how do you personally balance that tension between the product side of the organization and the research side? One thing that we're really excited is to kind of raise the
bar in terms of what we consider acceptable for open source.
>> Let's talk about some of the lessons that you learned from GPT4.
>> And GPD5 is one of the first models we have that marries both the pre-training paradigm and the reasoning paradigm together. We do believe that this
together. We do believe that this technology is going to raise the quality of life for most people.
>> Mark, thanks for joining me today. Good
to see you again.
>> Great to see you, Matt. So I I want to start with a broad question which is there's a lot of excitement leading into GBT5. I I'm wondering like what the
GBT5. I I'm wondering like what the energy is like internally at OpenAI in the months and then more recently weeks leading up to such a big launch, such an important model. What is that energy
important model. What is that energy like? What is it like inside of the
like? What is it like inside of the company? Every single launch, you know,
company? Every single launch, you know, it comes with high emotion, right? And
you know, you have that that feeling when you're starting out to do a project where people are excited and then you know, you have a period in the middle where uh there's always this kind of internal uncertainty, right? Is this
model going to be good? Is it going to kind of hit the expectations? And then
kind of really seeing kind of near the home stretch everything come together.
Uh the energy picks back up and I would say right now the feelings are are very strong. People are excited to get this
strong. People are excited to get this model out and um yeah, we're excited to show it to the world. Yeah, I I've tested it and it is absolutely incredible. Um, and I'm I'm really just
incredible. Um, and I'm I'm really just starting to figure out where kind of it really excels where it's personality and the tone and and everything are. So, um,
but I visited you guys a few weeks ago and um, Greg Brockman had mentioned that OpenAI still sees itself as a research
lab and you are head of research. So um
like how how do you personally balance that kind of tension between the product side of the organization and the research side of the organization?
>> Yeah. I mean I think it's important to take stock of you know we're we're here to do research and the research really is the product right. um every time we make a big breakthrough uh that is something that will lead to a lot of
value and utility for people and of course the product itself enables us to do more research right so I think these two things go hand in hand it's delicate balance you really don't have one without the other and that's kind of the
view that I hold in my head we want our research to have contact with the world we want people to be able to experience kind of all this intelligence that we're building and um we're very lucky that it's resulted in a very successful
product too >> let's talk about some of the lessons leson that you learned from GPT4 uh a and how you applied applied it to GPT5.
So from an outsers's perspective, there doesn't seem to be a large amount of new publicly available data that that can be applied to a brand new foundation model.
Um is is that a correct assumption? And
if so, how did you solve that data scarcity problem?
>> Right. So I would say it's somewhat accurate but not fully accurate. you can
always improve and expand the envelope of the data that you put into the model.
So we're always looking for more sources of you know publicly available data, more sources where we can license data from. But I do think there's some uh
from. But I do think there's some uh couple axes by which GPD4 is uh and and GPD5 differ. So when you look at GPD4,
GPD5 differ. So when you look at GPD4, it's really this culmination of scaling the pre-training paradigm. And GPD5 is one of the first models we have that
marries both the pre-training paradigm and the reasoning paradigm together. So
it really heavily leans on our O series of models as well. And one of our philosophies here is we want people to get reasoning when they need it and you know get a very fast response when that's the more appropriate mode to be
in. And they don't have to kind of pick
in. And they don't have to kind of pick hey you know uh do I need reasoning for this? Do I not need reasoning for this?
this? Do I not need reasoning for this?
And back to the question of data. Um,
one thing that we've also been exploring in GBD5 is the use of synthetic data.
Um, and this is data that isn't written by humans, but is generated by models.
And we've had a very healthy synthetic data program led by one of our researchers, Sebastian Bubbeck. And, you
know, it's bearing fruits and it helps us improve coverage in areas that we really want to shine. Yeah. Okay. So, I
was going to ask you all about synthetic data. So, let let's skip to that. Um,
data. So, let let's skip to that. Um,
one of my questions was, are you using synthetic data? Obviously, the answer is
synthetic data? Obviously, the answer is yes. There are some folks in the
yes. There are some folks in the industry who say synthetic data and models trained on synthetic data from prior generations of models can only really be marginally better than their
predecessor. Uh, what do you think about
predecessor. Uh, what do you think about that? What's your take there? Um, so how
that? What's your take there? Um, so how how do you think about the synthetic data in future generations?
>> Right. I mean, we really do believe in the the potential for synthetic data to be, you know, higher quality and to improve the model in meaningful ways
beyond just kind of uh really broadening or kind of deepening um surface level knowledge in in a particular category.
Um, of course, this is a research program we're still pursuing. Um, I
think still has a lot of room to go. Uh,
but I think we've seen enough uh signs of life that we've decided to use some of that to power GBD5. And are you able to speak to the mix of synthetic versus
human data uh for GPT5 and maybe how that compares to the mix for GPT4?
>> I I do think the exact mix is something that you know we want to keep uh to ourselves but over time it's becoming more and more.
>> Are there any categories of knowledge where synthetic data really excels? Are
there any categories of knowledge like math, science, coding? Those seem like maybe the obvious ones, but you tell me like where does synthetic data really work and maybe where does it fall short,
>> right? And I think fundamentally I
>> right? And I think fundamentally I believe in the promise of synthetic beta d bet d bet d bet d bet d bet d bet d bet d bet d bet d bet d bet d bet d bet d bet d bet d bet d bet d bet d bet d beta in a very broad-based setting. Um
and it's really kind of up to us to choose which domains that we really want to kind of unleash our set of tools on.
Um you know we care a lot about code in the GP5 release and um certainly that's one of the areas that we've emphasized but uh it's by by no means the only way
uh that we view synthetic data.
>> Okay. Um, and and just to continue on that, are there areas that synthetic data have fallen short for you or or that you maybe wouldn't apply synthetic data to?
>> Um, I don't think so. And I mean, there there's certainly areas that are more amendable, but I don't think that we, you know, consider the techniques in synthetic data to be not general in in
in a deep sense. So a as you know, let's like rewind months, maybe even years when you start thinking about the architecture for GPT5. What are some of the early bets that you made that maybe
at the time you were a little bit nervous that hadn't been proven yet and and ended up really working out with with a big model like this, right? It's
uh going to be the coming together of a lot of advances in architecture, in optimization, in reasoning um and even just kind of in the core infrastructure that we're building. So um you know we
we have teams that are exploratory teams in all of these domains right exploratory architecture teams exploratory optimization teams and uh in the early phases of a project like this
you know they're coming up with their scattershot set of ideas um and over time we refine those to the bets that are really working. Um, and it's nice to kind of see that kind of winnowing of of
the ideas and um, and the refinement of some of these ideas and you integrate them all together and it it creates this thing that's really kind of a a combination of a lot of these uh,
innovations on all of these axes. Were
there any early bets that really surprised you though? Like you you had you didn't know it was going to work or you had some doubts and then you just saw incredible results from it?
>> Yeah. I mean, one thing that I want to stress is kind of the coming together of reasoning and pre-training here, and that might sound pretty obvious on the surface, right? Like, why why can't you
surface, right? Like, why why can't you just kind of get the best of both worlds very very easily? Um, there's actually a lot of work done by our post training team led by Max Schwarzer. And you know
it it took a lot of work to make these reasoning models a lot faster, a lot more robust, a lot more reliable so that we could be able to combine all both of these two paradigms that we've been working on somewhat independently into
one surface that you know people can access the best of both worlds with.
when you're thinking about where to invest your compute in pre-training versus RL um how do you think about that mixture as as a decision of really
compute investment and and really monetary investment because of compute >> right um so I mean today we invest heavily in both right and um it's really
a function of on the RL side it's a new paradigm there's a lot of very promising work a lot of kind of like GP22 era style work of just exp- exploring all of
the different things um that you know influence RL and and and it feels very early right and and so there's a lot of things to explore but at the same time on the pre-training side there's also a
lot of energy right now right um there's the work in synthetic data like what I just talked about and I think still a lot of work healthy work in optimization and architecture >> as you're training the model as you're
doing post- training on the model how do you decide when a snapshot is is like hey this is the one we're going with and because obviously you're going continue to improve it. That's what you guys do.
But like how do you decide this is the one we're putting out? It's ready. It's
baked. Uh and and we'll wait for subsequent uh iterations for the next versions.
>> Yeah. Well, I mean, I think uh it's a little bit of an art, right? And uh you want to strike this balance of, >> you know, you want to pursue perfection.
You want to pursue something that you know um it doesn't have kind of any kind of nitpicks or any um you know, you play around with the models a lot. It has to pass the vibe check. Um, and it shouldn't have any kind of small
pathologies uh or you know like any behavior that that you're worried about.
And you have to tow the line of kind of uh you know waiting too long and kind of uh training what you consider the perfect model on all axes and something that just like feels good and ready to
deploy. So I think we've struck a good
deploy. So I think we've struck a good balance here and uh the post training team does a a great job of kind of vibe checking all the >> All right. All right. You mentioned vibe check. What is Markch's vibe check? How
check. What is Markch's vibe check? How
do you make sure that the model is good for you, for your personal life, for for your work life? What does your vibe test look like?
>> Yeah, I mean, um, I check it on a couple different axes. Um, there are a couple
different axes. Um, there are a couple of math problems that are my my go-to math problems. Um, >> can you share? I don't
>> Oh, sure. Sure. Um, so there's a fun one that, um, I I think I shared on Twitter a while back. Um, but, um, it has this very simple description. Basically, it
you want to create a random number uh generator uh uniform random number generator modulo 42 uh and you have access to random number generators modulo every single prime less than 42.
And this is actually still a relatively unknown problem. Um a lot of models um I
unknown problem. Um a lot of models um I think still can't get this consistently right. Um but you can see a progression
right. Um but you can see a progression of solutions until you get to the most optimal solution here. Um and uh you can kind of see uh improvement in models in terms of how creative they are in being
able to generate uh steps towards the the best solution here.
>> Yeah. Yes. Like I understand those cutting edge math problems, the frontier math problems, the ones that are really hard. Do you have anything that is more
hard. Do you have anything that is more like day-to-day uh tests?
>> So I I think I have two others, you know, one one are just kind of um generating user interfaces, right? um a
lot of kind of physical simulations um that you that are very visual, right?
You can quickly get a test for like does the physics look right, right? Um uh I think you see classic examples of this of you know balls bouncing around hexagons, but you can also do kind of like fluid simulations or other kind of
user interfaces. Um and and that gives
user interfaces. Um and and that gives you a sense for you know how how powerful is the model? Um how good is the underlying physics? Um
>> and how good is it as kind of generating robust code um and aesthetic code. Um,
another thing that I like to test it with is creative writing. Um, you know, I I write a lot uh day-to-day as well.
And yeah, sometimes, um, probably my one of the biggest use cases I, uh, I use personally in my day-to-day is, uh, having comment on, revise, you know, as
a thought partner in in documents that I write. And, uh, so we really just kind
write. And, uh, so we really just kind of like does it have a intuitive grasp for what good style is, you know, is it persuasive? Is it compelling? Um, those
persuasive? Is it compelling? Um, those
are kind of the the things that I kind of test for.
>> Did you see a noticeable improvement from GBT4 to GBT 5 for creative writing?
>> Yeah, I I I do and I think other users will notice that, too.
>> Do you ever test humor, comedy?
>> Um, I always find the models struggle with that. So, that's like kind of the
with that. So, that's like kind of the the the golden benchmark for me.
>> Yeah. I mean, I think there was a a a time in GPD2, GPD3 era where the models weren't generating that realistic human text, and it would be funny in this kind of like, you know, it's just a little
bit off. But, um, I think now that, you
bit off. But, um, I think now that, you know, we're we're in um much more realistic uh text land. Um, it might take a little bit of of time before the models become funny again.
>> Yeah. A lot of dad jokes I've noticed.
>> Yeah. I think it's getting better. the
reasoning models are surprisingly good at humor. I do think that ultimately
at humor. I do think that ultimately could be a good test of of reasoning, right? Um, can it can you kind of like
right? Um, can it can you kind of like figure out deeply what makes something funny?
>> One thing that I use um models for frequently is getting life advice and and I don't know if you do this as well, but I you know some complex situation in my life where there's different kind of
angles I need to think about it on and and it's kind of a difficult decision.
It just helps me understand all of those angles. helps me understand maybe
angles. helps me understand maybe something I hadn't thought of, a new way to approach the problem. Do you ever use it for that?
>> Yeah. No, I I love it as a brainstorming partner. And yeah, it's, you know, when
partner. And yeah, it's, you know, when when you need advice on a situation, um when you have some kind of complex problem where you know there's pros and cons of every single kind of decision
you could take. Um sometimes it use it as kind of something you can confide in, right? Um and it often times provides
right? Um and it often times provides some interesting perspectives. you know
over the last few months uh even even back to the beginning of the year China has open sourced a number of incredible models they've made a lot of innovations
on efficiency um did you take any lessons or techniques from those open source models open source papers uh as they were coming out and apply them to
GPT5 or like how how is it adjusted your thinking and your approach to your own internal models >> right so I want to one thing that I'm very proud of I OpenAI is that we have a
research road map. Um it's something that we put a lot of effort into uh and it scopes out kind of what path that we want to take towards AGI and um it prescribes a couple of things you want
to do in the short and medium term but also in the long term and in the last several years this road map hasn't changed much at all and it still hasn't
changed. Yeah. Yeah. Yeah. um even in
changed. Yeah. Yeah. Yeah. um even in light of um model releases from from you know uh from deepseek and others and um I think that's really one of the
strengths at OpenAI right um we have conviction in the path that we want to take we know what kind of ideas it implies and we just pursue those ideas and we're not a very reactionary company
in our research road map and um at the same time you know uh Chinese labs have been doing really great work too right like I think uh DeepSync in particular they do phenomen phenomenal architecture
research. They write very very efficient
research. They write very very efficient kernels and I think there's kind of lessons you can learn from that. But by
and large the the research format hasn't changed and we are still executing on on kind of the plan that we had 6 months ago a year ago.
>> Yeah, that's actually kind of surprising to hear if so I I assume the research road map is more broad categories of achievements you're looking to make like a certain amount of efficiency or are
they specific techniques that you're looking to test and implement? It really
did inform kind of the development of reasoning models to begin with. And you
know, this is really one of I think OpenAI's proudest achievements over the last year. Just the ability to create
last year. Just the ability to create models that can think deeply to reason for longer and longer periods of time before coming up with very intelligent
answers. Um we think that's a very core
answers. Um we think that's a very core part of developing general intelligence.
And you know that that's something that we've been pushing on and and kind of developing um independently of what's going on outside.
>> Okay. Um all right. So let's let's talk about GPT5 a little bit more. And as it compares to GPT4, were there any emergent capabilities with the new
architecture, the new training flow that surprised you as you were starting to get towards the end of GPT5 uh post-training?
>> Right. Um again I think a lot of these things will take some time to tell right um when we launch a new model into the world uh it takes a lot of experimentation and you know the world discovers a lot of these things for us.
One of the things that you know you you can notice right off the bat is the coding is is much better right like people prefer the the coding capabilities of GPD5 with you know I
think 70 plus% win rate over over our previous set of models like 40 and 03.
So you really see that step up um in terms of coding capabilities and and and so I do think developers right they're they're going to notice the difference here um and it's just much more robust code. It's a model that you know
code. It's a model that you know hallucinates a lot less. It's more
reliable. Um you can trust the reasoning for longer and longer periods of time.
It's much better at agentic tool calling. So a lot of these capabilities
calling. So a lot of these capabilities that you know people focus on for for knowledge work I think are going to improve and people are going to notice that >> during my testing there were two specific things that I noticed with
respect to coding. One the front end was it was just able to create much more beautiful front ends visually appealing front ends. Um and then uh it was also
front ends. Um and then uh it was also outputting much longer code when I was like single turn much longer code. I
think previously when I was using GPT40 and even the thinking models, it would start to max out around like 500 600 lines of code, but um GPT5 is is
certainly pushing well past a,000 lines of code in a single turn. So it's really interesting. Is there a was that a
interesting. Is there a was that a decision? Is there a certain training
decision? Is there a certain training technique that you needed to elicit longer chunks of code from the model?
>> Yeah, I mean I think one important thing about GPD5 is it's tailored for developers as well, right? uh we care about practical situations which often involve long code bases um large pieces
of code that are generated and so we've put a lot of focus into things like long context too the ability to handle real world long repositories. Your colleague
Nome Brown uh said he believes the future of AI is likely to be omnimodel one giant model to rule them all versus many smaller more specialized models. Uh
so just wondering what's your take on that? Do you believe that? Um, and and
that? Do you believe that? Um, and and maybe it's a mixture of both, but I want to hear what you think, Mark.
>> Funnily enough, I haven't talked to Noam directly about this before. Um, I think it makes a lot of sense, right? Like,
uh, I think, you know, having one big brain that's capable of everything. Um,
you know, it's going to be able to kind of have subm modules and and learn subm modules in the right way. Um but I do think at the same time you know one um
one thing that we aspire to build at OpenAI is organizational AI. When you
look at the levels of uh AGI framework um we do kind of think in terms of organizations of AI agents working together to to achieve and accomplish
high high level difficult objectives.
And I think it's always been a fascinating question, right? Like do
organizations work better or do kind of single entities um work better and um I I think it's it's still a a very active area of research and one that we hope to get more signal on in the in the near
future.
>> You touched on agents and I want to pull on that thread a bit. I have been talking about scaffolding being such an incredible opportunity for
so many developers. Um, but you know, as you think about a giant or kind of an omni model that's going to start to eat some of that scaffolding, I guess my my question is how how much improvement how
much headroom for improvement do you think is um available for folks who want to build scaffolding on top of GPT5? I
do think scaffolding will always need to be there in some sense to to tailor models to a particular application. But
the hope with what we're trying to do with building very general models is that we can kind of trip away at the need to provide very detailed, very complex scaffolding, right? Um I think a
very intelligent model, it should be able to soak in all of that context and just be able to kind of tell what you want to do. Um and it should be able to thrive even with less information, right? Um some of the scaffolding today
right? Um some of the scaffolding today is built to really engineer around deficiencies in the models. Um and we really hope that part of the work that we're doing on robustness on reliability
can kind of remove the need for a lot of scaffolding and um and you know allow people to kind of more intuitively tailor models to their applications.
>> Um and one element of scaffolding is memory or context management. Um, do you think we can reach whatever your definition of artificial super
intelligence or AGL whatever whatever these kind of big milestones in AI are without having memory as as internally in the model do you think it's possible?
Yeah, I mean I I think there's just so much rich work that needs to be done with memory like clearly today, right?
Like uh you know like there's a context window for for models and uh that's one of the big limitations and you would really love the model to be able to maintain memory and and context over you
know a a really really long period of time, right? Like it should be able to
time, right? Like it should be able to fit code bases. It should be able to uh fit all of your you know personal documents and and thoughts and and maybe even like everything that you see
dayto-day, right? um visual signals. So
dayto-day, right? um visual signals. So
um all of these things are going to help the model make better decisions for you to to allow you to kind of remove that scaffolding you're talking about, right?
And um really have the model act on your behalf autonomously without having to ping you all the time. So uh we believe kind of memory is a huge uh limitation and a huge thing to overcome to make the
models even more useful um in the future. And and do you think let's say
future. And and do you think let's say you had an infinite or like a very very large context window is that the kind of the ultimate solution for memory or do you think there's some other
architecture where memory is maybe uh baked more into the core model itself and and maybe the weights get updated dynamically.
This is a little bit above uh my pay but there's a lot of implementation details right in terms of how you could do memory and even when you say hey we if we support a huge long context is that
sufficient I mean a lot of it rides on how you actually implement that long context right um you need an architectural primitive that allows you to implement long context that's rich enough to actually integrate all of that
past information um and so I think a lot of it comes down to the specific architecture that you develop Um, and yeah, it's stuff that we're always improving. Um, you want to make the
improving. Um, you want to make the model that much more efficient at both synthesizing and being able to pull from early memory.
>> Speak for a second about multimodality.
Uh, what is available today in GPT5 and then what do you have uh planned in the coming months if you're able to share uh about different modalities uh for inputs?
>> Um, so we still have the same uh multimodal feature set as our previous models. you know, they can do images in,
models. you know, they can do images in, they can do um audio in. So, they're
perceptual models and um they're also able to do kind of image generation. I I
do really kind of think of perception as a core part of intelligence. And one thing that we've
intelligence. And one thing that we've highlighted in previous demos in in 03 and 04 mini for instance is the ability for a model to take a very complex image to analyze it to really find the
important parts and extract and kind of fobiate around the image and understand uh what's the most important part for for answering the query. Um GPD5 is much more efficient at doing the same kind of
task. um when you give it visual
task. um when you give it visual perception tasks, it's able to cut the kind of thinking time that it needs to extract the information by several factors. And um that's kind of what
factors. And um that's kind of what we're seeing across the board in terms of GP5's reasoning ability. It's just
much more efficient. It's faster at getting you the reasoning that you need for your task.
>> I believe I read you are one of the creators of the original codeex model, the coding model. Yeah. Okay, cool. Uh
so by the way when I first saw GitHub uh Copilot I was absolutely blown away. It
was really the first time I'd ever seen something like that. When did you first realize how powerful AI models could be at coding? Was this always something you
at coding? Was this always something you knew or was there a certain moment in time where you knew it was it was just incredible at coding because of the amount of data out there? I think one of
the biggest reasons we focused on code um when we were producing codeex is we really saw this as a way to accelerating our own work fundamentally because you know a lot of the work that we do in
research is implement and and try out ideas and and the medium of that is code right and and so we all really inspired to work on coding models because that seemed like a way and a fast way to kind
of accelerate our own progress. The
problem back then is there wasn't any good way to measure uh progress on code generation back then and um part of developing codecs was also developing
this evaluation mechanism for figuring out how do you compare code models like what are the right benchmarks um I think today you know uh it's it's much more mature um and we've also gone from you
know uh the problems that we could solve with original codecs which were very basic you know five lines of of code um to the IOI problems that we're solving today, right? Which are very complicated
today, right? Which are very complicated thousand line programs uh involve a lot of creativity. So I think um we the
of creativity. So I think um we the models have just come so far and you know we still believe in them as really this vector for pushing scientific progress forward as well as helping
accelerate ourselves.
>> Yeah. I mean like for for all of the frontier labs uh coding seems to be such a an emphasis such a focus. Do you is
that because coding and and even in as an extension math are are the keys to more capable all-purpose models. Do they
is that the piece of it that allows them to do reasoning at a much higher level?
Yeah, I mean I think there are a lot of different ways to teach reasoning, right? Um mathematics is also a very
right? Um mathematics is also a very efficient way to learn reasoning. You
know, physics, um there are a lot of domains that require a lot of deep reasoning, even stuff like creative writing, right? Um or or telling jokes
writing, right? Um or or telling jokes like you said. Um but you know we do think of coding as specifically aligned because so much of our day-to-day involves coding and also so much of the
value that I think people give to the world today through technology is through code as well. So um it's definitely an area of strategic
importance and um one that I think a lot of us internally benefit from. It's so
scalable because it does have verifiable outputs. And so I want to talk about
outputs. And so I want to talk about verifiers uh for a little bit. As part
of GPT5's training, outside of more verifiable domains like STEM, were you able to figure out ways to verify things like creative writing or humor as we
talked about less uh or or more subjective domains?
>> Yeah, I mean I think that's always a big part of our research program. We're
trying to make RL more generalizable and one way to do that is try to find ways to make verifiers more more generalizable as well. Um you know kind of any RL system is going to require
verifiers and uh we're trying to make our RL systems as general as possible.
>> Can you share a little bit about how you thought about verifiers for GBT 5? Are
you using LLM as a judge or like anything you can share?
>> Um it's a mixture of a lot of approaches. Uh unfortunately I think uh
approaches. Uh unfortunately I think uh we'll share that at a later date. Is
there anything you want to cover, Mark?
I have a couple more questions, but I want to make sure like if there's something especially uh cool that that you're excited to talk about, we get there.
>> It's really just been an exciting month for us, right? I think um we're coming off the heels of not just GP5 release, but there's also the open source release. Um and you know, there we're
release. Um and you know, there we're just really excited that we've been able to squeeze all this capability into such a small model that's accessible for people who, you know, want to run run models on prem. And then kind of even
before that, we've uh had some really good results at competitions like at coder and and the IMO. And it's really kind of just confirmation that these broad-based reasoning models are able to
achieve stronger and stronger results, right? We're continuing to raise the
right? We're continuing to raise the ceiling. And for me, um the coder result
ceiling. And for me, um the coder result stands out just because it's the first top three result at a a worldass tier kind of a math or programming competition. And I think there's just a
competition. And I think there's just a world of difference between your like top 100 and top three. So
>> really really kind of happy with the progress that the team is making pushing into that top tier level of human level performance.
>> And so like how how does achieving such good performance at IY um or or at coder how does that translate into more everyday use intelligence >> where where people are actually going to
see a difference?
>> Yeah. Um that's a really really great question. And when we approached these
question. And when we approached these two competitions, we didn't set out to create like here's an IMO specific model that we're going to pump a ton of compute into and all it can do is the
IMO, right? Um we're creating fairly
IMO, right? Um we're creating fairly general models. We take general
general models. We take general techniques, layer them on on top of our reasoning models and with a very small amount of tuning if any, we create models that can pres really kind of
perform at this level at the IMO and II.
And I think um uh you know some of the the models that we use for math competitions, it's the same models that we're using in the coding competitions.
So they really aren't these like very narrow specific models. Uh we care about a program of producing general intelligence. And so I think what you're
intelligence. And so I think what you're seeing is what you're getting, right? Um
these models are very good at these contests, but also you're just very good generally, right? We haven't done too
generally, right? We haven't done too much tuning for them.
>> Yeah. I mean that kind of speaks to Gnome and and yourself what you thought about having kind of a generalized omni model in the future if it's if it's good at all of these different types of very
difficult tasks and then those things are all inside one model and then they can apply to more everyday use. So uh
definitely lines up with what you were speaking about. Absolutely. I mean let's
speaking about. Absolutely. I mean let's let's talk a little bit about the open source models that were just released.
Mhm.
>> Um tell me why you are so excited about these two flavors of models 20B 120B. Um
why are you so excited and how how do they help uh progress OpenAI's mission?
>> Yeah, so I think uh in terms of just the sizes of the models, right, we're happy to deliver models which can run on a laptop and run on a phone, right? These
are very, you know, very commonplace kind of consumer form factors and it's really just going to allow a lot of hobbyists to become more involved.
People with kind of very specialized constraints, academics, right? They're
going to be able to access models which they have full control over. And I think that's going to be very powerful. Um,
and in terms of how it advances our goals, one thing that we're really excited is um is to kind of raise the bar in terms of what we consider acceptable for for release for for open
source. And uh we've done a lot of
source. And uh we've done a lot of extensive safety work on this model. Um
what I mean here is you know we have a preparedness framework which allows us to think about risk for models like uh you know is it risky from a you know bio perspective or from a chemical
perspective or from a cyber security perspective and what we've done with these open source models is to just test right like if some actor decided to fine-tune one of these models to be
maximally dangerous in in terms of like uh cyber security or being able to to attack um you know uh code in uh offensively um you know what what is
kind of the ceiling there right and we're making sure that it passes under the bar of what we consider high levels of danger before we release this model so um I think it's really a good chance
to set norms for um what is safe and responsible to release in open source >> at the same time right uh we're releasing a frontier class open source model
>> yeah no I I'm super excited to test it out I obviously read all about it already. Um what went into the decision
already. Um what went into the decision um of the total number of parameters I think obviously it's what can fit on consumer grade hardware but then also the active parameters. How did you
decide what that ratio looks like?
>> Um yeah I mean I think really it is just motivated by the surfaces um and and by latency like uh yeah we just want it to be very easy for people to run. Um and I
think that really just drove um really drove uh all of the decisions in the open source model. We've uh engaged with developers very early on in the process.
We've taken a lot of input uh even starting from I think 2 or 3 months ago and I think a lot of those conversations were very helpful in in terms of parameterizing the model.
>> The benchmarks are very impressive right comparable to 03 mini sometimes 04 mini.
Let's talk about benchmarks a little bit. you know, you spoke about GPT5 or a
bit. you know, you spoke about GPT5 or a model um getting close to winning uh the AT coder second place, >> getting gold medal at IO. How do you think about benchmarks going forward
when many of these benchmarks are saturated, but you have things like the ArcGI benchmark, which >> still pretty far off. Um what are your favorite benchmarks? How do you think
favorite benchmarks? How do you think about benchmarks and and how these benchmarks kind of tell the story of the model capabilities? You're absolutely
model capabilities? You're absolutely right that there is a bit of a crisis when it comes to benchmarks today, right? And what I mean is that we used
right? And what I mean is that we used to be able to take hard benchmarks written by humans and just use them to test the models as well. And you know, as the models get really to that frontier of what humans can do, the
human benchmarks are not going to cut it anymore, right? Like I think like today,
anymore, right? Like I think like today, no one will be surprised if it could get a very good score on the SAT anymore, right? Um and and so now you've had to
right? Um and and so now you've had to develop a lot of new types of benchmarks just to test the frontier capabilities of the models. And the problem is that there's no kind of standard or agreed
upon way to do this, right? Um people
are are putting it together in in the last just couple of years. We do a lot of that work ourselves too, right? Um
and I think over time some of these things will become more and more standard. Um but the danger too is that
standard. Um but the danger too is that you know they tend to have fairly short shelf lives these days. um when you see kind of uh any benchmark that starts with, you know, just a couple percentage
points of accuracy, often times within a year, you know, you're already getting into like 30 to 50%. And um and so I do think kind of the the shelf life of of
benchmarks has really reduced quite a bit and we spend a lot of active effort in just creating orthogonal new types of benchmarks to challenge our models.
>> Yeah. And what do you think about interactive benchmarks like ARGI3?
Right. It's kind of a interactive game benchmark and and what do you what do you think of these benchmarks that are more set up as as gaming environments for for the AI to play within?
>> Yeah, I mean it it's certainly a good test of our our models and I think there's a danger in just indexing too much in one specific benchmark. Um what
we try to do is just develop these broad-based reasoning capabilities and then use all of these specific kind of thermometers to to gauge whether we're making progress in a broad-based way. Um
so I wouldn't say our approach really like takes one benchmark as front and center and optimizes that but rather you know we try to get a pulse on you know what are all signal bearing benchmarks
out there um design more ourselves and just make sure that those are all kind of small signals that we're pushing in broad-based reasoning in the right way.
I have a lot of developers and engineers who who watch my videos and a lot of them are are quite nervous as AI continues to get better, right? Second
place at the at coder uh competition.
What do you tell let's say new graduates or or students that are thinking about getting into coding as a career? What
what do you tell them? Are you I assume you're optimistic, but don't let me put words in your mouth. What do you tell them to make them feel more confident about that as a career choice?
>> Yeah, I mean I would say just lean into using the tools to accelerate yourself.
That's what we think about at OpenAI as well, right? We ultimately want to build
well, right? We ultimately want to build AI that can accelerate research. And I
think part of that is just accelerating your own productivity, right? If you
learn how to interface with the tools, if you learn how to make yourself 2x, 3x more effective, um there's still a lot of value you bring with your ideas and with learning deeply how the technology
works. So really lean into that,
works. So really lean into that, understand the mechanics, be able to contribute to the mechanics and just accelerate yourself. I think that's
accelerate yourself. I think that's probably one of the most important things going forward.
>> And then more generally, let's say not coding but knowledge work >> there again like a lot of people are very fearful of being automated away. uh
do you apply that same uh uh recommendation to to general knowledge workers just learn the tools and and if so why? I think so. And it's always
so why? I think so. And it's always because I feel like the model's going to automate some surface, but it creates new surface for us to adapt to, right?
Um, we're not going to lie, you know, this we think this is going to transform the the economy in significant ways, but we also believe that humans are very adaptable, right? We've always done it
adaptable, right? We've always done it in the past. There have been very powerful productivity tools in the past.
And I think the more that we adapt to them, we create new surface for us to work in. And um I think ultimately you
work in. And um I think ultimately you know we do believe that this technology is going to raise both through scientific progress and and just through economic means right the the quality of
life for most people. And I think that's one of the things that motivates all of us here.
>> Yeah I love that. I love the very optimistic view. I'm I'm very optimistic
optimistic view. I'm I'm very optimistic as well. Okay. So let's end on this
as well. Okay. So let's end on this question. What are you personally most
question. What are you personally most excited about over the next 6 months and then let's say over the next 24 months?
I would say in 6 months um I'm really excited to just continue the reasoning scaling paradigm, right? I I think there's so many different ways that we
see pumping more test time compute um into the models and to to have the models leverage test time compute effectively. Um there's a lot of you
effectively. Um there's a lot of you know different RL objectives that we're trying out. There's a lot of innovations
trying out. There's a lot of innovations in RL optimization that we're trying out and there's a lot of different ways that we're looking into scaling RL. So, you
know, it's uh still a field that is rich with ideas that's kind of uh not mature in the sense that there's just one recipe and we're really excited to kind
of figure out all all those small details. Um and I would say in 2 years I
details. Um and I would say in 2 years I just think you know we really can get to the point where these models are just as effective as
you know I am at doing AI research and I love to create this system that you know um the AI is really kind of driving a lot of the innovations that make future
systems more successful. Yeah,
self-improving AI is is just such a cool concept to think about and and yeah, if it's able to discover new innovations in artificial intelligence, apply it to itself, and then continue to scale up
from there, it's really >> just a function of how much compute you can throw at it. Absolutely.
>> Uh Mark, thank you so much for chatting with me today. It's been a pleasure. I
uh I really appreciate it.
>> Yeah. Um really really enjoyed the chat.
Thank you so much, Matt.
Loading video analysis...