TLDW logo

More New AI Models! OpenAI Drops 5.1 Pro and Codex Pro

By The AI Daily Brief: Artificial Intelligence News

Summary

## Key takeaways - **GPT-5.1 Pro Rolls Out Quietly**: OpenAI released GPT-5.1 Pro to all Pro users, delivering clearer, more capable answers for complex work with strong gains in writing help, data science, and business tasks, though without a dedicated blog post. [06:03], [05:49] - **Codex Max Masters Compaction**: Codex Max is the first model natively trained to operate across multiple context windows through compaction, coherently working over millions of tokens to enable project-scale refactors, deep debugging, and multi-hour agent loops. [01:51], [03:11] - **Token Efficiency Leaps 30%**: Codex Max with medium reasoning achieves better performance than GPT-5.1 Codex using 30% fewer thinking tokens, with major gains in token-efficiency for long-running autonomous coding agents. [02:42], [02:38] - **24-Hour Autonomous Coding**: In internal evaluations, Codex Max has worked on tasks for more than 24 hours independently, sustaining coherent work over long horizons by pruning history while preserving key context. [03:36], [03:25] - **Pro Excels in Complex Reasoning**: GPT-5.1 Pro is a slow heavyweight reasoning model that feels smarter than anything else on tough problems, with standout instruction following like a contract engineer, outperforming on immunology explanations and scientific synthesis. [07:05], [06:35] - **Agent Time Horizons Triple**: On meter tests of long-horizon tasks, Codex Max completes 2-hour-42-minute human programmer tasks at 50% success, with state-of-the-art agent capabilities time horizon tripling since Claude 3 Sonnet in February. [04:44], [05:07]

Topics Covered

  • Compaction Unlocks Agentic Coding
  • Codex Max Sustains 24-Hour Tasks
  • GPT-5.1 Pro Equals Domain Expert
  • Scaling Laws Defy AI Plateau Narrative

Full Transcript

Welcome back to the AI Daily Brief. Boy,

did this turn into just a hell of a week. Today we're talking about OpenAI's

week. Today we're talking about OpenAI's response to Gemini 3, but we're also talking about what I think will start to happen in the wake of this week, which is a bit of a recalibration in the

larger narrative around AI as well.

First though, let's start with the new model releases. When we got GPT 5.1,

model releases. When we got GPT 5.1, which frankly no one was really expecting, it became clear that OpenAI knew that Gemini 3 was coming out very, very soon. Now 5.1, as I've said

very soon. Now 5.1, as I've said numerous times, was a major update. It

was not a nothing update at all. On the

one hand, 5.1 brought more personality back to the model, trying to appeal to the 40 people who had been so mad when GPT5 came out and felt much more clinical to them. But it also has felt

to many, just frankly, a big step up in capabilities from GPT5. I know on a personal level, I have significantly increased the amount of time that I've been collaborating in a brainstorm and

creative and strategic ideation capability since 51 dropped. Likewise,

it was notable that the pre-Gemini 3 drop did not include a pro version, leading many to speculate that that would be OpenAI's fast follow to Gemini 3. I'm not sure that people thought it

3. I'm not sure that people thought it would be this fast to follow, though.

And as it turns out, it was not just 5.1 Pro that we got, but in fact, even more emphasis yesterday was placed on a new coding model, GPT51 Codeex Max. In their

announcement post, OpenAI writes, "GPT51 Codeex Max is built on an update to our foundational reasoning model, which is trained on agentic tasks across software engineering, math, research, and more.

GPT51 Codex Max is faster, more intelligent, and more token efficient at every stage of the development cycle and a new step towards becoming a reliable coding partner. Codex Max, they say, is

coding partner. Codex Max, they say, is built for longunning detailed work. And

one of the big new innovations is this new process they call compaction. They

write, "It's our first model natively trained to operate across multiple context windows through a process called compaction. Coherently working over

compaction. Coherently working over millions of tokens in a single task.

This unlocks project scale refactors, deep debugging sessions, and multi-hour agent loops. In other words, this model

agent loops. In other words, this model is not only designed for raw capabilities, but it's designed to improve performance in the specific context in which it's going to operate as not just a coding assistant, but as

an autonomous coding agent. Now, as with any model release, we got some benchmarks. And remember, this is a

benchmarks. And remember, this is a model that is very specifically designed for the purpose of coding. Introducing

the benchmarks, they reinforce that it was trained on real world software engineering tasks, including PR creation, code review, and front-end coding. And in so doing, Codex Max

coding. And in so doing, Codex Max represents a major jump from 51c high on both sui lancer as well as terminal bench. The value, however, isn't just in

bench. The value, however, isn't just in output, it's also in token efficiency.

For example, they write on sweet bench verified codeex max with medium reasoning achieves better performance than GPT51 codecs with the same reasoning effort while using 30% fewer thinking tokens. They also announced

thinking tokens. They also announced that they're introducing a new extra high reasoning effort for non- latency sensitive tasks, i.e. tasks that can run for a long period of time overall. Then

you're getting better results and more efficient performance. And it's clear

efficient performance. And it's clear from the blog post that this is a model that's designed to expand the universe of what's possible with AI and agentic coding. In a section called long-running

coding. In a section called long-running tasks, OpenAI writes, "Compaction enables Codeex Max to complete tasks that would have previously failed due to context window limits, such as complex refactors and long-running agent loops

by pruning its history while preserving the most important context over long horizons. The ability to sustain

horizons. The ability to sustain coherent work over long horizons is a foundational capability on the path towards more general reliable AI systems. Ultimately, they claim that Codex Max can work independently for

hours at a time. Indeed, they say, "In our internal evaluations, we've observed Codex Max work on tasks for more than 24 hours." They conclude, "Codex Max shows

hours." They conclude, "Codex Max shows how far models have come in sustaining long horizon coding tasks, managing complex workflows, and producing high-quality implementation with far fewer tokens." Finally, they clued with

fewer tokens." Finally, they clued with some statistics. Internally, they say

some statistics. Internally, they say 95% of their engineers use Codeex weekly, and the engineers that do ship roughly 70% more pull requests since adopting Codeex. So, that's the official

adopting Codeex. So, that's the official blog post. Other members of OpenAI's

blog post. Other members of OpenAI's team focused on different parts.

Researcher Nome Brown used it as a chance to reinforce a message which has been coming up all week. Pre-training

hasn't hit a wall, he writes, and neither has test time compute. Ethan

Malik points out in a theme we'll come back to. 51 Codex was released 6 days

back to. 51 Codex was released 6 days ago. Now we have 5.1 Codex Max. The use

ago. Now we have 5.1 Codex Max. The use

of every naming scheme piled on top of each other from version numbers to qualifiers like Max makes it hard to see how big a deal each release is, but this looks like a big jump in ability. Peter

Gostev tested it against a prompt to create an application that allows you to view the Golden Gate Bridge from various angles and said, "This is definitely the best I ever got out of this type of prompt by far." On meters measurement of

longtime horizon tasks, which is of course this chart that we've been following very closely as a more fast visual cue to understand shifts and capabilities showed that Codex Max was able to complete tasks that take a human

programmer 2 hours and 42 minutes with a 50% success rate. That's 25 minutes longer than GPT5, which was the previous state-of-the-art. Although Gro 41 and

state-of-the-art. Although Gro 41 and Gemini 3 have not yet been tested. What

all of this adds up to, by the way, on the meter test is that the time horizon for agent capabilities is still doubling roughly every 7 months, but due to a slight inflection point somewhere around the release of 03, the time horizon of capabilities for the state-of-the-art

has actually tripled since the release of Claw 3 sonnet in February. Now,

people have not had a lot of time to digest this, but a lot of folks are jumping on this idea of compaction and what it might mean for context windows in the long run. And indeed, you get the sense that a lot of the innovations in

Codeex Max were basically OpenAI trying out things that it wants to bring to general purpose AI in what they perceive as the most competitive and highest value use case area right now, which is

AI coding. Now, Simon Willis pointed out

AI coding. Now, Simon Willis pointed out despite Codex Max, the quote bigger news today may actually be GBT5 Pro.

Although, as he points out, that one didn't even get a blog post. It just got this tweet. OpenAI actually retweeted

this tweet. OpenAI actually retweeted its announcement of GPT51 from last week, saying, "GPT51 Pro is rolling out today to all Pro users. It delivers

clearer, more capable answers for complex work with strong gains in writing help, data science, and business tasks." Now, despite it not having a lot

tasks." Now, despite it not having a lot of release, Hallelu, there were some people who had early access to it.

Professor Daria Anutmaz writes, "I can confidently say 51 Pro has raised the level of my favorite model GPT50 Pro by a significant notch." He gave an example where he asked both 50 and 51 Pro about

the top unanswered questions in immunology, requesting that both models unpack each question clearly so that someone without an immunology degree could understand their importance. He

concludes, "51 Pro is clearly better in that someone without an immunology background can more easily understand these explanations with the importance and potential payoff clearly spelled out. They are also more self-contained,

out. They are also more self-contained, more visual, and more accessible while still being deep. Content creator Theo had tweeted back on November 17th, "Just had my mind absolutely melted by

redacted. Can't wait to talk about it."

redacted. Can't wait to talk about it."

And responded yesterday, "OpenAI just quietly released GPT51 Pro, and this is the redacted I was talking about." Matt

Schumer did not mince words. He said,

"I've had access to GPT51 Pro for the last week. It's an effing monster.

last week. It's an effing monster.

easily the most capable and impressive model I've ever used. But he says it's not all positive. His review ultimately is called an absolute monster, but trapped in the wrong interface. His

summary reads, "51 Pro is a slow, heavyweight reasoning model. When given

really tough problems, it feels smarter than anything else I've used.

Instruction following is the standout.

It actually does what you ask for without going off the rails. For serious

coding, it feels less like an assistant and more like a contract engineer working from a spec. It is ridiculously smart. It genuinely feels like a better

smart. It genuinely feels like a better reasoner than most humans, and I expect examples within days of it solving problems people thought were out of bounds for today's AI systems. However, he said there are still areas where it

loses to Gemini 3, and there are interface issues. He writes, "Frontend

interface issues. He writes, "Frontend and UX design are still far worse than Gemini 3, and the biggest weakness is the interface. It lives in chat GPT, not

the interface. It lives in chat GPT, not in my IDE, not wired into my existing tools. This friction is beyond limiting

tools. This friction is beyond limiting and frustrating." He says, "For most

and frustrating." He says, "For most day-to-day work, Gemini 3 is just better. Waiting 10 minutes for an answer

better. Waiting 10 minutes for an answer in a separate interface is not ideal.

For anything that requires deep thought, planning, and research, and anything that I need to get right on the first try, I reach for 5.1 Pro. Ethan Mllik

pointed out, "Open AAI feels like it underells GPT5 Pro, which is still the model that is most likely to deliver serious value on very hard problems. Partially, it is because these hard problems are complicated, so they're

hard to describe to others." Now, Ethan also points out the right comparison is probably not Gemini 3, but Gemini 3 Deep Think, but still, it is interesting that 5 Pro has always had a bit of a shroud

of mystery when it comes to the right use cases. One other person who had

use cases. One other person who had early access to 5.1 Pro is Simon Smith.

He wrote, "I was invited to Alpha Test 51 Pro alongside experts in robotics, math immunology medicine music and more. My focus was life science

more. My focus was life science commercial research and strategy, and some personal use cases. Having used 5.1 Pro for a few days, I find it more like a human domain expert than 5 Pro with clearer writing, better judgment, fewer

tangents, stronger synthesis, and more emotionally aware responses. I ran 51 Pro head-to-head against 5 Pro on work tasks like scientific literature synthesis, drug launch planning, and social media analysis. I also tried it

for personal financial planning and even journaling. It was more rigorous and

journaling. It was more rigorous and comprehensive in research and planning, stronger at reasoning, better at staying on track and avoiding tangents, and in at least one case, associated errors, much clearer, more confident, more

empathetic in its communication style.

Now, he does point out that it's still bad at certain things. He said that it's not good at creating professional quality presentations or Excel spreadsheets. And he said, I saw that at

spreadsheets. And he said, I saw that at least one tester found the model conservatively avoided tackling known open problems in STEM domains, choosing instead to explain why they're open problems. Ultimately, he says it's about

a 10 to 15% jump over 5 Pro for the types of things he uses it for. And he

says knowing OpenAI's focus on real world performance like GDP vals of it hiring domain experts in fields like finance, I think human domain expertise is exactly what they're going for. And

with 51 Pro, they're getting closer.

This bodess well for AI doing even more impactful work in 2026.

Now to zoom out here, I think the obvious surface level story is something like OpenAI cracks back in the week that Google wanted to dominate with Gemini 3.

And to some extent that's the case, although it's pretty clear that OpenAI is not trying to steal Gemini's general thunder with this, or at least knows that it's not possible with these models, but instead they chose to

release the two update models that are most specifically about very discreet types of work. They are showing off some new approaches or at least newly named approaches like this compaction that hint at where the future of general

models is headed and suggest that there is still much much more territory to be claimed. Indeed, interestingly, I think

claimed. Indeed, interestingly, I think that these releases in a weird way are much less about trying to win back momentum from Google and much more about

leaning into Google's momentum more broadly. Taken alongside Nvidia's

broadly. Taken alongside Nvidia's earnings report, you can feel the embers of a little bit of a shift in the AI narrative. For a couple of months now,

narrative. For a couple of months now, markets have been flirting with the idea that AI is just a big bubble. And one of the things that they've been looking for as evidence is, of course, plateaus or walls in the ability of these models to

continue to improve. The story of this week, as investor Gavin Baker points out, is that Gemini 3 shows that scaling laws for pre-training are intact. He

says this is the most important AI data point since the release of 01. Now he

gets into why that is, which is a topic that we'll explore in an episode later this week. But for our purposes here

this week. But for our purposes here today, I think that takeaway one from these new models from OpenAI is that we all just got even more new tools to play with. And two, in some ways, this week

with. And two, in some ways, this week wasn't about competition, but about all the model companies, including Grock with 41, standing shoulder-to-shoulder and telling all of the skeptics, just

wait to see what comes next. That's

going to do it for today's AI Daily Brief. Thanks for listening or watching

Brief. Thanks for listening or watching as always and until next time, peace.

Loading...

Loading video analysis...