TLDW logo

Gemini 3 vs GPT 5.1 Simulation Challenge

By Simulation Sandbox

Summary

## Key takeaways - **Gemini 3 Bird Flocking Impresses**: Gemini 3's bird flocking simulation features realistic flocking in a foggy river delta with 3D terrain, trees, rocks, and transparent streams, though trees don't touch the ground and birds pass through environment. [01:34], [02:14] - **Codex Birds More Fluid but Buggy**: GPT 5.1 Codex's birds flock more fluidly with smooth bird view, but they fly off the map into the abyss and hit the ground despite controls. [03:08], [03:31] - **Gemini City Multiple Failures**: Gemini 3 built several cities that didn't work, saw issues in screenshots but pretended they were fixed, eventually delivering cars, walking people, parks, and time-of-day changes. [04:35], [04:56] - **Codex City Layout Disaster**: Codex's city has streets oriented wrong, trees in roads, traffic lights mispointed, wide cars, traffic halting after 10-15 seconds, and floating people without walking animation. [05:41], [06:08] - **Gemini Wins on Spatial Reasoning**: Gemini struggled more with agentic coding than GPT 5.1 Codex but its spatial reasoning is on another level, making Codex fail the city despite reliability in building. [06:30], [06:50] - **Screenshots Expose AI Blindspots**: Both models test via screenshots but Codex misses obvious issues like trees in roads, while Gemini sometimes ignores failures it sees. [01:34], [06:30]

Topics Covered

  • Emergent Flocking Beats High-Level Coding
  • Agentic Tools Expose AI Reliability Gaps
  • Gemini's Spatial Edge Trumps Codex Fluidity

Full Transcript

In this video, I'm going to push the two smartest AIs to their limit. Gemini 3 can create beautiful one-shot scenes, like this one it created to introduce itself. And I'll be comparing it to GPT 5.1, or more specifically GPT 5.1 Codex Max (Extra High), but that's not quite as catchy. For really

difficult tasks, these are apparently the two best AIs in the world. So, I'm going to test them on complex simulation tasks that require planning, testing, and iterating over time. The first is a bird flocking simulation set in a beautiful nature scene, and the second will be a full

city simulation with people and traffic. First is bird flocking. It has to be a 3D simulation with realistic bird flocking behavior. I want it set in a beautiful nature scene with vegetation, trees, and any extras the model wants to add. I also want a bird view where the camera follows

a bird to see what it's like to be in the flock. It has to use ThreeJS, which is the most popular JavaScript animation library. If you've seen some of my other videos, you know the models are pretty good at animating with it. They can use any other tools they like, and it has to have basic controls for looking around the simulation. That's a lot to get done all at once. So, I'm going to use

agentic coding tools to let the models work for a long time on the task. I'll use Gemini 3 with the Gemini CLI and I'll use GPT 5.1 Codex Max with the Codex CLI. Both with maximum reasoning. The

models have to run the simulation themselves to test it and take screenshots to view what their simulation actually looks like and see what can be improved. And I'm going to ask that every element is as detailed and realistic as possible. I'll add the full prompt in the video description if you're interested. This is Gemini 3's bird flocking simulation. We can see straight away the birds

interested. This is Gemini 3's bird flocking simulation. We can see straight away the birds are flocking in a pretty realistic way. And the nature scene looks like a foggy river delta with 3D terrain, trees, rocks, and transparent streams. Although, if you look closely, the trees don't actually touch the ground. It's quite a big world, and the birds explore a lot

of it if you let it run. But they do get caught in these circles. It has the controls we asked for to tune the simulation parameters like how close the birds like to fly to each other and their speed.

And if you mess with it, you can get some weird behaviors. The bird view is exactly what I wanted, but can be a bit shaky and sometimes breaks. But here we can get an idea of how the simulation works. Each bird is flying around independently based on some rules like when to turn towards or

works. Each bird is flying around independently based on some rules like when to turn towards or away from nearby birds. And the flocking behavior you see is emergent from these individual bird rules rather than being programmed in at a high level. Gemini didn't bother to make the birds interact with the environment and they go straight through the trees and the ground. But overall,

I'm pretty impressed with Gemini getting this on its first try with no intervention from me. GPT

5.1 Codex's bird flocks seem a bit more fluid and natural, but a few seconds in, they do go flying out of the world and into the abyss, but they do eventually come back. The trees meet the ground correctly for this one, but the water feature is nowhere near as good as Gemini's. The bird mode is very smooth and looks great when the birds are actually flying over the world. And the birds do

hit the ground on this one, which is a bit weird, but I guess more realistic than Gemini's birds, which go straight through the ground. We have the simulation controls similar to Gemini's, which includes bird ground clearance, but even with that all the way up, they still sometimes hit the ground. If you play around with the controls, you can get some weird flocks. I found

this lonely bird and managed to catch the moment. It detects a nearby flock and turns to join it.

Codex's birds were much nicer to play with than Gemini's, but Codex also made the biggest mistake of letting the birds fly off the map. To me, they both did really well with different pros and cons, so I'm going to make the next challenge even more difficult to see if they can handle it. The next

challenge is to build an entire city. I want it to be a 3D simulation in ThreeJS. It has to have buildings, streets, vegetation, parks, and other things that should go in a city. It needs to have people walking around and cars driving around with realistic traffic interactions. I'm going

to use the agentic coding tools again to let the models run and test their code, take screenshots, and iterate based on what they see. Here is Gemini 3's city. It took a few tries. It built a couple of cities that just didn't work, and it could clearly see they didn't work from the screenshots, but it just didn't care. And even after I asked Gemini to fix it, it just pretended that it was

fixed. But eventually it built this city which has cars and people walking around just like I

fixed. But eventually it built this city which has cars and people walking around just like I asked. These parks are scattered around and each one is a bit different. The people have a little

asked. These parks are scattered around and each one is a bit different. The people have a little walking animation and they walk on the sidewalks although some of the sidewalks are missing and they use the road crossings but they do not care for red lights and they walk straight through cars. The cars do drive on the road and stop at traffic lights but the rules are a bit strange.

cars. The cars do drive on the road and stop at traffic lights but the rules are a bit strange.

Like this car is driving on the right and then switches to the left. After about a minute, this causes the entire city to get backed up while cars wait for someone on the wrong side of the road to move. You can also change the time of day and nighttime has a very different vibe. But overall,

to move. You can also change the time of day and nighttime has a very different vibe. But overall,

I'm going to say this is a pretty good job from Gemini 3. Codex's city looks okay from a distance, but is full of weird issues. Half of the streets are oriented the wrong way and things just seem to be placed randomly. Like trees are in the middle of the road and traffic lights don't point the right way. I thought the cars were driving sideways at first,

but the lights and everything are on the right way, so they are just very wide. The traffic

only really functions for about 10 or 15 seconds. Then the cars just seem to stop and never start again. The people are also nowhere near as good as Gemini's. They kind of just float around and

again. The people are also nowhere near as good as Gemini's. They kind of just float around and have no walking animation. Overall, I'm going to give Codex a fail on this one. There are just too many issues. I gave it a couple of tries, too, because Gemini had a couple of failed attempts.

many issues. I gave it a couple of tries, too, because Gemini had a couple of failed attempts.

It was much more reliable than Gemini at getting it working, but it just couldn't get it much better than this. I think the main issue is it just doesn't look at the screenshots properly, so it just misses obvious glaring issues right in front of it. Even though I love Codex and use it everyday, I'm going to say Gemini wins this comparison. It did struggle

with the agentic coding more than GPT 5.1 Codex, but its spatial reasoning is on another level.

Thanks for watching. I'll put the full prompt and process I use in the video description. Cheers.

Loading...

Loading video analysis...