Behind the scenes of Google's state-of-the-art "nano-banana" image model

By Google for Developers

Summary

## Key takeaways - **Gemini 2.5 Flash: A Leap in Image Generation**: Google's Gemini 2.5 Flash represents a significant quality improvement in image generation and editing, offering state-of-the-art capabilities that impress users with both visual quality and perceived intelligence. [00:05], [01:05] - **Natural Language Editing & "Nano Banana"**: Users can interact with the model using natural language for complex edits, like transforming an image into a "nano banana" version, demonstrating the model's creative interpretation and ability to maintain scene consistency across multiple turns. [01:31], [02:31] - **Text Rendering as a Quality Proxy**: Text rendering, initially a challenging area, has become a key metric for evaluating overall image generation quality, as the model's ability to structure text implies a better grasp of image structure. [04:09], [06:14] - **Cross-Modal Learning Drives Progress**: Capabilities developed for image understanding positively transfer to image generation and vice-versa, with the ultimate goal of a unified multimodal model that learns across different modalities like images, video, and audio. [08:38], [09:52] - **Interleaved Generation for Complex Tasks**: Interleaved generation allows for complex editing and ideation by breaking down prompts into multiple steps, enabling incremental creation and offering a new paradigm for generating intricate images that maintains context and consistency. [11:25], [16:23] - **User Feedback Fuels Model Improvement**: Real-world user feedback, gathered from platforms like Twitter, is crucial for identifying failure modes and driving improvements in subsequent model releases, such as enhancing character consistency and overall aesthetic appeal. [20:59], [22:14]

Topics Covered

The Magic of Iterative Image Generation
Creative AI Interpretation: From Banana Costumes to Nano Versions
Text Rendering: A Key to Understanding Image Structure
When AI's 'Mistakes' Lead to Better Results
Factuality and Functionality: The Future of Image Generation

Full Transcript

Today we're talking about native image

generation with the team behind the new

model that we're releasing.

>> It's a giant quality leap, the model

state-of-the-art, and we're really

excited about both the generation and

editing capabilities.

>> You can ask to, for example, render the

character from different angles and it

will look like the exact same character.

>> When users interact with this, not only

they're impressed by the quality of

images, but they feel like, "Wow, this

is smart."

>> And can kind of have a fun conversation

with the model over multiple turns. So I

think this like iterative process of

creating is kind of like um the magic

behind it

>> and I think we're just scratching the

surface um on what these models can do.

>> Hey everyone, welcome back to release

notes. My name is Logan Kilpatrick. I'm

on the Google DeepMind team. Today we're

joined by Kosik, Robert, Nicole Mustafa.

These are the folks who are doing

research and product for our Gemini

native image generation model which

we're here to talk about today which I'm

super excited about. So Nicole, you want

to kick us off? Why what's the what's

the good news? Uh, I'm excited to hear

about the the release.

>> Yeah, we're releasing an update to our

image generation and editing

capabilities in Gemini and Tutor 5

Flash. And it's a giant quality leap.

Um, the model's state-of-the-art and

we're really excited about both the

generation and editing capabilities. Um,

and why don't I just show you what the

model does because I that's the best way

to kind of get that across.

>> I'm excited and I played around with it

like once, but I I have not done as much

playing around as y'all have. So, I'm

excited to see some some examples.

>> Um, great. I'm I'm going to take a

picture of you.

>> Okay. Um, and let's just start with,

let's say, zoom out and show him wearing

a giant banana costume and keep his face

visible because we want to make sure you

know looks like you.

All right,

it's going to take a couple of seconds

to generate, but it's still it's still

pretty snappy, which I think you

remember from our last release. Like, it

was a pretty fast model. Um,

>> this was one of my favorite things. cuz

I feel like this like pace of of editing

uh makes these models a ton of fun to

play with. Can you make it slightly

bigger for me? Can you just You can go

full screen, I think. Click on this.

Click on this.

>> Let me just click on this. So, there we

go. This this is Logan. This is still

your face. And what's awesome about this

model is that this still looks like you,

right? Like like this is you, but it's

actually like you're wearing a giant

banana costume and now there's like a

nice background of you walking through a

city. That's so interesting because this

is in this picture is in Chicago and

that actually is pretty much what that

what that street looks like. So

>> world knowledge coming from on this on

this model. Um and now let's keep going

and let's say make it nano.

>> What does that mean? What does make it

nano mean?

>> So let's see.

>> Let's let's see what the model does. Um

when we first released it on Ella

Marina, we gave it the code name Nano

Banana.

>> Yeah.

>> And people started speculating that it's

an updated model from us. And it is a

model updated model from us. And there

you go. Now the model takes you and

creates this like cute nano version of

you wearing a giant banana costume.

>> I love that. That's awesome.

>> And the awesome thing here is obviously

like this was a very vague prompt,

right? Like you were like, "What does

this mean?"

>> I actually did not know what that meant.

Um, but then the model's creative enough

to kind of interpret it and then like,

you know, create a scene of this where

like that that it fulfills your prom and

it still makes sense in the context and

it keeps all the rest of the scene

relevant. Um, and this is really

exciting because um, it's the first time

I think that we're seeing kind of LLMs

be really able to like keep the scene

consistent across these multiple edits

and have users use really natural

language to interact with the model,

right? I don't have to put in a super

long prompt. Like I'm just giving it

very natural language instructions and

can kind of have a fun conversation with

the model over multiple turns. Um, so

that's super exciting.

>> I love that. How good is it at like text

rendering stuff, which is one of the use

cases I care the most about.

>> Um, do you want me to

>> Yeah. Yeah.

>> put something on this picture. Why don't

you give me a prompt?

>> Um,

Gemini Nano.

That's the only nano thing that comes to

mind.

I feel like this is the use case that

I'm always trying to do is like uh

announcement tweets with uh billboards

with with text on them is what I love is

my use case.

>> All right, let's go. There you go.

>> Nice.

>> Um and so this is a relatively simple

text, right? It's a pretty small number

of letters, like easy words, and that

worked really well. Um we do have some

gaps in text rendering that we call out

in the release. Um and we're working

really hard on it. Folks on the team um

Koshi maybe can talk about that. Um are

working on making text rendering even

better in our next model.

>> I love it. Any other part of the um any

other examples you want to show or like

is there any other like metric story

around this launch? I know one of the

challenges and I'm curious actually how

you all think about this is like the

eval story is like a lot of human

preference stuff um is like what you're

measuring. It's like hard to have like a

s I think there's probably some things

that you could have a source of truth on

but I'm curious how um yeah how you how

you all think about that for this

release but also just in general as

we're as we're training these models. I

think generally with like you know

multimodal stuff like image and video

it's like very hard to kind of like hill

climb and you know um the kind of like

historic approach has been to use like a

bunch of human preference and kind of

like hill climb bats. Um obviously like

images are like super subjective so

you're kind of like um getting like

signal from a large group of people and

it takes time right it's not necessarily

like the fastest metric and it takes

like um like real like hours to kind of

get anything back from it. So generally

like we've been working really hard to

come up with like other metrics that we

can like hill climb as like we train.

Um, and I think like text rendering has

been a really interesting story because

like I think uh Koshik has been, you

know, talking about it for a long time.

One of the like biggest advocates of it.

Uh, and we were kind of like brushing

him off for a long time about how like,

you know, this guy's a little crazy.

Like he's really obsessed with text

rendering. Um, but eventually it kind of

became like one of the staple things we

looked at. And you can kind of think

about it like um when the model learns

how to do this like structure for text,

it's also kind of able to learn other

structure in an image as well. And like

in an image you have these like um

different like frequencies and you can

have like structure which you can think

of and but you can also have like

texture and stuff like this. So it

really gives you signal into like how

good the model is at generating the

structure of the scene. Um, and I'll let

Koshik talk a bit more about it because

he's like the the main guy.

>> Yeah. I'm also curious like what the the

initial conviction was for is it just

like as you were doing like a bunch of

research experiments like it became

clear that this was the case or Yeah,

I'm I'm curious to double click on it.

>> Yeah, I think it started from a place of

figuring out what these models were bad

at. So we in order to kind of improve

any model you need a signal for uh what

is not working well and then you try a

bunch of ideas whether it's related to

the model architecture data or other

things and once you have that clear

signal you you can definitely make good

progress on it and I think if we go back

a few years there were pretty much no

models that were doing a decent job and

even prompts that were on the order of

uh short short lengths like this Gemini

nano prompt here for example So as we uh

spent more time looking into this metric

and always tracking it right whatever

experiment we run now uh if we track

this metric we can make sure that we

don't regress on it and just by virtue

of having that as a signal we might even

find that uh changes that we didn't

expect to make a difference here

actually do make a difference and then

we can make sure we continue improving

that metric over time. Yeah. Yeah, and

like Robert said, uh it's a great way to

just measure overall image quality in in

the absence of other uh other metrics

for image quality that don't saturate

very quickly. Right? I think humans uh I

was actually a little bit skeptical of

the human raider approach to doing evals

for image generation. But I think what

at least I've realized over time is uh

when you have enough humans looking at

enough prompts across uh a variety of

categories, you actually do get quite a

bit of good signal. But obviously this

is expensive. You don't want to always

be uh asking a bunch of humans uh to

grade images. So looking at this text

rendering metric for example while a

model is training gives you great signal

as to whether it's performing like you

expect.

That's super interesting. I'm curious

about this um interplay between the

native image generation capability,

native image understanding capability.

We did an episode with Anie and uh that

team has obviously been pushing super

hard on like Gemini has state-of-the-art

image understanding. Is it like a

reasonable mental model as our models

get better at understanding images? Uh

there's like some of that capability is

actually transferable to generation as

well and vice versa. Like is that

>> I think Yeah. Um so basically the hope

is that we end up with uh with native

image generation or native u native

multimodal understanding and generation

and and learning all these modalities

and different capabilities at the same

uh model within the same training run is

that you you want to end up having

positive transfer across these different

axis. Right? And uh it's not only for

understanding and generation generation

for a single modality but also it's

about can we uh can we learn something

about the word that is um like from

images or videos or audio that is going

to be helping us from on on the text

understanding or text generation. So for

sure uh image understanding and image

generation are like sisters. So like we

definitely see still they're like going

hand in hand in interle generation for

example. uh but also the ultimate goal

is to to see uh like let me just give

you one example. So for example uh

language we have this uh like uh like

phenomena that we call it like reporting

biases and what it means that you go to

your friend's place and when you come

back you never talk about their normal

sofa in a conversation right but if you

show someone an image of that room it's

there right so if you want to learn

about a lot of things in the whole world

uh like images and videos they have that

information there without like you know

um explicit um um explicit like request

for for those information. So what I

want to say is that eventually with text

you can look or or with like other

modalities you can learn a lot about

different things but but it might take

more tokens. So like visual signals are

definitely a good shortcut for for

learning about the board and back to the

understanding and generation question as

I said you know like this these two

hands in hands and and coming into the

uh like interle generation you can see

that there's actually a huge uh help

from understanding to better generation

and the other way around. So like you

know image generation can help uh like

you know like you draw uh something on

the board to to solve a problem. So

maybe you know you can better understand

a a problem that is given to you as as a

like a visual uh image. Um so maybe we

can actually show some yeah

>> interle generation that that is kind of

related to understanding and generation

going hand in hand with text as well.

Um, let me do transform this subject

into a 1980s American glamour mall shot

in five different ways.

All right, fingers crossed this works.

Okay, this looks promising.

And this takes obviously a little bit

longer, right? Because we're trying to

generate multiple images and then we're

also trying to generate the text that

would describe what's in those images.

And one of the things that you'll notice

about native image generation is that

it's generating these images one after

another. So the model may choose to look

at a previous image and either try to

generate something very different from

it or try to generate a minor

modification of it. It at least has that

context of what is already generated. So

that's what we mean by native image

generation models. They have access to

multimodal context and then they

generate an image.

>> Yeah, that's interest. And it is my

mental model had always been that it was

like just I guess maybe that doesn't

even make sense, but it would have just

been like four independent forward

passes or something like that, but this

is actually like all in a single it's

all in the context of the model.

>> All in the context of the model. That's

super interesting.

>> And what's nice is then the style is

kind of similar, right? It's also the

model's doing this funny thing where it

has you twice in every single

>> interesting. We could we make some of

these full screen.

>> I'm going to make some of the So this is

arcade king logo.

Um, if we scroll, this is red dude.

And see, like none of these descriptions

that go with the images were something

that we came up with. The prompt was

just like you as a 1980s American

glamour mall shop. Um, Maltt,

this you should consider some of these

outfits

attire for.

>> And the fourth option, chill bro. See,

and like you have a different outfit in

all of them. They all look like you. Um,

the fact that you were there twice is

probably a little bit of a failure mode.

Um, but it's really cool to be able to

see the model kind of come up with these

five separate ideas. Um, give them

different names, give you different

outfits, right? And like keep the

character consistent. Um, and this is

not just useful for character building,

but this is also useful if you have a

picture of your room.

>> Yeah.

>> And you can say like, "Hey, help me

decorate this in five different ways,

right? And maybe you can go from like

really creative to maybe something more

conservative that's a little bit more

incremental to what you're doing." Um,

and we've seen a lot of people on the

team already using it to like redesign

their gardens and homes. And it's been

really cool to see that and more kind of

like practical application, not just us

making fun of.

>> Yeah. I

>> 80s Logan.

>> I vibe coded uh for my girlfriend a in

AI studio actually an app to visualize

her uh office with every different color

of blinds or of curtains. And she was

like, I don't know what curtain color is

going to fit this vibe. So, it literally

just This was with 2.0 know and I'll

I'll have to retry it with 2.5 check all

the different vibes. It actually worked

really well. It was like a very helpful

and like doesn't incre sometimes with

2.0 and actually this will be a good

thing to retest sometime with 2.0 it

would like change the bed or like change

like other artifacts would change not

just the curtain. Um

>> so it was interesting to see that use

case. One of my favorites.

>> You should give it a try. The the model

does a pretty good job keeping the rest

of the scene consistent and we and we

call this kind of pixel perfect editing.

Um, and that's really important, right?

Because sometimes you want to just edit

that one thing in your image, but you

actually want everything else to stay

the same. Again, if if you're doing

character building, you just want to

turn the character's head, but like

everything that they're wearing is to be

the same um across the scenes. And and

the model's really good at that. Um, it

will not always 100% work. Um, but we're

really excited about how far it's come.

>> Robert, you're going to say something?

Yeah.

>> Yeah. I was going to say like I think

one really cool thing is like just how

fast it is still, right? Like you know,

>> how long was this whole thing? All

right, let's let's give this a uh this

is 13 seconds.

>> Wow. So I think I think each image was

>> each each image was 13 seconds, right?

Is is

>> And so

>> Okay. This is the cumulative now.

>> Yeah. Yeah.

>> This is me, not the AI studio.

>> Yeah. Yeah. So So I think like the cool

thing is like even when you know 2.0

came out, I was using it for very

similar things. Like I had a bookshelf.

I had all the stuff on the ground. and

I'm like decorate this like what

configuration of these items should be

placed on my bookshelf and you know my

girlfriend might not have agreed with

the output so like sometimes we want to

like iterate on that and so like

rerunning it really quickly and

iterating so even if sometimes like it

kind of like fails you just tweak the

prompt rerun it and you get something

like really good afterwards so I think

this like iterative process of creating

is kind of like um the magic behind it

>> and any difference in how folks who had

tried 2.0 as an example um and like one

of the examples for me using 2.0 know

was like wanting to be like do um only

single edits like one at a time. Like if

you had said like if you had asked it to

like change six different things like

the model would sometimes not do a great

job of that. Any of those like is that

still something that we you should still

do those like type of targeted edits

with this model or any other just like

general like usability or like things

that folks should know as they're as

they're playing around with the model?

Th

>> this is something that I wanted to

mention basically. So uh one of the

magics of interleaf generation is that

it offers you to do a new paradigm for

image generation right like so if you

have a very complex prompt you know

you're talking about six different edits

what if I go with like 50 different

edits right so uh now that the model has

a really good mechanism to grab

information from the context like pixel

perfect and use it in the next turn what

you can do is you can ask the model to

break down the complex prompt either it

is editing or for image generation into

multiple steps and do edits like one by

one over different steps. So for the

first one you do this like edits like

this five different things and then for

the for the next one the next five and

so on so forth. So it's like very

similar to the test on compute that we

have on the language side, right? So, so

you spend more flops and and you let the

model to bring uh basically this

thinking in the uh in the kind of like

the pixel space plus breaking it down

into smaller pieces that you can like

really nail down uh that specific stage

but like accumulate it you can do

whatever complex task you want right so

I think like again this is the magic of

interle generation that you can think

about you know incremental generation of

like really complex images as opposed to

traditional way of doing it which was

like really push pushing hard for

getting the best image in one shot,

right? Like at the end of the day

there's a there's a capacity that you

can push the middle. You know, at some

point you realize that okay, you know,

with 100 details, uh we cannot do that.

But when you have this one interle

generation breaking it steps, you can

always go for any capacity in any

complexity that you want to generate.

>> One of the one of the things that's

always top of mind for me, especially as

like you're uh Nicole, you're also the

PM for uh for our imagined models. Um

how should people think about um

developers or just like people who are

who have knowledge of all the models

like imagine versus this like native

capability that we have? Um

>> yeah and you know this but our goal is

to always like build one model with

Gemini right so our ultimately our goal

is to always like bring all the

modalities into Gemini so we can benefit

from all the knowledge transfer that

Mustafa was talking about and ultimately

build towards AGI right on the way there

um there's a lot of usefulness out of

having specialized models that are just

very very good at a specific thing that

you need them to do um and imagine is an

amazing model for text image generation

right um and we have a lot of different

imagine variants that also to do image

editing and those are available in

Vert.ex. Um, and they're just optimized

for that specific task, right? So, if

you just want text to image and you want

just one image out of that model and you

want really amazing visual quality and

you also want that to be really

cost-effective and kind of snappy um in

generation time, um, imagine is the

place to go, right? Um, if you want some

of these kind of more complex workflows

where you want to generate with the

model, but then you also want to edit in

that same workflow and you want to do it

across multiple turns or you want to do

some of this like ideation like we were

doing with the model of like you know

what what design ideas could you help me

come up with for you know my room or

this library um then Gemini is the place

to go, right? So it really is kind of

that more multimodal like creative

partner um where it can output images,

it can output text. Um you can be kind

of less precise with the instructions

that you give to Gemini. Um because like

when we you know at the beginning we

said like make a nano um because it has

that kind of world understanding. It

will just more creatively interpret your

instructions. Um but imagine is still a

great family of models for developers to

go to um if they want like a super

optimized model for that specific task.

>> Yeah. Yeah, one of the examples I was

trying today and I'm curious what your

take is on which model or like if the

native image generation model fixes this

problem was I was saying like generate

this image and like make the uh this is

my my dumb billboard use case. I was

like make the billboard use case. I need

billboards. Um make the billboard uh the

style of of some company that I

mentioned. Is that something that like

native image generation benefits from

because it's like a little bit better at

this world knowledge piece relative to

like imagine being like really good at

if you give it a good prompt but like

less good at the like in understanding

my implying my prompts

>> your your actual intent behind the Yeah.

So so I think that's part of it. The

other part is um with native image

generation if if you just want to grab

that style reference that you have from

that you know other company that you

were trying to um emulate the style of

you can also insert that into the model

and use that as a reference right so the

fact that you can then also input an

image as a reference like helps with

that prompt and that is just easier to

do in Gemini natively than it is in

imagine um so I do you should try it you

should let us know we should add this to

our emails

>> I'll I'll let you know whether or not

the billboard use I'll make a billboard

eval

Boen email.

>> I love that. One of um back to this

thread of like the progress from 2.0.

One of the most fun things was when that

model launched, people were sending us

tons of feedback about the experience in

AI Studio and then ultimately the Gemini

app. Um just like general failure modes

for the model and all that stuff. Uh I

made my only contribution to M to the

original launch which was adding that

hot tag in AI Studio. We're bringing the

hot tag back for this model actually and

it's going to go away on the other

model. Um, how like what can we talk

about that story of just like the

progress um and like the failure modes

that we we did get a ton of feedback on

of like things that didn't work well for

2.0 that now hopefully work well for

2.5. Yeah, I mean we literally sat on

like X or Twitter and like went through

a bunch of feedback and literally I

remember like Kosik and I and some of

the other team like gathering all the

failure cases and making evals out of

that. So we have like a benchmark that

we take from like real user feedback

just from Twitter and it's just people

adding us and saying like hey this

didn't work and like for every model we

make in the future we kind of just like

append to that so that we know for

example like when we release 2.0 know

one of the failure cases sometimes we

would see is like if you edit it would

add your edit but it wouldn't

necessarily be consistent with like the

rest of the image right so that was like

one of the things that was like in that

and that we hill climbed and then

there's plenty more so kind of um we're

always just like gathering that feedback

yeah send us send us the examples that

uh that don't work well any any ones for

you all that like particularly stand out

of things that like just did not work

before that now is like a slam dunk. I

don't know if there's anything top of

mind. You you all play with I think the

like the team plays with this model I

think so much in the I assume in the

process as we're actually building it

and bringing it to life. I don't know if

there's any like go-to use cases for you

all to test and like is this actually a

good model?

>> Yeah, I think one thing I've noticed

specifically while playing with the 2.5

model is that in the 2.0 model.

Actually, one of the things that we

thought was going to be hard was

consistency from image to image, but

specifically the cases where you have an

object or say like a character that

you're building and you want that

character to remain consistent across

images. And uh if you actually leave the

character in the same place that it was

in the input image, it turns out that

this is actually quite easy and the 2.0

model could do this really well. It

could, for example, add a hat, change

the expression and stuff like that while

kind of keeping the pose and overall

structure of the scene the same. What

the 2.5 model adds on top of what these

capabilities look like in 2.0

uh is that you can ask to for example

render the character from different

angles and it will look like the exact

same character but from for example the

side. Or you could uh take a piece of

furniture and place it into a completely

different context uh reorient it and

create a whole scene. But that piece of

furniture it would remain faithful to

the original that you uploaded while

transforming it in very substantial

ways. just taking the input image and

pasting kind of those pixels into the

output image.

>> I love that. One of the one of the

reactions that I had about some of the

2.0 stuff was sometimes the images would

almost look like as you would do like

add something like I picture my face and

add a goofy mustache or a hat or

something. It almost look like it was

like superimposed or was like kind of

like photoshopped onto it. Is that

something that is also like similar to

this like character? It's like it seems

tangential to this character

consistency, but it feels like it's a

similarish problem where it's just like

taking pixels from memory and like

putting them into the image almost

versus the pixel transfer. I'm curious

if that's like a capability that's

improved.

>> Yeah. And actually I think that comes uh

comes down a lot to the actual teams

working on this model. Uh the previous

model actually we were kind of of the

mindset that okay it did the edit that's

it like it was successful but when we

started working more and more closely

with the imagine teams uh they would

look at the same exact edit that the

that we were looking at from the Gemini

side and they'd say this is terrible why

would you ever want the model to do

something like this right so uh this is

one example where blending the

perspectives from both teams uh so on

the Gemini side the instruction

following world knowledge all of these

things and then on the imagined side

making the images actually look natural,

aesthetically pleasing, and genuinely

useful. Uh, so I think it takes both of

these and having these teams work

together on this led to 2.5 being much

better at the stuff you're describing.

>> I love it. Um,

>> yeah, and just on that point, we

actually have folks on the team who

mostly come from the Imagine team who

have like a really honed aesthetic

taste. Um, and so a lot of the times

when we do evals, um, they will actually

just look at like hundreds and thousands

of images and be like, "No, this model

is better than this other model." And a

lot of other people on the team will

kind of look at it and be like, "Okay,

you know, like we like like you kind of

have to hone that I think um like

sensibility over a couple of years." And

I've gotten a lot better at it over the

years. Um but there's definitely people

on the team who are like amazing at it

and we always go to them when we try to

pick between models.

>> Can you train uh auto raiders on

people's like personal

>> We haven't been able to do it yet.

>> Fun side project.

>> That's a fun side project. I'm very

excited for for as as Gemini gets better

kind of at understanding to have like an

aesthetic operator based on you know one

of the folks on the team this who are

really amazing at this

>> just put that person to to provide

training signal for app.

>> Yes. Yes. And this is we'll we'll take

that as a side project after this.

>> I love that. Um lots of progress on 2.5

and obviously I'm I think folks are

going to be super excited to try out the

model and all that stuff. What comes

next? We've made a great model. Um, I'm

sure we have more stuff cooking in the

pipeline, but I don't know if how much

we want to say about the the future

direction and and what other

capabilities hopefully will will land in

the future.

>> Um, so when you when it comes to image

generation, I think like we do care

about the the visual quality, but uh I

think one thing that is again like new

and and we want this with like unified

omni model is smartness. You know, like

you want your image generation model to

feel smart. you know when users interact

with this not only they're impressed by

the quality of images but they feel like

wow this is smart you know like one

example that I have in mind and u I'm

looking forward to to to see this

happening and it's a bit controversial

because I I cannot even define it well

is when I ask the model to do something

it doesn't follow my instruction but

it's some it does something that at that

at the end of the generation I say I'm

glad that it's like you know it didn't

follow my instruction it's even better

than than what I actually described

right like so it's It has this kind of

like edge to it that you know um

>> I is that like the you think the model

is like intentionally doing this or it's

like it is it's like kind of an

unintended accident. Is that what you're

trying to say? No, no, it's not just

that, but basically, you know, like

sometimes, you know, underspecified or

sometimes you think wrong about like

some like something that is a reality,

but you know, outside word with like the

knowledge of Gemini um it's different

from your perspective, right?

>> And uh I think um again like it's not

intentional or or what just like happens

organically and uh I think again like

you just feel that I'm interacting with

a system that is like smarter than me,

right? And and when I when I'm asking

for uh some images, um like I don't mind

if it goes like off the rail with my my

uh my prompt and and generates something

that is different from what I ask

because it's most of the time better

than than what I had in mind. So uh I

think definitely smartness like in high

level is the the direction that we we

are pushing uh forward while maintaining

the visual quality or or improving it.

Uh but uh there are so many specifics

and capabilities and and use cases

especially for developers uh that um um

I think like this release has some but

next release is going to be like also

like and yeah we we have like these

coming releases in the pipeline. I can I

cannot share about the timeline but but

it's just like so exciting and yeah I

should maybe I should. Yeah. Um but I'm

so excited. I'm I'm like happy and and

the momentum is like unmatched here like

on the image generation side.

>> I love that. any other any other

capabilities folks are excited about?

>> I'm really excited about factuality. Um,

and and so that kind of like goes back

to the point that like sometimes like

maybe you need to make a little diagram

or like an infographic for a work

presentation, right? And like it's

amazing if it looks nice, but that's not

enough, right, for that case. Like it

actually has to be it has to be

accurate. Um, that you can't have any

extraneous text. Like it it just kind of

has to both look good and also be

functional for that purpose. And I think

we're just scratching the surface um on

what these models can do with that. And

I'm really excited about like some of

these upcoming releases like us getting

better at that type of use case so that

like my dream one day is that these

models can actually make a slide deck

for me for work that like looks nice.

>> This is every PM's dream that

>> every dream. I'm trying to outsource

that part of my job to Gemini. Um and I

think we play a really big part in it.

So

>> awesome. I love it. Well, I think folks

are going to be super excited to try

these models. Uh, thank you all four of

you and for the rest of the team for for

making this happen. So, I appreciate all

the hard work. I'm excited for this. Um,

and thanks everyone for watching release

notes. We'll we'll see you in the next

episode.

[Music]

Loading...

Loading video analysis...