Behind the scenes of Google's state-of-the-art "nano-banana" image model
By Google for Developers
Summary
## Key takeaways - **Gemini 2.5 Flash: A Leap in Image Generation**: Google's Gemini 2.5 Flash represents a significant quality improvement in image generation and editing, offering state-of-the-art capabilities that impress users with both visual quality and perceived intelligence. [00:05], [01:05] - **Natural Language Editing & "Nano Banana"**: Users can interact with the model using natural language for complex edits, like transforming an image into a "nano banana" version, demonstrating the model's creative interpretation and ability to maintain scene consistency across multiple turns. [01:31], [02:31] - **Text Rendering as a Quality Proxy**: Text rendering, initially a challenging area, has become a key metric for evaluating overall image generation quality, as the model's ability to structure text implies a better grasp of image structure. [04:09], [06:14] - **Cross-Modal Learning Drives Progress**: Capabilities developed for image understanding positively transfer to image generation and vice-versa, with the ultimate goal of a unified multimodal model that learns across different modalities like images, video, and audio. [08:38], [09:52] - **Interleaved Generation for Complex Tasks**: Interleaved generation allows for complex editing and ideation by breaking down prompts into multiple steps, enabling incremental creation and offering a new paradigm for generating intricate images that maintains context and consistency. [11:25], [16:23] - **User Feedback Fuels Model Improvement**: Real-world user feedback, gathered from platforms like Twitter, is crucial for identifying failure modes and driving improvements in subsequent model releases, such as enhancing character consistency and overall aesthetic appeal. [20:59], [22:14]
Topics Covered
- The Magic of Iterative Image Generation
- Creative AI Interpretation: From Banana Costumes to Nano Versions
- Text Rendering: A Key to Understanding Image Structure
- When AI's 'Mistakes' Lead to Better Results
- Factuality and Functionality: The Future of Image Generation
Full Transcript
Today we're talking about native image
generation with the team behind the new
model that we're releasing.
>> It's a giant quality leap, the model
state-of-the-art, and we're really
excited about both the generation and
editing capabilities.
>> You can ask to, for example, render the
character from different angles and it
will look like the exact same character.
>> When users interact with this, not only
they're impressed by the quality of
images, but they feel like, "Wow, this
is smart."
>> And can kind of have a fun conversation
with the model over multiple turns. So I
think this like iterative process of
creating is kind of like um the magic
behind it
>> and I think we're just scratching the
surface um on what these models can do.
>> Hey everyone, welcome back to release
notes. My name is Logan Kilpatrick. I'm
on the Google DeepMind team. Today we're
joined by Kosik, Robert, Nicole Mustafa.
These are the folks who are doing
research and product for our Gemini
native image generation model which
we're here to talk about today which I'm
super excited about. So Nicole, you want
to kick us off? Why what's the what's
the good news? Uh, I'm excited to hear
about the the release.
>> Yeah, we're releasing an update to our
image generation and editing
capabilities in Gemini and Tutor 5
Flash. And it's a giant quality leap.
Um, the model's state-of-the-art and
we're really excited about both the
generation and editing capabilities. Um,
and why don't I just show you what the
model does because I that's the best way
to kind of get that across.
>> I'm excited and I played around with it
like once, but I I have not done as much
playing around as y'all have. So, I'm
excited to see some some examples.
>> Um, great. I'm I'm going to take a
picture of you.
>> Okay. Um, and let's just start with,
let's say, zoom out and show him wearing
a giant banana costume and keep his face
visible because we want to make sure you
know looks like you.
All right,
it's going to take a couple of seconds
to generate, but it's still it's still
pretty snappy, which I think you
remember from our last release. Like, it
was a pretty fast model. Um,
>> this was one of my favorite things. cuz
I feel like this like pace of of editing
uh makes these models a ton of fun to
play with. Can you make it slightly
bigger for me? Can you just You can go
full screen, I think. Click on this.
Click on this.
>> Let me just click on this. So, there we
go. This this is Logan. This is still
your face. And what's awesome about this
model is that this still looks like you,
right? Like like this is you, but it's
actually like you're wearing a giant
banana costume and now there's like a
nice background of you walking through a
city. That's so interesting because this
is in this picture is in Chicago and
that actually is pretty much what that
what that street looks like. So
>> world knowledge coming from on this on
this model. Um and now let's keep going
and let's say make it nano.
>> What does that mean? What does make it
nano mean?
>> So let's see.
>> Let's let's see what the model does. Um
when we first released it on Ella
Marina, we gave it the code name Nano
Banana.
>> Yeah.
>> And people started speculating that it's
an updated model from us. And it is a
model updated model from us. And there
you go. Now the model takes you and
creates this like cute nano version of
you wearing a giant banana costume.
>> I love that. That's awesome.
>> And the awesome thing here is obviously
like this was a very vague prompt,
right? Like you were like, "What does
this mean?"
>> I actually did not know what that meant.
Um, but then the model's creative enough
to kind of interpret it and then like,
you know, create a scene of this where
like that that it fulfills your prom and
it still makes sense in the context and
it keeps all the rest of the scene
relevant. Um, and this is really
exciting because um, it's the first time
I think that we're seeing kind of LLMs
be really able to like keep the scene
consistent across these multiple edits
and have users use really natural
language to interact with the model,
right? I don't have to put in a super
long prompt. Like I'm just giving it
very natural language instructions and
can kind of have a fun conversation with
the model over multiple turns. Um, so
that's super exciting.
>> I love that. How good is it at like text
rendering stuff, which is one of the use
cases I care the most about.
>> Um, do you want me to
>> Yeah. Yeah.
>> put something on this picture. Why don't
you give me a prompt?
>> Um,
Gemini Nano.
That's the only nano thing that comes to
mind.
I feel like this is the use case that
I'm always trying to do is like uh
announcement tweets with uh billboards
with with text on them is what I love is
my use case.
>> All right, let's go. There you go.
>> Nice.
>> Um and so this is a relatively simple
text, right? It's a pretty small number
of letters, like easy words, and that
worked really well. Um we do have some
gaps in text rendering that we call out
in the release. Um and we're working
really hard on it. Folks on the team um
Koshi maybe can talk about that. Um are
working on making text rendering even
better in our next model.
>> I love it. Any other part of the um any
other examples you want to show or like
is there any other like metric story
around this launch? I know one of the
challenges and I'm curious actually how
you all think about this is like the
eval story is like a lot of human
preference stuff um is like what you're
measuring. It's like hard to have like a
s I think there's probably some things
that you could have a source of truth on
but I'm curious how um yeah how you how
you all think about that for this
release but also just in general as
we're as we're training these models. I
think generally with like you know
multimodal stuff like image and video
it's like very hard to kind of like hill
climb and you know um the kind of like
historic approach has been to use like a
bunch of human preference and kind of
like hill climb bats. Um obviously like
images are like super subjective so
you're kind of like um getting like
signal from a large group of people and
it takes time right it's not necessarily
like the fastest metric and it takes
like um like real like hours to kind of
get anything back from it. So generally
like we've been working really hard to
come up with like other metrics that we
can like hill climb as like we train.
Um, and I think like text rendering has
been a really interesting story because
like I think uh Koshik has been, you
know, talking about it for a long time.
One of the like biggest advocates of it.
Uh, and we were kind of like brushing
him off for a long time about how like,
you know, this guy's a little crazy.
Like he's really obsessed with text
rendering. Um, but eventually it kind of
became like one of the staple things we
looked at. And you can kind of think
about it like um when the model learns
how to do this like structure for text,
it's also kind of able to learn other
structure in an image as well. And like
in an image you have these like um
different like frequencies and you can
have like structure which you can think
of and but you can also have like
texture and stuff like this. So it
really gives you signal into like how
good the model is at generating the
structure of the scene. Um, and I'll let
Koshik talk a bit more about it because
he's like the the main guy.
>> Yeah. I'm also curious like what the the
initial conviction was for is it just
like as you were doing like a bunch of
research experiments like it became
clear that this was the case or Yeah,
I'm I'm curious to double click on it.
>> Yeah, I think it started from a place of
figuring out what these models were bad
at. So we in order to kind of improve
any model you need a signal for uh what
is not working well and then you try a
bunch of ideas whether it's related to
the model architecture data or other
things and once you have that clear
signal you you can definitely make good
progress on it and I think if we go back
a few years there were pretty much no
models that were doing a decent job and
even prompts that were on the order of
uh short short lengths like this Gemini
nano prompt here for example So as we uh
spent more time looking into this metric
and always tracking it right whatever
experiment we run now uh if we track
this metric we can make sure that we
don't regress on it and just by virtue
of having that as a signal we might even
find that uh changes that we didn't
expect to make a difference here
actually do make a difference and then
we can make sure we continue improving
that metric over time. Yeah. Yeah, and
like Robert said, uh it's a great way to
just measure overall image quality in in
the absence of other uh other metrics
for image quality that don't saturate
very quickly. Right? I think humans uh I
was actually a little bit skeptical of
the human raider approach to doing evals
for image generation. But I think what
at least I've realized over time is uh
when you have enough humans looking at
enough prompts across uh a variety of
categories, you actually do get quite a
bit of good signal. But obviously this
is expensive. You don't want to always
be uh asking a bunch of humans uh to
grade images. So looking at this text
rendering metric for example while a
model is training gives you great signal
as to whether it's performing like you
expect.
That's super interesting. I'm curious
about this um interplay between the
native image generation capability,
native image understanding capability.
We did an episode with Anie and uh that
team has obviously been pushing super
hard on like Gemini has state-of-the-art
image understanding. Is it like a
reasonable mental model as our models
get better at understanding images? Uh
there's like some of that capability is
actually transferable to generation as
well and vice versa. Like is that
>> I think Yeah. Um so basically the hope
is that we end up with uh with native
image generation or native u native
multimodal understanding and generation
and and learning all these modalities
and different capabilities at the same
uh model within the same training run is
that you you want to end up having
positive transfer across these different
axis. Right? And uh it's not only for
understanding and generation generation
for a single modality but also it's
about can we uh can we learn something
about the word that is um like from
images or videos or audio that is going
to be helping us from on on the text
understanding or text generation. So for
sure uh image understanding and image
generation are like sisters. So like we
definitely see still they're like going
hand in hand in interle generation for
example. uh but also the ultimate goal
is to to see uh like let me just give
you one example. So for example uh
language we have this uh like uh like
phenomena that we call it like reporting
biases and what it means that you go to
your friend's place and when you come
back you never talk about their normal
sofa in a conversation right but if you
show someone an image of that room it's
there right so if you want to learn
about a lot of things in the whole world
uh like images and videos they have that
information there without like you know
um explicit um um explicit like request
for for those information. So what I
want to say is that eventually with text
you can look or or with like other
modalities you can learn a lot about
different things but but it might take
more tokens. So like visual signals are
definitely a good shortcut for for
learning about the board and back to the
understanding and generation question as
I said you know like this these two
hands in hands and and coming into the
uh like interle generation you can see
that there's actually a huge uh help
from understanding to better generation
and the other way around. So like you
know image generation can help uh like
you know like you draw uh something on
the board to to solve a problem. So
maybe you know you can better understand
a a problem that is given to you as as a
like a visual uh image. Um so maybe we
can actually show some yeah
>> interle generation that that is kind of
related to understanding and generation
going hand in hand with text as well.
Um, let me do transform this subject
into a 1980s American glamour mall shot
in five different ways.
All right, fingers crossed this works.
Okay, this looks promising.
And this takes obviously a little bit
longer, right? Because we're trying to
generate multiple images and then we're
also trying to generate the text that
would describe what's in those images.
And one of the things that you'll notice
about native image generation is that
it's generating these images one after
another. So the model may choose to look
at a previous image and either try to
generate something very different from
it or try to generate a minor
modification of it. It at least has that
context of what is already generated. So
that's what we mean by native image
generation models. They have access to
multimodal context and then they
generate an image.
>> Yeah, that's interest. And it is my
mental model had always been that it was
like just I guess maybe that doesn't
even make sense, but it would have just
been like four independent forward
passes or something like that, but this
is actually like all in a single it's
all in the context of the model.
>> All in the context of the model. That's
super interesting.
>> And what's nice is then the style is
kind of similar, right? It's also the
model's doing this funny thing where it
has you twice in every single
>> interesting. We could we make some of
these full screen.
>> I'm going to make some of the So this is
arcade king logo.
Um, if we scroll, this is red dude.
And see, like none of these descriptions
that go with the images were something
that we came up with. The prompt was
just like you as a 1980s American
glamour mall shop. Um, Maltt,
this you should consider some of these
outfits
attire for.
>> And the fourth option, chill bro. See,
and like you have a different outfit in
all of them. They all look like you. Um,
the fact that you were there twice is
probably a little bit of a failure mode.
Um, but it's really cool to be able to
see the model kind of come up with these
five separate ideas. Um, give them
different names, give you different
outfits, right? And like keep the
character consistent. Um, and this is
not just useful for character building,
but this is also useful if you have a
picture of your room.
>> Yeah.
>> And you can say like, "Hey, help me
decorate this in five different ways,
right? And maybe you can go from like
really creative to maybe something more
conservative that's a little bit more
incremental to what you're doing." Um,
and we've seen a lot of people on the
team already using it to like redesign
their gardens and homes. And it's been
really cool to see that and more kind of
like practical application, not just us
making fun of.
>> Yeah. I
>> 80s Logan.
>> I vibe coded uh for my girlfriend a in
AI studio actually an app to visualize
her uh office with every different color
of blinds or of curtains. And she was
like, I don't know what curtain color is
going to fit this vibe. So, it literally
just This was with 2.0 know and I'll
I'll have to retry it with 2.5 check all
the different vibes. It actually worked
really well. It was like a very helpful
and like doesn't incre sometimes with
2.0 and actually this will be a good
thing to retest sometime with 2.0 it
would like change the bed or like change
like other artifacts would change not
just the curtain. Um
>> so it was interesting to see that use
case. One of my favorites.
>> You should give it a try. The the model
does a pretty good job keeping the rest
of the scene consistent and we and we
call this kind of pixel perfect editing.
Um, and that's really important, right?
Because sometimes you want to just edit
that one thing in your image, but you
actually want everything else to stay
the same. Again, if if you're doing
character building, you just want to
turn the character's head, but like
everything that they're wearing is to be
the same um across the scenes. And and
the model's really good at that. Um, it
will not always 100% work. Um, but we're
really excited about how far it's come.
>> Robert, you're going to say something?
Yeah.
>> Yeah. I was going to say like I think
one really cool thing is like just how
fast it is still, right? Like you know,
>> how long was this whole thing? All
right, let's let's give this a uh this
is 13 seconds.
>> Wow. So I think I think each image was
>> each each image was 13 seconds, right?
Is is
>> And so
>> Okay. This is the cumulative now.
>> Yeah. Yeah.
>> This is me, not the AI studio.
>> Yeah. Yeah. So So I think like the cool
thing is like even when you know 2.0
came out, I was using it for very
similar things. Like I had a bookshelf.
I had all the stuff on the ground. and
I'm like decorate this like what
configuration of these items should be
placed on my bookshelf and you know my
girlfriend might not have agreed with
the output so like sometimes we want to
like iterate on that and so like
rerunning it really quickly and
iterating so even if sometimes like it
kind of like fails you just tweak the
prompt rerun it and you get something
like really good afterwards so I think
this like iterative process of creating
is kind of like um the magic behind it
>> and any difference in how folks who had
tried 2.0 as an example um and like one
of the examples for me using 2.0 know
was like wanting to be like do um only
single edits like one at a time. Like if
you had said like if you had asked it to
like change six different things like
the model would sometimes not do a great
job of that. Any of those like is that
still something that we you should still
do those like type of targeted edits
with this model or any other just like
general like usability or like things
that folks should know as they're as
they're playing around with the model?
Th
>> this is something that I wanted to
mention basically. So uh one of the
magics of interleaf generation is that
it offers you to do a new paradigm for
image generation right like so if you
have a very complex prompt you know
you're talking about six different edits
what if I go with like 50 different
edits right so uh now that the model has
a really good mechanism to grab
information from the context like pixel
perfect and use it in the next turn what
you can do is you can ask the model to
break down the complex prompt either it
is editing or for image generation into
multiple steps and do edits like one by
one over different steps. So for the
first one you do this like edits like
this five different things and then for
the for the next one the next five and
so on so forth. So it's like very
similar to the test on compute that we
have on the language side, right? So, so
you spend more flops and and you let the
model to bring uh basically this
thinking in the uh in the kind of like
the pixel space plus breaking it down
into smaller pieces that you can like
really nail down uh that specific stage
but like accumulate it you can do
whatever complex task you want right so
I think like again this is the magic of
interle generation that you can think
about you know incremental generation of
like really complex images as opposed to
traditional way of doing it which was
like really push pushing hard for
getting the best image in one shot,
right? Like at the end of the day
there's a there's a capacity that you
can push the middle. You know, at some
point you realize that okay, you know,
with 100 details, uh we cannot do that.
But when you have this one interle
generation breaking it steps, you can
always go for any capacity in any
complexity that you want to generate.
>> One of the one of the things that's
always top of mind for me, especially as
like you're uh Nicole, you're also the
PM for uh for our imagined models. Um
how should people think about um
developers or just like people who are
who have knowledge of all the models
like imagine versus this like native
capability that we have? Um
>> yeah and you know this but our goal is
to always like build one model with
Gemini right so our ultimately our goal
is to always like bring all the
modalities into Gemini so we can benefit
from all the knowledge transfer that
Mustafa was talking about and ultimately
build towards AGI right on the way there
um there's a lot of usefulness out of
having specialized models that are just
very very good at a specific thing that
you need them to do um and imagine is an
amazing model for text image generation
right um and we have a lot of different
imagine variants that also to do image
editing and those are available in
Vert.ex. Um, and they're just optimized
for that specific task, right? So, if
you just want text to image and you want
just one image out of that model and you
want really amazing visual quality and
you also want that to be really
cost-effective and kind of snappy um in
generation time, um, imagine is the
place to go, right? Um, if you want some
of these kind of more complex workflows
where you want to generate with the
model, but then you also want to edit in
that same workflow and you want to do it
across multiple turns or you want to do
some of this like ideation like we were
doing with the model of like you know
what what design ideas could you help me
come up with for you know my room or
this library um then Gemini is the place
to go, right? So it really is kind of
that more multimodal like creative
partner um where it can output images,
it can output text. Um you can be kind
of less precise with the instructions
that you give to Gemini. Um because like
when we you know at the beginning we
said like make a nano um because it has
that kind of world understanding. It
will just more creatively interpret your
instructions. Um but imagine is still a
great family of models for developers to
go to um if they want like a super
optimized model for that specific task.
>> Yeah. Yeah, one of the examples I was
trying today and I'm curious what your
take is on which model or like if the
native image generation model fixes this
problem was I was saying like generate
this image and like make the uh this is
my my dumb billboard use case. I was
like make the billboard use case. I need
billboards. Um make the billboard uh the
style of of some company that I
mentioned. Is that something that like
native image generation benefits from
because it's like a little bit better at
this world knowledge piece relative to
like imagine being like really good at
if you give it a good prompt but like
less good at the like in understanding
my implying my prompts
>> your your actual intent behind the Yeah.
So so I think that's part of it. The
other part is um with native image
generation if if you just want to grab
that style reference that you have from
that you know other company that you
were trying to um emulate the style of
you can also insert that into the model
and use that as a reference right so the
fact that you can then also input an
image as a reference like helps with
that prompt and that is just easier to
do in Gemini natively than it is in
imagine um so I do you should try it you
should let us know we should add this to
our emails
>> I'll I'll let you know whether or not
the billboard use I'll make a billboard
eval
Boen email.
>> I love that. One of um back to this
thread of like the progress from 2.0.
One of the most fun things was when that
model launched, people were sending us
tons of feedback about the experience in
AI Studio and then ultimately the Gemini
app. Um just like general failure modes
for the model and all that stuff. Uh I
made my only contribution to M to the
original launch which was adding that
hot tag in AI Studio. We're bringing the
hot tag back for this model actually and
it's going to go away on the other
model. Um, how like what can we talk
about that story of just like the
progress um and like the failure modes
that we we did get a ton of feedback on
of like things that didn't work well for
2.0 that now hopefully work well for
2.5. Yeah, I mean we literally sat on
like X or Twitter and like went through
a bunch of feedback and literally I
remember like Kosik and I and some of
the other team like gathering all the
failure cases and making evals out of
that. So we have like a benchmark that
we take from like real user feedback
just from Twitter and it's just people
adding us and saying like hey this
didn't work and like for every model we
make in the future we kind of just like
append to that so that we know for
example like when we release 2.0 know
one of the failure cases sometimes we
would see is like if you edit it would
add your edit but it wouldn't
necessarily be consistent with like the
rest of the image right so that was like
one of the things that was like in that
and that we hill climbed and then
there's plenty more so kind of um we're
always just like gathering that feedback
yeah send us send us the examples that
uh that don't work well any any ones for
you all that like particularly stand out
of things that like just did not work
before that now is like a slam dunk. I
don't know if there's anything top of
mind. You you all play with I think the
like the team plays with this model I
think so much in the I assume in the
process as we're actually building it
and bringing it to life. I don't know if
there's any like go-to use cases for you
all to test and like is this actually a
good model?
>> Yeah, I think one thing I've noticed
specifically while playing with the 2.5
model is that in the 2.0 model.
Actually, one of the things that we
thought was going to be hard was
consistency from image to image, but
specifically the cases where you have an
object or say like a character that
you're building and you want that
character to remain consistent across
images. And uh if you actually leave the
character in the same place that it was
in the input image, it turns out that
this is actually quite easy and the 2.0
model could do this really well. It
could, for example, add a hat, change
the expression and stuff like that while
kind of keeping the pose and overall
structure of the scene the same. What
the 2.5 model adds on top of what these
capabilities look like in 2.0
uh is that you can ask to for example
render the character from different
angles and it will look like the exact
same character but from for example the
side. Or you could uh take a piece of
furniture and place it into a completely
different context uh reorient it and
create a whole scene. But that piece of
furniture it would remain faithful to
the original that you uploaded while
transforming it in very substantial
ways. just taking the input image and
pasting kind of those pixels into the
output image.
>> I love that. One of the one of the
reactions that I had about some of the
2.0 stuff was sometimes the images would
almost look like as you would do like
add something like I picture my face and
add a goofy mustache or a hat or
something. It almost look like it was
like superimposed or was like kind of
like photoshopped onto it. Is that
something that is also like similar to
this like character? It's like it seems
tangential to this character
consistency, but it feels like it's a
similarish problem where it's just like
taking pixels from memory and like
putting them into the image almost
versus the pixel transfer. I'm curious
if that's like a capability that's
improved.
>> Yeah. And actually I think that comes uh
comes down a lot to the actual teams
working on this model. Uh the previous
model actually we were kind of of the
mindset that okay it did the edit that's
it like it was successful but when we
started working more and more closely
with the imagine teams uh they would
look at the same exact edit that the
that we were looking at from the Gemini
side and they'd say this is terrible why
would you ever want the model to do
something like this right so uh this is
one example where blending the
perspectives from both teams uh so on
the Gemini side the instruction
following world knowledge all of these
things and then on the imagined side
making the images actually look natural,
aesthetically pleasing, and genuinely
useful. Uh, so I think it takes both of
these and having these teams work
together on this led to 2.5 being much
better at the stuff you're describing.
>> I love it. Um,
>> yeah, and just on that point, we
actually have folks on the team who
mostly come from the Imagine team who
have like a really honed aesthetic
taste. Um, and so a lot of the times
when we do evals, um, they will actually
just look at like hundreds and thousands
of images and be like, "No, this model
is better than this other model." And a
lot of other people on the team will
kind of look at it and be like, "Okay,
you know, like we like like you kind of
have to hone that I think um like
sensibility over a couple of years." And
I've gotten a lot better at it over the
years. Um but there's definitely people
on the team who are like amazing at it
and we always go to them when we try to
pick between models.
>> Can you train uh auto raiders on
people's like personal
>> We haven't been able to do it yet.
>> Fun side project.
>> That's a fun side project. I'm very
excited for for as as Gemini gets better
kind of at understanding to have like an
aesthetic operator based on you know one
of the folks on the team this who are
really amazing at this
>> just put that person to to provide
training signal for app.
>> Yes. Yes. And this is we'll we'll take
that as a side project after this.
>> I love that. Um lots of progress on 2.5
and obviously I'm I think folks are
going to be super excited to try out the
model and all that stuff. What comes
next? We've made a great model. Um, I'm
sure we have more stuff cooking in the
pipeline, but I don't know if how much
we want to say about the the future
direction and and what other
capabilities hopefully will will land in
the future.
>> Um, so when you when it comes to image
generation, I think like we do care
about the the visual quality, but uh I
think one thing that is again like new
and and we want this with like unified
omni model is smartness. You know, like
you want your image generation model to
feel smart. you know when users interact
with this not only they're impressed by
the quality of images but they feel like
wow this is smart you know like one
example that I have in mind and u I'm
looking forward to to to see this
happening and it's a bit controversial
because I I cannot even define it well
is when I ask the model to do something
it doesn't follow my instruction but
it's some it does something that at that
at the end of the generation I say I'm
glad that it's like you know it didn't
follow my instruction it's even better
than than what I actually described
right like so it's It has this kind of
like edge to it that you know um
>> I is that like the you think the model
is like intentionally doing this or it's
like it is it's like kind of an
unintended accident. Is that what you're
trying to say? No, no, it's not just
that, but basically, you know, like
sometimes, you know, underspecified or
sometimes you think wrong about like
some like something that is a reality,
but you know, outside word with like the
knowledge of Gemini um it's different
from your perspective, right?
>> And uh I think um again like it's not
intentional or or what just like happens
organically and uh I think again like
you just feel that I'm interacting with
a system that is like smarter than me,
right? And and when I when I'm asking
for uh some images, um like I don't mind
if it goes like off the rail with my my
uh my prompt and and generates something
that is different from what I ask
because it's most of the time better
than than what I had in mind. So uh I
think definitely smartness like in high
level is the the direction that we we
are pushing uh forward while maintaining
the visual quality or or improving it.
Uh but uh there are so many specifics
and capabilities and and use cases
especially for developers uh that um um
I think like this release has some but
next release is going to be like also
like and yeah we we have like these
coming releases in the pipeline. I can I
cannot share about the timeline but but
it's just like so exciting and yeah I
should maybe I should. Yeah. Um but I'm
so excited. I'm I'm like happy and and
the momentum is like unmatched here like
on the image generation side.
>> I love that. any other any other
capabilities folks are excited about?
>> I'm really excited about factuality. Um,
and and so that kind of like goes back
to the point that like sometimes like
maybe you need to make a little diagram
or like an infographic for a work
presentation, right? And like it's
amazing if it looks nice, but that's not
enough, right, for that case. Like it
actually has to be it has to be
accurate. Um, that you can't have any
extraneous text. Like it it just kind of
has to both look good and also be
functional for that purpose. And I think
we're just scratching the surface um on
what these models can do with that. And
I'm really excited about like some of
these upcoming releases like us getting
better at that type of use case so that
like my dream one day is that these
models can actually make a slide deck
for me for work that like looks nice.
>> This is every PM's dream that
>> every dream. I'm trying to outsource
that part of my job to Gemini. Um and I
think we play a really big part in it.
So
>> awesome. I love it. Well, I think folks
are going to be super excited to try
these models. Uh, thank you all four of
you and for the rest of the team for for
making this happen. So, I appreciate all
the hard work. I'm excited for this. Um,
and thanks everyone for watching release
notes. We'll we'll see you in the next
episode.
[Music]
Loading video analysis...