Google DeepMind Developers: How Nano Banana Was Made
By a16z
Summary
## Key takeaways - **Nano Banana: Best of Both Worlds**: Nano Banana, a new image generation model, combines the visual quality of Google DeepMind's 'Imagine' models with the multimodal, conversational capabilities of Gemini 2.5 Flash. This blend is key to its resonance with users. [01:55] - **Viral Success Driven by User Demand**: The viral nature of Nano Banana surprised even its creators. User demand surged unexpectedly, requiring constant increases to query capacity and demonstrating the model's significant utility. [02:10] - **Personalization Drives Emotional Connection**: The ability for users to see themselves in AI-generated images, like creating personalized '80s makeover versions,' was a key moment that resonated deeply and highlighted the model's personal impact. [03:41] - **AI Empowers Artists, Frees Up Creativity**: AI models like Nano Banana are transforming creative work by automating tedious tasks, allowing artists to focus 90% of their time on creativity rather than manual operations. [00:03], [05:40] - **Intent is Key to AI-Generated Art**: While AI generates outputs, the true essence of art lies in human intent. These models serve as tools for individuals with creative intent to produce inspiring and meaningful work. [07:34] - **Balancing Control and Simplicity in Interfaces**: Developing user interfaces for AI image models involves a trade-off between offering extensive control for professionals and maintaining simplicity for everyday users, a balance not yet fully realized. [10:35]
Topics Covered
- Personalization drives emotional resonance in AI art.
- AI empowers artists by reducing tedious tasks.
- Will AI interfaces prioritize suggestions or control?
- Visual AI models will transform education.
- Improving the "worst image" unlocks AI's future.
Full Transcript
These models are allowing creators to do
um less tedious parts of the job, right?
They can be more creative and they can
spend, you know, 90% of their time being
creative versus 90% of their time like
editing things and doing these tedious
kind of manual operations.
>> I'm convinced that this ultimately
really empowers the artists, right? It
gives you new tools, right? It's like,
hey, we now have, I don't know,
watercolors for Michelangelo. Let's see
what he does with it, right? And amazing
things come out.
maybe start by telling us about the
backstory behind the nano banano model.
How did it come to be? How did you all
start working on it?
>> Sure. So um you know our our team has
worked on image models for some time. We
developed the imagine family of models
which is goes back a couple years. Um
and and actually there was also an um an
image generation model in Gemini before
the Gemini 2.0 image generation model.
So what happened was the um the teams
kind of started to focus more on the
Gemini use cases. So like interactive
conversational and and editing
>> um and and essentially what happened is
we teamed up and we we built this model
which became what's known as nano
banana. Um so yeah that's sort of the
origin story but
>> yeah and and I think maybe just some
more background on that. So our imagine
models were always kind of top of the
charts for visual quality and you know
we really focus on kind of these
specialized generation editing use cases
and then when 2.0 Flash came out that's
when we really started to see some of
the magic of like being able to generate
images and text at the same time so you
can maybe tell a story. Um just the
magic of being able to talk to images
and edit them conversationally. Uh but
the visual quality was maybe not where
we wanted it to be. And so Nano Banana
or Gemini 2.5 flash image um
>> Nano Banana is way cooler.
>> It's it's easier to say. It's a lot
easier.
>> It's the name that stuck.
>> Yes, it's the name that stuck. Uh but it
really became kind of the best of both
worlds in that sense like the Gemini
smartness and the multimodal kind of
conversational nature of it plus the
visual quality of imagine. And I feel
like that's maybe what resonates a lot
with people.
>> Wow. Amazing. Um, so I guess when you
were testing out a model, as you were
developing it, what were some wow
moments um that you found, I know this
is going to go viral. I know people will
love this.
>> I So I actually didn't feel like it was
going to go viral until we had released
on Ellarina. And what we saw was that we
budgeted like, you know, a comparable
amount of queries per second as we had
for our previous models that were on Elm
Marina. And we had to keep upping that
number as people were going to Ella
Marina to use the model. And I feel like
that was the first time when I was
really like, "Oh, wow. This is something
that's very very useful to a lot of
people." Like I it surprised even me. I
don't know about the whole team, but
like we, you know, we were trying to
make the best conversational editing
model possible. But um but then it
really started taking off when when yeah
when people were like going out of their
way and using a using a website that
would actually only give you the model
some percentage of the time. But even
that was worth like using going to that
website to use the model. So I think
that was really the moment at least for
me that I was like oh wow this is this
is going to be bigger.
>> That's actually the best way to
condition people like only give them a
rewards partially not all the time by
design. Uh I had a moment earlier um and
that was when so I've been trying some
similar queries on kind of multiple
generations of models over time. Um and
a lot of them have to do with like
things I wanted to be as a kid. So like
an astronaut explorer or you know put me
on the red carpet and I tried it on a
demo that we had internally before we
released the model. It was the first
time when the output actually looked
like me. Um and you know you guys play
with these models all the time. The only
time that I've seen that before is if
you know you fine-tune a model, you
know, using Laura or some other method
to like do that and you need multiple
images and takes a really long time and
then you have to like actually serve it
somewhere. So this was the first time
when it was like zero shot. Oh wow, just
one image of me and it looks like me and
I was like wow. And then there became
these like we have decks that are just
like covered in my face as I was trying
to convince other people that it was
really cool. Um, and really I think the
moment more people realized that it was
like a really fun feature to use is when
they tried it on themselves because it's
it's kind of fun when you see it on
another person, but it doesn't really
resonate with people emotionally. It
makes it so personal. It's like you your
kids, you know, your spouse and and I
think that's your dog,
>> your dog. And and and that's really what
started kind of resonating internally.
And then people just started making all
these like 80s makeover versions of
themselves. And that's when we really
started to see like a lot of internal
activity and we were like, "Okay, we're
on to something."
>> It's it's a lot of fun to test these
models when we're making them because
you just you see all these amazing
creative things that people make. Oh,
wow. I I never thought that was
possible.
>> So, it's it's really fun.
>> No, it's I mean, we've dealt with the
whole with the whole family and it's
it's it's a it's a crazy amount of fun.
>> So, think a bit about long term. Where
does this lead? Right. I mean we we
built these new tools that I think will
change visual arts forever, right? I
mean there we suddenly can transfer
style. We suddenly can you generate
consistent images of a subject, right? I
have I have what used to be a very
complex manual Photoshop process.
Suddenly I type one command and
magically happens. But what's the end
state of this? I mean do do we have an
idea yet? You know how will how will
creative arts be taught in a university
in you know five years from now?
want to take that.
>> So I I think it's going to be a spectrum
of things, right? I think on the
professional side, a lot of what we're
hearing is that these models are
allowing creators to do um less tedious
parts of the job, right? They can be
more creative and they can spend, you
know, 90% of their time being creative
versus 90% of their time like editing
things and doing these tedious kind of
manual operations. So I'm really excited
about that. Like I think we'll see kind
of an explosion of creativity like on
that side of the spectrum. Um and then I
think for consumers there's sort of like
two spect two sides of the spectrum for
this probably. One is you know you might
just be doing some of these fun things
like Halloween costumes for my kid,
right? And and the out the goal there is
probably just to like share it with
somebody, right? Your family or your
friends. Um on the other side of the
spectrum, you might have these tasks
like putting together a slide deck,
right? I started out as a consultant. We
talked about that at the beginning. Um,
and you spend a lot of time on like very
tedious things like trying to make
things look good, trying to make the
story make sense. I think for those
types of tasks, you probably just have
an agent who you give the specs of what
you're trying to do and that it goes out
and like actually lays it out nicely for
you. It creates the right visual for the
information that you're trying to
convey. And it really is going to be
this, I think, spectrum depending on
what you're trying to do. Like do you
want to be in the creative process and
actually tinker with things and
collaborate with the model or do you
just want the model to like go do the
task and be as minimally involved as
possible?
>> So So in this new world then what what
is art? I mean somebody recently said
art is if you can create an an out of
distribution sample. Is is that a good
definition or or is it is it is it
aiming too high? Right.
>> Do you think if art is out of
distribution or in distribution for the
model?
>> There we go.
I think that out of distribution sample
that is a little bit too restrictive. I
think a lot of great art is actually in
distribution for art that occurred
before it. So I I mean what is art? I
think it's like a very philosophical
debate and there's a lot of people that
do discuss this. Like to me I think that
the most important thing for art is
intent. And so the the what is generated
from these models is is a tool to allow
people to create art. And I'm actually
not worried about the high-end and the
creatives and the professionals because
I've seen like if you put me in front of
one of these models like I can't create
anything that anyone wants to see
>> but like and but I've seen what people
can do who are creative people and who
have like intent and these ideas and I
think it's like
>> that's the most interesting thing to me
is is the things they create are really
amazing and and inspiring for me. So, I
feel like the the high-end and the the
the professionals and the creatives,
like they'll always use state-of-the-art
tools, and this is like another tool in
the tool belt for people to make cool
things. I think one of the the really
interesting things that I kept hearing
about this model in particular from like
creatives and artists was a lot of them
felt like they couldn't use a lot of AI
tools before because it didn't allow
them the level of control that they
expected for their art. what on one side
that was like the um characters or
object consistency like they really used
that to have a compelling narrative for
a story and so before when you couldn't
get the same character over and over it
was very difficult and then I think
second the like second thing I hear all
the time from artists is like um they
love being able to upload multiple
images and say like use the style of
this on this character or add this thing
to this image which is something that I
think was very hard to do even with
previous image edit models. I guess I'm
curious like was that something you guys
were really optimizing for when when you
trained this one or or how did you think
about that?
>> I mean yeah definitely sort of
customizability and character
consistency are things that we closely
monitored during the development and we
tried to do the best job we could on
them. Um, I think another thing is also
uh the iterative nature of kind of like
an interactive conversation. Um, and you
know, art tends to be iterative as well
where you you make lots of changes, you
see how it where it's going and you make
more. Um, and this is another thing I
think makes the model more useful and
and actually that's an area that I also
feel like we can improve the model
greatly. Like I know that um once you
get into really long conversations like
it it starts to follow um your
instructions a little bit worse. But
like this something that we're planning
to improve on and make the model more
kind of like a natural conversation
partner or like a creative partner in in
making something.
>> One thing that's so interesting is after
you guys launched Nano Banana, we start
to hear about editing models all the
time everywhere. Like it's like after
you launched the world woke up and they
were editing model, it's great, everyone
wants it. And then obviously like it it
kind of um you know goes into the
customizability the personalization of
it and then uh Oliver I know you used to
be Adobe and then there's also software
where we used to manually edit things.
How do you see the knobs evolve now on
the model layer versus what we used to
do? Um, yeah. I mean, I think that, you
know, one thing that that Adobe has
always done and the professional tools
generally require is lots of of control,
lots of knobs, lots of of So, there's
always a balance of like we want someone
to be able to use this on their phone.
>> Um, maybe with just like a a voice
interface,
>> and we also want someone who can really
like a really professional art creative
to be able to do fine skill adjustments.
I think we haven't exactly figured out
how to enable both of those yet. Um, but
there's a lot of people that are
building really compelling UIs like um
and and and I think it's a you know
we're Yeah, I think I think there's
different ways it can be done. Um, I
don't know. You have thoughts?
>> Well, I also hope that we get to a point
where you don't have to learn what all
these controls mean and the model can
maybe smartly suggest what you could do
next based on the context of what you've
already done, right? Um, and that feels
like it's kind of prime for someone to
tackle that on. So like what do the UIs
of the future look like um, in a way
where you probably don't need to learn a
hundred things that you had to before,
but like the tools should be smart
enough to suggest to you what it can do
based on what you're already doing.
>> That's such an insightful take. I
definitely had moments when when I used
Nano Banana, I was like, I didn't know I
wanted this, but
>> but I didn't even ask for this style. I
don't even know have the words for that
what that style even, you know, is
called. So this is like very insightful
on how image embedding and the language
embedding is not one to one like we
cannot map to like all the editing task
with language. So oh go ahead.
>> Yeah, let me let me sort of take a
little the counter point just to see
where this goes.
>> The other the question of how complex
the interface be can be limited by sort
of what we can express in software, how
easy we can make something in software
which to some degree is also limited by
how much complexity is a user willing to
tolerate. And you know if you have a
professional
>> they only care about the result they're
willing to tolerate a vast amount of
complexity they they have the training
they have the education they have the
experience to use that right then we may
end up with lots of knobs and dials it's
just very different of dials right I
mean today if you use a cursor or so for
coding it's not that it has a super easy
you know single text prompt interface it
has it has a a good amount of you know
here add context here different modes
and so on right so so
will we have Will we have like the the
ultra sophisticated interface for the
for the power user and how how would
that look like? So I'm a big fan of
Comfy UI and nodebased interfaces in
general
>> and that is complex
>> and it's complex but it's also it's very
robust and you can do a lot of things
and so you know after we released Nano
Banana we saw people building all these
really complicated comfy UI workflows
where they were combining a bunch of
different models together and different
tools and that generated some of the
like for example using nanobana as um as
a way to get storyboards or key frames
for video models like you can plug these
things together and and get really
amazing outputs. So, I I think that like
at the the pro or the developer level,
like these kinds of interfaces are are
great. Um, in terms of like the proumer
level, I think it's it's very much
unknown what it's going to look like in
in a couple years.
>> Yeah. I think it just really depends on
your audience, right? Because for the
regular consumer, like I use my parents
always as an example. The chatbot is
actually kind of great.
>> Oh, yeah.
>> Because you don't have to learn a new
UI. You you just upload your images and
then you talk to them, right? Like it's
it's kind of amazing that way. Then for
the pros, I agree that like you need so
much more control than you know and then
there's somewhere in between probably
which are people who may want to be
doing this but they were too intimidated
by the professional tools in the past
and for them I do think that there's a
space of like that you need more control
than the chatbot gives you. Uh but you
don't need as much control as what the
professional tools give you and like
what's that kind of in between state?
>> There's a ton of opportunity there.
>> There's a ton of opportunity there. It
is interesting you mentioned comfy UI
because it's on the other far spectrum
of workflow like a workflow can have
hundreds of steps and notes and you need
to make sure all of them work whereas on
the other side of the spectrum there's
nano banana you kind of describe
something with words and then you get
something out like I don't know what's a
model architecture stuff like that but
um I guess is your view that the world
is moving to an ensemble of a model
hosted by one provider doing it all or
do you think the world is moving to more
of everyone building a workflow. Nano
Banana is one of the nodes in comfy work
UI.
>> Um I I definitely don't think that that
the the broad amount of use cases will
be fully satisfied by one model at any
point. So I think that there will always
be a diversity of models. some like um
I'll give you an example, but some you
know we could we could optimize for um
instruction following in our models.
Make sure it does exactly what you want,
but it might be um a worse model for
someone who's looking for ideiation or
kind of inspiration where they want the
model to kind of take over and and do
other things, go crazy. So like I just
think there's so many different use
cases and so many types of people that
like there's a lot of space there's a
lot of room in this space for multiple
models. So that's that's where I see us
going. I don't think this is going to be
like a single to rule it a single model
to rule them all.
>> Complete sense. Let's go to the very
other end of the spectrum from the
professional. Do you think
kindergarteners in the future will learn
drawing by by sketching something, you
know, on a on a little tablet and then
you have the AI make turn that into a
beautiful image and and so that's how
how they they allow get in touch with
art. I don't know if you always wanted
to turn into a beautiful image, but I
but I think there's something there
about the AI being again a partner and a
teacher to you in a way that you like
didn't have. So I
>> didn't know how to draw, still don't um
don't have any talent for it really. Uh,
but I think it would be great if we
could use these tools in a way that
actually teaches you kind of the step by
steps and helps you critique and maybe
again shows you kind of like an
autocomplete almost for images like what
like like what's the next step that I
could take, right? Or maybe show me a
couple of options and like how do I
actually do this? So, I hope it's more
that direction. I I don't think we all
want, you know, every 5-year-old's image
to suddenly look perfect.
>> We we we would probably lose something
um in the process. As someone who
struggled the most in high school out of
all my classes of the art and the
sketching class, I actually would have
would have preferred it, but I know a
lot of people want their kids to learn
to draw, which I understand.
>> It's funny because we've been trying to
get the model to create um like
childlike crayon drawings, which is
actually quite challenging.
>> Um, ironically, you know, sometimes the
the things that are hard to make are
because the level of abstraction is very
large right?
>> So, it's actually quite difficult to
make those types of images. your
dedicated prek fin.
>> We we do have seminar evals right now
>> to try to see if we're getting better.
>> I'm in general I'm very optimistic about
AI for education. And you know part of
the reason is I think that most of us
are visual learners,
>> right? So that AI right now as a tutor
basically all it can do is is talk to
you or give you text to read and that's
definitely not how students learn. So I
think that these models have a lot of
potential as a way to help education by
giving people sort of visual cues. You
know, imagine if you could get an
explanation for something where you get
the text explanation, but you also get
images and figures that kind of like
help explain how they work. I think it
just everything will be much more
useful, much more accessible for
students. So I'm really excited about
that. That is a
>> on that point, one thing that's very
interesting to us is that when Nano
Banana came out, it almost felt like
there's part of a use case is the
reasoning model. Like you have a
diagram. Absolutely. Right. Like you can
explain some knowledge visually. So the
model not just doing approximation of
the visual aspect. There's the reasoning
aspect to it too.
>> Do you think that's where we're going
to? Do you think all the model large
models will realize that oh like to be a
good LM or VL like uh VLM we have to
have both image and language and audio
and so on so forth.
>> 100%. I definitely think so. Um the the
future for these AI models that I'm most
excited by is where they are tools for
people to accomplish more things. Like I
think if you imagine a future where you
have these agentic models that just talk
to each other and do all the work, then
it becomes a little bit less necessary
that there's like this visual mode of
communication. But as long as there's
people in the loop and as long as the
the kind of the the motivation for the
task they're solving comes from people,
I think it makes total sense that that
visual modality is going to be really
critical for any of these AI agents
going forward.
Will we get to a point where there's
actually so of you know I'm I'm asking
you to create an image it sits for two
hours reasons with itself has drafts
explores different directions and then
comes back with a final answer.
>> Yeah. Absolutely. If it's if necessary.
Yeah. Like
>> and maybe not just for a single image
but to the point of you know maybe
you're redesigning your house and maybe
you actually really don't want to be
involved in the process right but you're
like okay this is what it looks like
this some this is some inspiration that
I like. And then you send it to um a
model the same way that you would send
it to like a designer.
>> It's the visual deep research.
>> The vis it's like visual deep research
basically. I really like that term. And
then it goes off and does its thing and
searches for maybe the furniture that
would go with your environment and then
it comes back to you and maybe presents
you with options because maybe maybe you
don't want to sit for two hours at one
thing art book
you know 10 10 slide deck. I also I
think if you if you think about like um
instruction manuals or like IKEA
directions or something then like
breaking down a hard problem into many
intermediate steps could be really
useful as as a way to communicate.
>> So when can we generate Lego sets?
>> Yeah, soon maybe
do we at some point need 3D as part of
it?
>> Right.
>> I mean there's a whole debate around
world models and image models and how
they fit together. thoughts
enlighten us here. What is the what is
the short summary of where we'll end up
there?
>> Um I mean I don't know the answer. I
think that um obviously the real world
is in 3D. So if you have 3D a 3D world
model or world model that has explicit
3D representations. There's a lot of
advantages. For example, everything
stays consistent all the time.
>> Um now the main challenge is that we
don't walk around with 3D capture
devices in our pocket. So in terms of
like the available data for training
these models, it's largely the
projection onto onto 2D. So I think that
both viewpoints are totally valid for
where we're going. I come a bit from the
projection side like I think it we can
solve almost all the problems if not all
the problems working on the projection
of the 3D world directly and letting the
models learn the latent world
representations. I mean we see this
already that the video models have very
good 3D understanding. You can run
reconstruction algorithms over the
videos you generate and they're they're
very accurate. Um, and in general, if
you look at like the history of human
art, like it it it starts as like the
projection, right? People drawing on on
cave walls. Um, all of our interfaces
are in 2D. So, I think that like humans
are very very well suited for working on
this projection of the 3D world into a
2D plane. And it's a really natural
environment for interfaces and for
viewing. So,
>> that is very true. like um so I'm a
cartoonist in my spare time and then
drawing in 2D is just light and shadow
and then you present yourself with 3D
cannot we trick ourselves to believing
it's 3D or it's you know on a piece of
paper but then what human can do that
you know like a drawing or like a model
can do is you we can navigate the world
like we see a table we can't walk past
it I guess the question becomes if
everything is 2D how do you solve that
problem
>> well I don't think yeah so if we're
trying to solve the robot products
problems. I think maybe the 2D um
representation is useful for planning
and visualizing kind of at a high level.
>> Like I think people navigate by um by
remembering kind of 2D projections of
the world. Like you don't you don't
build a 3D map in your head. You're more
like oh I know I see this building I
turn left.
>> Yeah.
>> So I think that like for that kind of
planning it's reasonable but for the
actual locomotion around the space like
I definitely 3D is important there. So
>> robotics Yeah. They probably need 3D.
That's the saving grace.
>> Yeah. Yeah. Um, so character
consistency, which you previously
mentioned, I really love the example of
like when a model feels so personal,
like people are so tempted to try it.
>> How did you unlock that moment? The
reason why I asked is that character
consistency is so hard.
>> Uh, there's a huge uncanny valley to it.
like you know like if it's someone I
don't know if I see their AI generation
I'm like okay it's maybe the same person
but if it's someone I know if there's
just a little bit of a difference uh I
I'm actually felt very turned off by it
because I'm like this not a real person.
So in that case how do you know what
you're generating is good and then is it
mostly by user feedback or like I love
this or is it something else? You look
at faces, you know, and but
face detection camera user and
>> no. So, so, so not even before you ever
released this, right? So, when when
we're we were developing this model, we
actually started out doing character
consistency evolves on faces we didn't
know and it doesn't tell you anything,
right?
>> Um, and then we started testing it on
ourselves and quickly realized like,
okay, this is what you need to do
because this is a face that I'm familiar
with. And so there is a lot of sort of
eyeballing evaluations that happens and
just the team testing it on themselves.
Um, and just generally people they know
like Oliver probably knows my face at
this point enough to be able to tell
whether or not it's actually me when
it's generated.
>> Um, and so we do do a lot of that. Um,
and then you know you you ideally test
it on different sets of people,
different ages, right? Different
different kind of groups of folks to
make sure that it kind of works across
the board.
>> Yeah, I think they're right. I mean that
that touches that touches a little bit
on this this um bigger issue which is
that like eval are really difficult in
this space
>> um because human perception is very
uneven in terms of the things that it
cares about. So um so really it's hard
it's very hard to know like how good is
the character consistency of a model and
um is it good enough? Is it not good
enough? Like you know I think there's
there's still a lot of improvement we
can make on character consistency but I
think that for some use cases like we
got to a point and that's you know we
weren't the first edit model by any
means but I think that like once the
quality gets above a certain level for
character consistency it can kind of
just take off because it becomes useful
for so much more.
>> And I think as it gets better it'll be
useful for even more things too. Yeah,
>> I think one of the really interesting
things we're seeing across a bunch of
modalities of which image edit and
generation obviously is one is like um I
think the arenas and benchmarks and
everything are awesome but especially
when you have like multi-dimensional
things like image and video um it's very
hard as all of the models get better and
better to condense every quality of a
model into like one judgment. So it's
like, you know, you're judging, okay,
you swap a character into an image and
you change the style of the image. Maybe
one did the character swap and
consistency much better and the other
did the style much better? Like how do
you say which output is better? And it
probably comes down to like what the
person cares most about and what they're
what they want to use it for. Um, are
there like certain, you know,
characteristics of the model that you
guys value more than other things in
like making those trade-offs when
deciding which version of the model to
deploy or like what to really focus on
during training?
>> Um, yes, there are. One of the things I
like about this space is that uh there
is no right answer. So, actually there's
quite a lot of of I don't know if it's
taste, but it's like preference that
goes into the models. And I think you
can kind of see the difference in
preferences of the different research
labs in the models that they release.
>> So like when we're balancing two things,
a lot of it comes down to like, oh well,
I I don't know. I just like this this
look better or I you know, this this
feature is more important to us.
>> I'd imagine it's hard for for you guys,
too, because you have you have so many
users, right? like Google like being in
the Gemini app like everyone in the
world can use that versus like many
other AI companies just think about like
we're only going for the professional
creatives or we're only going for the
consumer meat makers and like you guys
have the unique and exciting but
challenging task of like literally
anyone in the world can do this. How do
we decide what everyone would want?
>> Yeah. And it is sometimes we do make
these trade-offs. We do have a set of
things that are sort of like super high
priority that we don't want to regret
restress on. Right? So now because
character consistency was so awesome and
so many people are using it, we don't
want our next models to get worse on
that dimension. Right? So we pay a lot
of attention to it. We care a lot about
images looking photorealistic when you
want photos and this is important. One,
I think we all prefer that style.
too. Um, you know, for advertising use
cases, for example, like a lot of it is
kind of photorealistic images of
products and people. And so, we want to
make sure that we can kind of do that.
And then sometimes there are just things
that like will kind of fall down the
wayside. So, for this first release, the
model is not as good as text rendering
at as we would like it to be, and that's
something that we want to fix in the
future. But it was kind of one of those
things where we looked at, okay, the
model's good at XY Z, it's not as good
at this, but we still think it's okay to
release and it will still be an exciting
thing for people to play with.
>> If you look at the past, right, we we we
had for previous model generations, a
lot of things we did with like sidecar
models like control net or something
like that where we basically figured out
a way to provide structured data to the
model to achieve a particular result. It
seems like these newer models that has
taken a step back just because they're
so incredibly good in just prompting or
or you know giving a reference image and
picking things up from there. Where will
this go long term? Do you think this
will come back to some degree? Um you
know like I mean for from the creators
perspective right having I don't know
open pose information so I can get get a
pose exactly right right for multiple
characters. This seems very very
tempting, right? Is it like or to
rephrase it a little bit, it's like does
the bitter lesson hold here that at the
end of the day everything's just one big
model and you throw things in or is
there is a little of structure we can we
can offer to make this better?
>> Um, I mean I think that there will be
there'll always be users that want
control that the model doesn't give you
out of the box. But I think we we tried
to make it so that um you know because
really what really what an artist wants
when they want to do something is they
want the intent to be understood. And I
think that that these um AI models are
getting better at understanding the
intent of users. So often when you ask
text queries now the model gets what
you're going for.
>> So you know in that sense I think we can
we can get pretty far with understanding
the intent of our users.
>> And um and maybe some of that is
personalization like we need to know
information about what you're trying to
do or what you've done in the past. But
I think once you can understand the
intent then you can you can generally do
the the type of edit like is this like a
very structure preserving edit or is
this like a free form kind of like we
can learn these these kinds of effects I
think. Um but still of course there's
one person who's going to really care
about every pixel and like this this
thing needs to be slightly to the left
and a little bit more blue and like
those people will use existing tools to
do that.
>> I mean I think it's like you know I want
an image with 26 people spelling out
every letter of the alphabet or
something like that. Right. That's off
the thing where I think we're still
quite a bit away from getting that
right, you know, in the first try. On
the other hand, with pose information,
it could potentially get
>> But then the then the question I guess
is like do you really want to be the one
who's like extracting the pose and
providing that as an information or do
you just want to provide some reference
image and say like this is actually what
I want like model go figure this out
right
>> there are 26 people every now and
different style. Fair enough. Yeah, I
think in that in that case I wouldn't
spend a ton of time building a custom um
interface for making this this picture
of 46 people seems like the kind of
thing that we can we can solve.
>> Just transfer.
>> Do you think the representation of what
the AI images are will change? So the
reason why I asked the question is that
as artists there's different formats we
play with. There's the SVGs, we have
anchor points and bezier curves.
>> And on the other side, there's, you
know, porcy or like fresco, what have
you. There's layers that we can also
play with. There's the other parameter
which is what's the brush you use like
the brush, the texture of it. So, every
one parameter you can write script and
actually uh do something very personal
about it. Mhm.
>> Do you think like pixel is the right
representation um the endgame for image
generation model or do you think there's
a net new representation that we haven't
invented yet?
>> That's an easy question to
>> wow. Um I I'll say that uh that um
everything is a subset of pixels.
>> That's true.
>> So text is a subset of pixels because I
could just render all the text as an
image.
>> So how far can we get with just pixels
is an interesting question. I think you
know if the model is really um
responsive and handles multi-turn
interactions well then I think you can
probably get pretty far because the
primary reason I think you would want to
leave the pixel domain is for
editability.
>> Um and so you know in cases where you
need to have your font or you want to
change the text or you want to move
things around just like with control
points um it could be useful to have um
kind of mix generation which consists of
pixels and SVGs and other other forms.
Um but if we can do it all, if we can if
the multi- interaction is enough, then I
think you can get pretty far with
pixels. Um I will say that one of the
things that's exciting about these um
these models that have native
capabilities is that you now have a
model that can generate code and it can
generate images.
>> So there's a lot of interesting things
that come in that intersection, right?
Like maybe I wanted to write some code
and then make make some some things be
rasterized, some things be parametric.
>> Yeah.
>> Like stick it all together,
>> train it together. Like this would be
very cool. That's such a good point
because I did see a tweet of someone
asking Cloud Sauna to replicate a image
on an Excel sheet where every cell is a
pixel
>> which is like a very fun exercise. It
was like a coding model like doesn't
really know anything about you know
images yet it worked.
>> Yeah, there's the classic pelican riding
a bicycle test.
>> Yeah, totally. I have one on on model
like on interfaces if that's okay. I
don't sorry if I'm bringing up too much
product stuff guys. I'm just very
curious on on the product front like um
I guess I'm curious how you think about
like owning the interface where people
are editing or generating images with
Nano Banana versus
really just wanting a ton of people to
use the model for different things in
the API. Like we've talked about so many
different use cases like ads, you know,
education,
um design, uh like architecture. Each of
those things could be there could be a
standalone product built on top of Nano
Banana that prompts the model in the
right way or allows certain types of
inputs or whatever. Is your guys' vision
like that the kind of the product in the
Gemini app is like a playground for
people to explore and then developers
will build the individual products that
are used for certain use cases or is
that something you're also kind of
interested in owning?
>> I think it's a little bit of everything.
Um, so I definitely think that the
Gemini app is an entry point for people
to explore. And the one the nice thing
about Nano Banana is I think it shows
that fun is kind of a gateway to utility
where you know people come to make a
figurine image of themselves but then
they stay because it helps them with
their math homework or it helps them
write something, right? And and so I
think that's a really powerful kind of
transition point. Um there's definitely
interfaces that we're interested in
building and exploring as a company. And
so um you know you may have seen Flo
from Josh's team in labs that's that's
really trying to rethink like what's
what's the tool for AI filmmakers right
and for AI filmmakers image is actually
a big part of the iteration journey
right because video creation is
expensive a lot of people kind of think
in frames um when they when they
initially start creating and a lot of
them even start in the LLM space for
like brainstorming and thinking about
what they want to create in the first
place um and so there's definitely kind
of place that we have in that space of
just us trying to think about like what
does look like. Um, we have the
advantage of it kind of sitting close to
the models and the interfaces so we can
kind of build that in in a tight
coupling. Um, and then there's
definitely the, you know, we're probably
not going to go build a software for an
architecture firm. My dad is an
architect and he would probably love
that. Um, but I don't think that's
something that we will do, but somebody
should go and do that. Um, and that's
why it's exciting because we do have the
developer business and we have the
enterprise business and so people can go
use these models and then figure out
like what's the next generation workflow
for like this specific audience so that
I can help them solve a problem. So I I
think the answer is kind of like yes all
three.
>> Yeah.
>> Yeah. I I brought that up. I don't know
if you guys have been following the
reception of Nano Banana in Japan, but
um I'm sure you've had it's it's been
insane. And it's so funny like I now
half of my X feed is these really heavy
Nano Banana users in Japan who have
created like Chrome extensions called
there's one called like Easy Banana
that's specifically for using Nano
Banana for like manga generation and
specific types of anime and things like
that. And like they go super deep into
basically prompting the model for you
and storing the outputs in various
places. Um using obviously your your
underlying model to generate these like
amazing anime that you would never guess
were AI generated because like the level
of of precision and consistency and that
sort of thing is just beyond what I've
seen any single model be able to do
today.
I guess um what are some like to
Justin's point what are some force
multipliers that you guys have seen in
the model? So what I mean by this is for
example if you unlock character
consistency you can generate different
frames and then you can make a video and
then you can make a movie right. Um so
these are the things that if you get it
right and get it really well there's so
much more downstream tasks that can
derive from it. Um just curious like how
do you think about what are the force m
multipliers that you want to unlock? So
the next
>> what's the next big one
>> what's the next yeah big wave of people
who can just use nano nano as the base
model for all the downstream tasks.
So I think one one current one actually
is also the latency point right because
I think because I think it's also just
like it makes it really fun to iterate
with these models when it just takes 10
seconds to generate the next frame right
if you had to sit there and wait for two
minutes like you would probably just
give up and leave a very different
experience so I think that's one just
like there has to be some quality bar
because if it's just fast and the
quality isn't there then it also doesn't
matter right like you have to hit a
quality bar and then um then speed
becomes a force multiplier I think this
general idea of just visualizing
information to your education point from
earlier is sort of another one, right?
And that needs
>> good text. It needs factuality, right?
Because if you're going to start making
kind of visual explainers about
something, um it it looks nice, but it
also needs to be accurate,
>> right?
>> And so, and so I think that's probably
kind of the next level where at some
point then you could also just have a
personalized textbook to you, right?
Where it's not just the text that's
different, but it's also the visuals.
Yeah, the diamond age that was basically
Yeah, basically. Um, and then it should
also internationalize really well,
right? Because a lot of the times today
you might actually be able to find a
diagram that explains the thing that
you're trying to learn about on the
internet, but it's maybe not in the
language that you actually speak. Um,
right? And so I think that becomes just
like another way to improve and open up
accessibility um of information to just
a lot more people and again visually
because a lot of people are visual
learners.
>> Interesting. How do you think about like
images generated? So the reason why I
asked is that there's another very cool
example. I've seen someone making it
work with nano banana which is he wrote
a script and then he kept prompt the
model to say generate the frame one
second after this and then it became a
video. So and then when I saw it I'm
like well is every image just one frame
in a continuum like you always know
about the continuum in a parallel
universe. you could have you know
generated any one of them.
>> It's one big directive graph that
>> right exactly and then maybe it's video
at the end of the day. So how do you see
that? Where does it you know intersect
or not intersect?
>> I think it's very yeah video and images
are very closely related. Um and
also I think what we're seeing in these
kind of what's coming next or sequence
predicting um use cases is the the
generalization in world knowledge of the
model as well. Um and and this is and so
where where do I think it's going? I
think that we will have yeah I think
video is um an obvious next kind of
domain. I think that like when when you
have editing um a lot of times what
you're asking is like you know what
happens if I do this and that's what
video has it has the the time sequence
of actions. So it's like we have a slow
frames per second video that you can
interact with,
>> but obviously making something that's
like fully interactive and real time and
um is the direction this this field is
headed.
>> So you are probably in the zero I don't
know how many zer 0.001% of most
experienced people in the world using
image models.
>> What are your your personal favorite use
cases? How how do you use it dayto-day
if you're not just testing an existing
model?
Well, I so I'm not sure I am in the the
very top, but but I'll tell you what,
>> um I mean it's it's like we were saying
earlier, the personalization aspect is
the the thing that totally drives it
home for me. I have I have two young
kids and like the best things that I do
with the model are the things I do with
my kids and like we can make, you know,
make their their stuffed animals come to
life in these types of applications and
it's just so personal and gratifying to
see. Um, we also a lot of people um
taking old pictures of their family for
example and like um showing what
restoring them and and like so I think
that that's that's the the real beauty
of the edit models is that you can you
can make it about the one thing that
matters most to you.
>> So that's what I use it for is is my
kids basically.
>> Very nice.
>> Yeah. You're you're basically making
content that you probably would have
never made before and it's like for the
consumption of one person, right? Or or
or one family and you're kind of telling
these stories that you would have never
told before. So, kind of similar like I
do a lot of family holiday cards and
birthday cards and whatnot. Um, now
anytime I make a slide deck, I like
force myself to generate some images
that are like contextually relevant and
then try to get the text right um and
all of those things. And then we try to
push the boundaries around like can you
make a chart in the pixel space? Do you
want to that's another question, right?
Because you also want the um you want
the bars in the bar chart to be
accurately positioned relative to one
another. Um so I I think we do a lot of
these things. I'm actually really
impressed with the people we work with
on the team who are just like very
creative. Um we have a team um who just
works really closely with us on models
that we're developing and then they just
like push the boundary. They'll do like
crazy things with the models.
>> What's the most surprising thing you've
seen here? Like I didn't know our model
can do this. Yeah.
>> This is even just kind of like simple
things where people have been doing like
texture transfer. Like they will take
>> Yeah. like you take a portrait of a
person and then you're like what would
it look like but if it had the texture
of this piece of wood and I'm like I
would have never I would have never
thought of this being a use case because
my brain just doesn't work that way. Um
but people like kind of just push the
boundaries of what you're what you can
do with these things.
>> That is an interesting uh example of a
world knowledge because texture
technically is 3D because there's like
the whole 3D aspect of it. There's a
light and shadow of it but this is a 2D
transfer. Yeah. So that's very cool. I
think for me the the thing I'm most
excited by and maybe most impressed by
is um are the the use cases that test
the reasoning abilities of the models.
So um some people in our team figured
out you could like give geometry
problems to the model and like ask it to
kind of you know solve for X here or
fill in this missing thing or like
present this this from a slightly
different like a different view
>> and like these types of um of things
that really require world knowledge and
the reasoning ability of like a
state-of-the-art language model are the
things that make me really go wow that's
amazing I didn't think we would be able
to do that.
>> Can it uh generate compile code on a
blackboard yet? And like if I take a
picture of my I don't know like code on
the laptop, would it know if it compiles
on the image model?
>> Um I've I've seen examples where people
give it like an image of HTML code and
have the model render the the web page
and it can do that.
>> That's very cool. The coolest example I
saw, so I came from academia, so I spent
a lot of time writing papers and making
figures. And um one of our colleagues uh
took a picture of one of the result
figures uh from one of their papers with
a method that could do a bunch of
different things. This this one, you
know, a bunch of different um type of
applications in the paper and asked the
model to and like sort of erased the um
the results. So you have like the inputs
and asked the model to like solve all of
these in picture form in a figure of a
paper and it was able to do that. So it
could actually like figure out what is
the problem that this one figure is
asking for, find the answer and put it
in the image and then do that for a
bunch of different applications at the
same time which was really amazing. Very
cool.
>> That's very cool. Have um has anyone
built application on top of that
capability yet? Like what's the
application that will come out of that?
>> I think that there are a lot of very
interesting I would say zero transfer
capability like problem solving type
things that we don't even know the
boundary of yet. And some of these are
are probably quite useful like you know
if you want to have a method that does
solves some problem X I don't know like
finds the the the the normals of the
scene or something like the service
orientations or something um you
probably can prompt the model to give
you kind of a reasonable estimate. Um so
I think there's lots of problems like
sort of understanding problems and other
types of things that we could maybe
solve with zero or few shop prompting
that we don't know yet. Yeah, there's
one thing you mentioned I found super
interesting, which is the world
knowledge transfer, but in a lot of
world models like or video models, there
always is something that keeps the state
like just because you look away doesn't
mean that the chair should disappear or
change color because it's that's not
what the state of the world is. How do
you see that? Do you think there's
relevance there in image model? Is that
something you even consider optimizing
for? Yeah, I mean if you think about an
image model that has a a long context
where you can put other things in that
context like text, images, audio, video,
then I think it's definitely like your
reasoning over the context of things you
have to produce a final output image
>> or video. Um so yeah, I think there's
definitely um some model capability to
do this type of stuff already.
>> Got it. I haven't tested it out yet for
this big use case, but I'll let you
know. That's one of my favorite things
about these models is just finding and
I'm sure it's really fun for you guys
and you guys probably have much more of
a hint than we do about what they can
do. But sometimes you'll just see some
crazy X or Reddit or wherever post about
some incredible thing that someone has
figured out um how to do that you would
never expect that the model might be
able to do necessarily and then other
people kind of build on that and say
like oh and then I tried the next
iteration of this thing and suddenly you
have this like almost entirely new space
that's been discovered in terms of what
the what the models are capable of. It
must be fun as people much more deeply
involved in kind of building these
models and building the interfaces to
kind of watch that happen.
>> Yeah.
>> So, so if you talk to visual artists
today, I I've, you know, I I personally
love this stuff. I post about it on the
internet. You can get some very
skeptical answers. People like, "Oh,
this is terrible." Right? Like what do
you have any idea what triggers this
this reaction, right? I'm convinced that
this ultimately really empowers the
artists, right? It gives you new tools,
right? is like hey we now have I don't
know watercolors for Michelangelo let's
see what he does with it right and
amazing things come out it's of the
similar thing but but what triggers this
this strong reaction against it
>> so I think it's something something to
do with the amount of control over the
output so you know in the beginning when
we had these kinds of text image models
they would be very much like a oneshot
you put in some text you get an output
and people would be like oh this is art
this is this thing I made and I think
that maybe rubs people a little bit the
wrong way who are come from the creative
community because um you know that it's
it's most of the decisions that were
made were made by the model by the data
that was used to train
>> express yourself anymore physically
right
>> yeah exactly it's not yeah so as a
creative person you want to be able to
express yourself so I think as we make
the models more controllable then a lot
of these concerns of like oh that's just
that the computer is doing everything
kind of may may go away um and the other
thing is I think that that there was a
period of time where we were all so
amazed by the images these models could
create that like we were we were pretty
like uh happy to see just like oh this
stuff comes out of these models but I
think humans get really bored fast of
this type of thing. So like there was a
big rush and now if you see a if you see
an image that you know was just like oh
that's just like a single prompt person
didn't think about it much you can kind
of tell like that's an AI generated
image not that interesting. So I think
like there's still this boundary of like
now you need to be able to make
interesting things with the AI tools um
which is hard but it this will yeah this
will always be you know a requirement.
We need someone to be able to do this.
And I think
>> we still need artists.
>> We still need artists. And I think
artists will be able to also recognize
when when people have actually like put
a lot of control and intent
>> and still not be an artist.
>> Maybe get but but it it is there's a lot
of craft and there's a lot of taste,
right, that you accumulate sometimes
over decades, right? And I don't think
these models really have taste, right?
And so I think a lot of like a lot of
the reactions that you mentioned maybe
also come from that. And so we do work
with a lot of artists across all the
modalities that we work with. Um so
image, video, um music because we really
care about like building the technology
step by step with them and trying to
figure out they really help us kind of
like push the boundary of what's
possible. A lot of people are really
excited, but they they really do bring a
lot of their knowledge and expertise and
kind of like 30 years of design
knowledge. We just work with um Ross
Loveg Grove um on fine-tuning a model on
his sketches so that he can then create
something new
>> out of that and then we design an actual
physical chair that we like have a
prototype of um and so there there's a
lot of people who want to kind of bring
the expertise that they've built and
kind of like the rich language that they
use to describe their work and and have
that dialogue with the model so that
they can push their work kind of to the
frontier. And it is, you know, it
doesn't happen in like one prompt and
two minutes. Um, it it does require a
lot of that kind of taste and human
creation and and craft that goes into
building something that actually then,
you know, becomes art.
>> At the end, it's still a tool that
requires the human behind it to to
express the feelings and the emotions
and the story and everything.
>> Yeah, absolutely. Absolutely.
>> And that's what resonates with you when
you probably look at it, right? Um, you
you will have a different reaction when
you know that there's a human behind it
who has spent 30 years thinking about
something and then pour that into a
piece of art.
I think there's also a bit of this um
phenomenon that like most people who
consume creative content and maybe even
ones that are that care a lot about it
like they they don't know what they're
going to like next. You need someone who
has a vision and can do something that's
interesting and different, right? And
then you show it to people like, "Oh,
wow. That's amazing." But like they
wouldn't necessarily like think of that
on their own,
>> right?
>> So when we're, you know, when we're
optimizing these models, like one thing
we could do is we could optimize for
like the the average preference of
everybody.
>> But I don't think you end up with
interesting things by doing that. You
end up with something that everyone kind
of likes, but you don't end up with
things that people are like, "Oh, wow.
That's amazing. like I'm going to change
my my my whole like perspective of art
because I saw that
>> there's the avantguard edition of the
model
if I use it with the term there's the I
don't know what's what's the other end
of the spectrum the marketing edition or
so where it's very predictable and
>> very straightforward.
>> Yeah. Well, since we're coming up on
time, uh, last couple question. One is,
what's one feature that you know the
model is capable of that you wish people
ask you more?
>> Interle.
>> Yeah, in I think we've always been
amazed that nobody ever posts anything
about in solely generation is what we
call the model's ability to generate
more than one image for a specific
prompt. So, you can ask for like I want
a story like a bedtime story or
something like generate the same
character over these series of images.
And I think that um yeah, people haven't
really found it useful yet or haven't
discovered it. I don't know.
>> Oh, interesting. Well, if you're
listening to the podcast, go try this
out.
>> Try
>> Yeah. And what's the most um exciting
technical challenge that you look
forward to tackling in the next, I don't
know months years.
>> So, I think that there's really a high
ceiling in terms of quality for where
we're going. Like, I think you people
look at these images and say, "Oh, it's
almost perfect. we must be done. And for
a while, we were in this like cherrypick
phase where we would, you know, everyone
would pick their best images. So, you
look at those and they're great. But
actually, what's more important now is
the worst image. We're in a lemon
picking stage because every model can
cherrypick images that look perfect.
>> So, like now I think the real question
is like how expressable is this model
and what's the worst image you would get
given what you're trying to do.
>> So, I think by raising the quality of
the worst image, we really open up the
amount of use cases for things we can
do. like there's all kinds of
productivity use cases like um you know
beyond this kind of like immediate
creative tasks that we know the model
can do and I think that's a direction
we're headed we're headed to where if
these models can do more things
reasonably then they're just the the use
cases will be far greater
>> so that's the that's the moral
equivalent of the monkeys on typewriters
basically any model given enough tries
will eventually make an amazing
adventure
>> but the the other way around it's hard
>> yeah the other round is hard one monkey
writing a book would be very hard
>> it would be a good monkey for that one
>> what are the applications you think that
would come out when we reach the lower
bound?
>> So, the one I'm most interested in, we
mentioned this before, is education
factuality. I have um you know, I I have
every I don't know how many times I want
to use these models for creative
purposes a month, but like I have way
more use cases for information seeking,
factuality, kind of like learning,
education type use cases. So, I think
like once that starts working, then
it'll be opening up all these new areas.
Amazing.
>> There's also something about I think
taking more advantage of the models
context window. Um so you can input a
really large amount of content right
into these LLMs. And um some companies
um you mentioned a few before um they
will have like 150 page brand guidelines
on like what you can and cannot do,
right? And they're like very precise,
right? Like colors, fonts, and right
>> um and like the the the size of like a
Lego brick maybe. Um and so being able
to actually like take that in and follow
that to a tea when you're doing
generation that's like a whole new level
of control um that we just can't we
don't have today right um to to make
sure that you're actually kind of like
following that to a tea. I think that
will build a lot of trust with you know
very established brands. where we have a
second creative compliance review model
that then double checks everything that
I could do against the the model should
do it on its own, right? Like like it
should kind of have this yes it should
have this loop as like okay I generate
this but then page 52 says that I
shouldn't have right and I'm going to go
back and try again and then two hours
later it will come back to you with that
respect.
>> Yeah.
>> And we saw with the text models how this
inference time scaling how much it can
help right being able to to critique
your own work. Yep.
>> So this this feels really important.
>> Boy, an incredibly amazingly exciting
future for for image models.
>> Yes. And congrats on all the amazing
work.
>> Thank you.
>> Thank you.
>> Thanks for having us.
>> Well, thank you so much for coming on
the pod.
[Music]
[Music]
Loading video analysis...