Google DeepMind Developers: How Nano Banana Was Made

By a16z

Summary

## Key takeaways - **Nano Banana: Best of Both Worlds**: Nano Banana, a new image generation model, combines the visual quality of Google DeepMind's 'Imagine' models with the multimodal, conversational capabilities of Gemini 2.5 Flash. This blend is key to its resonance with users. [01:55] - **Viral Success Driven by User Demand**: The viral nature of Nano Banana surprised even its creators. User demand surged unexpectedly, requiring constant increases to query capacity and demonstrating the model's significant utility. [02:10] - **Personalization Drives Emotional Connection**: The ability for users to see themselves in AI-generated images, like creating personalized '80s makeover versions,' was a key moment that resonated deeply and highlighted the model's personal impact. [03:41] - **AI Empowers Artists, Frees Up Creativity**: AI models like Nano Banana are transforming creative work by automating tedious tasks, allowing artists to focus 90% of their time on creativity rather than manual operations. [00:03], [05:40] - **Intent is Key to AI-Generated Art**: While AI generates outputs, the true essence of art lies in human intent. These models serve as tools for individuals with creative intent to produce inspiring and meaningful work. [07:34] - **Balancing Control and Simplicity in Interfaces**: Developing user interfaces for AI image models involves a trade-off between offering extensive control for professionals and maintaining simplicity for everyday users, a balance not yet fully realized. [10:35]

Topics Covered

Personalization drives emotional resonance in AI art.
AI empowers artists by reducing tedious tasks.
Will AI interfaces prioritize suggestions or control?
Visual AI models will transform education.
Improving the "worst image" unlocks AI's future.

Full Transcript

These models are allowing creators to do

um less tedious parts of the job, right?

They can be more creative and they can

spend, you know, 90% of their time being

creative versus 90% of their time like

editing things and doing these tedious

kind of manual operations.

>> I'm convinced that this ultimately

really empowers the artists, right? It

gives you new tools, right? It's like,

hey, we now have, I don't know,

watercolors for Michelangelo. Let's see

what he does with it, right? And amazing

things come out.

maybe start by telling us about the

backstory behind the nano banano model.

How did it come to be? How did you all

start working on it?

>> Sure. So um you know our our team has

worked on image models for some time. We

developed the imagine family of models

which is goes back a couple years. Um

and and actually there was also an um an

image generation model in Gemini before

the Gemini 2.0 image generation model.

So what happened was the um the teams

kind of started to focus more on the

Gemini use cases. So like interactive

conversational and and editing

>> um and and essentially what happened is

we teamed up and we we built this model

which became what's known as nano

banana. Um so yeah that's sort of the

origin story but

>> yeah and and I think maybe just some

more background on that. So our imagine

models were always kind of top of the

charts for visual quality and you know

we really focus on kind of these

specialized generation editing use cases

and then when 2.0 Flash came out that's

when we really started to see some of

the magic of like being able to generate

images and text at the same time so you

can maybe tell a story. Um just the

magic of being able to talk to images

and edit them conversationally. Uh but

the visual quality was maybe not where

we wanted it to be. And so Nano Banana

or Gemini 2.5 flash image um

>> Nano Banana is way cooler.

>> It's it's easier to say. It's a lot

easier.

>> It's the name that stuck.

>> Yes, it's the name that stuck. Uh but it

really became kind of the best of both

worlds in that sense like the Gemini

smartness and the multimodal kind of

conversational nature of it plus the

visual quality of imagine. And I feel

like that's maybe what resonates a lot

with people.

>> Wow. Amazing. Um, so I guess when you

were testing out a model, as you were

developing it, what were some wow

moments um that you found, I know this

is going to go viral. I know people will

love this.

>> I So I actually didn't feel like it was

going to go viral until we had released

on Ellarina. And what we saw was that we

budgeted like, you know, a comparable

amount of queries per second as we had

for our previous models that were on Elm

Marina. And we had to keep upping that

number as people were going to Ella

Marina to use the model. And I feel like

that was the first time when I was

really like, "Oh, wow. This is something

that's very very useful to a lot of

people." Like I it surprised even me. I

don't know about the whole team, but

like we, you know, we were trying to

make the best conversational editing

model possible. But um but then it

really started taking off when when yeah

when people were like going out of their

way and using a using a website that

would actually only give you the model

some percentage of the time. But even

that was worth like using going to that

website to use the model. So I think

that was really the moment at least for

me that I was like oh wow this is this

is going to be bigger.

>> That's actually the best way to

condition people like only give them a

rewards partially not all the time by

design. Uh I had a moment earlier um and

that was when so I've been trying some

similar queries on kind of multiple

generations of models over time. Um and

a lot of them have to do with like

things I wanted to be as a kid. So like

an astronaut explorer or you know put me

on the red carpet and I tried it on a

demo that we had internally before we

released the model. It was the first

time when the output actually looked

like me. Um and you know you guys play

with these models all the time. The only

time that I've seen that before is if

you know you fine-tune a model, you

know, using Laura or some other method

to like do that and you need multiple

images and takes a really long time and

then you have to like actually serve it

somewhere. So this was the first time

when it was like zero shot. Oh wow, just

one image of me and it looks like me and

I was like wow. And then there became

these like we have decks that are just

like covered in my face as I was trying

to convince other people that it was

really cool. Um, and really I think the

moment more people realized that it was

like a really fun feature to use is when

they tried it on themselves because it's

it's kind of fun when you see it on

another person, but it doesn't really

resonate with people emotionally. It

makes it so personal. It's like you your

kids, you know, your spouse and and I

think that's your dog,

>> your dog. And and and that's really what

started kind of resonating internally.

And then people just started making all

these like 80s makeover versions of

themselves. And that's when we really

started to see like a lot of internal

activity and we were like, "Okay, we're

on to something."

>> It's it's a lot of fun to test these

models when we're making them because

you just you see all these amazing

creative things that people make. Oh,

wow. I I never thought that was

possible.

>> So, it's it's really fun.

>> No, it's I mean, we've dealt with the

whole with the whole family and it's

it's it's a it's a crazy amount of fun.

>> So, think a bit about long term. Where

does this lead? Right. I mean we we

built these new tools that I think will

change visual arts forever, right? I

mean there we suddenly can transfer

style. We suddenly can you generate

consistent images of a subject, right? I

have I have what used to be a very

complex manual Photoshop process.

Suddenly I type one command and

magically happens. But what's the end

state of this? I mean do do we have an

idea yet? You know how will how will

creative arts be taught in a university

in you know five years from now?

want to take that.

>> So I I think it's going to be a spectrum

of things, right? I think on the

professional side, a lot of what we're

hearing is that these models are

allowing creators to do um less tedious

parts of the job, right? They can be

more creative and they can spend, you

know, 90% of their time being creative

versus 90% of their time like editing

things and doing these tedious kind of

manual operations. So I'm really excited

about that. Like I think we'll see kind

of an explosion of creativity like on

that side of the spectrum. Um and then I

think for consumers there's sort of like

two spect two sides of the spectrum for

this probably. One is you know you might

just be doing some of these fun things

like Halloween costumes for my kid,

right? And and the out the goal there is

probably just to like share it with

somebody, right? Your family or your

friends. Um on the other side of the

spectrum, you might have these tasks

like putting together a slide deck,

right? I started out as a consultant. We

talked about that at the beginning. Um,

and you spend a lot of time on like very

tedious things like trying to make

things look good, trying to make the

story make sense. I think for those

types of tasks, you probably just have

an agent who you give the specs of what

you're trying to do and that it goes out

and like actually lays it out nicely for

you. It creates the right visual for the

information that you're trying to

convey. And it really is going to be

this, I think, spectrum depending on

what you're trying to do. Like do you

want to be in the creative process and

actually tinker with things and

collaborate with the model or do you

just want the model to like go do the

task and be as minimally involved as

possible?

>> So So in this new world then what what

is art? I mean somebody recently said

art is if you can create an an out of

distribution sample. Is is that a good

definition or or is it is it is it

aiming too high? Right.

>> Do you think if art is out of

distribution or in distribution for the

model?

>> There we go.

I think that out of distribution sample

that is a little bit too restrictive. I

think a lot of great art is actually in

distribution for art that occurred

before it. So I I mean what is art? I

think it's like a very philosophical

debate and there's a lot of people that

do discuss this. Like to me I think that

the most important thing for art is

intent. And so the the what is generated

from these models is is a tool to allow

people to create art. And I'm actually

not worried about the high-end and the

creatives and the professionals because

I've seen like if you put me in front of

one of these models like I can't create

anything that anyone wants to see

>> but like and but I've seen what people

can do who are creative people and who

have like intent and these ideas and I

think it's like

>> that's the most interesting thing to me

is is the things they create are really

amazing and and inspiring for me. So, I

feel like the the high-end and the the

the professionals and the creatives,

like they'll always use state-of-the-art

tools, and this is like another tool in

the tool belt for people to make cool

things. I think one of the the really

interesting things that I kept hearing

about this model in particular from like

creatives and artists was a lot of them

felt like they couldn't use a lot of AI

tools before because it didn't allow

them the level of control that they

expected for their art. what on one side

that was like the um characters or

object consistency like they really used

that to have a compelling narrative for

a story and so before when you couldn't

get the same character over and over it

was very difficult and then I think

second the like second thing I hear all

the time from artists is like um they

love being able to upload multiple

images and say like use the style of

this on this character or add this thing

to this image which is something that I

think was very hard to do even with

previous image edit models. I guess I'm

curious like was that something you guys

were really optimizing for when when you

trained this one or or how did you think

about that?

>> I mean yeah definitely sort of

customizability and character

consistency are things that we closely

monitored during the development and we

tried to do the best job we could on

them. Um, I think another thing is also

uh the iterative nature of kind of like

an interactive conversation. Um, and you

know, art tends to be iterative as well

where you you make lots of changes, you

see how it where it's going and you make

more. Um, and this is another thing I

think makes the model more useful and

and actually that's an area that I also

feel like we can improve the model

greatly. Like I know that um once you

get into really long conversations like

it it starts to follow um your

instructions a little bit worse. But

like this something that we're planning

to improve on and make the model more

kind of like a natural conversation

partner or like a creative partner in in

making something.

>> One thing that's so interesting is after

you guys launched Nano Banana, we start

to hear about editing models all the

time everywhere. Like it's like after

you launched the world woke up and they

were editing model, it's great, everyone

wants it. And then obviously like it it

kind of um you know goes into the

customizability the personalization of

it and then uh Oliver I know you used to

be Adobe and then there's also software

where we used to manually edit things.

How do you see the knobs evolve now on

the model layer versus what we used to

do? Um, yeah. I mean, I think that, you

know, one thing that that Adobe has

always done and the professional tools

generally require is lots of of control,

lots of knobs, lots of of So, there's

always a balance of like we want someone

to be able to use this on their phone.

>> Um, maybe with just like a a voice

interface,

>> and we also want someone who can really

like a really professional art creative

to be able to do fine skill adjustments.

I think we haven't exactly figured out

how to enable both of those yet. Um, but

there's a lot of people that are

building really compelling UIs like um

and and and I think it's a you know

we're Yeah, I think I think there's

different ways it can be done. Um, I

don't know. You have thoughts?

>> Well, I also hope that we get to a point

where you don't have to learn what all

these controls mean and the model can

maybe smartly suggest what you could do

next based on the context of what you've

already done, right? Um, and that feels

like it's kind of prime for someone to

tackle that on. So like what do the UIs

of the future look like um, in a way

where you probably don't need to learn a

hundred things that you had to before,

but like the tools should be smart

enough to suggest to you what it can do

based on what you're already doing.

>> That's such an insightful take. I

definitely had moments when when I used

Nano Banana, I was like, I didn't know I

wanted this, but

>> but I didn't even ask for this style. I

don't even know have the words for that

what that style even, you know, is

called. So this is like very insightful

on how image embedding and the language

embedding is not one to one like we

cannot map to like all the editing task

with language. So oh go ahead.

>> Yeah, let me let me sort of take a

little the counter point just to see

where this goes.

>> The other the question of how complex

the interface be can be limited by sort

of what we can express in software, how

easy we can make something in software

which to some degree is also limited by

how much complexity is a user willing to

tolerate. And you know if you have a

professional

>> they only care about the result they're

willing to tolerate a vast amount of

complexity they they have the training

they have the education they have the

experience to use that right then we may

end up with lots of knobs and dials it's

just very different of dials right I

mean today if you use a cursor or so for

coding it's not that it has a super easy

you know single text prompt interface it

has it has a a good amount of you know

here add context here different modes

and so on right so so

will we have Will we have like the the

ultra sophisticated interface for the

for the power user and how how would

that look like? So I'm a big fan of

Comfy UI and nodebased interfaces in

general

>> and that is complex

>> and it's complex but it's also it's very

robust and you can do a lot of things

and so you know after we released Nano

Banana we saw people building all these

really complicated comfy UI workflows

where they were combining a bunch of

different models together and different

tools and that generated some of the

like for example using nanobana as um as

a way to get storyboards or key frames

for video models like you can plug these

things together and and get really

amazing outputs. So, I I think that like

at the the pro or the developer level,

like these kinds of interfaces are are

great. Um, in terms of like the proumer

level, I think it's it's very much

unknown what it's going to look like in

in a couple years.

>> Yeah. I think it just really depends on

your audience, right? Because for the

regular consumer, like I use my parents

always as an example. The chatbot is

actually kind of great.

>> Oh, yeah.

>> Because you don't have to learn a new

UI. You you just upload your images and

then you talk to them, right? Like it's

it's kind of amazing that way. Then for

the pros, I agree that like you need so

much more control than you know and then

there's somewhere in between probably

which are people who may want to be

doing this but they were too intimidated

by the professional tools in the past

and for them I do think that there's a

space of like that you need more control

than the chatbot gives you. Uh but you

don't need as much control as what the

professional tools give you and like

what's that kind of in between state?

>> There's a ton of opportunity there.

>> There's a ton of opportunity there. It

is interesting you mentioned comfy UI

because it's on the other far spectrum

of workflow like a workflow can have

hundreds of steps and notes and you need

to make sure all of them work whereas on

the other side of the spectrum there's

nano banana you kind of describe

something with words and then you get

something out like I don't know what's a

model architecture stuff like that but

um I guess is your view that the world

is moving to an ensemble of a model

hosted by one provider doing it all or

do you think the world is moving to more

of everyone building a workflow. Nano

Banana is one of the nodes in comfy work

UI.

>> Um I I definitely don't think that that

the the broad amount of use cases will

be fully satisfied by one model at any

point. So I think that there will always

be a diversity of models. some like um

I'll give you an example, but some you

know we could we could optimize for um

instruction following in our models.

Make sure it does exactly what you want,

but it might be um a worse model for

someone who's looking for ideiation or

kind of inspiration where they want the

model to kind of take over and and do

other things, go crazy. So like I just

think there's so many different use

cases and so many types of people that

like there's a lot of space there's a

lot of room in this space for multiple

models. So that's that's where I see us

going. I don't think this is going to be

like a single to rule it a single model

to rule them all.

>> Complete sense. Let's go to the very

other end of the spectrum from the

professional. Do you think

kindergarteners in the future will learn

drawing by by sketching something, you

know, on a on a little tablet and then

you have the AI make turn that into a

beautiful image and and so that's how

how they they allow get in touch with

art. I don't know if you always wanted

to turn into a beautiful image, but I

but I think there's something there

about the AI being again a partner and a

teacher to you in a way that you like

didn't have. So I

>> didn't know how to draw, still don't um

don't have any talent for it really. Uh,

but I think it would be great if we

could use these tools in a way that

actually teaches you kind of the step by

steps and helps you critique and maybe

again shows you kind of like an

autocomplete almost for images like what

like like what's the next step that I

could take, right? Or maybe show me a

couple of options and like how do I

actually do this? So, I hope it's more

that direction. I I don't think we all

want, you know, every 5-year-old's image

to suddenly look perfect.

>> We we we would probably lose something

um in the process. As someone who

struggled the most in high school out of

all my classes of the art and the

sketching class, I actually would have

would have preferred it, but I know a

lot of people want their kids to learn

to draw, which I understand.

>> It's funny because we've been trying to

get the model to create um like

childlike crayon drawings, which is

actually quite challenging.

>> Um, ironically, you know, sometimes the

the things that are hard to make are

because the level of abstraction is very

large right?

>> So, it's actually quite difficult to

make those types of images. your

dedicated prek fin.

>> We we do have seminar evals right now

>> to try to see if we're getting better.

>> I'm in general I'm very optimistic about

AI for education. And you know part of

the reason is I think that most of us

are visual learners,

>> right? So that AI right now as a tutor

basically all it can do is is talk to

you or give you text to read and that's

definitely not how students learn. So I

think that these models have a lot of

potential as a way to help education by

giving people sort of visual cues. You

know, imagine if you could get an

explanation for something where you get

the text explanation, but you also get

images and figures that kind of like

help explain how they work. I think it

just everything will be much more

useful, much more accessible for

students. So I'm really excited about

that. That is a

>> on that point, one thing that's very

interesting to us is that when Nano

Banana came out, it almost felt like

there's part of a use case is the

reasoning model. Like you have a

diagram. Absolutely. Right. Like you can

explain some knowledge visually. So the

model not just doing approximation of

the visual aspect. There's the reasoning

aspect to it too.

>> Do you think that's where we're going

to? Do you think all the model large

models will realize that oh like to be a

good LM or VL like uh VLM we have to

have both image and language and audio

and so on so forth.

>> 100%. I definitely think so. Um the the

future for these AI models that I'm most

excited by is where they are tools for

people to accomplish more things. Like I

think if you imagine a future where you

have these agentic models that just talk

to each other and do all the work, then

it becomes a little bit less necessary

that there's like this visual mode of

communication. But as long as there's

people in the loop and as long as the

the kind of the the motivation for the

task they're solving comes from people,

I think it makes total sense that that

visual modality is going to be really

critical for any of these AI agents

going forward.

Will we get to a point where there's

actually so of you know I'm I'm asking

you to create an image it sits for two

hours reasons with itself has drafts

explores different directions and then

comes back with a final answer.

>> Yeah. Absolutely. If it's if necessary.

Yeah. Like

>> and maybe not just for a single image

but to the point of you know maybe

you're redesigning your house and maybe

you actually really don't want to be

involved in the process right but you're

like okay this is what it looks like

this some this is some inspiration that

I like. And then you send it to um a

model the same way that you would send

it to like a designer.

>> It's the visual deep research.

>> The vis it's like visual deep research

basically. I really like that term. And

then it goes off and does its thing and

searches for maybe the furniture that

would go with your environment and then

it comes back to you and maybe presents

you with options because maybe maybe you

don't want to sit for two hours at one

thing art book

you know 10 10 slide deck. I also I

think if you if you think about like um

instruction manuals or like IKEA

directions or something then like

breaking down a hard problem into many

intermediate steps could be really

useful as as a way to communicate.

>> So when can we generate Lego sets?

>> Yeah, soon maybe

do we at some point need 3D as part of

it?

>> Right.

>> I mean there's a whole debate around

world models and image models and how

they fit together. thoughts

enlighten us here. What is the what is

the short summary of where we'll end up

there?

>> Um I mean I don't know the answer. I

think that um obviously the real world

is in 3D. So if you have 3D a 3D world

model or world model that has explicit

3D representations. There's a lot of

advantages. For example, everything

stays consistent all the time.

>> Um now the main challenge is that we

don't walk around with 3D capture

devices in our pocket. So in terms of

like the available data for training

these models, it's largely the

projection onto onto 2D. So I think that

both viewpoints are totally valid for

where we're going. I come a bit from the

projection side like I think it we can

solve almost all the problems if not all

the problems working on the projection

of the 3D world directly and letting the

models learn the latent world

representations. I mean we see this

already that the video models have very

good 3D understanding. You can run

reconstruction algorithms over the

videos you generate and they're they're

very accurate. Um, and in general, if

you look at like the history of human

art, like it it it starts as like the

projection, right? People drawing on on

cave walls. Um, all of our interfaces

are in 2D. So, I think that like humans

are very very well suited for working on

this projection of the 3D world into a

2D plane. And it's a really natural

environment for interfaces and for

viewing. So,

>> that is very true. like um so I'm a

cartoonist in my spare time and then

drawing in 2D is just light and shadow

and then you present yourself with 3D

cannot we trick ourselves to believing

it's 3D or it's you know on a piece of

paper but then what human can do that

you know like a drawing or like a model

can do is you we can navigate the world

like we see a table we can't walk past

it I guess the question becomes if

everything is 2D how do you solve that

problem

>> well I don't think yeah so if we're

trying to solve the robot products

problems. I think maybe the 2D um

representation is useful for planning

and visualizing kind of at a high level.

>> Like I think people navigate by um by

remembering kind of 2D projections of

the world. Like you don't you don't

build a 3D map in your head. You're more

like oh I know I see this building I

turn left.

>> Yeah.

>> So I think that like for that kind of

planning it's reasonable but for the

actual locomotion around the space like

I definitely 3D is important there. So

>> robotics Yeah. They probably need 3D.

That's the saving grace.

>> Yeah. Yeah. Um, so character

consistency, which you previously

mentioned, I really love the example of

like when a model feels so personal,

like people are so tempted to try it.

>> How did you unlock that moment? The

reason why I asked is that character

consistency is so hard.

>> Uh, there's a huge uncanny valley to it.

like you know like if it's someone I

don't know if I see their AI generation

I'm like okay it's maybe the same person

but if it's someone I know if there's

just a little bit of a difference uh I

I'm actually felt very turned off by it

because I'm like this not a real person.

So in that case how do you know what

you're generating is good and then is it

mostly by user feedback or like I love

this or is it something else? You look

at faces, you know, and but

face detection camera user and

>> no. So, so, so not even before you ever

released this, right? So, when when

we're we were developing this model, we

actually started out doing character

consistency evolves on faces we didn't

know and it doesn't tell you anything,

right?

>> Um, and then we started testing it on

ourselves and quickly realized like,

okay, this is what you need to do

because this is a face that I'm familiar

with. And so there is a lot of sort of

eyeballing evaluations that happens and

just the team testing it on themselves.

Um, and just generally people they know

like Oliver probably knows my face at

this point enough to be able to tell

whether or not it's actually me when

it's generated.

>> Um, and so we do do a lot of that. Um,

and then you know you you ideally test

it on different sets of people,

different ages, right? Different

different kind of groups of folks to

make sure that it kind of works across

the board.

>> Yeah, I think they're right. I mean that

that touches that touches a little bit

on this this um bigger issue which is

that like eval are really difficult in

this space

>> um because human perception is very

uneven in terms of the things that it

cares about. So um so really it's hard

it's very hard to know like how good is

the character consistency of a model and

um is it good enough? Is it not good

enough? Like you know I think there's

there's still a lot of improvement we

can make on character consistency but I

think that for some use cases like we

got to a point and that's you know we

weren't the first edit model by any

means but I think that like once the

quality gets above a certain level for

character consistency it can kind of

just take off because it becomes useful

for so much more.

>> And I think as it gets better it'll be

useful for even more things too. Yeah,

>> I think one of the really interesting

things we're seeing across a bunch of

modalities of which image edit and

generation obviously is one is like um I

think the arenas and benchmarks and

everything are awesome but especially

when you have like multi-dimensional

things like image and video um it's very

hard as all of the models get better and

better to condense every quality of a

model into like one judgment. So it's

like, you know, you're judging, okay,

you swap a character into an image and

you change the style of the image. Maybe

one did the character swap and

consistency much better and the other

did the style much better? Like how do

you say which output is better? And it

probably comes down to like what the

person cares most about and what they're

what they want to use it for. Um, are

there like certain, you know,

characteristics of the model that you

guys value more than other things in

like making those trade-offs when

deciding which version of the model to

deploy or like what to really focus on

during training?

>> Um, yes, there are. One of the things I

like about this space is that uh there

is no right answer. So, actually there's

quite a lot of of I don't know if it's

taste, but it's like preference that

goes into the models. And I think you

can kind of see the difference in

preferences of the different research

labs in the models that they release.

>> So like when we're balancing two things,

a lot of it comes down to like, oh well,

I I don't know. I just like this this

look better or I you know, this this

feature is more important to us.

>> I'd imagine it's hard for for you guys,

too, because you have you have so many

users, right? like Google like being in

the Gemini app like everyone in the

world can use that versus like many

other AI companies just think about like

we're only going for the professional

creatives or we're only going for the

consumer meat makers and like you guys

have the unique and exciting but

challenging task of like literally

anyone in the world can do this. How do

we decide what everyone would want?

>> Yeah. And it is sometimes we do make

these trade-offs. We do have a set of

things that are sort of like super high

priority that we don't want to regret

restress on. Right? So now because

character consistency was so awesome and

so many people are using it, we don't

want our next models to get worse on

that dimension. Right? So we pay a lot

of attention to it. We care a lot about

images looking photorealistic when you

want photos and this is important. One,

I think we all prefer that style.

too. Um, you know, for advertising use

cases, for example, like a lot of it is

kind of photorealistic images of

products and people. And so, we want to

make sure that we can kind of do that.

And then sometimes there are just things

that like will kind of fall down the

wayside. So, for this first release, the

model is not as good as text rendering

at as we would like it to be, and that's

something that we want to fix in the

future. But it was kind of one of those

things where we looked at, okay, the

model's good at XY Z, it's not as good

at this, but we still think it's okay to

release and it will still be an exciting

thing for people to play with.

>> If you look at the past, right, we we we

had for previous model generations, a

lot of things we did with like sidecar

models like control net or something

like that where we basically figured out

a way to provide structured data to the

model to achieve a particular result. It

seems like these newer models that has

taken a step back just because they're

so incredibly good in just prompting or

or you know giving a reference image and

picking things up from there. Where will

this go long term? Do you think this

will come back to some degree? Um you

know like I mean for from the creators

perspective right having I don't know

open pose information so I can get get a

pose exactly right right for multiple

characters. This seems very very

tempting, right? Is it like or to

rephrase it a little bit, it's like does

the bitter lesson hold here that at the

end of the day everything's just one big

model and you throw things in or is

there is a little of structure we can we

can offer to make this better?

>> Um, I mean I think that there will be

there'll always be users that want

control that the model doesn't give you

out of the box. But I think we we tried

to make it so that um you know because

really what really what an artist wants

when they want to do something is they

want the intent to be understood. And I

think that that these um AI models are

getting better at understanding the

intent of users. So often when you ask

text queries now the model gets what

you're going for.

>> So you know in that sense I think we can

we can get pretty far with understanding

the intent of our users.

>> And um and maybe some of that is

personalization like we need to know

information about what you're trying to

do or what you've done in the past. But

I think once you can understand the

intent then you can you can generally do

the the type of edit like is this like a

very structure preserving edit or is

this like a free form kind of like we

can learn these these kinds of effects I

think. Um but still of course there's

one person who's going to really care

about every pixel and like this this

thing needs to be slightly to the left

and a little bit more blue and like

those people will use existing tools to

do that.

>> I mean I think it's like you know I want

an image with 26 people spelling out

every letter of the alphabet or

something like that. Right. That's off

the thing where I think we're still

quite a bit away from getting that

right, you know, in the first try. On

the other hand, with pose information,

it could potentially get

>> But then the then the question I guess

is like do you really want to be the one

who's like extracting the pose and

providing that as an information or do

you just want to provide some reference

image and say like this is actually what

I want like model go figure this out

right

>> there are 26 people every now and

different style. Fair enough. Yeah, I

think in that in that case I wouldn't

spend a ton of time building a custom um

interface for making this this picture

of 46 people seems like the kind of

thing that we can we can solve.

>> Just transfer.

>> Do you think the representation of what

the AI images are will change? So the

reason why I asked the question is that

as artists there's different formats we

play with. There's the SVGs, we have

anchor points and bezier curves.

>> And on the other side, there's, you

know, porcy or like fresco, what have

you. There's layers that we can also

play with. There's the other parameter

which is what's the brush you use like

the brush, the texture of it. So, every

one parameter you can write script and

actually uh do something very personal

about it. Mhm.

>> Do you think like pixel is the right

representation um the endgame for image

generation model or do you think there's

a net new representation that we haven't

invented yet?

>> That's an easy question to

>> wow. Um I I'll say that uh that um

everything is a subset of pixels.

>> That's true.

>> So text is a subset of pixels because I

could just render all the text as an

image.

>> So how far can we get with just pixels

is an interesting question. I think you

know if the model is really um

responsive and handles multi-turn

interactions well then I think you can

probably get pretty far because the

primary reason I think you would want to

leave the pixel domain is for

editability.

>> Um and so you know in cases where you

need to have your font or you want to

change the text or you want to move

things around just like with control

points um it could be useful to have um

kind of mix generation which consists of

pixels and SVGs and other other forms.

Um but if we can do it all, if we can if

the multi- interaction is enough, then I

think you can get pretty far with

pixels. Um I will say that one of the

things that's exciting about these um

these models that have native

capabilities is that you now have a

model that can generate code and it can

generate images.

>> So there's a lot of interesting things

that come in that intersection, right?

Like maybe I wanted to write some code

and then make make some some things be

rasterized, some things be parametric.

>> Yeah.

>> Like stick it all together,

>> train it together. Like this would be

very cool. That's such a good point

because I did see a tweet of someone

asking Cloud Sauna to replicate a image

on an Excel sheet where every cell is a

pixel

>> which is like a very fun exercise. It

was like a coding model like doesn't

really know anything about you know

images yet it worked.

>> Yeah, there's the classic pelican riding

a bicycle test.

>> Yeah, totally. I have one on on model

like on interfaces if that's okay. I

don't sorry if I'm bringing up too much

product stuff guys. I'm just very

curious on on the product front like um

I guess I'm curious how you think about

like owning the interface where people

are editing or generating images with

Nano Banana versus

really just wanting a ton of people to

use the model for different things in

the API. Like we've talked about so many

different use cases like ads, you know,

education,

um design, uh like architecture. Each of

those things could be there could be a

standalone product built on top of Nano

Banana that prompts the model in the

right way or allows certain types of

inputs or whatever. Is your guys' vision

like that the kind of the product in the

Gemini app is like a playground for

people to explore and then developers

will build the individual products that

are used for certain use cases or is

that something you're also kind of

interested in owning?

>> I think it's a little bit of everything.

Um, so I definitely think that the

Gemini app is an entry point for people

to explore. And the one the nice thing

about Nano Banana is I think it shows

that fun is kind of a gateway to utility

where you know people come to make a

figurine image of themselves but then

they stay because it helps them with

their math homework or it helps them

write something, right? And and so I

think that's a really powerful kind of

transition point. Um there's definitely

interfaces that we're interested in

building and exploring as a company. And

so um you know you may have seen Flo

from Josh's team in labs that's that's

really trying to rethink like what's

what's the tool for AI filmmakers right

and for AI filmmakers image is actually

a big part of the iteration journey

right because video creation is

expensive a lot of people kind of think

in frames um when they when they

initially start creating and a lot of

them even start in the LLM space for

like brainstorming and thinking about

what they want to create in the first

place um and so there's definitely kind

of place that we have in that space of

just us trying to think about like what

does look like. Um, we have the

advantage of it kind of sitting close to

the models and the interfaces so we can

kind of build that in in a tight

coupling. Um, and then there's

definitely the, you know, we're probably

not going to go build a software for an

architecture firm. My dad is an

architect and he would probably love

that. Um, but I don't think that's

something that we will do, but somebody

should go and do that. Um, and that's

why it's exciting because we do have the

developer business and we have the

enterprise business and so people can go

use these models and then figure out

like what's the next generation workflow

for like this specific audience so that

I can help them solve a problem. So I I

think the answer is kind of like yes all

three.

>> Yeah.

>> Yeah. I I brought that up. I don't know

if you guys have been following the

reception of Nano Banana in Japan, but

um I'm sure you've had it's it's been

insane. And it's so funny like I now

half of my X feed is these really heavy

Nano Banana users in Japan who have

created like Chrome extensions called

there's one called like Easy Banana

that's specifically for using Nano

Banana for like manga generation and

specific types of anime and things like

that. And like they go super deep into

basically prompting the model for you

and storing the outputs in various

places. Um using obviously your your

underlying model to generate these like

amazing anime that you would never guess

were AI generated because like the level

of of precision and consistency and that

sort of thing is just beyond what I've

seen any single model be able to do

today.

I guess um what are some like to

Justin's point what are some force

multipliers that you guys have seen in

the model? So what I mean by this is for

example if you unlock character

consistency you can generate different

frames and then you can make a video and

then you can make a movie right. Um so

these are the things that if you get it

right and get it really well there's so

much more downstream tasks that can

derive from it. Um just curious like how

do you think about what are the force m

multipliers that you want to unlock? So

the next

>> what's the next big one

>> what's the next yeah big wave of people

who can just use nano nano as the base

model for all the downstream tasks.

So I think one one current one actually

is also the latency point right because

I think because I think it's also just

like it makes it really fun to iterate

with these models when it just takes 10

seconds to generate the next frame right

if you had to sit there and wait for two

minutes like you would probably just

give up and leave a very different

experience so I think that's one just

like there has to be some quality bar

because if it's just fast and the

quality isn't there then it also doesn't

matter right like you have to hit a

quality bar and then um then speed

becomes a force multiplier I think this

general idea of just visualizing

information to your education point from

earlier is sort of another one, right?

And that needs

>> good text. It needs factuality, right?

Because if you're going to start making

kind of visual explainers about

something, um it it looks nice, but it

also needs to be accurate,

>> right?

>> And so, and so I think that's probably

kind of the next level where at some

point then you could also just have a

personalized textbook to you, right?

Where it's not just the text that's

different, but it's also the visuals.

Yeah, the diamond age that was basically

Yeah, basically. Um, and then it should

also internationalize really well,

right? Because a lot of the times today

you might actually be able to find a

diagram that explains the thing that

you're trying to learn about on the

internet, but it's maybe not in the

language that you actually speak. Um,

right? And so I think that becomes just

like another way to improve and open up

accessibility um of information to just

a lot more people and again visually

because a lot of people are visual

learners.

>> Interesting. How do you think about like

images generated? So the reason why I

asked is that there's another very cool

example. I've seen someone making it

work with nano banana which is he wrote

a script and then he kept prompt the

model to say generate the frame one

second after this and then it became a

video. So and then when I saw it I'm

like well is every image just one frame

in a continuum like you always know

about the continuum in a parallel

universe. you could have you know

generated any one of them.

>> It's one big directive graph that

>> right exactly and then maybe it's video

at the end of the day. So how do you see

that? Where does it you know intersect

or not intersect?

>> I think it's very yeah video and images

are very closely related. Um and

also I think what we're seeing in these

kind of what's coming next or sequence

predicting um use cases is the the

generalization in world knowledge of the

model as well. Um and and this is and so

where where do I think it's going? I

think that we will have yeah I think

video is um an obvious next kind of

domain. I think that like when when you

have editing um a lot of times what

you're asking is like you know what

happens if I do this and that's what

video has it has the the time sequence

of actions. So it's like we have a slow

frames per second video that you can

interact with,

>> but obviously making something that's

like fully interactive and real time and

um is the direction this this field is

headed.

>> So you are probably in the zero I don't

know how many zer 0.001% of most

experienced people in the world using

image models.

>> What are your your personal favorite use

cases? How how do you use it dayto-day

if you're not just testing an existing

model?

Well, I so I'm not sure I am in the the

very top, but but I'll tell you what,

>> um I mean it's it's like we were saying

earlier, the personalization aspect is

the the thing that totally drives it

home for me. I have I have two young

kids and like the best things that I do

with the model are the things I do with

my kids and like we can make, you know,

make their their stuffed animals come to

life in these types of applications and

it's just so personal and gratifying to

see. Um, we also a lot of people um

taking old pictures of their family for

example and like um showing what

restoring them and and like so I think

that that's that's the the real beauty

of the edit models is that you can you

can make it about the one thing that

matters most to you.

>> So that's what I use it for is is my

kids basically.

>> Very nice.

>> Yeah. You're you're basically making

content that you probably would have

never made before and it's like for the

consumption of one person, right? Or or

or one family and you're kind of telling

these stories that you would have never

told before. So, kind of similar like I

do a lot of family holiday cards and

birthday cards and whatnot. Um, now

anytime I make a slide deck, I like

force myself to generate some images

that are like contextually relevant and

then try to get the text right um and

all of those things. And then we try to

push the boundaries around like can you

make a chart in the pixel space? Do you

want to that's another question, right?

Because you also want the um you want

the bars in the bar chart to be

accurately positioned relative to one

another. Um so I I think we do a lot of

these things. I'm actually really

impressed with the people we work with

on the team who are just like very

creative. Um we have a team um who just

works really closely with us on models

that we're developing and then they just

like push the boundary. They'll do like

crazy things with the models.

>> What's the most surprising thing you've

seen here? Like I didn't know our model

can do this. Yeah.

>> This is even just kind of like simple

things where people have been doing like

texture transfer. Like they will take

>> Yeah. like you take a portrait of a

person and then you're like what would

it look like but if it had the texture

of this piece of wood and I'm like I

would have never I would have never

thought of this being a use case because

my brain just doesn't work that way. Um

but people like kind of just push the

boundaries of what you're what you can

do with these things.

>> That is an interesting uh example of a

world knowledge because texture

technically is 3D because there's like

the whole 3D aspect of it. There's a

light and shadow of it but this is a 2D

transfer. Yeah. So that's very cool. I

think for me the the thing I'm most

excited by and maybe most impressed by

is um are the the use cases that test

the reasoning abilities of the models.

So um some people in our team figured

out you could like give geometry

problems to the model and like ask it to

kind of you know solve for X here or

fill in this missing thing or like

present this this from a slightly

different like a different view

>> and like these types of um of things

that really require world knowledge and

the reasoning ability of like a

state-of-the-art language model are the

things that make me really go wow that's

amazing I didn't think we would be able

to do that.

>> Can it uh generate compile code on a

blackboard yet? And like if I take a

picture of my I don't know like code on

the laptop, would it know if it compiles

on the image model?

>> Um I've I've seen examples where people

give it like an image of HTML code and

have the model render the the web page

and it can do that.

>> That's very cool. The coolest example I

saw, so I came from academia, so I spent

a lot of time writing papers and making

figures. And um one of our colleagues uh

took a picture of one of the result

figures uh from one of their papers with

a method that could do a bunch of

different things. This this one, you

know, a bunch of different um type of

applications in the paper and asked the

model to and like sort of erased the um

the results. So you have like the inputs

and asked the model to like solve all of

these in picture form in a figure of a

paper and it was able to do that. So it

could actually like figure out what is

the problem that this one figure is

asking for, find the answer and put it

in the image and then do that for a

bunch of different applications at the

same time which was really amazing. Very

cool.

>> That's very cool. Have um has anyone

built application on top of that

capability yet? Like what's the

application that will come out of that?

>> I think that there are a lot of very

interesting I would say zero transfer

capability like problem solving type

things that we don't even know the

boundary of yet. And some of these are

are probably quite useful like you know

if you want to have a method that does

solves some problem X I don't know like

finds the the the the normals of the

scene or something like the service

orientations or something um you

probably can prompt the model to give

you kind of a reasonable estimate. Um so

I think there's lots of problems like

sort of understanding problems and other

types of things that we could maybe

solve with zero or few shop prompting

that we don't know yet. Yeah, there's

one thing you mentioned I found super

interesting, which is the world

knowledge transfer, but in a lot of

world models like or video models, there

always is something that keeps the state

like just because you look away doesn't

mean that the chair should disappear or

change color because it's that's not

what the state of the world is. How do

you see that? Do you think there's

relevance there in image model? Is that

something you even consider optimizing

for? Yeah, I mean if you think about an

image model that has a a long context

where you can put other things in that

context like text, images, audio, video,

then I think it's definitely like your

reasoning over the context of things you

have to produce a final output image

>> or video. Um so yeah, I think there's

definitely um some model capability to

do this type of stuff already.

>> Got it. I haven't tested it out yet for

this big use case, but I'll let you

know. That's one of my favorite things

about these models is just finding and

I'm sure it's really fun for you guys

and you guys probably have much more of

a hint than we do about what they can

do. But sometimes you'll just see some

crazy X or Reddit or wherever post about

some incredible thing that someone has

figured out um how to do that you would

never expect that the model might be

able to do necessarily and then other

people kind of build on that and say

like oh and then I tried the next

iteration of this thing and suddenly you

have this like almost entirely new space

that's been discovered in terms of what

the what the models are capable of. It

must be fun as people much more deeply

involved in kind of building these

models and building the interfaces to

kind of watch that happen.

>> Yeah.

>> So, so if you talk to visual artists

today, I I've, you know, I I personally

love this stuff. I post about it on the

internet. You can get some very

skeptical answers. People like, "Oh,

this is terrible." Right? Like what do

you have any idea what triggers this

this reaction, right? I'm convinced that

this ultimately really empowers the

artists, right? It gives you new tools,

right? is like hey we now have I don't

know watercolors for Michelangelo let's

see what he does with it right and

amazing things come out it's of the

similar thing but but what triggers this

this strong reaction against it

>> so I think it's something something to

do with the amount of control over the

output so you know in the beginning when

we had these kinds of text image models

they would be very much like a oneshot

you put in some text you get an output

and people would be like oh this is art

this is this thing I made and I think

that maybe rubs people a little bit the

wrong way who are come from the creative

community because um you know that it's

it's most of the decisions that were

made were made by the model by the data

that was used to train

>> express yourself anymore physically

right

>> yeah exactly it's not yeah so as a

creative person you want to be able to

express yourself so I think as we make

the models more controllable then a lot

of these concerns of like oh that's just

that the computer is doing everything

kind of may may go away um and the other

thing is I think that that there was a

period of time where we were all so

amazed by the images these models could

create that like we were we were pretty

like uh happy to see just like oh this

stuff comes out of these models but I

think humans get really bored fast of

this type of thing. So like there was a

big rush and now if you see a if you see

an image that you know was just like oh

that's just like a single prompt person

didn't think about it much you can kind

of tell like that's an AI generated

image not that interesting. So I think

like there's still this boundary of like

now you need to be able to make

interesting things with the AI tools um

which is hard but it this will yeah this

will always be you know a requirement.

We need someone to be able to do this.

And I think

>> we still need artists.

>> We still need artists. And I think

artists will be able to also recognize

when when people have actually like put

a lot of control and intent

>> and still not be an artist.

>> Maybe get but but it it is there's a lot

of craft and there's a lot of taste,

right, that you accumulate sometimes

over decades, right? And I don't think

these models really have taste, right?

And so I think a lot of like a lot of

the reactions that you mentioned maybe

also come from that. And so we do work

with a lot of artists across all the

modalities that we work with. Um so

image, video, um music because we really

care about like building the technology

step by step with them and trying to

figure out they really help us kind of

like push the boundary of what's

possible. A lot of people are really

excited, but they they really do bring a

lot of their knowledge and expertise and

kind of like 30 years of design

knowledge. We just work with um Ross

Loveg Grove um on fine-tuning a model on

his sketches so that he can then create

something new

>> out of that and then we design an actual

physical chair that we like have a

prototype of um and so there there's a

lot of people who want to kind of bring

the expertise that they've built and

kind of like the rich language that they

use to describe their work and and have

that dialogue with the model so that

they can push their work kind of to the

frontier. And it is, you know, it

doesn't happen in like one prompt and

two minutes. Um, it it does require a

lot of that kind of taste and human

creation and and craft that goes into

building something that actually then,

you know, becomes art.

>> At the end, it's still a tool that

requires the human behind it to to

express the feelings and the emotions

and the story and everything.

>> Yeah, absolutely. Absolutely.

>> And that's what resonates with you when

you probably look at it, right? Um, you

you will have a different reaction when

you know that there's a human behind it

who has spent 30 years thinking about

something and then pour that into a

piece of art.

I think there's also a bit of this um

phenomenon that like most people who

consume creative content and maybe even

ones that are that care a lot about it

like they they don't know what they're

going to like next. You need someone who

has a vision and can do something that's

interesting and different, right? And

then you show it to people like, "Oh,

wow. That's amazing." But like they

wouldn't necessarily like think of that

on their own,

>> right?

>> So when we're, you know, when we're

optimizing these models, like one thing

we could do is we could optimize for

like the the average preference of

everybody.

>> But I don't think you end up with

interesting things by doing that. You

end up with something that everyone kind

of likes, but you don't end up with

things that people are like, "Oh, wow.

That's amazing. like I'm going to change

my my my whole like perspective of art

because I saw that

>> there's the avantguard edition of the

model

if I use it with the term there's the I

don't know what's what's the other end

of the spectrum the marketing edition or

so where it's very predictable and

>> very straightforward.

>> Yeah. Well, since we're coming up on

time, uh, last couple question. One is,

what's one feature that you know the

model is capable of that you wish people

ask you more?

>> Interle.

>> Yeah, in I think we've always been

amazed that nobody ever posts anything

about in solely generation is what we

call the model's ability to generate

more than one image for a specific

prompt. So, you can ask for like I want

a story like a bedtime story or

something like generate the same

character over these series of images.

And I think that um yeah, people haven't

really found it useful yet or haven't

discovered it. I don't know.

>> Oh, interesting. Well, if you're

listening to the podcast, go try this

out.

>> Try

>> Yeah. And what's the most um exciting

technical challenge that you look

forward to tackling in the next, I don't

know months years.

>> So, I think that there's really a high

ceiling in terms of quality for where

we're going. Like, I think you people

look at these images and say, "Oh, it's

almost perfect. we must be done. And for

a while, we were in this like cherrypick

phase where we would, you know, everyone

would pick their best images. So, you

look at those and they're great. But

actually, what's more important now is

the worst image. We're in a lemon

picking stage because every model can

cherrypick images that look perfect.

>> So, like now I think the real question

is like how expressable is this model

and what's the worst image you would get

given what you're trying to do.

>> So, I think by raising the quality of

the worst image, we really open up the

amount of use cases for things we can

do. like there's all kinds of

productivity use cases like um you know

beyond this kind of like immediate

creative tasks that we know the model

can do and I think that's a direction

we're headed we're headed to where if

these models can do more things

reasonably then they're just the the use

cases will be far greater

>> so that's the that's the moral

equivalent of the monkeys on typewriters

basically any model given enough tries

will eventually make an amazing

adventure

>> but the the other way around it's hard

>> yeah the other round is hard one monkey

writing a book would be very hard

>> it would be a good monkey for that one

>> what are the applications you think that

would come out when we reach the lower

bound?

>> So, the one I'm most interested in, we

mentioned this before, is education

factuality. I have um you know, I I have

every I don't know how many times I want

to use these models for creative

purposes a month, but like I have way

more use cases for information seeking,

factuality, kind of like learning,

education type use cases. So, I think

like once that starts working, then

it'll be opening up all these new areas.

Amazing.

>> There's also something about I think

taking more advantage of the models

context window. Um so you can input a

really large amount of content right

into these LLMs. And um some companies

um you mentioned a few before um they

will have like 150 page brand guidelines

on like what you can and cannot do,

right? And they're like very precise,

right? Like colors, fonts, and right

>> um and like the the the size of like a

Lego brick maybe. Um and so being able

to actually like take that in and follow

that to a tea when you're doing

generation that's like a whole new level

of control um that we just can't we

don't have today right um to to make

sure that you're actually kind of like

following that to a tea. I think that

will build a lot of trust with you know

very established brands. where we have a

second creative compliance review model

that then double checks everything that

I could do against the the model should

do it on its own, right? Like like it

should kind of have this yes it should

have this loop as like okay I generate

this but then page 52 says that I

shouldn't have right and I'm going to go

back and try again and then two hours

later it will come back to you with that

respect.

>> Yeah.

>> And we saw with the text models how this

inference time scaling how much it can

help right being able to to critique

your own work. Yep.

>> So this this feels really important.

>> Boy, an incredibly amazingly exciting

future for for image models.

>> Yes. And congrats on all the amazing

work.

>> Thank you.

>> Thank you.

>> Thanks for having us.

>> Well, thank you so much for coming on

the pod.

[Music]

[Music]

Loading...

Loading video analysis...