The "Nano Banana" of AI Video is Here!
By Theoretically Media
Summary
## Key takeaways - **Kling 01: Nano Banana for Video**: Kling have released their 01 model and you can kind of think of it as Nano Banana but for video. Clling is touting 01 as the first unified multimodal video model. [00:08], [00:39] - **Clown Bartender Swap via Natural Language**: Take an initial image reference of a woman at a bar, add 'A clown is working behind the bar as a bartender. The woman orders a drink,' and everything in the initial image remains the same, only now we have a clown. To swap for a specific clown, generate the clown image, load it, prompt 'change the clown in video one,' and get the same video output with the different clown. [02:36], [03:26] - **Time Travel: Pre/Next Shots**: We can prompt for things that happen before or after our input video, like 'based on video one, generate the previous shot' for a tracking shot of the man walking towards the blue box, or 'generate the next shot' for the runaway bride making her getaway in a classic car. [05:23], [05:43] - **Elements Library for Consistent Characters**: Create your own library of subjects like characters to continually re-reference for consistency, such as building Flamethrower Girl with full body, close-up, and side profile shots, then auto-describe and swap her into videos like 'replace the character in video one with flame girl.' The model handles multiple custom characters well. [07:47], [09:41] - **Location Referencing Boosts Dynamics**: Providing an image reference and calling out 'at image three as the location' in prompts like standing on a busy Shanghai street makes outputs more dynamic, with better color balancing and characters feeling like they're in the location, compared to no location image. [10:46], [11:11] - **Synthetic Characters Hold Better**: Cling 01 is very good at character retention but holds a lot better with synthetic characters than real ones, like Planet Hell's Tom in a gray t-shirt who is less likely to drift. [11:36], [12:07]
Topics Covered
- Videos Evolve Beyond Input Frames
- Swap Characters Without Reshooting
- Generate Preceding or Following Shots
- Build Persistent Character Libraries
- Restyle and Extend Existing Videos
Full Transcript
Some pretty massive news coming from the world of Clling. We have a new model from them and this isn't just like a version update. This is a whole new
version update. This is a whole new thing. Clling have released their 01
thing. Clling have released their 01 model and you can kind of think of it as Nano Banana but for video. Today I've
got a full walkthrough of Cling 01 and I got to say there's a lot you can do with this. It's pretty exciting. All right,
this. It's pretty exciting. All right,
let's hop in.
So, just as a quick FYI, I did partner with Clling for early access for this video. But, I mean, overall, think of it
video. But, I mean, overall, think of it as a first look and tutorial of all the things you can do with 01 because, well, again, it's a lot. Clling is touting 01 as the first unified multimodal video
model. Think again, think of it as nano
model. Think again, think of it as nano banana, but for video. That said, it does also handle images. We'll touch on that as we go through this video.
Overall, the idea with 01 is that well, it does all the things. Text to video, image to video, in and out, painting, stylization, transformation. I mean, on
stylization, transformation. I mean, on and on it goes. And it does so with a semantic understanding of well, just about everything you throw at it. Now,
the UI does differ a bit from what like old school cling uh and which Two is all still there. Um, but in order to use the
still there. Um, but in order to use the 01 model, you'll just want to make sure that you're you're on this tab. Um, so
that brings you into the 01 model. This
first area here will contain kind of a history of your recent outputs. Uh down
here is of course your prompt area and then on this side is of course your outputs. Moving down to the prompt area,
outputs. Moving down to the prompt area, we have a couple of different options.
We can of course generate video or images here. We can obviously provide it
images here. We can obviously provide it with reference image or video elements.
We're going to get into in just a little bit. Uh we also do have options for 916
bit. Uh we also do have options for 916 11169 um duration and output amounts. There
are also modules at top for elements, transformation, video reference, and frames. All of which we will be getting
frames. All of which we will be getting into. So, at baseline, what we can do
into. So, at baseline, what we can do with this model, we're going to start very simple and ramp our way up. Uh, but
we can take an image like this and give it the prompt to create a video of the woman in and in this case, add image one. So, every time that you upload
one. So, every time that you upload something, it, you know, it creates a tag for it. Uh, just entering the location and taking a seat at the bar, generating this, we end up with an output like this. And what's important to note here is that this isn't
traditional image to video in that our first frame is not our image. Actually,
our first frame starts in a completely different location showcasing an area of the location that does not exist in that initial image input. And you know, uh obviously we're following directions.
Now, as a note to that output, no native audio generating here yet. Keyword being
yet. But here's where things start to get really interesting is that we can take that initial prompt and add something like, "A clown is working behind the bar as a bartender. The woman
orders a drink." Now, why is there a clown working as a bartender? I don't
know. Maybe he's working his way through clown college. Uh the important thing
clown college. Uh the important thing here is that, you know, everything in our initial image reference remains the same. Only now we just we have a clown.
same. Only now we just we have a clown.
Now, the question I'm sure many of you are asking right now, what if we want a very specific clown? Well, yes, we can do so. uh actually heading over to the
do so. uh actually heading over to the image generation model. So switching
this off of video and over to image and just prompting for a full uh body photorealistic image of a clown. Uh
doing this at 916. Um we end up with this guy. Overall, I got to say no notes
this guy. Overall, I got to say no notes here. So to swap him out, all we have to
here. So to swap him out, all we have to do is hit this go to01 to create.
That'll reload our video into the prompt box. And with very simple natural
box. And with very simple natural language, uh we can just prompt change the clown in video one. Uh, and then we're going to load in our clown. We'll
just select from history to the clown in image one. I hit generate and there you
image one. I hit generate and there you go. Same video output, different clown.
go. Same video output, different clown.
Still can't pour a Manhattan to save his life, though. Now, I do want to quickly
life, though. Now, I do want to quickly note, again, going back to this multimodal idea, you don't have to use image inputs. We can also use video
image inputs. We can also use video inputs. For example, taking this well
inputs. For example, taking this well real stock footage drone shot of Dodger Stadium, bringing that in and issuing the prompt to change it to sunset. we
end up with this as an output which is kind of crazy good. I mean everything in our initial video input is still here.
It's just I get the the time of day has now changed. So actually like one really
now changed. So actually like one really great use case here is if you have some old VO3 outputs from back when we were all putting text on the first frame to provide direction. Uh such as our period
provide direction. Uh such as our period piece of two romantic lovers who have just discovered that they are living in a simulation.
>> These last few days with you have felt like I'm >> living in a simulation. I feel the same.
>> So now we can simply say remove the text and red neon boxes in at video 1 and end up with this as a result.
>> These last few days with you have felt like I'm >> living in a simulation.
>> I feel the same. If we're prompts, I don't care. I want to marry you.
don't care. I want to marry you.
>> Speaking of removals, Plasmo takes this surrealist piece, which I, you know, I'm not exactly sure what's going on. It
does look pretty cool, though. Um, the
thing is that this hand that comes out, that was actually. So now utilizing the 01 model, Plasmo can now well remove the hand and the output comes out as intended. We can also now change shot
intended. We can also now change shot composition and even camera movement. Uh
for example, in this output from Ludovic the creator where we have this for Lauren guy staring at the sea probably wondering where his pirate lady love has vanished to where we can now change it to a crane over the head shot which
actually starts to feel a little bit darker. Don't do it oldtime. I'm sure
darker. Don't do it oldtime. I'm sure
she's going to pull a deport any day.
Now, an additional use case, and this one is pretty insane. We can prop for things that happen before or after our input video. Yeah, you heard that right.
input video. Yeah, you heard that right.
Kind of like time traveling kind of. So,
given that is a theme, I had to take this shot of not Doctor Who entering.
Well, I mean, it's not the TARDIS. It's
like, you know, the the TARDIS that you would get if you ordered it from Wish.com. But, if you issue the prompt
Wish.com. But, if you issue the prompt based on at video one, essentially your input video, generate the previous shot and then give it some direction. In this
case, it was a tracking shot of the man um walking down the street towards the blue box. You end up with results like
blue box. You end up with results like this or like this, which does, you know, obviously represent what happens before that input video. Or as a bit of a throwback, we can take our woman in the
red dress as the runaway bride from her wedding to the man in the green tuxedo.
Um and then issue the prompt based on well again at video one generate the next shot. Apparently, it's our woman in
next shot. Apparently, it's our woman in the red dress making her getaway in a well, it looks like it'd be a pretty classic car there. Now, to be fair, this isn't exactly a onetoone. This isn't
acting like first frame, blast frame.
Now, this does, I don't want to say require, but it does need you to tell it what the next shot is if left to its own devices. I mean, it'll it'll generate
devices. I mean, it'll it'll generate something as it did here where I left that blank. I just said just generate
that blank. I just said just generate the next shot. We ended up with well uh you know, what I what I presume is a much happier ending for our groom. And
of course, in typical AI ziness, if you aren't like kind of specific with your prompt, uh you could end up with an output like this in which our runaway bride is getting into the car, which is still inside the chapel, I presume that
if we generate the next scene, it's going to be the car busting through that wall. And once again, the model can do
wall. And once again, the model can do first frame, last frame as well as showcased here by Gcomo Malamasi. And it
actually has a few tricks up its sleeve.
In another example by Gaccom, we can see our opening frame here, our ending frame here, and then this kind of like special effect frame in the middle and sort of treating it like first, middle, and last
frame. And again, that semantic
frame. And again, that semantic understanding of references is really where this model shines. Uh, for
example, here's a couple of input images by Mad Pencil. Um, putting the four of these together ends up with a video output like this. I guess kind of like one pieceish, rock solid all the way
through. And it's important here again,
through. And it's important here again, all references are accounted for. Now,
if we have some subjects that we know we want to continually re-reference and we want to kick up the consistency a notch, well, here is the the feature that I'm probably most excited about because we can now create our own library. If we
head down to this elements button and hit this, we'll discover a number of presets in the library. But if you head over to my subjects, you'll be able to create your own subject that you can
essentially add in any prompt. Um so in this case going with this kind of animated look we can tag it as a character an animal an item costume etc. In this case obviously character and
then from here we can actually bring in additional reference images. So
obviously it is very handy to have the 01 image model on hand to you know create different reference looks. From
there I ended up generating up another character uh and a location for them to be hanging out with. Uh and then from there I was able to generate up this animated sequence pretty quickly. And
again, you know, no sound or dialogue here, but again, you should just visually be able to tell everything that's happening here. And the important part is characters art staying consistent, location is staying
consistent, everything feels pretty solid. Of course, I ended up training up
solid. Of course, I ended up training up a version of myself. And I This one looks pretty rock solid. A little
cranky, but AI always seems to be a bit on the cranky side. So, as far as generating up a pretty solid character model, uh, we're going to use channel fan favorite, uh, Flamethrower Girl.
Flamethrower girl fans, you're going to be uh pretty happy with the rest of this video. She appears quite a bit. Um so
video. She appears quite a bit. Um so
just, you know, bringing that in and loading her in as an image reference was able to very quickly generate up a full body shot of her. Uh simply the prompt of create a full body shot of the woman in our reference image standing against
a neutral studio backdrop. Of course, we did have the flamethrower there. And
because I knew baking that into the reference that could cause issues down the line, I just simply ran it again with remove the flamethrower. From
there, I ended up creating a close-up of her face as well as a side profile shot.
And once you have everything loaded in, I actually don't have to do this because I've already created her. Um, you know, once again, we just simply name her. And
then there's a description down here.
Um, you don't have to get too crazy about this. In fact, actually, you can
about this. In fact, actually, you can just hit the auto button and it does a pretty good job of describing your character. After that's done, you just
character. After that's done, you just simply hit generate. So now we can take this, I guess, like kind of like Joan of Arc inspired output that we ran a little while back and now have flamethrower
girl time travel back to the medieval era u simply by issuing the prompt uh replace the character in video one with flame girl. Now one thing I do have to
flame girl. Now one thing I do have to note is that you know sometimes when you're just doing a character swap like this the colors do seem a little bit off so you might have to bring it in and do some like light color correction on it.
It doesn't always happen. Um, in fact, actually, we're going to take a look at sort of a solution on that in just a minute. The model does seem to handle
minute. The model does seem to handle multiple characters from your library pretty well. Again, since I'm already
pretty well. Again, since I'm already there, I get to hang out with flamethrower girl. I guess we're on the
flamethrower girl. I guess we're on the case to find her missing flamethrower. I
got to say, I am really impressed with how good the character modeling is here.
Even though, like, this is kind of a weirder output. This was supposed to be
weirder output. This was supposed to be us discovering a mysterious symbol written on the wall. I mean, that really does look like Flamethrower Girl. That
really does look like me. the whole
thing kind of comes off as like we're the most odd couple on The Amazing Race ever. Now, one thing I did discover that
ever. Now, one thing I did discover that I, you know, I think plays a pretty big part is location referencing. Uh, for
example, in this prompt, I did not provide a location um image reference. I
just said that we are standing on a busy Shanghai street. Um, however, by
Shanghai street. Um, however, by providing this image as an image reference and then rerunning that same prompt, only this time calling out things like at image three as the location, we end up with this as an
output, which does look a lot more dynamic. You know, again, color
dynamic. You know, again, color balancing, everything, the characters actually feel like they're in the location. Um, camera across the board,
location. Um, camera across the board, it does seem when you're image referencing a location does seem to be a lot more dynamic as well. Overall, yeah,
highly recommend image referencing a location in your video outputs. Now, one
thing I did want to note in terms of character retention, uh, Cling01 is very good. It will occasionally kind of do
good. It will occasionally kind of do things like I kind of feel like my face gets a little bit on the squished side here. Um, that said, I mean, overall, I
here. Um, that said, I mean, overall, I do find it to be a lot better than Sora's Cameo. That said, it does seem to
Sora's Cameo. That said, it does seem to hold a lot better with synthetic characters for whatever reason. So, I
ended up taking um this guy, who was our guy from uh Planet Hell, the AI short trailer I did kind of recently. I did
take him out of the uh orange jumpsuit that he was in and just put him into a gray t-shirt. I ended up building out a
gray t-shirt. I ended up building out a model of him for the library. We named
him Tom. Um and I did just in general as I ran him found that he was less likely to drift. So, I don't know if the model
to drift. So, I don't know if the model is just better at consistency with synthetic generated characters or the fact that, you know, Tom there is way more handsome than I am. He's not real, but he is more handsome. So, that's
something I will continue experimenting with. Hey, look, our girl got her
with. Hey, look, our girl got her flamethrower back. Uh, yeah, the model
flamethrower back. Uh, yeah, the model is actually very good at doing like these handheld tracking running shots.
Um, yeah, that's really good. An
alternate version of that prompt was this, which is my favorite, where Tom is just like about halfway through the action sequence. Just like, you know
action sequence. Just like, you know what, I'm not I don't like cardio. I'm
just going to You go on, I'll catch up.
What's really exciting about this is I feel like we're barely scratching the surface of what this model is capable of. uh pushing things just a bit further
of. uh pushing things just a bit further with this uh Sora animated output.
>> What What happened to the world? I don't
know.
But whatever did this, it's still out there.
>> I brought that into the 01 image model and had it turn it into liveaction. It
did miss some stuff like the backpack here and uh this guy's outfit still looks a little bit on the animated side, but I actually kind of like that as like sort of a weird hybrid look. So
restylizing that output, we ended up with this which um I mean it actually holds together pretty well and but the at the end um the yeah the giant Cthulhu monster up there wasn't that impressive
to me. So I was really curious to see if
to me. So I was really curious to see if Clling would be able to handle this image not as a last frame but rather as a reference to something that appears, you know, after the camera has tilted
up. To be honest, I really wasn't sure
up. To be honest, I really wasn't sure if it was going to work. But yeah, I mean ultimately Cllingo 01 ended up coming through for me in a pretty big way. So yeah, overall this is a hugely
way. So yeah, overall this is a hugely exciting model and it is only launch day. So as the old saying goes, this is
day. So as the old saying goes, this is the worst it'll ever be. I'm really
excited to see what you all discover about Clling 01. Uh do let me know any discoveries that you make in the comments. Um in the meantime, I thank
comments. Um in the meantime, I thank you for watching. My name is Dem.
Loading video analysis...