Inside NotebookLM with Raiza Martin and Steven Johnson
By Google DeepMind
Summary
## Key takeaways - **AI can make any content interesting**: NotebookLM's Audio Overview feature uses Gemini and advanced audio generation to make even mundane or complex content, like a PhD thesis or a document filled with repetitive words, engaging by presenting it as a conversation between two hosts. This addresses the challenge of making information interesting, a task previously difficult for computers. (01:45, 20:40) - **Audio Overviews tap into ancient learning methods**: Humanity has learned through conversation for millennia, making the conversational format of Audio Overviews a powerful and resonant way to consume information. This approach connects with a deep, ancestral part of us, making it more engaging than simply reading text. (07:17, 45:02) - **NotebookLM prioritizes source grounding for accuracy**: Unlike general AI models, NotebookLM is built on the principle of 'source grounding,' where it draws exclusively from user-uploaded documents. This approach significantly reduces hallucinations and ensures that all generated content is fact-checkable against the provided material. (10:53, 11:46) - **User-uploaded data remains private**: Information uploaded to NotebookLM is treated as part of the model's short-term memory (context window) and disappears once a session is closed. This ensures user data is private and not used to train the general model, allowing for the secure input of personal or sensitive information. (14:43, 39:20) - **Future NotebookLM to include diverse voices & interaction**: Future developments for NotebookLM aim to expand capabilities with more voices, languages, and personas for the AI hosts. There are also plans to allow users to interact with and even interrupt the hosts, creating a more dynamic and collaborative content experience. (26:46, 40:18)
Topics Covered
- Source grounding creates a personalized, hallucination-resistant AI expert.
- AI finds 'interestingness' by detecting controlled surprise.
- AI voices need 'disfluencies' to sound truly human.
- Future AI hosts will debate ideas from different perspectives.
- AI creates hyper-niche podcasts for an audience of one.
Full Transcript
[MUSIC PLAYING]
HANNAH FRY: Welcome to "Google DeepMind-- the Podcast."
I'm Professor Hannah Fry.
Now, I want to start today, unusually,
perhaps with a clip from another podcast.
Listen to this.
[AUDIO PLAYBACK]
- What's the overall message here?
Is it social commentary, artistic expression, or just
a really elaborate joke?
- That's the beauty of this piece, I think.
It defies easy categorization.
It exists in this liminal space between language
and non-language, between art and absurdity.
[END PLAYBACK]
HANNAH FRY: This is a very interesting discussion,
which, as you might have guessed, is AI generated.
But what is notable about this particular clip,
aside from the fact that neither of the two podcast hosts
have ever existed, is that their conversation,
a mini treatise on human nature and our relationship with art,
was generated from the most unusual of prompts.
The podcast itself was created by a new feature
called Audio Overview, part of NotebookLM,
a personalized AI research assistant from Google Labs.
Now, NotebookLM is powered by Gemini,
and it lets you upload your sources, anything from PDFs
to videos to generate insights, explanations and, of course,
podcasts.
We often think of AI as just crunching through data
and spitting out answers, but NotebookLM
draws on expertise from storytelling
to present information in an engaging way.
And we wanted to see what happens
when you ask NotebookLM to analyze what most people would
consider to be nonsense, a single document containing
just two words repeated a thousand times over--
"cabbage" and "puddle."
And here is the result.
[AUDIO PLAYBACK]
- So I have to admit, at first, I was like,
what is going on here?
But the more I think about it, [LAUGHS] the curious I get.
- It is fascinating, isn't it?
We're like, dealing with this one-piece puzzle.
- Right
- And we're trying to figure out, well,
what does this piece tell us?
What do you think?
What's your first impression?
- Honestly, it's almost like hypnotic or something.
If you were really staring into a puddle and all
you saw were these cabbages floating around--
- I can see it.
- --it's a little unsettling but also kind of funny.
[END PLAYBACK]
HANNAH FRY: Several minutes of intellectual analysis
packed to the brim with seemingly relevant ideas that
are nowhere in the original document.
It's actually quite impressive, really.
I am joined today by two people who are deeply involved
in writing NotebookLM story.
Joining us from San Francisco is Steven Johnson,
NotebookLM's editorial director and also a "New York Times"
best-selling author.
And in Mountain View, California,
Raiza Martin is a senior product manager for AI at Google Labs
who leads the team behind NotebookLM.
Welcome to the podcast, both of you.
Now, I want to start with the feature
that everybody's been talking about, this Audio Overview.
And well, I understand that you've got a little clip
that you want to play me.
STEVEN JOHNSON: Yes, let's play the clip.
I think you will enjoy this, Hannah.
HANNAH FRY: OK, here we
[AUDIO PLAYBACK]
- Welcome back, everyone.
Ready for another deep dive?
Today, we're shrinking down, way down.
- Microscopic, you might say?
- Exactly.
Think about those tiny little droplets of water,
you know, like the ones you see on a freshly washed car.
- Oh yeah.
- But imagine those droplets clinging to an airplane wing.
HANNAH FRY: [LAUGHS] Oh my gosh.
- Or on a plant leaf.
- Right.
Being sprayed with pesticides.
The way those droplets behave--
HANNAH FRY: Oh, you guys.
- --is actually incredibly important for all kinds
of things--
- It is?
- Making planes safer.
- More efficient farming.
- Even figuring out how rain forms.
- Wow, that's fascinating.
- We're diving into some serious research today.
[END PLAYBACK]
HANNAH FRY: That was my PhD, the first page of my PhD thesis.
Extraordinary.
I mean, frankly, there is no good stuff,
apart from heavy equations in math.
OK.
Lots of things to notice about that.
For starters, they made it sound much more
exciting than it actually is.
STEVEN JOHNSON: That's the point.
[LAUGHS]
HANNAH FRY: But also, though, the sort of back and forth.
I mean, the two voices, they were finishing
each other's sentences.
It felt very fluid, pardon the pun, very natural.
STEVEN JOHNSON: Imagine defending your dissertation now.
You could just play the podcast and leave it at that, I think,
if you'd only had that at your disposal back then.
HANNAH FRY: Raiza, have you been surprised by people's reaction
to this?
Because it's had really quite serious uptake, hasn't it?
RAIZA MARTIN: Yes, and I think the most surprising thing to me,
and, really, equally delightful, is how people are using it.
I think I imagined how they might.
But I think the beautiful thing about launching something
with this much sort of excitement around it
is you see a whole new universe of what everybody has been
trying, from things that are funny,
things that are entertaining, things that are inspiring
or really meaningful.
It's just been incredible.
I actually probably spend a good chunk
of my day, a third of my day, just listening to these.
[LAUGHS]
STEVEN JOHNSON: [LAUGHS] Really?
HANNAH FRY: You set up a Discord server, didn't you,
just to let people share stories about the ways
that they're using it.
What kind of things have come up?
STEVEN JOHNSON: So that was an interesting example,
playing your dissertation, because one of the things that I
think genuinely surprised us is people
would put their CVs and their resumes in there,
and it was almost like a little like hype machine.
If you were feeling down about yourself,
you would listen to like a 10-minute audio conversation
between two very enthusiastic hosts.
You're like, wow, Steven has really done a lot in his career.
It's very impressive.
But actually, a more serious version of that--
I mean, that's kind of fun and playful,
but people are using it, like, you can kind of workshop
things you're working on.
So you can upload a short story you're working on
and say, hey, give me some constructive criticism on this.
And you listen to people talking about your work
and they're very good at pulling out
the kind of interesting twists or focusing
on the characters that are particularly compelling or not.
And so it's a way of getting a little--
it's almost like a little focus group for stuff
that you're working on, which is really amazing.
HANNAH FRY: I guess also hearing people actually
talk about it out loud adds that kind of extra layer of, I don't
know, objectivity almost, Raiza.
RAIZA MARTIN: I would say it's been really surprising
because if we think about it, a lot of the content or content
generation, if you just render it in text, it's not new.
It's like if I upload my CV and then I
have an LLM spit out something that
says like, oh, here's Raiza's career,
write a summary of sorts--
maybe there's a few interesting tidbits
that it pulls out here and there--
that was novel two years ago, and everybody
was excited by that.
But I think adding that new layer or that new modality
of just very human-like voices, I
think it connects with people in a very different way.
Personally, I call this type of technology human-like, where
you recognize it as being very similar to you
and it resonates with you in a different way as a result.
And I think the first time I listened to my CV,
I knew what to expect.
But when I heard it, I still felt that bubble inside of me,
like the woo!
[LAUGHS] And I think that's the magic of new modalities.
STEVEN JOHNSON: I think the other point on this
is that human beings have been learning and exchanging
information through conversation for hundreds of thousands
of years.
We've been learning by reading structured text
on a page for 500 years, and structured text
on a screen for 30 years.
And so when you activate that sense of a genuine human-like
conversation, it's just a deep, ancient kind of ancestral part
of who we are that--
I think that's one of the reasons why it just
lights up people when they hear it for the first time.
HANNAH FRY: Also interesting, I think
that you decided to have two hosts rather
than just one person sort of talking into space, as it were,
which--
I guess it speaks to the point that you're making, Steven.
STEVEN JOHNSON: Yeah, it's just a very different format.
If you just have one person, it feels like text to speech,
right?
We've heard text to speech before.
You're just like, the computer is turning the text that it just
wrote into something I can listen to, which is great.
And we're interested in trying to figure out ways we
can do that in other formats.
But to get the conversation right--
and we can dive into this in more detail--
there are all these subtle things
that you have to make work.
Nobody wants to listen to two robots talk to each other.
That will fail and be unlistenable after 30 seconds.
You have to master all these very subtle, weird things
that people do in conversation for it to work.
HANNAH FRY: To make it human-like, exactly as you said.
Raiza, I want to come back to those features
a little bit later, to the Audio Overview,
because I also wanted to discuss the origins of NotebookLM.
How did it come about, Raiza?
RAIZA MARTIN: For one, I think a lot of people
think that NotebookLM is new because of the Audio Overview
feature.
We had such a massive influx of people and people
were like, wow, what is this?
A brand-new thing from Google.
But actually, we've been working on NotebookLM for over a year.
We first announced it at Google I/O
last year, as Project Tailwind.
And before then, we actually had been incubating it inside
of Google Labs.
And it's actually how Steven and I met.
Steven was brought in.
What was your original title, Steven?
STEVEN JOHNSON: I was visiting scholar.
RAIZA MARTIN: Yeah.
[LAUGHS]
STEVEN JOHNSON: Yes.
Then I became editorial director.
RAIZA MARTIN: That's right, he was promoted.
And at the time, Josh Woodward, who now leads Google Labs--
he's the vice president--
told me, he was like, I want you to build a new AI business.
And I thought to myself, what does it
take to actually do that?
But what I'll say is, one of my early inspirations
was just watching Steven work.
Honestly, just understanding how he does, what he does,
I was like, wow, that could be a real superpower if you
could give that to people.
STEVEN JOHNSON: It was a mix of Steven
is abnormal in his research habits,
but maybe we could turn this into a mainstream pursuit
somehow.
Yeah, it was interesting because I had had this long history
writing books, and Josh had read some of those books
and had read some things that I was writing about,
tools for thought, basically, like how do you use software
to help you think and help you develop your ideas and research?
This is the middle of 2022, so language models
were at the top of the list then.
And so he kind of reached out to me and said,
hey, any chance you would want to come to Google
and help build the tool that you have always
wanted to help people learn and organize their ideas,
now built on top of language models?
And what Raiza and I-- kind of right from the beginning--
I think I met Raiza day two at Google.
We were like, let's build something new.
HANNAH FRY: This came about at a time
when large language models were at the top of the agenda.
In those early conversations, how
did you see this as being fundamentally different to just,
I don't know, like uploading a document on Gemini
and getting it to summarize it for you?
STEVEN JOHNSON: From the very beginning--
we call it source grounding.
That's the way we describe it.
You supply the source information
that you want to work with.
It might be the story you're writing.
It might be the book you're researching.
It might be your journals.
It might be the marketing documents you're working on.
And uploading that to the model then creates
a kind of personalized AI that is an expert in the information
that you care about.
And that was not-- no one was talking about that
in the middle of 2022.
So that was like the first thing we built, was like--
I mean, we uploaded part of one of my books,
and I could have this very crude conversation with the model that
was not at all like what you see now in text or with audio.
But you could get a little taste of what
it would be like to have all the ideas you were working with
instead of just talking to an open-ended model that just
had its general knowledge, actually have
that personalized knowledge.
And it was great because it also reduced hallucinations.
It made it more factual.
You could fact-check it.
You could go back and see the original source material.
That's a big part of the whole NotebookLM experience.
That was the beginning of it.
And everything we've done is built on that platform,
and Audio Overviews is just, OK, take that insight of,
I supply my sources and now, I turn it into something else.
In this case, it's an audio conversation.
HANNAH FRY: Because I guess the real key difference
here is that it's very focused on the sources
that you're giving it and anything
that's connected to that rather than just as you
say, this general model.
RAIZA MARTIN: Yeah, I think that--
I'll say, too, that what we've seen
is I think it's a little bit harder to get started
with this paradigm because it's so new.
The idea that one, you're talking to an AI, two,
you have to bring your own stuff.
So I think there's a little bit of a layer where
it's like, OK, you have to convince somebody
that it's worth doing.
But once you can get somebody over that hump,
it's just massively useful because--
I think about the work that I do every day,
the work Steven does every day, and many people
around the world that work on computers every day,
we are working with very specific sets of information,
shared contexts that we have with others.
We do research, we pull it in, we
want to extract our own insights from it.
I think that's what makes NotebookLM really special
and has made it special from the beginning.
HANNAH FRY: So it does include these text elements too then,
because as you say, the podcast part
is the bit that's most notable.
RAIZA MARTIN: That's right.
So the podcast thing is the most recent development
in NotebookLM, but we actually launched a year ago,
where it was primarily a chat feature.
So you're chatting with the system using your sources,
and it's always referencing back to exactly what pieces
of your content that it used.
HANNAH FRY: So give me some more mundane examples
of how people are using this, like on a day-to-day level
then Steven.
STEVEN JOHNSON: Yeah, so we actually
see a huge amount of usage of the product
just with the text features.
And suddenly, you have this amazing resource
that can answer any question about all-- hundreds of pages
of documents.
And in the text version, you get citations and everything.
It's a very scholarly thing, actually,
and you would appreciate it.
You get your answers back and every fact that the model says
has a little inline footnote, and you
can click directly on that footnote
and go and read the original passage.
Writers, journalists, obviously, are using it.
This comes a little bit out of my involvement for the project.
I have one notebook that has thousands and thousands
of quotes from books that I've read over the years,
plus a lot of the text of books that I've written.
And that notebook has basically like my brain
kind of captured in the AI.
And so whenever I work on anything, I
have a new idea for something, I'll go into that notebook
and be like, hey, what do you think about this idea?
And the AI will say, hey, Steven,
you read something related to that seven years ago.
What about this passage?
And so it's a true extension of my memory, so that kind
of stuff.
And the other thing, last thing I'll say,
is we're not training the model on this information.
So your information is secure, it's private.
It's not going to get into the general knowledge of the model
and be used by somebody else.
So you can put private information in there.
And when you put a couple of years of your journal
in a large-context model like this,
you can get these amazing insights
and you can turn them into audio overviews
and listen to two people talk about yourself.
Or you can just be like, what was I thinking about last May?
Give me an overview of all the stuff that was going on.
And 20 seconds later, you'll have this amazing kind
of document of your own life.
HANNAH FRY: Rather than just recalling stuff,
they can actually be insightful, in terms of your own journals.
RAIZA MARTIN: I would say yes, because I've
used it for that purpose.
And one of the things that I like to ask it after uploading--
I do these weekly journals --is I say, how much have I
changed over time?
And it's really remarkable.
It's been able to pull out for me, really interesting nuances
that I haven't been able to observe about myself.
It's been able to say things like hey,
you tend to associate a lot of negativity
with this particular topic.
You associate a lot of positivity with this topic.
And it's just really interesting because I think, to your earlier
question around the mundane use cases,
I think we see a lot more of those,
which is just people trying to take the work that they're
doing every day.
For example, sales teams use this a lot
to share knowledge with each other.
Makes a lot of sense.
There's a lot of technical, complex changing documentation,
so it's really nice to have an AI partner.
I think that's really different from how a lot of AI systems
work today, right?
I use everything.
I use everything that's out there,
and the prompts that I write are massive.
The first thing that I write is, you are a blah.
This is what we are doing.
Here are the documents that are relevant.
And I think for NotebookLM, this just shortcuts it.
It's just a project space.
It knows what you're talking about.
You can have a conversation forever.
It takes up to 25 million words.
It's just contextually quite massive.
STEVEN JOHNSON: I think one of the things that was interesting
and maybe a little bit distinctive about it
was so many of the questions about what makes this product
work or not work are not so much technological questions
as they are editorial stylistic questions.
Like, what is the right kind of answer when you get
an audio overview that works?
What's the style?
What's the house style for those conversations?
What level should they be pitched at?
And those are not technological questions.
Those are language questions.
And that's the crazy reality of the language-model age,
is that all these things that used to be just mostly
a question of, let's get the programming right now,
become more about the rhetoric of it all.
HANNAH FRY: Well, actually, I want
to dig into some of the house style a little bit more,
I guess.
Why did you decide to go into Audio Overview?
What was it that inspired that?
I mean, there are already quite a lot of podcasts.
Let's be honest.
STEVEN JOHNSON: When Audio Overviews really began--
it was a great example of the lab's structure,
I think, really working well because it
was another small team inside of labs
that were just kind of focused on the audio version of this.
And part of the idea of it was not
so much to compete with podcasts but rather
that there was a whole universe of content
that you would never-- the economics of generating
a podcast for it would never make any sense.
But if you could generate one automatically,
you might have five people that would want to listen to it,
or one person who would want to listen to it, or 20 people,
but not 200,000.
And so we want to create a podcast
based on our team meetings from the last week
so we can review them.
That's not going to be a commercial business.
No one's going to ask you to host that,
but actually, it might be useful for that team.
And so they had started developing this thing,
and Raiza and I heard it probably
in March or April of this year.
And as everyone who's heard an audio overview,
initially, we're just like, wow, what did I just hear?
That was amazing.
But we realized pretty early on that part of our mission
with NotebookLM was to build a tool that
helps people understand things.
And suddenly, we were like, oh wait, people really
understand, and remember, and pay attention
when they hear something in the form of an engaging conversation
between two smart people.
We've released it internally to Googlers over the summer,
and that was, I think, when we started
to think this is going to be a hit,
because you could just see the delight that people had with it.
So while we were surprised that it
went quite as crazy as it did, we knew we were on to something.
HANNAH FRY: Now, I remember the last season,
we got to hear a demo of WaveNet, which, of course, is
one of the first AI models to generate this human-like speech.
And it was quite impressive back then,
but I mean, presumably, there have
been technological advancements that have happened since, that
have been necessary to make something like Audio Overview
possible.
RAIZA MARTIN: I think the underlying model for NotebookLM
is Gemini 1.5 Pro, and that just creates, really,
to me, incredible content.
The voice models, the audio model that we use, that,
by itself, is a breakthrough.
And I think that's what you're talking about,
which is the realism of the human voice,
the human-like voices that we hear,
and pair that with the approach that we've taken--
and Steven can speak more to this, too--
of editorializing the content to thinking about,
how do we create something really useful
and really fun for you that's engaging?
STEVEN JOHNSON: Yeah, that's a great segue actually,
to something I was going to say, which is about interestingness.
So Simon, who's one of the leads on the audio side,
he sometimes has a slogan for Audio Overviews, which
is make anything interesting.
[LAUGHS]
So, like, whatever, make your dissertation interesting.
I'm sure it was interesting.
HANNAH FRY: It wasn't.
[LAUGHS]
STEVEN JOHNSON: And so it's a great example
of a convergence of three different technologies
or breakthroughs that make something magical happen.
Gemini itself, and it can do this with text as well,
is incredibly good at pulling out
interesting facts or ideas or stories
from the material you give it.
So I do this all the time.
I upload something new and say, tell me
the most interesting things from this, just in text.
Computers could never do that before.
You couldn't command F for interestingness.
This was not a search query you could do.
HANNAH FRY: But how are you defining it even?
I mean, what does it mean?
STEVEN JOHNSON: I believe that it comes out
of the basic idea behind language models, which
is that they're predictive.
They're like, given this string of text,
I expect the next thing to happen.
And so what interestingness is kind of controlled surprise.
I thought this was going to be the case,
but actually, there's some new information here
that I wasn't expecting.
And so it makes sense, in a way, that the language models
would be good at this because their basic circuitry is
prediction.
And so they're looking through all this information.
Given their training data, what in this information
is novel or--
HANNAH FRY: Surprising.
STEVEN JOHNSON: --defies their expectations.
So it's very good at that.
So that's an underlying Gemini thing, right?
And the hosts of the show are instructed
to find the interesting material and present it
to the user in an engaging way.
So that's one capability.
The second thing that is really cool about this
is that the instructions take the script that is generated,
and they add noise to the script.
So they add what are called disfluencies,
so all the stammers, and the "likes,"
and the interjections that humans actually have when they
speak.
And it turns out you need that because if you
don't have that noise, it sounds too robotic.
HANNAH FRY: Mhm.
STEVEN JOHNSON: And then finally, there's
the audio voices themselves, and what
they do is all these subtle things, like in English,
speakers will raise their voice a little bit if they're not
sure about what they're saying.
Or for emphasis, they will slow down what they're saying.
All these things that we do natively,
we never even think about it.
But no computer could do that until now,
and that's the part of it that just like lights up.
And that's the underlying language vocal model,
audio model, that didn't exist a year ago.
HANNAH FRY: It's the voice modulation, right?
It's like when you--
I remember years ago at the BBC, being taught
to make content sound engaging.
And they give you a copy of "Winnie the Pooh" to read.
And then they say, OK, read it as you would a newsreader
and you read it very, very flat.
And then they say, read it as you would to a child,
and you notice, exactly as you say,
Steven, that your voice goes up at certain points
and it goes down at other points.
The range that you have and the speeds completely changes.
But you've built all of those aspects into this.
I mean, how on Earth do you do that?
STEVEN JOHNSON: Yeah, we should make it very clear,
we do not build the vocal model.
HANNAH FRY: You plural, you plural.
STEVEN JOHNSON: We have no idea how it was built,
and geniuses inside of Google built that,
and we inherited that technology.
And we have been running with it and showing
how it could be useful, but we did not build it.
One of the questions that people have is--
it's English only right now, and people
are very eager for it to come into different languages.
And we are very eager for that, too,
because we have a wonderful international audience.
But it's not something you can do easily
because the intonations and all those little conversational tics
are different in every language.
And so you can't just be like, change the words into Spanish
and press Play.
RAIZA MARTIN: I was just going to add that DeepMind actually
has a recent blog post about the audio model,
and how it was built, and who built it, and all the research
papers underneath it.
I think if we could share that, we should.
HANNAH FRY: Yeah.
No absolutely.
I think one thing that's really noticeable playing around
with this is how it is very versatile across different types
of data that you give it.
And so the way that you're describing
this, Steven, is that you're sort
of coding in all of the disfluencies,
but how do you stop this thing from just
sounding like a bunch of cliches every single time?
Raiza.
RAIZA MARTIN: I actually think it's
hard to get it not to sound like a bunch of cliches every time.
I think because of trying to standardize interestingness,
that's really, actually quite difficult.
And so interestingness tends to sound the same after hearing it
enough times.
And that's why we actually introduced the first improvement
to this particular launch, which is, we're letting users--
I call it pass a note to the hosts,
where you could slip them a little instruction on, hey, you
know what, maybe less of the cliche.
Go deeper on this topic.
And it will change the way that they talk about whatever content
you've given them.
HANNAH FRY: Should I imagine this
as almost though you have different kind of dials?
Like, maybe you turn up the quirky dial,
and maybe you turn up the historical-fact dial.
Or how can I think of this?
STEVEN JOHNSON: Well, imagine one thing
that I'm very interested in.
What if you could give each of the hosts
a different kind of field of expertise?
Right now, they basically are kind of interchangeable.
They don't have defined perspectives on the world.
One takes the lead in the conversation,
and we switch back and forth randomly.
But what if you were like, OK, I'm a city planner,
and I'm working on this design for this new town square.
And I want one of them to be an environmental activist,
and I want one of them to be an economist.
And now, let's have a conversation
and let's have a debate.
And suddenly, they have different perspectives
because one of the things-- this is something
I've written about a lot in my books over the years
--is that people are more creative, make better decisions,
when they have a diverse pool of expertise helping
them make the choices or come up with the ideas they're
trying to do.
And that's also on our roadmap for 2025.
HANNAH FRY: Will I actually be able to interact
with these hosts in the future?
Like, I don't know, interrupt them
and join their conversation?
STEVEN JOHNSON: Well, we actually
showed a version of this at I/O, the big Google Developers
Conference where we first rolled out this feature, announced it.
And they do their audio-podcast format,
and then Josh Woodward, the head of Labs, in the demo,
interrupts and says, hey, can you--
they're talking about physics.
And he's like, hey, can you use a basketball metaphor here,
because my son is listening?
And they're like, oh, great.
OK.
Someone called into the show, basically, and they're like,
let's do it in a basketball metaphor.
So that has been, publicly, part of what we wanted to do,
and you can imagine we're very eager to bring that to people.
HANNAH FRY: I mean, you paint a really compelling picture.
I do also wonder, though, is there the danger
here that you could have--
picking up on a minor detail in the corpus of text,
and then make it into a much bigger thing than it necessarily
is?
I mean, we're still at the situation
where large language models can kind of hallucinate,
can not necessarily put the right emphasis
on different parts of what it's reporting.
STEVEN JOHNSON: The early days of three weeks
ago, when we were testing this customization passive note
from the producers feature that Raiza's talking about,
I uploaded an article I'd written a couple of years ago,
and I gave them the instructions to give me
relentless criticism of this piece in the style of an insult
comic at a roast because, again, they're kind of instructed
to be enthusiastic.
So I upload this piece, and it was cool.
They immediately were like, what is Johnson's problem?
Did he even do any research for this piece?
But they also kind of reached for a criticism of it that,
genuinely--
I'm not just saying this because I
wrote it and I'm defensive-- was kind of wrong.
It kind of misread it a little bit.
And I couldn't quite tell whether it
was because I'd instructed them to be so extreme
or whether they just-- it's almost like,
I keep saying this to people.
It's like they don't really hallucinate in the way
that the first-generation models do.
It's just that they sometimes get confused
or they misinterpret something in a way that humans do,
and their take is a little bit off.
HANNAH FRY: Well, what about humor though?
I mean, we're talking about all of these different types
of examples.
Have they ever made you laugh?
RAIZA MARTIN: Yes.
Yes.
STEVEN JOHNSON: Yes.
RAIZA MARTIN: Actually, I will say
that they have made me laugh through the cleverness,
and the humor, and the exploration of other people,
because I myself--
I don't think I could have come up
with the funny cases on my own.
But just seeing what people have tried
in the outside world with the technology,
that's been really funny.
And somebody uploaded a document to NotebookLM,
and the document just said the words "poop" and "fart" in it.
And when I saw that that's what it was-- the person
posted it on Twitter.
They're like, that's all this is.
Listen to the podcast.
I was like oh dear.
What is this about to be?
But it was hilarious.
It was so good.
And the thing that makes it so funny
is that there were moments that were truly hilarious,
and then it would dip into, but what does it really mean?
And it would be thoughtful.
It would be bizarre.
It would be thought provoking.
And I'm like, am I really listening to this?
But I took it very seriously.
It was great.
HANNAH FRY: Yeah, I guess in some ways,
though, that's sort of hilarious in the way
that the AI is kind of oblivious to how absurd the challenge is,
that it's been said.
RAIZA MARTIN: I think on that one, they mentioned,
is somebody's trying to trick us into just saying a bunch
of "poop" and "fart"?
And I was like, I think so.
HANNAH FRY: I do also think that the more traditional forms
of humor, so not just laughing at how oblivious the AI is,
but a lot of that seems to me like it's about the build up
and release of tension.
STEVEN JOHNSON: Yeah.
HANNAH FRY: So, it's the kind of similar thing
about you're making a prediction of where you're
expecting a sentence to go, and then it
goes in a different direction.
Is this something that you think that it will
be able to do in the future?
Because I don't think it's particularly good at it now.
STEVEN JOHNSON: I actually had this sense
in the early days, the first couple of weeks,
really, that it was out--
I actually wrote about this briefly--
which was that they actually weren't very good at humor.
They had banter and they were playful,
but they didn't really like crack good jokes
or have genuinely funny things.
And then it turned out, as Raiza said,
that users were able to push them into being genuinely funny.
They had to be put in a funny situation, as it were.
Like, we've been given this poop-fart document.
Another one was a completely coherent-looking scientific
paper with charts, and graphs, and published in with footnotes
and everything, except that every word in the paper was
"chicken," just "chicken," "chicken," "chicken," "chicken,"
"chicken," "chicken."
And every footnote was "chicken," "chicken," "chicken,"
"chicken."
All the charts were "chicken."
And so they gave them that, and that was the first part where
I actually really laughed.
They were just like, what is even happening?
And they made some funny jokes.
And so it's like they have to be prodded
into it by an unusual situation, in a weird way.
HANNAH FRY: You did mention something there, actually,
that I want to pick up on.
There are people who have made a criticism of this technology,
saying that it's a threat to the podcasting world, that you could
be flooding the podcasting world with lots
of generic, low-quality AI-generated podcasts.
Is there a response that you have to that?
RAIZA MARTIN: What is most interesting and nuanced
about it is that what we've found
is that people are creating content
of things that probably don't have a podcast about it
to begin with.
It really is--
I don't want to say mundane.
But it really is things that nobody is going
to make a whole show about.
And I think that is interesting.
I think we're putting power in people's hands
to create content that they want that they ordinarily
wouldn't have access to.
The second piece of this around the low-quality content,
I would say that most of the content that I have heard,
ones on the internet, just people posting on the Discord,
the quality is quite high.
I think on the third note, all of the generations
from NotebookLM are also watermarked with SynthID,
and so we've taken a very responsible and cautious
approach to making sure that as we create the machinery,
or as we launch machinery, where you can create
audio outputs that are very human-like,
we want to make sure that we approach that with watermarking.
STEVEN JOHNSON: One of the other things
that's interesting here that I think
you're getting at a little bit in this line of questioning
is, we are personifying these people.
They do sound human, and we do all these things
to make them sound human.
And the interesting thing about this
is actually the philosophy that we've
had up until Audio Overviews with the product,
was in the text version of NotebookLM,
it actually does not try to sound particularly human.
It's very kind of factual, and it
doesn't try to be your friend on some level.
HANNAH FRY: Yeah, it's quite cold almost.
STEVEN JOHNSON: Yeah, it's almost cold.
And that was kind of a bit of the idea of the house style
was that, but you can't do that with voice.
That's the thing that became very clear the second we first
heard these.
It's like, you can't say, convey this through a conversation,
but don't sound human.
Don't pretend to be a person.
There's no place where the human ear will tolerate that.
HANNAH FRY: I do wonder about that, though,
because I mean, in that way, you are, as you say,
leaning in a different direction to--
I mean, lots of the other conversations
that I've had with Google DeepMind about how
you should try and avoid anthropomorphization.
You should avoid trying to think of them as "they."
We've been describing the podcast host as "they"
the entire conversation.
I mean, are there dangers or concerns
that are associated with anthropomorphization
of these characters?
RAIZA MARTIN: I think that by personifying them
to a certain extent in the way that we have, like
adding texture to the way that they describe things,
making them sound more human-like,
I think it's a way to make information easier to consume
and easier to make something more useful.
And I think that the reality is that we probably
shouldn't resist these types of approaches
if we believe that there is enough value associated
with them.
And I really do.
I really think that--
I've seen-- I don't know if you've seen on TikTok --all
of these people uploading their study materials,
and they're like, wow, I can study so much faster.
I think about the cases like that where I'm like,
are these people being harmed?
What is the actual danger?
And I'm not saying this to be like, well, clearly, right?
It's good for society.
But I really am thinking, what are they losing
as part of this experience?
And I think that it's less about the personification
or the anthropomorphization of the hosts themselves
and more about, OK, what did you lose by listening
instead of reading?
Maybe that's it.
STEVEN JOHNSON: Yeah, and that's a great point, Raiza.
And the other thing that I would add
to on that is it turns out to be a very powerful way to learn
and to understand is through dialogue,
and through asking follow-up questions,
and steering the focus towards the things
that you need to in a complex body of work.
But that kind of dialogue, if you
wanted to have a conversation about a book
and really engage with it, most people
don't have access to the author of the book.
Most people don't have access to an expert tutor that understands
the complexities of the book.
But now, with AI, those kinds of conversational explorations
are possible.
HANNAH FRY: It's kind of a much more ancient way
to explore things, exactly as you describe.
I do wonder, though, I mean, you're talking about here,
people don't have access to the author,
but what's to stop somebody from uploading a book where,
actually, you really don't want them to have
a conversation with the author?
I'm thinking here like putting in "Mein Kampf"
or the "Anarchist Cookbook."
STEVEN JOHNSON: Yeah, I mean, there's
a kind of underlying safety layer
that Google spent a lot of time working on,
DeepMind and spent a lot of time working on.
So if there are obviously offensive, dangerous things,
that, you can catch.
The trickier thing is, what happens in terms of politics?
So if you upload something that's
within the bounds of conventional political
discussion, but it may be more right wing or more left wing,
how should the host respond to that?
And so we specifically included instructions that say, listen,
if it feels political, then you should
adopt the attitude of hey, we're not taking sides in this.
We are just going to have a conversation about what
this document says, and we're not going to endorse it
or critique it in that way.
And we figured that was the best compromise
for those kind of complicated political stances.
RAIZA MARTIN: I think there's also
the interesting sort of line, where, I think, there's
the safety concern and then I think
there's censorship concern.
And actually, in the early days, we ran into this a lot
before the safety filters were much more sophisticated, where--
people study difficult topics, people
study things that happened in history that have quite
a bit of violence, racism.
These are topics that are fraught.
But I think it would be wrong to create a tool that blocks
content generically without a thought around the intent
of the user so that we're not allowing users
to create harmful content, but at the same time, if--
most of our users, especially in the beginning,
were learners, educators.
Like if you're studying history, you
are definitely going to run into a safety filter.
STEVEN JOHNSON: Well, that was my problem.
The last book that I wrote was-- actually,
you mentioned "The Anarchist Cookbook."
Part of it is about the history of anarchism
and the kind of roots of terrorism in the early anarchist
world.
And so I was using NotebookLM to help me research that book
because I was writing it.
And it was constantly like, I'm sorry,
I can't answer that question because you are obviously
a terrorist, Steven.
And I'm like, no, no.
HANNAH FRY: You're definitely on a list
somewhere, Steven, aren't ya?
[LAUGHS]
RAIZA MARTIN: That's right.
STEVEN JOHNSON: Maybe I still have a job.
[LAUGHS]
HANNAH FRY: There is also this question about personal data.
I know that this is something that
has been really subject to a lot of discussion
with large language models and people uploading documents to it
and being concerned about it, kind
of feeding into the next generation of models.
So how do you make sure, in NotebookLM,
as you said, that the information that you upload
can be private and remain so?
STEVEN JOHNSON: Yeah, so this actually
is an opportunity to explain something
that I think is really important here,
which is the idea of the context window or the model.
So a context window is effectively
like the short-term memory of a language model.
The long term memory is like its training data,
like its general knowledge of the world.
And the context is the stuff you put in with your query
when you ask a question.
And anything in the context window is transitory.
The second you close your session, it disappears.
It gets wiped from the memory of the model.
What that also means is that's why it's private.
We're not training the model on your information.
All we're doing is putting it in the short-term memory
in the model, letting the model answer questions.
And then when you close the session,
it's like the model has completely forgotten anything
that you've given to it.
HANNAH FRY: So in terms of the future of this--
I mean, this is still quite a young product.
What are the things are you hoping to include on it?
RAIZA MARTIN: I think we've seen so much excitement
about the audio feature, so I think
we can definitely commit to that to being on the future roadmap.
I think it's alluded to more controls, more voices, more
personas, more languages.
I think that's just such an exciting horizon for us.
STEVEN JOHNSON: The one that I'm so
excited to think about, which we've just started
to scratch the surface of, is--
there's a lot of tools for asking questions and listening
to explanations of things, but what
about writing with these sources at your disposal?
How do we write in a source grounded environment?
And so just as a writer myself, I
think that that's going to be an amazing thing.
So we have some really, really cool things in the works.
HANNAH FRY: I do also wonder about different modalities.
I mean, you've gone to audio, but, presumably, you
could go to video at some point too.
RAIZA MARTIN: Yeah, and, actually, there's a fun idea
we have for video, which is like--
we're not talking about fully generative video yet,
but imagine if you could do even something really basic.
You upload these slide decks, they have charts,
they have diagrams, you have PDFs of papers.
Just take the content that's already there.
And NotebookLM is already incredible at this
because of our citations model.
The fact that we know exactly where every piece of the answer
comes from--
we use it to generate audio overviews,
we use it to generate textual answers.
I think it wouldn't be that big of a leap
to generate short videos using your own content.
HANNAH FRY: I do really like, Steven,
how you're describing this often as the thing that you
use to make the podcast that nobody else would want to make.
But the point here, I guess, is that you're not
trying to replace all podcasts.
There are presumably things that you expect NotebookLM will never
be able to do.
STEVEN JOHNSON: Yeah, people, I think,
will generally always prefer to hear two actual humans talking
about a topic.
If there is economics or passion enough
to generate a podcast on a topic,
humans actually talking to each other will be the choice.
It's just turns out that there's this vast,
uncharted territory that just wasn't-- no one ever thought
about making a podcast based on the family trip to Alaska,
[LAUGHS] because it just didn't make sense to rent a studio
to do that.
But now, you can just take everybody's journal entries
and photos and upload it to NotebookLM,
and you can have a podcast based on your family trip.
And so I think that's where it turns out
there's just all this untapped kind of blank space on the map
that we've just started to explore.
HANNAH FRY: Do you think that there
are elements of, like, human-content creation
that are really hard to capture with AI,
or the AI will maybe never be able to capture?
STEVEN JOHNSON: Yeah, that's the thing
we're trying to figure out.
I mean, the one idea I think that I'm really interested in
is like, how capable are these models
at thinking and developing ideas that are really long form?
So book writing-- so when you're coming up with the idea
for a book, you're really thinking it's-- one,
it's an incredibly long-term process and you're thinking
about a presentation of information that's going to go
on for 300 pages.
It's going to involve all this complexity, all this narrative
complexity.
And you couldn't approach that all
with a language model right now.
You could work on little bits of it.
You could say, OK, I'm trying to set up this scene
or I'm trying to figure out what the narrative should be,
but you can't actually imagine the whole thing.
That, right now, is just a human-exclusive capability.
And I think it will be for a long time, and it may always be.
But who knows where we're going to end up?
HANNAH FRY: Both the wood and the trees simultaneously.
STEVEN JOHNSON: Yeah.
Yeah, and I think they're the kind of seeds of that.
There's some promising signals, but people
who write books for a living, I think,
can feel confident that they will
continue to be able to do that.
HANNAH FRY: Yeah, although writing books for a living
is one of the most torturous professions there is.
[LAUGHS]
As someone who's trying to write one at the moment,
I want you guys to hurry up, please.
Well, thank you both for joining me.
That was a really, really fascinating discussion.
Appreciate it.
STEVEN JOHNSON: Thanks for having us.
RAIZA MARTIN: Thank you.
Thanks for having us.
HANNAH FRY: You know, I think there's actually
something quite heartwarming about the way
that NotebookLM has captured people's imagination,
because on the one hand, you've got this technology that
is operating at the absolute cutting edge of what
is possible with some of the most sophisticated AI models
out there.
And it's something that's designed
to deal with this very modern problem about how we are often
overwhelmed with having to process
these large amounts of, often, quite dense and maybe
quite-boring information.
And they've hit upon a solution that is so innately human, so
ancient and appealing, the idea of listening
in to a conversation between two excitable and interested people.
And, of course, the fastest way to make
a human prick up their ears and pay attention is through gossip.
And this is like sitting around a fire
while an AI uses that very trick to help
you digest 25 pages of a snorefest-lecture series.
I mean, put it this way.
If it can make my PhD thesis sound interesting,
then this has the potential to be quite a powerful tool.
You have been listening to "Google DeepMind-- the Podcast,"
with me, Professor Hannah Fry.
If you enjoyed that episode, then
do subscribe to our YouTube channel.
And you can also find us on your favorite podcast platform.
And of course, we have got plenty more episodes
on a whole range of topics to come, so do check those out too.
See you next time.
[MUSIC PLAYING]
Loading video analysis...