Deep Dive into Long Context
By Google for Developers
Summary
## Key takeaways - **Tokens ≈ Slightly Less Than Words**: A token is basically slightly less than one word in case of text; it could be a word, part of a word, or punctuation like commas or full stops. [01:17], [01:47] - **Context Window: In-Context Memory**: Context window is the explicit in-context memory supplied to the model like prompts, previous interactions, or uploaded files, which is much easier to modify than in-weight pre-training memory. [05:55], [06:57] - **RAG + Long Context Synergy**: Long context and RAG work together: long context allows retrieving more relevant needles from RAG to increase recall, especially for enterprise knowledge bases with billions of tokens. [13:17], [13:34] - **Attention Competition with Distractors**: Attention has a drawback due to competition between tokens; hard distractors attract more attention, leaving less for relevant info, worsening with larger context sizes. [24:11], [24:55] - **Context Caching for Efficiency**: Rely heavily on context caching: first long context query is slower and costlier, but subsequent questions on the same context are cheaper and faster; put query after context. [41:21], [44:04] - **Future: 10M Tokens for Coding**: Near-perfect 1-2M context quality first, then cost drops enable 10M commodity context, unlocking superhuman coding by holding entire large codebases and connecting dots precisely. [49:53], [53:13]
Topics Covered
- Tokens Distort Human Word Views
- Context Overrides Inweight Memory
- RAG Complements Long Context Synergy
- Attention Competes Across Tokens
- 10M Context Unlocks Superhuman Coding
Full Transcript
I'm really impressed by the work of our inference team. I've got a bunch of
inference team. I've got a bunch of spicy uh rag versus long context questions for you. You can rely on context caching to make it both cheaper and faster to answer. What's the
limitation of like continuing to scale up beyond 1 to 2 million? This thing is going to be incredible for coding applications. We will have lots more
applications. We will have lots more exciting long context stuff to to share
with folks.
[Music]
Welcome back to release notes everyone.
How's it going? Today we're joined by Nicolay Savanghov who's a staff research scientist at Google Deep Mind and one of the co-leads for long context pre-training. Nikolai, how are you?
pre-training. Nikolai, how are you?
Yeah, hi. Thanks for inviting me. Let's
start off with um at the most fun foundational level and we'll sort of build up from that. Uh what is what is a token um and how folks should should think about that.
So the way you should think about a token, it's basically slightly less than one word in case of text. So token could
be a word, part of a word or it could be things like um yeah punctuation like uh
commas uh full stops etc. And um for images and audio it's slightly different but um for text just think of it as um
slightly less than one word. Yeah. And
why do we need tokens? Like why like like humans generally are sort of familiar with characters? Why does AI and LLMs have this special concept of a of a token? What's like what does it
actually enable? Well, this is a great
actually enable? Well, this is a great question and uh actually many researchers uh asked this question themselves. So there were actually quite
themselves. So there were actually quite some papers trying to get rid of tokens and uh just rely on character level generation. But the thing is while there
generation. But the thing is while there are some benefits uh of doing that there are also some drawbacks and the the most important drawback is uh while the
generation is going to be slower because uh you generate uh roughly one token at a time and if you are generating a word in one go it's going to be much faster
than generating every character separately. So those efforts they I
separately. So those efforts they I would say they didn't really succeed and we are still using tokens. Yeah. For
folks who haven't spent a bunch of time thinking about tokens, there's a bunch of good uh Andre Karpathy videos and tweets and stuff like that of how like tokenizers are the root of all like
weirdness and complexity in LM like all these weird edge cases that you run into. It's like most of them are rooted
into. It's like most of them are rooted in the fact that the model is not looking at things from a character level. It's looking at it from a token
level. It's looking at it from a token level and actually like the pertinent example um which folks love to go to these days is like counting the characters in a single word like how many Rs are there in strawberry is like
a weird problem to solve. My
understanding is because tokenizers like break the word into different parts.
It's not actually looking at the word at like the individual character level. Is
that a an app description? Yeah, I think that's a pretty good description of the problem and you should one thing you should realize is that those models due to
tokenization the view they view the world very differently from how humans uh view the world. And when you see a strawberry, you see a sequence of
letters. But what the model says, it
letters. But what the model says, it could be even one token and then you ask like, hey, count number of uh R letters
in this token. But this is a pretty hard it's pretty hard to get this knowledge from from pre-training because you would need to associate
the R letter token that you encountered somewhere on the web with the word strawberry which is also one token. So
if you think about the mental load of doing that it's it's not as such a trivial task I would say. Although
obviously when the model can't do it, we start complaining hey like if it's AGI how come it can't count number of our letters in strawberry that's like a
child could do that? Yeah, it is it is super weird. And and actually another
super weird. And and actually another interesting thing is that if you watch some of the Karpathi videos then there's a lot of there are a lot of
problems with the white space. So this
is an interesting point because normally most tokens are prefixed with a white space and then some really weird effects uh might happen because
uh you might you might encounter problems uh on the boundaries when you think you are concatenating uh something
but uh this concatenation is very unusual for the model to see. M
interesting. That is super interesting.
I think this actually takes me to just like generally talking about context windows and I think there's a lot of discussion or obviously we're talking about long context which sort of assumes you know what a context window is but um
can you give the lay of the land of like how folks should thinking about like what a context window actually is. Why
do I as a user of LLM or somebody who's building with AI models why do I need to care about the context window? So
context window is uh those are basically exactly this uh this context tokens that we are feeding into LLM and it could be the current prompt or the previous
interactions with the user. It could be the files that the user uploaded like uh videos or PDFs.
And when you supply context into the model, the model actually has knowledge from two sources. So one source is what
I would call inweight or pre-training memory. So this is a knowledge that well
memory. So this is a knowledge that well the LLM was trained on a slice of the internet and it learned something from
there. It doesn't need um additional
there. It doesn't need um additional knowledge to to be supplied into context to remember some of those facts. So
there is already even without context there is some kind of memory present in the model.
But another kind of memory is this explicit in context memory that you are supplying to the model. And it's pretty important to
model. And it's pretty important to understand the distinction between those two because in context memory is much
much easier to modify and update than in weight memory. So for some kinds of
weight memory. So for some kinds of knowledge in weight memory might be just fine like uh if you need
to uh memorize some simple simple facts that uh the objects uh fall down and not up. This is a very like basic common
up. This is a very like basic common facts. It's fine if it's if this
facts. It's fine if it's if this knowledge comes from pre-training but there are some facts which are true at time of uh pre-training but then they
become obsolete at the time of inference and you would need to update those facts somehow and the context provides you a mechanism for to do this um
update and it's not only about the up-to-date knowledge there also different kinds of knowledge like uh private information the network doesn't
know anything about you personally and it can't read your mind. So if you want it to be really helpful for you, you should be able to supply your private
information into context and then it will be able to personalize. Without this
personalize. Without this personalization, it's going to give you a generic answers it would give to any human instead of answers tailored to you. And the final category of knowledge
you. And the final category of knowledge which um which need to be inserted in context is uh some rare facts. So basically some
knowledge which was uh encountered very sparingly on the internet and I must say I suspect this category of knowledge it
might go extinct with time. Maybe future
models will just learn the whole slice of the internet by heart and we will not need to to worry about those. But the reality
at this point is that if something is mentioned once or twice on the whole internet, the models are actually unlikely to remember those facts and they are going to hallucinate the
answers. So you might want to insert
answers. So you might want to insert those uh explicitly into context and the kind of trade-off we are
dealing with is for the short context models you have limited ability to provide additional context.
Basically, you would have a competition between uh knowledge sources. And if the context is really large, then you can be less picky about what you insert and you
can have uh higher recall and coverage of uh relevant knowledge. And if you have if you have higher coverage uh in
context, that means well you're going to alleviate all those problems with uh ine memory.
Yeah, I think there there are so many angles to push on. Uh that that was a great description. Um one of the
great description. Um one of the follow-ups from this is we talked about sort of ine memory, we talked about in context memory um or in yeah just in
context in general. The the sort of third class is around how to bring context in that like through rag systems um retrieval augmented generation. Can
you sort of give like a highle description of rag and then I've got a bunch of spicy uh rag versus long context questions for you?
Yeah, sure. So, what rag does is uh well, it's a simple engineering technique. It's an additional step
technique. It's an additional step before you pack the the information into LLM context.
So imagine you have a knowledge uh knowledge corpus and you chunk this knowledge corpus into well small textual chunks and then
you use some uh special embedding model to turn every chunk into a real valued vector. Then based on those real valued
vector. Then based on those real valued vectors, if you get a query at the test time, you can embed the query as well.
And then uh you can compare this real valued vector for query to those chunks from the corpus. And for the chunks which are close to the query, you're
going to say, hey, like I found something relevant. So I'm going to pack
something relevant. So I'm going to pack those uh chunks into context. And now
I'm running LLM on this. So that's how that's how rag works and and why and and this is maybe a silly question. Um, rag
my my sense has always been like lets you obviously there's like very hard limits on context that you can pass to the model. We have 1 million, we have 2
the model. We have 1 million, we have 2 million. That's awesome. But like
million. That's awesome. But like
actually if you look at like internet scale, you know, Wikipedia has, you know, many trillions of tokens or whatever or maybe not trillions, maybe billions of tokens, whatever it is. Um
why is like rag as this notion of like bringing in the right context to the model not just like baked into the model itself? Like is it just that to the
itself? Like is it just that to the point of the conversation the model just not working well for like it's just like the wrong research direction to go in or like why why don't you think we build
that mechanism in because my face value perspective is like it seems like that would like kind of be useful if the model could just like do rag and if I could pass a billion tokens and then let the model sort of figure out
heristically or through the you know whatever mechanism what the right tokens are um or is that just like uh a problem somewhere else in the stack that should be solved and the model shouldn't have
to think about that.
Well, one thing I want to say is that after we released 1.5 Pro model, there were a lot of debates on social media like is rag becoming
obsolete and well from my perspective not really like uh say enterprise knowledge bases they
constitute billions of tokens and not millions. And so for this use case, for
millions. And so for this use case, for this uh for this scale, you still need rack. What I think is going to happen in
rack. What I think is going to happen in practice is that it's not like rack is going to be eliminated right now, but rather long context and rack are going
to work together.
And the benefit of long context for rack is that you will be able to retrieve more relevant uh needles from the context by using rack. And by doing
that you're going to increase the recall of the useful information. So if
previously you were setting some rather conservative threshold and cutting out many potentially relevant chunks then now you're going to say hey I have a
long context so I'm going to be more generous so I'm going to pull more more effects and so I think uh there's a
pretty good synergy between those and uh the real limitation is uh the latency requirements of your application. Mhm.
So if you need real time interactions then well you'll have to use shorter context but uh if you can afford to wait a little bit more then yeah you're going
to use long context just because uh you can increase the recall by doing that.
Why is 1 million just like a marketing number or like is there something like intrinsic that like after a million or 2 million like is there actually like something technically happening around the like million token mark for from a
long context perspective or is it literally just we found a a number that sounds good and then made the technology work from a um from a research perspective? Well, when I started
perspective? Well, when I started working on long context, uh the competition at the time, I think it was
about 128k or maybe 200k tokens at most. So I was thinking how to set the goals for long
context project and well it was at the time it was a small small part of Gemini and I originally thought well I mean just matching
competitors don't doesn't sound very exciting. So I thought let's uh set an
exciting. So I thought let's uh set an ambitious bar. So I thought well 1
ambitious bar. So I thought well 1 million is uh an ambitious enough step forward. It was like compared to 200k
forward. It was like compared to 200k that's like 5x and very soon after we released 1 million we also actually shipped 2 million which which was about
10x larger and I guess one order of magnitude uh larger than the previous uh state-of-the-art. That's that's a good
state-of-the-art. That's that's a good goal. That's what makes it uh exciting
goal. That's what makes it uh exciting for people to work on. Yeah, I love that. And how my my my followup spicier
that. And how my my my followup spicier question from that is like we shipped 1 million, we shipped 2 million rapidly after that. Like what's the limitation
after that. Like what's the limitation of like continuing to scale up beyond 1 to 2 million? Is it like uh from like a serving perspective, it's too costly or too expensive? Or is it just like the
too expensive? Or is it just like the architecture that makes 1 to 2 million work like just like fundamentally breaks down when you go larger than that? Or
like how how come we haven't seen the frontier for long context continue to push? Yeah. So when we released uh 1.5
push? Yeah. So when we released uh 1.5 pro model, we actually ran some inference tests at 10 million and we got some quality numbers
as well and for say single needle retrieval it was almost perfect for the whole 10 million context. We could have shipped this
context. We could have shipped this model but it's pretty expensive to run this inference. So I guess we weren't sure if
inference. So I guess we weren't sure if people are ready to pay a lot of money for this. Uh so we started you know more
for this. Uh so we started you know more uh you know with something more reasonable like in terms of the price
but uh I think terms of quality that's also a good question because uh it's it was so expensive to run we didn't run many
tests and so just you know bringing up this server again it's also quite costly so unless we want to you know ship it to a a lot of
customers right now. It's, you know, we don't have chips to to do that. Yeah. Do
you think that that will continue to hold that like the it's like this like m I don't know if it's like an exponential increase in capacity that's needed as we do more long context stuff but like do you have an intuition that like that
will like do we need like fundamental breakthroughs from a research perspective for that to change to make it so that we can actually keep scaling up beyond or is it like 1 to 2 million is is going to be what we stick with and
if you want more than that do rag and then like be really smart about bringing context in and out of the of the context window from the model perspective. So my
feeling is uh that we actually need more innovations. So it's not just a matter
innovations. So it's not just a matter of brute force scaling to actually have close to perfect 10 million context. Uh
we need to learn more innovations. Uh but then in terms of uh
innovations. Uh but then in terms of uh the rack and which paradigm will be more powerful going into the future. I think
that the cost of those models is it's going to decrease over time and we're going to try to pack more and more context
uh retrieved with rack into those models and because the quality is also going to increase then it's going to be more and more beneficial to do that.
Yeah, that makes sense. Can you take us back to when we originally landed long context? My understanding of the story
context? My understanding of the story is for 1.5 Pro um we didn't it like wasn't like it had been built for a long context to begin with like I think like you had obviously like tried to kick off
that workstream with others inside of DeepMind um and it ended up just being that like the pace of research progress was like super fast and like we I think my my loose understanding of the story
is like we had the breakthroughs we realized it worked and then it was like shortly thereafter ended up actually landing in the model side or like what was the like timeline from the the
effort starting to like actually landing it into a model that was available externally to the world. Oh, I think that was uh that was indeed pretty quick. And just to clarify, we didn't
quick. And just to clarify, we didn't really like we were wishing to to go long and achieve say 1 million or 2
million context, but we kind of didn't expect uh ourselves to get there that fast.
And when it actually happened then we thought like hey like this is this is really impressive like uh we actually made some strides on this task. So now
we need to ship it and then we actually managed to assemble a pretty awesome team very quickly and the
team worked really hard. Like to be honest uh, in my life, I've never seen people working this hard. I was really impressed.
I love that. That's awesome. Um, and
that was for the original 1.5 Pro series. We landed it for 1.5 flash as
series. We landed it for 1.5 flash as well. We now have it for 2.0 Flash. We
well. We now have it for 2.0 Flash. We
have uh 2.5 Pro. And I think the c can you sort of give us the lay of the land of like what's uh what's been happening from a long context perspective from that in original launch when like we know long context is possible. We
released the technical report for 1.5 pro which showed you know the needle in the haststack results a bunch of stuff like that. Um to today where like I
like that. Um to today where like I think a lot of what's actually making this 2.5 pro model blow people's mind is actually how how strong it is at long context. um which has been awesome for
context. um which has been awesome for you know coding use cases and stuff like that. So what what's happened in the
that. So what what's happened in the long context world from original launch to today? Yeah. So I think the the
to today? Yeah. So I think the the biggest um the biggest improvement was actually the quality
and we made strides both on the quality at say 128k context and also at 1 million context. So if we look at the
million context. So if we look at the benchmark results for 2.5 pro model, we observe that uh it's uh better
compared to many strong baselines like uh GPD 4.5, CL 3.7 and also 03 mini high and some of
the deepseek models. So the
quality like to to actually compare to those models we had to run the evals at 128k so that they all comparable and we saw quite a big improvement
for 2.5 pro and now in terms of 1 million context uh we compared it with 1.5 pro and we also saw some significant
advantages. This is maybe a weird
advantages. This is maybe a weird question, but like does the quality like eb and flow at different context sizes?
Like do you see like is it like like almost like linear quality like on a you know 100,000 token input versus like 128,000 or like a 50,000 versus 100?
like is it like pretty consistent across or is there like weird like I'm I'm trying to imagine maybe the it all generalizes when you make it into the final model and there's no there's no difference but like is there any like
nuance in in that perspective have we done evals that show anything like that?
Yeah, internally we looked at some some of those evals. Um I guess maybe your question goes into these effects that people observed in the past like uh very
popular one was lost in the middle effect. M and to answer your question uh
effect. M and to answer your question uh the lost in the middle effect where you have a deep in the middle of the context we don't really observe this with our
models but what we do observe is that if it's a hard task not like a single needle but uh some task with hard distractors then the
quality slightly decreases with uh increasing context and that's something we want to improve Yeah. And just for my own mental model
Yeah. And just for my own mental model does like when I think about putting 100,000 tokens into the context window of the model, should the should like from a developer perspective or a user
of who's like actually using the long context functionality um should I assume that the model is like actually attending to all of the different context? I know it could
different context? I know it could definitely do the like the one needle.
It can pull that out, but um like is it actually reasoning over all those tokens like in the in the context window? I
have like just like a a bad mental model of like what's happening behind the scenes um when you have that much context in the in the context window of the model. Yeah, I think that's a good
the model. Yeah, I think that's a good question.
So one thing you need to keep in mind is that uh attention in principle has a bit of a drawback because uh there's um there's a
competition happening between tokens.
So if one token gets more attention then other tokens will get less attention.
The thing is if you have hard distractors then one of the distractors might look very similar to the information that you're looking for.
and it might uh attract a lot of attention and now the piece of information that you are actually looking for it's going to receive less
attention and the more tokens you have the harder competition becomes. So it it depends on the
becomes. So it it depends on the hardness of uh destructors and also also on the context size. Yeah, this is another silly follow-up question but
like is the is the amount of attention always fixed? Like there's no like is it
always fixed? Like there's no like is it possible to have more attention or is it just like whatever it's like a value of one and then there's like you know spread across all of the tokens in the context window and like so the more
tokens you have like literally the less attention there is there's no way for that to change. Normally that's that's
the case that um like the whole pool of attention is um is limited. Yeah. from
from that example you gave about like the like hard distractors like causing the model to do a lot more work and sort of split the attention. H has the has your team explored or other teams on the
applied side explored like like pre-filtering mechanisms any of that type of stuff? So like you want long context to work really well in production? It like sounds like the best
production? It like sounds like the best outcome is you don't have like you have like very dissimilar data um that's in the context window. if there's like a lot of similar data and you're asking a question that could be relative to all
of it, would you you'd expect the performance to be worse in general in that use case. So, have you is that just something that like developers or the world needs to like figure out from that perspective or do you have any any suggestions of like how folks should
approach that problem? For me as a researcher, I think it's it would kind of be a move in the wrong direction. I
think we should uh work more on improving the quality and robustness instead of coming up with some hacks for filtering. One practical recommendation
filtering. One practical recommendation though is of course uh try not to include totally relevant context. Like if you know that something
context. Like if you know that something is not useful then what's the goal of including it into the context because in the very minimum it's going to be more
expensive. Yeah. So why would you do it?
expensive. Yeah. So why would you do it?
Yeah, that's it's interesting because I feel like um in some sense that like goes against the like core way that people use long like I think the examples I see online it's just like people being like oh I'll
just take all this random data and throw it in the context window of the model and have it sort of figure out what's useful for me. Um, so you'd almost expect like the model to do uh given
like how important it sounds like it is to like remove some of that stuff, the model to do like the pre-filtering itself almost to like only include the relevant because I think the like not that humans are lazy, but like I feel
like that's like been one of the selling points is like I don't need to think about what data I'm putting into the context window. So do you think there's
context window. So do you think there's a world where like the model you know it's like a multi-art system or something like that where the model is actually doing some version of the model is doing some of that eliminating the
extraneous data based on what the user's query is and then like making sure when the context actually goes to the model it's a little bit easier over time as the models get better quality and they
get cheaper you'll just you will not need to think about this anymore. I'm
just talking about the like the current realities like if you want to make a good use of use of it right now
then well let's be realistic uh just don't put irrelevant context but also I agree with your point that uh if you spend too much time manually filtering
it or like handcrafting which uh things to put into context that's annoying. So
I guess there should be a good balance between those. Yeah, I think the point
between those. Yeah, I think the point of context is to simplify your life and make it more automatic, not to make it more time consuming or you know make you spend
time handcrafting things. Yeah, I've
I've got to follow up on this around like evals and the evals that you're thinking about from like a long context quality perspective. um needle and
quality perspective. um needle and haststack obviously that was the original one that we put into the uh 1.5 technical report and for folks who aren't familiar needle and haststack is
just like asking the model to find like one piece of context in 2 million 1 million 10 million tokens of context the models are extremely good at this um how
do you think about like the other set of long like is there like a set of like standard I think like folks like generally I feel like needle and hack gets talked about a little bit but are like another set of like standard
benchmarks that you're thinking about from a long context perspective. So
let's see I think the evaluation is uh pretty much the cornerstone of the LLM research and especially if you have a
large team evaluation provides a way for for the whole team to align and push in the common direction. So the same applies to long context. If you want to make progress, you need to have great
evaluations.
Now single needle in hashtag it's a solved problem especially with uh with easy distractors. So if it's like Paul
easy distractors. So if it's like Paul Graham's essays and uh you put phrase
here is my uh like a magic number for the city of Barcelona is 37. give me the magic number for the city of Barcelona.
So this is like this is really a solved problem. But now the frontier of
problem. But now the frontier of capabilities is handling hard distractors. If you for example packed
distractors. If you for example packed your whole context with uh like a magic number
for ctx is y and like you pack say the whole million context with this uh key value pairs. But that's a much harder
value pairs. But that's a much harder task because then distractors are actually looking very similar to what you want to retrieve. Another thing
which is hard for LLMs is retrieving multiple needles. So I I feel like these
multiple needles. So I I feel like these two things the hardness of destructors and multiple needles they are the frontier. But also there are there are
frontier. But also there are there are additional considerations for the evals.
One consideration you might have is oh well like those new in the haststack evals even with hard destructors they're pretty artificial so maybe I want something more
realistic. This is a valid argument but
realistic. This is a valid argument but the thing you need to keep in mind is that once you increase the realism of the eval you might actually lose the ability to measure the core long context
capability.
For example, if you are asking a question to a very large codebase and the question is basically can be answered by just one file in this codebase and then the task is to
actually implement something complicated then you're not really going to be exercising the long context capability.
Instead you are going to be exercising the coding capability and then it will give you a wrong signal for heel climbing. It will you will basically
climbing. It will you will basically heel climb on coding instead of long context. Yeah. So that's one
context. Yeah. So that's one consideration. Another consideration is
consideration. Another consideration is that something which people call retrieval versus uh synthesis evolves.
So theoretically if you need to just retrieve one needle from the haststack that can be solved by by rack as well.
But the tasks that we should really be interested in are the tasks which integrate information over the whole context.
And for example, well summarization is uh is one such task and uh rack would have a hard time dealing with this. But now these tasks
this. But now these tasks they it sounds nice and the right direction to go but they're actually not so easy to use for automatic evaluation.
For example, the metrics for summarization we know that they are like metrics like rouch etc. they are imperfect. Mhm. And if you're doing hill
imperfect. Mhm. And if you're doing hill climbing, then you actually you're better off using something which is more
um how do I say less uh less gameable metrics? And and just a quick follow
metrics? And and just a quick follow what makes them less useful like summarization as an example? Is it just that it's like more subjective of like what a good summary is versus what isn't and it doesn't have like a ground truth source of source of truth or what makes
that use case hard? Yeah, those of us are going to be uh pretty noisy because there will be a relatively low agreement between the
even between the human raers. Of course
this is not to give an impression that we shouldn't work on summarization and we shouldn't measure summarization.
These are important tasks. I'm just
talking about uh like my personal preferences as a researcher is to heel climb on something which has a very strong signal. Yeah
that makes sense. How do you see sort of as long context especially for Gemini is just like a core part of the capability story that we're telling the world. It's
like a core differentiator for for Gemini.
Um and and yet at the same time it feels like the long context has always been like a independent work stream of like everything isn't long context. Like do
you think there's a world where like you know we have there's a ton of other teams hill climbing on a bunch of other random stuff factuality whatever it is uh you know reasoning etc etc. Do you
think the like directional from a research perspective from like a modeling perspective is that like long context is just fused into every other workstream or do you think there's like still you know it needs to be an
independent workstream because it's like just fundamentally different in how you get the model to do useful stuff with long context versus you know reasoning
as a corlary example perhaps. So I guess my answer will be twofold. First of all I find it helpful to have
um an owner for every important capability. But second, I think it's
capability. But second, I think it's important for for the workstream to also uh provide tools for people outside of this
workstream to contribute. Yeah, that
that makes a ton of sense. Um, my I have a another followup around reasoning stuff and I'm curious how like the interplay between reasoning and long context. We had Jack Ray on and we were
context. We had Jack Ray on and we were both at at dinner with Jack last night uh and we were talking about random reasoning reasoning stuff. Um, do do you think like re the have you been
surprised by like how much it feels like and you can correct me if this is wrong like the reasoning capability actually makes long context much more useful. Like is that like just a normal expected outcome just
because the model's spending more time thinking or is there like some inherent like deep connection between reasoning capabilities and and long context to like make it much more
effective? I would say there's a deeper
effective? I would say there's a deeper connection and the connection is that if the next token prediction task improves
with the increasing context length then you can interpret this in in two ways.
Uh, one way is to say, hey, I'm going to load more context into the input and the predictions for my short
answer are going to improve as well. But
another way to look at this is say hey well the output tokens they are very similar to input tokens. So if you allow
the model to feed the output into its own input then it kind of becomes like input. So theoretically if you if you
input. So theoretically if you if you have a very strong long context capability it should also help you with the reasoning. Another argument is that
the reasoning. Another argument is that long context is pretty important for the reasoning because if you are just going to make a decision by generating one
token even if the answer is binary and uh it it's totally fine to gen generate just one token. it might be preferable to first generate a thinking trace. And
the reason is simply they are architectural. Like if you need to make
architectural. Like if you need to make um uh many logical jumps uh through the context when making a prediction then
you are limited by the network depth because that's that's roughly the number of uh number of attention layers. That's
what going to limit you in terms of the the jumps through the context. So you're
limited. But uh now if you imagine that you are feeding the output into the input then you are not limited anymore.
Basically you can write into your own memory and you can uh perform much harder tasks than you could by just uh
utilizing the the network depths. That
that's super interesting. You you and I have also both like related to this reasoning plus long context story. You
and I have both been pushing for a long time to try to get uh long output uh landed into the models. And I think developers want this. I I see pings all the time. I'm going to start sending
the time. I'm going to start sending them to you now so that you have to answer this question. But lots of people saying, "Hey, we want longer than 8,000 output tokens. We sort of have this to a
output tokens. We sort of have this to a certain extent now with reasoning token or with the reasoning models. they have
65,000 output tokens with the caveat that a large portion of those output tokens is actually for the model to do the thinking itself versus generating some like final response to the user.
How connected are like the long long context input versus like long context output capabilities like is there any interplay between those two things because I feel like for a lot of the
like core use case I think that people want is like you know dump in a million tokens and then like refactor that million tokens. Um, do you think we'll
million tokens. Um, do you think we'll get to a world where like those two things are actually like the same cap capability? Do you look at them as the
capability? Do you look at them as the same capability or is it like two like completely fundamentally different things from a research perspective? No
I don't think they are fundamentally different. I think
different. I think uh the important thing to understand is that straight out of pre-training there isn't really any limitation from the model side to
generate a lot of tokens. You can just put say half a million and tell it I don't know copy this half a million
tokens and it will actually do it and we actually tried it. it it works but this capability it requires very careful handling in the post training and the
reason why it requires a careful handling is because in the post training you also you have this special uh end of
sequence token and if your SFT data is short then what's going to
happen is the model is going to see this end of sequence token pretty early in the in a sequence and then it's just going to learn like hey like you you're
always showing me this token within context X so yeah I'm going to generate this token within context X and stop generation that's what you are teaching
me this is actually an alignment problem but one point I want to make is that I feel like reasoning is just one kind of uh long output tasks
And for example translation is another kind and reasoning it it has a very special format. It packs uh the
special format. It packs uh the reasoning trace uh into some delimiters and model actually knows that
we asking it to do the reasoning in there. But for translation the whole
there. But for translation the whole output not just the reasoning trace is going to be long. And this is another kind of capability that we want uh the
model to encourage to produce. So it's
just uh it's just a matter of properly aligning the model and we are actually working on long output. I'm
excited. People people want it uh very badly. I think that gets to a bunch of
badly. I think that gets to a bunch of uh a broader point around just like how developers should be thinking about best practices for long context and also for rag potentially as well. Do you have a
general sense of and I know you um gave a bunch of feedback on our long context developer documentation. So we have some
developer documentation. So we have some of this stuff sort of documented already but what what's your general sense of what the suggestions are for developers as they're thinking about how to most
effectively use long context. So I think suggestion number one is uh try to rely heavily on context caching. So let me explain the concept of context caching.
uh the first time you supply a long context to the model and you're asking a question, it's going to take longer and it's going to cost more.
While if you're asking the second question after the first one on the same context, then you can rely on context caching to make it both cheaper and
faster to answer. That's the one of the features that we are currently providing for some of the models. And so yeah, try to rely heavily on this thing. try to
catch the files that the user uploaded into context because it's not only uh faster to process but it's going to
cost you on average four times less for um for the input token price. And and
just to give an example of this, like the most common, and you correct me if this is wrong or not not the same mental model that you have, but like the most common application where this ends up being really useful is like the like
chat with my docs or like chat with PDF or like chat with my data type of applications where the the actual original input context to your point is the same. And that's one of the um again
the same. And that's one of the um again correct me if my mental model is wrong like that's one of the requirements of using context caching is that the original context you supply has to be the same. If for some reason that input
the same. If for some reason that input context was changing on a request by request basis, context caching doesn't actually end up being that affected because you're you're paying to store
some set of original input context that has to persist from like a user request by user request basis. Yeah, I guess
uh answer is yes to to both. It's
important for cases where like you want to chat to a collection of your documents.
um or like some large video you want to ask some questions on it or a code base and you are correct to mention that this
knowledge it shouldn't change or if it changes then the best way for it to change is at the very end because then uh what we're going to do under the hood
is we're going to find the the prefix which matches the cache prefix and we're just going to throw away uh the test and sometimes developers ask a question like
where should we put the question before the context or after the context? Well
this is uh this is the answer like uh you want to put it after the context because if you want to rely on caching and
uh uh profit from costsaving then uh that's a place to put it because if you put it at the beginning and if you are intending to put all your questions at
the beginning then your caching is going to start from scratch. Yeah, that that's awesome. That's helpful. Um other tips
awesome. That's helpful. Um other tips anything else besides context caching that folks should be thinking about from a developer perspective? Uh one thing we already touched on and that's a
combination with Drag. So if you need to go into billions of tokens of context then uh you need to combine with drag.
But also in some applications where you need to retrieve multiple needles it might still be beneficial to combine with rack even if you need much shorter
contexts. Another thing which we already
contexts. Another thing which we already discussed is that uh well don't don't pack the context with irrelevant stuff. it's uh it's
going to affect this uh multi- needle retrieval. Another interesting thing is
retrieval. Another interesting thing is we touched on the interaction between uh in weight and in context memory. So one thing I must mention is
memory. So one thing I must mention is that if you want to update your inweight knowledge using um in context memory
then the network will necessarily get two kinds of knowledge to rely on. So
there will there might be a contradiction between those two and I think it's beneficial to resolve this contradiction explicitly by careful prompting. So for
prompting. So for example you might start your question with saying based on the information above etc. And when you say
this based on the information above you give a hint to the model that it actually has to rely on in context memory instead of inweight memory. So it
resolves this uh ambiguity for the model. I love that. That's a great uh
model. I love that. That's a great uh that's a great suggestion and and your sort of comment about this tension between inweight versus not and and again we we talked a little bit about this but how how do you think about from
a developer perspective like the fine-tuning angle of this and the only thing that's maybe more controversial than like is you know long context going to kill rag is you know should people be
fine-tuning at all and like Simon uh Willinsson has a bunch of threads about this is like who does anyone actually fine-tune models does it end up helping them um how do you think about this from
like a like would it be useful to do fine-tuning and long context for like a similar corpus of knowledge or like does the fine-tuning piece potentially lead to like better general outcomes for
fine-tuning? How do you think about that
fine-tuning? How do you think about that that interplay? Yeah. So, let me maybe
that interplay? Yeah. So, let me maybe elaborate on how fine-tuning could actually be used on the knowledge corpus. So, what people sometimes do is
corpus. So, what people sometimes do is uh well, they get additional uh additional knowledge.
Let's say you have a big uh enterprise knowledge corpus say uh billion of tokens and well you could
continue training the the network just like we're doing with pre-training. So
you could uh apply language modeling loss and you can ask the the model to learn how to predict a next token on this uh knowledge corpus. But you should
keep in mind that uh this this way of integrating information it actually works but it has limitations and one limitation is
because uh you're actually going to train the network instead of just supply the context. You should be prepared for
the context. You should be prepared for various problems like uh you will need to tune hyperparameters. You will need to know
hyperparameters. You will need to know when to stop the training. you'll have
to deal with the overfeeding. Some people who actually
overfeeding. Some people who actually tried to do that, they reported increased hallucinations from from from using this process and they hinted that
uh maybe it's not the best way to supply knowledge information into the network. But obviously it it also like
network. But obviously it it also like this this technique also has advantages.
H in particular, it's going to it's going to be pretty cheap and fast at inference time because well the knowledge is in the weights. So you're just
the weights. So you're just sampling. But there are also some
sampling. But there are also some privacy implications because now the knowledge is cemented into the weights of the network and if you actually want to
update this knowledge then you are back to the original problem like uh this knowledge is not easy to update like it's in the weights. So how are you going to do it? You will have to again
supply this knowledge through the context. Yeah, I think it's it's such an
context. Yeah, I think it's it's such an interesting um trade-off problem from a developer perspective about like how how rapidly you want to be able to update the information. Uh I think the cost
the information. Uh I think the cost piece of it isn't like it's not cheap to just like keep paying to like I feel like rag is actually like pretty reasonable. you're paying for a vector
reasonable. you're paying for a vector database, which I feel like there's a lot of offerings and that's reasonably efficient to do at scale, but I think like continuously fine-tuning new models is like often times potentially not not
cheap, which is super yeah, a lot of interesting dimensions to take into account. I'm curious about the sort of
account. I'm curious about the sort of long-term direction from a fine-tuning or maybe not from a fine tuning from a long context perspective like what what can folks look forward to in the next
like 3 years for long context from a maybe an experience perspective but like will people will we even talk about long context in 3 years will it just be like the model does this thing and I don't
need to care about it and like it just works um or yeah how are you thinking about this so I'll make uh I'll make a few predictions uh what I think is going
to happen first is the quality of the current one or two million context it's going to increase
dramatically and we're going to max out pretty much all the retrieval like tasks quite
soon and the reason I think it's going to be the first step is because well you could say like Hey, but why don't
we extend the context? Why stop at 1 million or 2 million? But the point is that the current million context, it's
not uh close to perfect yet. And while
it's not close to perfect, uh there's a question, why do we want to extend it?
Because what I think is going to happen is when we achieve close to perfect million context then it's going to unlock uh totally incredible
applications like something we could never imagine would happen like the abilities to process uh information and connect the
dots it will increase dramatically.
this thing it already can simultaneously take in more information than a human can like uh I don't know go watch a one hour video and then immediately after
that answer some particular question on that video like at what second someone is dropping a piece of paper that you
you can't really do that very precisely as a human so what I'm what I think is going to happen is these uh superhuman abilities they are going to be more
pervasive like the more like the better long context we have the more capabilities that we could never imagine are going to be unlocked. So that's uh that's going to be step number one. The
quality is going to increase and u we're going to get nearly perfect uh retrieval. After that, what's going to
retrieval. After that, what's going to happen is the cost of long context is going to decrease and I think it will take maybe a little bit more time but
it's going to happen and uh as the cost decreases the longer context also gets unlocked.
So I think reasonably soon we will see that 10 million context window which is uh like a commodity like
um it it will basically be normal for the providers to to give uh 10 million context window which is currently not the case. when this happens that that's
the case. when this happens that that's uh that's going to be a deal breaker for some applications like coding because I think for one or two million you can
only fit some uh I don't know some somewhere between small and mediumsiz uh code base in the context but 10 million actually unlocks
um a large large coding projects to be included in the context completely and by that point we'll have uh we'll have the innovations which uh
enable uh nearperfect uh recall for the entire context. This thing is going to be
context. This thing is going to be incredible for coding applications because the way humans are coding, well you need to hold in memory as much as
possible to be effective as a coder and you need to jump between uh the files all the time and you you always have this
narrow attention span. But uh LLMs are going to
span. But uh LLMs are going to circumvent this problem completely.
They're going to hold all this information in their memory at once and they're going to be reproduce any part of this information precisely. Not only
that, they will also be able to really connect the dots. They will fight the connections between the files and so they will be very effective coders. I
imagine we will very soon get uh super superhuman coding AI assistants. they
will be totally unrivaled and uh this they will basically become the the new tool for every coder in the world and so when this 10 million
happens that that's a second step and going to say 100 million well it's more debatable I think it's going to happen I
don't know how soon it's going to come and I also think we will probably need more deep learning innovations to achieve this.
Yeah, I love that. What one one sort of quick followup across all three of those dimensions is like how much from your mind is this like hardware story or like
the infrastructure story relative to like the model story? Like there's
obviously a lot of work that has to happen to like actually serve long context at scale which is why it costs more money to do long context etc etc. Um, do you think about this from a research perspective or is it like, hey
the hardware is sort of going to take care of itself. The TPUs will do their job and like I can just focus on on the research side of things. Oh, well, yeah.
I mean, just having the the chips is is not enough. Uh, you also need very
not enough. Uh, you also need very talented uh inference engineers. And I'm actually I'm I'm
engineers. And I'm actually I'm I'm really impressed by the work of our inference team. What they pulled off
inference team. What they pulled off with the million context that was uh incredible
and without such strong uh inference engineers I don't think we would have delivered one or two million context uh to
customers. So this is uh it's a it's a
customers. So this is uh it's a it's a pretty big uh inference engineering investment as well and no I don't think it's going to
resolve itself. Yeah our our inference engineers
itself. Yeah our our inference engineers are always working hard because there's always uh we always want long contexts on these models and it's uh yeah it's not easy to make it happen. How do you
think about the sort of interplay of a bunch of these agentic use cases with long context? Do you think is it like a
long context? Do you think is it like a fundamental enabler of of different agent experiences than than you could have before or like what what's the interplay between those two those two
dynamics? Well, this is an interesting
dynamics? Well, this is an interesting question. I think uh agents can be
question. I think uh agents can be considered both consumers and suppliers for long context. So, let me explain this. So the agents to operate
this. So the agents to operate effectively they need to keep track of the last state like the previous actions that they took the observations uh that
they made etc and of course the current state as well. So to keep this uh all these previous interactions in memory you you need longer context. So that's
where longer context is helping agents.
That's that's where the agents are the consumers of long context. But there's
also another orthogonal perspective is that agents are actually suppliers of long context as well. And this is
because packing long context by hand is incredibly tedious. like if you have to upload all the documents that you
want uh by hand every time or like uh upload a video or I don't know copy paste some uh content somewhere from the web. This is
really tedious. You don't want to do that. You want uh the model to do it
that. You want uh the model to do it automatically. And one way to achieve
automatically. And one way to achieve this is uh through the gentic tool calls. So if the model can decide on its
calls. So if the model can decide on its own like hey at this point I'm going to um fetch some more information and then it's going to just uh pack the context
on its own and so yeah in that sense uh agents are the suppliers of long context. Yeah that's
such a great example. Um my two cents and I've had many conversations with folks about this. I think this is actually like one of the main limitations of how people interact with AI systems is like your your example of
like it's tedious like it's so tedious like the worst part about doing anything with AI is like I have to go and find all the context that might be relevant for the model and like personally bring
that context in and in many cases like the context is like already on my screen or on my computer or like I have the context somewhere but it's like I have to do all the heavy lifting. So, I'm
excited for like a uh we should, you know, we should build some like long context agent system that just like goes and gets your context from everywhere.
Like I think that would be super super interesting and I feel like solves a very fundamental problem not for not only for developers but like from a enduser of AI systems perspective like I wish the models could just go and fetch
my context and I didn't have to do it all. Yeah, MCP for the win. I love that.
all. Yeah, MCP for the win. I love that.
Nikolai, this was an awesome conversation. and thank you for taking
conversation. and thank you for taking the time. I'm glad we got to do this in
the time. I'm glad we got to do this in person. Um, and appreciate all the hard
person. Um, and appreciate all the hard work from from you and the Long Context teams. Uh, and hopefully we'll have lots more exciting long context stuff to to share with folks in the future. Yeah
thanks for inviting me. It was uh fun to have this conversation. Yeah, I love it.
Loading video analysis...