Build Hour: Agent Memory Patterns
By OpenAI
Summary
## Key takeaways - **Context Engineering: Art and Science**: Context engineering is both an art because it involves judgment to decide what matters most at a given step of reasoning or action processes, and a science because there are concrete patterns, methods, and measurable impacts to make context management more systematic and repeatable. [01:55], [02:17] - **Four Context Failure Modes**: Context burst is a sudden token spike from tool outputs, context conflict from contradictory instructions like refund rules, context poisoning from incorrect information propagating via summaries, and context noise from redundant tool definitions. [07:42], [08:37] - **Context Trimming Drops Old Turns**: Context trimming drops older turns while keeping the last n turns to provide fresh context, better attention, and reduced latency when approaching context limits. [20:20], [21:08] - **Compaction Removes Tool Results**: Context compaction drops tool calls or results from older turns while keeping the rest of messages and tool placeholders intact, ideal for tool-heavy agents to reduce noise. [21:12], [21:57] - **Summarization Compresses to Golden Summary**: Context summarization compresses prior messages into structured summaries injected into history, creating a dense golden summary of valuable information like device details and tried steps. [23:29], [24:28] - **Cross-Session Memory Injection Personalizes**: Long-term memory from session summaries injected into new session system prompts enables personalized responses like recalling MacBook internet issues after OS update. [36:33], [37:30]
Topics Covered
- Context Engineering: Art and Science
- Finite Context Bottleneck Limits Agents
- Four Context Failure Modes
- Reshape and Fit Techniques
- Cross-Session Memory Continuity
Full Transcript
Hi everyone, welcome back to another build hour.
I'm Michaela on the startup marketing team and I'm here today with two members of our solution architecture team. Emry
live in the studio and Brian joining virtually to help address Q&A throughout the hour.
>> Hi, I'm Emry. I work as a solution architect at OpenAI supporting digital native customers on building various of AI use cases including longunning AI agents.
So today's topic is agent memory patterns which is a very exciting topic in Emry and I's first ever build hour.
So if you've been following along, we started with how to build agents from scratch using responses API, then moved into agent RFT and today exploring agent memory patterns. All of the sessions are
memory patterns. All of the sessions are up on our YouTube channel, so definitely check them out if you want to catch up or revisit earlier builds.
Though the focus on the build hour is to empower you with the best practices, tools and AI expertise to scale your
company using open AI APIs and models.
So for today's build hour, we'll start with an introduction to context engineering, the foundation for agent memory, and then Emry will walk through several live demos covering memory
patterns like reshape and fit, isolate and route, and extract and retrieve.
We'll end with best practices, resources, and of course, live Q&A. On
the right hand side of the screen, you can drop questions into the Q&A box any time during the session. Our team is monitoring both in the room and virtually to help answer throughout, and we'll save a few for the end to go
through live. All right, with that, I'll
through live. All right, with that, I'll hand it over to Emry to kick things off.
>> Thanks, Michaela. Hi, everyone. Um so
I'll start the first part of this um the session with context engineering definition.
So this nice definitions from Andre Karpati. I'll start by emphasizing that
Karpati. I'll start by emphasizing that the context engineering is both an art and a science. So it's art because it involves judgment. So you have to decide
involves judgment. So you have to decide what matters most at a given step of uh a reasoning or action processes. It's
science because there are concrete patterns, methods, and measurable impacts to make context management more systematic and and repeatable. So, I'll
highlight that modern LLMs don't just perform based on the model quality, but they perform based on the context you you give them.
In this slide, I want to talk about different disciplines that comes together to basically present the context engineering. Uh so it is a
context engineering. Uh so it is a broader discipline than any single technique like prompt engineering or or retrieval. So the diagram visually
retrieval. So the diagram visually represents the ecosystem of context optimization layers that together shape uh what the model sees and and understands. So you see prompt
understands. So you see prompt engineering uh as as a core principle structured output rack state and history management memory is also a crucial part. So using persistent or
part. So using persistent or semi-persistent storage like files, databases or memory tools to upload and retrieve key information and all of this is contained inside the
larger sphere of context engineering. Um
so we can also connect these capabilities into different uh product capabilities.
Here is a nice summarization slide um that talks about the the core principles um like why it matters because longunning
tools longunning and tool heavy agents blood tokens and degrade quality uh via uh poisoning noise and and and confusion and and bursting. We have three core
strategies uh as we discussed in the beginning like reshape and fit uh to the context window, isolate and route the right amount of context to the right agent and extract high quality memories
to retrieve uh in the right time. We
also have prompt and tool hygiene as a core principle. So keeping system
core principle. So keeping system prompts lean, clear and well structured.
uh use a small canonical set of uh fusion examples and minimize overlap in tools and get the tool selection. And
our goal in Northstar is basically aiming for the smallest high signal context that maximize the the likelihood of the desired uh outcome.
And then in this slide is where our transition from why context injuring matters to how to to actually do it in in practice. So I'll frame this as a
in practice. So I'll frame this as a toolkit of of techniques. So these are not mutually exclusive. So most real world agent architectures combine
multiple strategies depending on the use case uh and the context budget. So the
first technique is reshape and fit. We
can apply context trimming, compaction and summarization. The second one is
and summarization. The second one is isolate and route. uh we can offload context uh and tools to specific sub
aents with a selective handoff and the last bucket is extract uh and retrieve um we can talk about memory extraction
state management and memory retrieval in in that last um bucket when we talk about context engineing it's essential to distinguish between
short-term and long-term memory because they solve very different problems uh I think we group uh the first two bucket as short-term memory which we also call
as insession techniques and then the last bucket is long-term memory uh we call it cross session. So that means you can um collect different information from multiple sessions uh and you can
retrieve back in in the next session or other sessions uh in in the future. So
short-term memory is all about like making the most of the context window uh during an active interaction and active conversation.
Uh and then in contrast long-term memory is is about building continuity uh across across the the sessions.
Cool. So we we often get excited about how powerful our agents are becoming how AI models are getting them better and better. They can handle complex tasks uh
better. They can handle complex tasks uh route between tools. They can plan multi-step workflows. But underneath all
multi-step workflows. But underneath all that there is a there is a core bottleneck because context is is finite.
So every piece of information we add to the prompt instructions um conversation history tool outputs competes for a space in in a fixed uh
token budget.
And this is why uh it matters slide. Uh
I want to make the problem concrete here. So I'll frame it around before and
here. So I'll frame it around before and after contrast. So you see two
after contrast. So you see two conversations. Um what happens without
conversations. Um what happens without memory on the left and what happens with memory on the on the right. So on the left hand side um the user started with
the issues like Wi-Fi, battery and over overheating in it troubleshooting agent.
After many turns the agent has forgotten the earlier context. it falls back to reasking for information that the user already gave. Right? But on the right
already gave. Right? But on the right hand side, the agent remembers the original issues even after many turns.
It can pick up the unresolved thread. Uh
it references previous actions like firmware update, background sync which makes it feel intelligent uh and reliable. So this is such a stateful
reliable. So this is such a stateful behavior which is the foundation of uh a longunning agent.
Now I'll switch gears to to failure mode. So we can group these failure
mode. So we can group these failure modes into four categories. The first
one is context burst. So you can imagine it as a sudden token spike in one of or multiple components um to do the limited external control or
increased calls. Context conflict if
increased calls. Context conflict if there's any contradictory instructions or information in your context. Context
poisoning uh if there's an incorrect information uh that enters the context and propagates over the turns. It can be via summaries or memory objects, state
objects you're injecting into the context. And then finally context noise.
context. And then finally context noise.
So you can imagine it as multiple um tool definitions or like way more many many tool definitions uh coming into your context at the same time. This can
be redundant or overly similar items. So that can make a noise uh in the context.
Uh here's a nice um visualization of context burst uh in tool heavy workflows. So you'll see that um there
workflows. So you'll see that um there will be like a specific increase uh in one specific turn and you'll be injecting uh like large amount of tool
tokens here.
And then the next one is is context conflict. So we can easily visualize it
conflict. So we can easily visualize it here. So you can imagine in one of the
here. So you can imagine in one of the turns uh there's specific tool call and here in this tool call you see that in the system instructions I never issue a
refund if warranty status is not active.
But in the middle of the turn you're also uh saying that it's eligible refund for VIP customers. But at the end of the turn and and your agent is responding, hey, you know, given your urgent travel,
I can issue a full refund. So, this is a nice vis visualization or example for the specific um context conflict that can be coming from uh one of the the
tool results.
And then last one is context poisoning.
uh so you can imagine it as a hallucination or something inaccurate mixed into the context in any step and propagates across different terms. So
here we have a couple possible pitfalls here. Uh losses summarization edits can
here. Uh losses summarization edits can can be causing this.
If you're using a free form uh note that accumulates over time that can be contradict and then finally uh older summaries override the newer ones and
you'll be basically uh causing any hallucinations uh because of the summary logic and you'll be injecting that hallucination into the into the context
and propagating uh over time.
Cool. Now I'll stop sharing my screen and switch to the demo that I prepared for you.
Uh and then I'll go over all of these some of these challenges um to show you like how it actually
works in a real world scenario. Okay,
let me share my screen.
Cool.
So here I prepared a demo app for for this build hour. It is an ID troubleshooting agent for software solving issues uh for for issues related
to both software and hardware. Uh and
this is a dual agent demo that lets you run two agents side by side. So backend
logic states inside the nextJS app and I'll be using OpenAI agents SDK.
So we have two tools connected to to that agent. One of them is get orders
that agent. One of them is get orders and other one is get policy. So here I can start sending a message uh and saying hi to to both of the agents,
right? Uh and then you'll see that both
right? Uh and then you'll see that both of the agents are are responding to my message and then here I can say hey my laptop fan is making weird noises while
I playing games. Is it normal? So here
you see that the the configuration for both of the agents are same. They're at
the same model, same reasoning level. um
but I'm sending the same message and there is no memory um basically there is no memory configuration yet. So here you also see the context usage bars at the
at the top. So you're you're seeing different type of components uh that is already in the context now. Um and I can say hey before I want to see my orders
my order number is 1 2 3 uh four five and so now I'm expecting the model to to make a tool call and show me uh the
orders I have. So here you see that the order status the items I have and it's powered by a specific tool called. So as
you see over time um it will accumulating different tokens and different type of tokens here and [clears throat] in the context life cycles I'm visualizing what is happening
under the hood across multiple turns. So
here you see that I have uh 84 tokens of system instructions. My user input is
system instructions. My user input is increasing uh slowly. Uh but the core component here uh is agent output that
will be generated by by a model.
Cool. Um so this is a typical real world scenario. So now I also want to showcase
scenario. So now I also want to showcase like how context burst is is happening.
So I can still uh start with high um and I can say hey this time I'm having an overheating issue on my laptop
uh and then the model is responding to to my to my message to my issue basically uh and then here it's telling me like specific um instructions and
it's asking me some questions to better understand uh what's happening right and then I can say hey thanks before that I want to see the refund policy of my
MacBook Pro 2014.
So while I'm sending this message, I also want to quickly show you the code and core concept of or how it's working.
So it's powered by open agent SDK and here you see um the agent definition. So
it's a customer support assistant. Uh
I'm having specific instructions here um that I'm adding. Uh I'm using like different models here. Um, and also I can show you the system prompt really
quickly instructions.
Um, so it's basically I'm I'm saying that hey you you're a customer support assistant for for devices. Uh, and then I'm using like very slight prompting and
instructions for for that specific uh agent here. So let's go back to um the
agent here. So let's go back to um the response. So, since I asked about the
response. So, since I asked about the the specific refund policy for from a MacBook Pro 2014, uh it's made a tool call get refund and
it's basically returning uh a specific um refund policy that I added before.
So, here you see that in between turn two and turn three there is a specific spike um in the in the context window.
So, in turn two, I I I had maybe around like 300 400 tokens, but now I have more than 3,000 tokens because I am just dumping lots of information into the
context. And then this is a nice example
context. And then this is a nice example for for context burst here. um
instead of maybe just dumping all of these information into the context uh as as a refund policy, I can be more careful about my tool definitions and
tool outputs um to basically make a decision about like what should I inject into the context. So maybe not all of these information are are valuable, but
as you see that I'll be injecting lots of information into the context that's visualized here uh in this context life cycle tab.
Cool. Now I'll stop sharing my screen and go back to to the deck. So I'll
continue uh with the next uh steps here.
Okay, nice. So, we talked about challenges and
nice. So, we talked about challenges and what's going on under the hood and a specific example about context. So, now
let's talk about the solution, right? Um
so the solution is basically uh managing context efficiently using different techniques such as um tming compaction
uh state management memories and make the natural step beyond prompt engineer.
Uh again this is also another visualization about different components in the context. Uh so you see that across the times it's increasing the token counts are increasing and these
these tokens can be coming from the system message user message maybe you might be injecting uh memories or maybe you might be injecting different uh type of specific tokens that can be added
into your context.
Uh here I want to group uh AI agents in terms of context profiles. So we can group them into three categories. So
first one is rack heavy assistance. So
you can imagine reports policy QA agents. Um in these type of agents
agents. Um in these type of agents context is most dominated by retrieved knowledge and citations. The second one is tool heavy workflows. Uh context is
mostly dominated by frequent tool calls and returned payloads. And last one is conversational concier. So you can think
conversational concier. So you can think about planning agents, coaching agents.
uh and in this case context is mostly dominated by uh growing dialogue history. There'll be lots of tokens in
history. There'll be lots of tokens in conversation history like assistant usage tokens that scales with uh with session lengths
and then to better understand the the solution and uh the techniques uh we can go over what is fixed in our context and what is dynamic and and a variable. So
here you see different type of components like usually system instructions tool definitions and examples unless you're doing a rack uh approach is is mostly static in the
context. What is dynamic is is tool
context. What is dynamic is is tool results retrieved knowledge memories and and conversation history. So these are
the nice examples for dynamic u and static context and tokens. uh so you have a control on dynamic tokens and you
can apply different techniques to to control it efficiently.
So I would like to start with prompting best practices to aid avoid context conflict here. Um you can also find it
conflict here. Um you can also find it from our our prompting guides and cookbooks. Uh the first rule is being
cookbooks. Uh the first rule is being explicit and structured. We suggest you to use clear direct language and specific enough to to guide action. uh
you should give room for planning and self-reflection. I think this is
self-reflection. I think this is becoming more and more important with uh reasoning models like GPT5. Uh and you should avoid conflicts. So keep the tool
set small and non-over overlapping.
Don't use uh ambiguous uh definitions.
Even if a human can pick a tool, the model won't either. So be careful with conflicting instructions and and tool definitions.
So for context noise, we talked about like many many tool definitions and many tools attached into the context as as an example situation.
Um again, you you should be explicit and structured in your prompts. Um more
tools isn't always equal to to better outcomes. Uh so favor targeted tools
outcomes. Uh so favor targeted tools with clear tool decision boundaries and then return meaningful context from your tools. So in the demo um you show that
tools. So in the demo um you show that specific uh example of context burst. So
we suggest you to control basically what is the tool output uh and then return high signal semantically useful fields um and prefer human readable uh
identifiers.
Nice. So now I'll switch gears to engineering techniques and I'll start with the first one which is reshape and fit. So the first technique here is
fit. So the first technique here is context trimming. So it is a pretty
context trimming. So it is a pretty basic technique. It basically means that
basic technique. It basically means that dropping older turns while keeping the last uh end turns. Uh here in the turn we have like limited context. Uh it's
getting getting noisy. There are lots of information in the context. It can be coming from a tool user message or different sources. And there's higher
different sources. And there's higher likelihood of losing track because uh we are getting close to to context limit.
But once we trim the older uh conversations, older messages uh now we have fresh context. It has better attention uh and you'll see that uh it will also increase the latency uh that
you're using.
So it basically keeps the last end messages and trips the the previous older messages. Uh and these are the
older messages. Uh and these are the some parameters and knob we have control over in in context trimming.
The second technique is context compaction. uh it basically means just
compaction. uh it basically means just dropping tool calls or tool call results from the older terms while keeping the rest of the messages. So if you have a
tool heavy agent, you can consider these techniques. Um you'll see that your
techniques. Um you'll see that your context will be most dominated by tool results. It will be noisy. There will be
results. It will be noisy. There will be maybe some context noise and lots of information coming from different tools.
And after compaction you'll see that there will be fresh context uh better attention and and faster uh processing uh and you will also be keeping the tool
placeholders uh intact after even after the context compaction here and a question you might have uh would be like okay how can I decide heristics
about trimming and compaction uh so here I can share a couple suggestions so first you can you can analyze your sessions uh you can collect
like context snapshots from from production or from your users. You can
collect times down and dislike context to see what's going wrong there. Think
about the average token size of a context. Uh what type of task do you
context. Uh what type of task do you have in one session? Secondly, um like do not trim mid turn and break turn blocks. So a turn basically means that a
blocks. So a turn basically means that a user message and all the other message until the next user message, right? Uh
so if you just break or or don't respect these turns uh there will be higher likelihood of losing track. And then
finally don't wait to hit context window limits. Uh so keep track of of context
limits. Uh so keep track of of context allocation. You can set thresholds like
allocation. You can set thresholds like 40 or 80%. So if you're getting closer to hitting the context window limits, uh these thresholds will help you to to
better understand basically uh when you should trigger some of these operations.
You can control tool outputs uh and you can also keep track of token saves. So
these techniques are also really nice for for cost um cost reducing cost purposes. Uh, and you can always keep
purposes. Uh, and you can always keep track of like how much token you're saving while you're increasing the the overall capability of the of your agent.
And then the next technique we have is context summarization. Um, so it
context summarization. Um, so it basically means compressing prior messages into structured summaries and you can inject into the context history.
So here uh in turn n you see lots of messages uh like noisy context again and you're basically keeping the last n messages and just summarizing or
compressing the previous ones um so that you'll have fresh context uh better attention faster processing and at the end of the day you'll have a golden summary. So this will be like a a
summary. So this will be like a a valuable information because you'll be basically compressing all of the valuable informations. Uh and at the end
valuable informations. Uh and at the end of today it will have like a very dense um object that you can keep track of and that will be also useful for you to
better understand what happened in the conversation.
And here's a nice visualization uh in the context life cycle uh about the summarization. So let's say you perform
summarization. So let's say you perform the summarization a specific turn. So
you'll see that you are compressing uh all the previous information and injecting it back uh into as into the context as as a memory object. So you
see there is a new component um called as memory after the summarization performed.
Nice. Um so here here is a comparison of summarization versus trimming. Uh so
there are different dimensions that you can consider while you're designing uh a memory pattern for your for your agent.
Um you can see that in trimming uh you'll just keep last entrance and you'll be dropping the oldest ones. So
basically it'll be pretty straightforward operation. So it'll be
straightforward operation. So it'll be very fast uh you know and there's no no latency. Uh but the trade-off is that
latency. Uh but the trade-off is that you you might be losing some information uh that is already there. So I think this is the main main trade-off that we
have. Uh it's it's really best for tool
have. Uh it's it's really best for tool heavy ops and and short uh workflows.
And then in in summarizing, you're basically just keeping track of all the information. So you you're not throwing
information. So you you're not throwing away any anything. Uh this can add this can add a little bit of um latency and cost because you'll be doing another
summarization call to a model. Uh but
you'll be just collecting all the information. So you can think about a uh
information. So you can think about a uh an agent or use case. If you have like multiple tasks in your longunning agent that are independent from each other,
you can definitely consider u trimming because probably the trimmed or throwed away information is not important for the agent for the next turns. But if
you're collecting useful information across multiple turns and tasks are dependent to each other, then you can definitely consider uh summarization here.
Nice. Uh so now I'll stop sharing my screen and go back to to the demo. I'll
show you couple examples of these techniques that we already covered here.
Let me quickly share my screen.
Nice. So here let's go to configurations page uh on the demo. So for agent P I want to uh enable trimming and I can
basically set max turns S3 to trigger uh trimming operation. I want to keep
trimming operation. I want to keep recent turns uh as S3. So here I can start again to test my my agent and
saying hi. So this time I want to
saying hi. So this time I want to understand the refund policy um that I want to to check. Maybe I want to refund the laptop I just bought. I can say,
"Hey, I want this refund policy for my MacBook about a month ago. Uh, and I want to understand like what's happening here um in in terms of refund. So if I'm
eligible or if I'm not eligible. So the
model is now making uh a tool call uh and calling the get refund tool. Here as
you see it's returning that specific information. I have 30 return window for
information. I have 30 return window for uh for returning that specific laptop.
And I also want to check my my order. So
I changed my mind and I'm saying, "Hey, can we also check my order? My order
number is 1 2 3 4 5." Uh and I want to see like if it's if it's on the way, if it's coming. So you see that the model
it's coming. So you see that the model is doing another tool call as as get order.
And now I want to switch gears. I can
say, "Hey, thanks. I'm having an issue with the internet connection. So until
this turn, you see that I have lots of tool tokens here in the context in context life cycle. So it's getting uh
accumulating over turns and in that specific turn now it's telling me that hey let's sort it out. It's asking me a
couple questions um about my my device and I can say hey I tried to to load an internet page and still see 40 404
error. Um I basically share a couple uh
error. Um I basically share a couple uh important information about the situation and you see still see that the across the turns it's accumulating lots
of lots of tokens and even agent output is is increasing.
Okay. So let's say it's still happening on on Safari. Um and then this is the last message probably uh that I that I want to share about the
current situation and I'm waiting to to have more guidance and instructions uh here. And here you see that at the end
here. And here you see that at the end of turn six the context is trimmed. Uh
if I go back to here to visualize what's happening. So when I hit the the turn
happening. So when I hit the the turn six, you see that it trimmed the context. So it basically removed all
context. So it basically removed all these tool tool outputs and tool tokens.
So now I have a fresh context here. Um
and then now I can continue talking about the same specific issue or I can continue to to talk about like different
different information.
Nice. So let's go back and then here I also want to show you how summarization works. So also the
summarization works. So also the compaction is also works in a similar way as a trimming. So here I can set uh like compaction trigger as for and keep
recent recent turns too. So you'll see that at the end of turn two, it'll be compacting and removing all these tool outputs and I'll have a a fresh context
similar to to the trimming approach.
But now I want to be a little bit more advanced here. So here I want to enable
advanced here. So here I want to enable summarization. Um
summarization. Um I want to set the the summarization trigger as five and I want to keep the the recent three turns. So here I I
clicked save and now I want to just see how it's summarizing all these information. So here I'm sending hey I
information. So here I'm sending hey I I'm having internet connection again. So
this time I decided to share more and more information about my my situation.
So I can say where I bought this computer um what is is the model. So I
can say, "Hey, I have a 2014 MacBook Pro 14inch and I live in the US, but I bought it from Amsterdam. I received it from from battery change service and
just updated the OS version last week and they asked me to update the OS version to Mac me OS Seoia.
So as you see, I'm sharing like many information uh and these informations are really available for an IT trouble troubleshooting agent, right? Uh and
here I can go back and say okay these are nice clarify the problem uh like what I need from you and I can say hey I already tried hard reset after
checking the FAQ docs but it didn't work. So this is still in a form of
work. So this is still in a form of memory because I'm sharing like which steps I tried which worked and and which
didn't work. I can go back um and
didn't work. I can go back um and continue talking with the agent. So now
it's reasoning itself and providing me like more detailed guidance uh and instructions for specific to to a MacBook. So I I go back I went back to
MacBook. So I I go back I went back to my computer and I saw that the Wi-Fi icon is is not active. Um and then I'm
thinking maybe it's related to to Wi-Fi or maybe it's related to a specific um software issue. So, as you see across
software issue. So, as you see across the turns, it's it's getting more complex. The the the agent needs to
complex. The the the agent needs to reason and the agent also needs to keep track of what's in in its context and make sure that it's not uh there's no
burst, there's no conflict. Um there's
no poisoning and other type of failure modes. It's telling me uh some specific
modes. It's telling me uh some specific steps and I can say, "Hey, I tried it already and I'm wondering if it's a specific software issue, right?" Um and
then here I'm waiting for the response from from the agent. So again, I shared lots of information here. Uh lots of available information. So now the agent
available information. So now the agent knows my device, where I bought this device, which steps I tried. Um and
basically what I what type of um steps I performed before uh doing that.
Cool. Uh so now you see that it's responded to me like a very well structured instructions given what you described. Wi-Fi icon is not active blah
described. Wi-Fi icon is not active blah blah. And then here I see that the
blah. And then here I see that the context is summarized and I notice that there is an orange uh component we count
as memory item and the memory item is basically um the summary uh that we had.
So here between turn four and turn five you see that I'm I'm condensing some part of the context and I'm injecting it
back as a user message uh as as a memory component. So again here memory is
component. So again here memory is basically um the the summarized context from the previous terms.
So now I want to go back to the code and show you the the summary prompt uh and go over like some important
topics about this specific prompt. So as
you see uh here's my summary prompt. I'm
saying hey you are a senior customer support assistant for tech devices setup and software issues. Um, and then before
you write, I'm saying, uh, hey, be careful with contradictions.
Uh, make sure you are having a temporal ordering and make sure you're having a hallucination control. So, I think these
hallucination control. So, I think these are very important things to consider when writing a well-crafted summarization prompt. And then I'm tying
summarization prompt. And then I'm tying this summary to my specific use case.
So, I'm saying, hey, in your summary, write a structured factual summary. And
then just think about product environment reported issues, what worked and what didn't work. Uh which steps you tried, include identifiers, which is
important, key timelines, uh timeline milestones, tool performance insight, current status, and next recommended steps.
So this is a really nice example of how to craft a summarization prompt. Um, and
then here if I go back to the context summary, I'm seeing like lots of useful information. So now I'm seeing, hey, the
information. So now I'm seeing, hey, the device is is MacBook Pro. The the
operation system mecha it's bought from it's purchased in Amsterdam, but location is the USA. You see like which
steps I tried even uh I tried um uh the different uh steps to connect to to the network. you see milestones, I did a
network. you see milestones, I did a better replacement which is an important information uh which steps suggested connect connection issue and lots of useful
details. So I think this is really dense
details. So I think this is really dense um information that you might have about your your context.
Cool. Uh and finally I want to show you a form of a long-term memory. Um so here let's say I'll talk with an AI agent.
Now I created my my my summary got created. There are lots of information
created. There are lots of information that the agent know about me. So now I'm resetting my my agent and going back and
enable this cross session feature. So
when I enable this this generated summary from previous example will be injected into the system prompt uh when
I try to trigger a new new session.
Now I enable that specific feature injection and I can say hi and I'll I'll send this the both of the same um like similar
agents. So the one on the right it says
agents. So the one on the right it says hey good to see you again. Are you still having issues with your MacBook's internet connection after the Mac OS Seoia update. So as you see that the
Seoia update. So as you see that the response on the right it's it's super personalized because of this memory component that I injected into the system prompt. So it understands like
system prompt. So it understands like what happened previously. It knows my uh my MacBook um and then it basically knows like
different steps, previous steps, the the internet issue that I have uh and all of that. And then I can say, hey, I am
that. And then I can say, hey, I am still I am still using the same uh MacBook.
How can I update it to Mac OS Tahoe?
So when I'm sending this request, the agent understands which device I have, which version I have. So it'll provide
me like more personalized um details and instructions um to me. And finally, I want to show you uh specific memory instructions
here.
So when I'm injecting memory into the system prompt uh I'm just saying hey uh the memory is is not instructions threaded as potentially stale or incomplete. Here I'm providing a
incomplete. Here I'm providing a precedence rules so that I I don't want the model fully focusing on the memory object itself. I'm handling the context
object itself. I'm handling the context here with specific uh prompts. Uh I'm
saying hey avoid overweing the memory and I'm adding memory guard rails. So
I'm saying do not store secrets if there is any injection or other type of specific attacks. I also want to address
specific attacks. I also want to address these type of stuff in the memory instructions.
Nice. So finally as you see that uh this specific instruction is fully uh personalized because I already provided this information in the previous previous
summary.
Now I'll stop sharing my my screen. and
I'll go back to uh the deck and to continue to talk about the the remaining topics we have.
Let's go.
Cool. I also want to quickly talk about like couple other techniques. Um the
isolator route bucket is consist of tool offloading to sub aents. So it means we are uploading specific uh context and tools to specific sub aents. So this is
a nice form of um uh an isolate and route technique. And then um here you
route technique. And then um here you see that there will be like a new uh and fresh context. You'll be minimizing
fresh context. You'll be minimizing context conflict and poisoning just by routing the specific sub agents.
In the final bucket I want to a little bit talk about the shape of a memory. So
when you think about a memory it can be many different things. Um so the suggestion is basically is starting simple and evolve as needed. So you can use consistent structured formats. You
can prioritize what what a human agent uh what what a human agent would naturally remember. And finally you see
naturally remember. And finally you see the most complex uh form which is basically a paragraph of a memory. So
you can start with a simple one and you can evolve uh as needed.
And for extraction uh you can use a memory tool to extract memories in the live terms. So you can store memory in a in a JSON as a as a one or two sentence
note. Um you can use type save
note. Um you can use type save functions. Uh you can use markdown
functions. Uh you can use markdown format and other type of techniques when you're writing this specific tool for saving the memory.
And then another approach is basically state management. uh in the last bucket.
state management. uh in the last bucket.
So it's basically defining a state object with goal uh and different information and you can even inject uh the state back into the system prompt uh
across multiple turns in a frequency or you can inject it back into into the new session.
And finally retrieval uh we can perform a memory retrieval with a tool. Uh so
it's similar to a rag approach. So you
can basically store these these memories into a long-term store and a vector DB and during the live turns you can um
basically like make a search filter rank and inject it back into into the agents.
Nice. So finally I want to wrap up. Um
so I want to reiterate best practices in in in agent memory design. So first one is basically understanding your typical context.
uh and you should define what is meaningful for you and for your agent.
The second point is deciding when and how to remember and forget. So you can promote stable reusable facts to memory and activity forget temporarily stale or
lo confidence information and you'll see that your memories will be evolving uh over time uh and you can continuously clean merge and consolidate memories uh
and you can optimize these steps uh in iterations and finally evals is is also super important. So you can run your own
super important. So you can run your own evals to see if there is any uh uh improvement with memory on and off. You
can even build your memory specific evals for long running task and and long context.
>> Awesome. With that, let's move on to some Q&A. We've had a ton of great
some Q&A. We've had a ton of great questions come in. So why don't we re refresh the presentation and we'll pull up the next slide and get into a few.
>> Nice. Okay. So let me go back to the Q&A session and jump into the questions we have. So let me quickly
share it again.
Nice.
Okay.
Yeah. Let's start with the the first questions.
Yeah. So there are li any libraries or packages uh to recommend for context engineering. So this demo is built by
engineering. So this demo is built by using uh open agents SDK package and library. It gives you a really good
library. It gives you a really good flexibility um to implement your own sessions. Uh and in these sessions you
sessions. Uh and in these sessions you can easily implement like trimming, compaction, summarization and all of that type of uh techniques easily here.
Um so I see many different libraries uh that are evolving really fast to basically um make your life easier for for context engineering. So as you see
we have too many um techniques and each technique has different parameters to tune. So I also see that uh there is an
tune. So I also see that uh there is an uh evolving uh part of the all of these um libraries but I can suggest open
agents SDK as a starting point uh to basically start implementing specific uh context engineering techniques uh and and go from there.
Nice.
Next one. So how do you evaluate or measure uh the memory feature is evolving uh the memory is improving performance. So this is a really nice
performance. So this is a really nice question. Um yeah after this the session
question. Um yeah after this the session you might think about hey you know I implemented a specific memory approach but I don't know if it's if it's good or
or not how it's performing well. So we
can maybe split this into couple um uh portions. So first one is basically just
portions. So first one is basically just running your your regular evals with memory and and without memory. So I
think this is a really nice way to to start thinking about if if memory feature works or not. So if you have some specific eval metrics like completeness
uh like upwards downloads or that type of uh numeric metrics you can see if there's an increase or decrease or if there's any statistically significant uh
uplift coming from the memory uh and then maybe your evals might not be capturing well that type of um memory
based boost or improvement then I I suggest you to think about more memory based eval. So what I mean by memory
based eval. So what I mean by memory based eval is basically uh evaluating the model on long running tasks and long context. So if you are not hitting any
context. So if you are not hitting any um context thresholds maybe uh your agent uh doesn't need any of these memory um improvements at all. So again
you can start with your core core evals if you have already and then secondly you can start creating your own memory based evals. So you can even evaluate
based evals. So you can even evaluate the the quality of the summary, you can evaluate the injection time, you can evaluate the injection prompt. So there
are different ways to to evaluate it. Uh
but of course in most EVAs you also might need to prepare a golden data set first and think about maybe like couple 50 examples like golden examples of a of
a good summary or you can try different horistics that I mentioned before to basically find the right balance of trimming uh and compacting. So I think
we can just group this into three different buckets. First one is running
different buckets. First one is running your own evals to see see if there's an uplift. Second one is building memory
uplift. Second one is building memory specific evals. And the third one is
specific evals. And the third one is basically finding the right heristics uh and parameters to apply in the in the context engineering techniques.
Next one. So should we use hierarchical context like entire project context for immediate task and context for immediate
file edit in questions? Um so yes the qu the answer is yes but it's mostly dependent on the use case. So we also have a a concept called memory scope. So
you can think about this memory scope as as a global scope that means if you have a customer or user of your agent probably there are some information that you should always remember about that
specific user. Maybe this user likes
specific user. Maybe this user likes more friendly tones. Maybe this this user lives in the US. So these are some examples for global memory. Um but you
can also have um uh a scope based on the specific session. So let's say I want to
specific session. So let's say I want to uh book a travel uh and then this time I prefer window seats because I want to sleep. Uh so this also a nice example
sleep. Uh so this also a nice example about the the session scope and session memories. So I think this is a good
memories. So I think this is a good practice to maybe separate these into two two buckets and you can keep track of uh session memories with session
scope and over time you can graduate session memories into global memories and you can keep track of like what is really important uh about uh the
specific user. So in travel concier
specific user. So in travel concier example if user is always saying hey this time I want window seat uh like maybe multiple times and you can finally
graduate that memory into global memories and keep track uh keep it in in agent's mind basically and remember that for the next uh next bookings.
Nice. Okay. So what strategies do you do you use to keep memory flash or prune so the agent doesn't become overloaded with stale or yeah this is a really um this
is another good question uh and in the real world you see that memories are are evolving really fast so after some time you'll see that there are some memories
that you need to prune and the agent needs to forget. So in that specific case there are a couple techniques to apply. So first of them first of it is
apply. So first of them first of it is basically keeping a temporal tax. Okay,
I I learned this memory uh from the user but I learned it maybe like two months ago. So if you can keep track of these
ago. So if you can keep track of these um timestamps uh or temporal tags, the model will understand what is old and
what is new. So if I say I like dogs, if I said I like dogs like two months ago and today I say I like cats. So you'll
see that the model is going to understand my favorite animal now is maybe cats and it will override the memory um with the right instructions.
So this also falls into a little bit to memor consolidation.
So how to prune uh stale memories, how to basically update uh and override the new ones into the existing ones. So
temporal text is one technique that you can apply. Uh the other one is you can
can apply. Uh the other one is you can use a way decay or um basically a window function uh and you can basically uh
focus more on the recent memories uh and you can basically downgrade the the oldest ones. So it really depends on the
oldest ones. So it really depends on the nature of your use case. So if you think that what I said a year ago is not important for your agent, uh you can
definitely prune these old ones and implement a weighted average probably for all your memories. But if you think that all of this memory is is equally
important for your agent, then you can consider maybe like memory consolidation and memory override with temporal text.
So we can talk about two different uh techniques um to manage the overloaded and stale uh memories.
Nice. Okay. So how do you manage scaling agent memory systems when you have many users with individual and shared memory pools?
Yeah, this also another good good example from real world. Um so once you see the memories are are uh evolving over time and you'll see that you're
collecting tons of memories from um from your users. Um so there are different
your users. Um so there are different ways to scale it. I think the first path or first first decision criteria starts with if you are basically performing uh
a retrieval or search base long long-term memory approach or you're just using um basically summarizing the the context. So if it's the the second one
context. So if it's the the second one that means you're just storing all of this information and persisted into into a disk. So you can think about some
a disk. So you can think about some scaling methods about like data management how to manage like a large amount of memory nodes uh as a text in a
text format or you can basically think about um scaling the first approach which is basically you have to think about like
how to scale a search uh and retrieval system. you might be storing all of this
system. you might be storing all of this information um into into a vector database and then in this vector database you can try to scaling uh the
storage you can scale all these all these vectors filtering ranking system and all of that so I think the first bucket is mostly about this long-term
memory so we talked about like memory as a tool so if you can think about extracting memories with a tool and retrieving back during the live turns probably this is the situation where
you're going to uh hit this question about scaling uh for many users. Uh in
this case you can think about u scaling techniques for vector databases. Uh you
can use shorting you can optimize your embeddings model probably if you're using like customized embedding model.
Um and you can basically optimize retrieval process similar to to a rag approach. Uh again the first one uh is
approach. Uh again the first one uh is mostly about scaling a retrieval system.
Uh the second one is mostly about basically like data storage, how to store specific data, how to manage uh like tons of information and sentences.
I think to wrap up uh we can basically put it into two buckets. One of them is uh scaling um and optimizing a retrieval system. Uh
the second one is is also u making more efficient for storing and persistent uh in in the disk.
So this is also a common question that I hear from from my customers. Um I think you can maybe follow like a like pilot
approach and you can turn on this new uh memory techniques for for for a subgroup of your users and you can think about okay how it's evolving over time. Maybe
you'll see that most of the memories that your users are saying are pretty limited. So think about this travel
limited. So think about this travel concier agent. So probably I'll just
concier agent. So probably I'll just sharing my my memories about my seat preference. Maybe if you want to book a
preference. Maybe if you want to book a hotel, I like maybe higher floors. Maybe
I like the specific menu or breakfast.
Uh so I think this is more limited type of groups uh type of memory memory possibilities I can say and that type of agent. But you are if you're building a
agent. But you are if you're building a life coach or life coach agent. So there
are tons of memories that you need to remember uh about me my life. uh and
you'll see that these type of memories and memory pools are evolving really fast. So yeah the third point is that
fast. So yeah the third point is that try to understand the the evolution of memory and possibilities of memory in your AI agent. So we have two examples
here travel concierge memories and then life coach memories. So yeah as you see in the second one you'll be collecting tons of information that is valuable for
uh yeah for my life. Um and then my dreams, my goals, uh what I was thinking a month ago or a year ago. So the second one is is mostly like super advanced and
complex and sophisticated memorable that requires uh lots of scaling uh for sure.
Okay. Um so yeah, that was the end of the the question the the Q&A session probably.
>> Yeah.
>> Okay. Um yeah, and then we can just switch to resources. So um all right, this has been awesome. To wrap things up, um we're we've linked a few great resources here, including the context
engineering cookbook, which was referenced, and the context summarization cookbook and our agents Python SDK. I know we've gotten a lot of
Python SDK. I know we've gotten a lot of questions on is this available in GitHub. So you can explore all of these
GitHub. So you can explore all of these links on the right and the full build hour repo is available on GitHub.
Um, so good news. We're likely going to squeeze one or two more of these in before the end of the year. So keep an eye out on our build hours page linked here.
And um, a big thank you all so much for tuning in and a big thanks to Emmery who's did an amazing job with this session.
>> Yeah, thanks everyone. Uh, we hope you enjoyed uh, this build hour on an agent member patterns. I know we covered like
member patterns. I know we covered like lots of different techniques uh lots of different information about memory, how to think about memory, how to design
memory. So, so overall as you see there
memory. So, so overall as you see there are too many options but the core uh idea is basically better understanding what your agent should remember and how
it should remember and how it should forget. So you can think about these
forget. So you can think about these three things when you're designing your own agent u memory. Uh and this is still an evolving field. So you you might see
like some um like new features coming uh about memory overall. Uh but yeah, I just wanted to show you like different design tradeoffs uh and and guide you
with the with the best option. So
finding the right balance between these techniques are uh usually like related to your specific uh use case. And then
you can keep track of all of all the news and cookbooks uh in the resources section. So, I'll be also upload
section. So, I'll be also upload uploading this um demo page uh so demo application to to our build hours GitHub. Uh and then yeah, thank you for
GitHub. Uh and then yeah, thank you for your time and thank you for for listening uh all of this.
>> Yeah, have a great rest of your day and we'll see you next time.
Loading video analysis...