AWS re:Invent 2025 - Integrate any agent framework with Amazon Bedrock AgentCore (AIM396)

By AWS Events

Summary

## Key takeaways - **AgentCore Runtime Deploys Any Framework**: With runtime, you can deploy any open source agent as well as MCP servers and A2A servers, with no changes in code or minimal changes in code. These runtime agents run in a session isolated environment. [02:30], [07:54] - **Minimal Code Changes for Deployment**: Make a few lines of code change: make the right imports from the Bedrock AgentCore SDK, initialize this app, and have a decorator at app entry point pointing to your handler. What happens inside that my agent function is completely up to you to be any agent, any framework. [09:30], [09:51] - **MCP and A2A Standardize Integrations**: MCP allows you to securely access external data and tools through a common interface; host your own MCP servers on runtime with no changes in code. A2A is standardizing the interface when you want agents to talk to each other, even built with different frameworks. [13:30], [14:16] - **LangGraph Memory Checkpointing**: In LangGraph, use AgentCore memory saver to save and load your checkpoints, storing information like your user AI conversation or your graph execution state, and rehydrate the container to restart where you left off. [16:11], [16:29] - **Cohere Health's 30-40% Faster Reviews**: Cohere Health uses AI agents to surface clinical information in prior authorization review processes, resulting in 30 to 40% faster reviews for clinicians without using AI to deny claims. They integrated LangGraph with AgentCore for healthcare compliance. [41:38], [38:19] - **OpenTelemetry Streams to Langfuse**: Open source frameworks are auto instrumented to collect telemetry; add two lines of code to initialize an exporter, and traces flow into Langfuse for visual observability of agent executions. [24:12], [27:14]

Topics Covered

AgentCore Deploys Open Source Agents Seamlessly
MCP and A2A Eliminate Integration Code
Evolve DevOps to AgentOps with OpenTelemetry
Precision Beats Latency in Healthcare Agents

Full Transcript

Hey everyone. Welcome to AIM 396, where we're gonna talk about how to integrate, uh, open source, uh, frameworks with Amazon Bedrock Asian Core.

Uh, my name is Shreya Subramanyan. I'm a

principal data scientist at AWS.

I also have Ari, who's a senior GEI specialist SA at AWS, and we're happy to also share the stage with, uh, Keith, who's a VP of Engineering at Coher Health. So

Health. So before we get started, uh, quick show of hands, how many of you have, uh, built an agent with either a no code tool or low code or, or an open source framework?

Quite, quite a lot of us, yeah, and so you might recognize some of the frameworks up here and also the protocols.

So building a basic agent requires piecing together all these different complex components, right? So you have, uh, picking a framework,

right? So you have, uh, picking a framework, picking what tools to integrate with, you also have, uh, state management, but fortunately. Some of these open source frameworks

but fortunately. Some of these open source frameworks came first and it really simplifies the way you can build these agents.

Uh, we also have open protocols like NCP and A2A that, uh, that help with connecting agents to agents or agents or tools.

We have open protocols and standards like OA for, uh, authorization.

We have Otel for observability, and all of these help you start off really quickly and build a POC.

But going from POC to production is a totally different ball game, right?

So at AWS since its inception we've been contributing to open source and we're proud to do that.

Uh, we think open source is good for everyone.

Uh, we've been, uh, you know, contributing to Linux Foundation, the open source projects on GitHub, uh, both, uh, you know, we're, we're in the, um, Asians related talks, so we thought we contribute to strands, which is our own.

Uh, Asiantic framework with, uh, and also with other open source frameworks we believe open source is good for everyone. We're committed to bringing this, uh, via managed services to our customers and you'll hear from, uh, Keith at Goher Health and, uh, uh, on more about this as well.

So today we'll be talking about specifically how Asian Core works with open source frameworks. If you haven't been

source frameworks. If you haven't been in other Asian Core sessions, so Asian Core is our end to end Asian tech platform that helps you build and operate agents at scale.

Um, and so zooming out, you know, a 30,000 ft view of, uh, Asian core, if you look at all of these components that you see on your screen.

Uh we uh, Asian core is comprised of all of these individual services. You can pick and choose what you want.

services. You can pick and choose what you want.

Uh, in the center of your screen is, uh, Asian core runtime.

So with runtime, you can deploy any open source agent as well as MCP servers and A2A servers, uh, with no changes in code or minimal changes in code. Um, these runtime

code. Um, these runtime agents, uh, run in a session isolated environment.

Uh, it can be a back and forth, um, agent where you have a conversational, uh, interface or you have a long-running agent that can run for hours and hours, like a deep research agent right?

Um, runtime also integrates with all of these other agent core services.

So for example, runtime integrates with identity, which is our, uh agent uh, identity and credential management service. It also integrates with memory, which

service. It also integrates with memory, which is um the first, uh, one of the first, uh, enterprise grade memory capabilities that you can have zero infrastructure set up, but you have short-term and long-term memory, uh, that can attach to your agents, um. We also recently introduced

um. We also recently introduced policy and evals in Asian core, so this helps you constrain and also judge how well your agents are doing in in in production.

Um, what connects all of these different services together is, uh, observability or Asian Core Observatory, which works on the open telemetry or hotel format.

The nice thing with Agent Core, as I said, was you can pick and choose whatever services you want and, uh, to build your, you know, Asiane application.

And uh today we'll see uh different flavors of observability, for example, so Keith, uh Keith will talk about arrays and how they use that in production. Ari will talk about how uh we can use lang fuse with uh, with Asian core as well.

So before we get into uh the details, let's start with a really basic scenario right?

So let's say you have uh your customers, you wanna uh uh, you wanna track an order and you're, you're usually contacting customer service and this uh customer service representative might have to uh look into like order management system or external shipping provider API or maybe even call them to get, get order status, right?

Uh, they might also have to work with other technical, uh, support, uh, you know, human resources, so you have two teams working together to actually get your final answer, uh, which a lot of times is have you start try restarting it. So, uh, you know, after all of that effort you might get back an answer in a couple of days.

We're gonna try and replace these, uh, two humans, uh, or human resources teams out there with, uh, agents, right? So you have a customer service agent.

agents, right? So you have a customer service agent.

And a technical support agent. Now these agents may not replace the entire functionality of the jobs to be done of, of, uh, these persons. Uh, it might just have to do a partial job or might be a co-pilot that's domain specific as well. So

as well. So how do you get started? The first gut reaction we have is to look into open source frameworks because open source frameworks have, you know, deep and, uh, wide integrations and community support, and, uh, we have great examples that you can start off with, right? So our gut reaction is usually there, but how do you pick which framework you have to start with? Uh, there are a lot of different

with? Uh, there are a lot of different factors including how your team is, uh, working with certain, uh, frameworks already, but, uh, we're listing three top, uh, you know, frameworks in our minor list, so Landgraf, crew AI, and strands agents, and, uh, we're just listing a few factors that you can choose, uh, your agent tech frameworks with. Uh, so Landgraf offers a graph-based

with. Uh, so Landgraf offers a graph-based execution with extensive open source uh integrations.

So it's powerful, but then it has a really steep, uh, learning curve compared to some of the other frameworks.

So it is, it's still good for complex agent orchestration, uh, with strong community support.

Uh, crew AI, on the other hand is really easy to get started with and especially for role-based and team-based agents where you're, you're running a multi-agent crew, uh, this is the best place to start with, uh, strands, which is our own open source agenttic framework. Uh,

you, you provide, uh, we provide like a model-based approach and, uh. Has

access to local as well as Asian core memory, for example, and this is ideal for a starting point for uh enterprise grade enttic applications that are running on AWS, right?

So you have all of these choices. You might have many more we in fact have many applications and, uh, and, uh, use cases that we have put in our GitHub repository they can check out.

Uh, but the good thing is you can pick and choose, uh, the frameworks for, uh, the right job, right? So you, in this case we're picking strands for the customer service agent and, uh, land graph for the technical support agent.

You don't have to choose two different frameworks. You can choose the same framework as well, but in, in this case, as an example, we're saying, uh, where you're choosing strands for customer support and Land graph for technical support.

So what's the best way to get started with that, right? So you have, let's say, a strands customer support agent that's running locally on your laptop. You have, uh, local credentials in there. Uh, the credentials let you

in there. Uh, the credentials let you access APIs that are hosted in AWS, and, uh, these A AWS APIs, you know, you can access because you have local credentials.

Uh, the technical support agent, uh, you're, you're doing some hard, uh, integration between the stand support, uh, agent and technical support agent.

And uh what you also see is that the customer support agent has an API key that may that may integrate with an external service provider.

Uh, but you can quickly notice that this is not gonna scale, right? So this is running on your laptop.

It's not something that you can, you can host and you can have multiple users interact with. But your goal is to, you know, put

with. But your goal is to, you know, put this out for, you know, tens of thousands of users to use. So how do you do that?

to use. So how do you do that?

That's where agent core runtime comes in, and we spoke about Acore runtime a little bit. This

is where, uh, you know, we mentioned line graph strands and other agenttic frameworks. You can very easily host these, uh, these agents, and you get a dedicated endpoint.

These endpoints scale automatically based on your, uh, requests, and you can, uh, go from zero to production really quickly, right? Uh, you can also connect

right? Uh, you can also connect these agents to any model provider, so it doesn't have to be Bedrock. It can be OpenAI. It could be any model provider pretty much, and, uh, you can also work with really strict security guidelines, so it could be a VPC only mode or you can only use OA.

Uh, we also introduced, uh, resource-based access policies recently.

Uh, today we also launched full support for the MCP spec. So if you have an MCP server, you can, uh, you can go and host the MCP server without any change in code.

Uh, also, bi-directional streaming was launched recently, which is useful for something like our customer service, uh, agent, right? So you're doing right now imagine a text-based interface, but imagine you wanna convert that to a voice-based interface. You can

very easily do that with runtime because runtime now supports bi-directional streaming.

All right. So how do you actually, you know, host these two agents we said there's a customer support agent and this Land graph agent.

How do you host that on runtime?

Uh, there's two main methods to do that. You can either package up your code in the, in a docker container.

That Docker container gets pushed to an ECR repository, and that ECR repository can be used as, uh, the starting point for deploying to runtime.

Uh, the other even easier way is you can zip up your, uh, deployments in a, in a zip file, and, uh, you can push that to S3 bucket, and then that S3 bucket becomes a starting point for your deployments and run. So it's 22 different methods, uh, and to do this you have to actually make a few lines of, uh, code change, right? And this basically says,

change, right? And this basically says, um, you know, make the right imports from the Bedrock Asian core SDK, uh, initialize this app which is actually a star app.

Um, and, um, and then you have a decorator. It's easy decorator at app. entry point just points to this

app. entry point just points to this handler or this entry point as the starting point for your agent execution.

What happens inside that my agent function that you see, it's completely up to you to be any agent, any framework that you're integrating with, any business logic, uh, and then finally you return a response.

What do you expect to work is, um, you know, whether it's sync or as sync responses or if it's streaming responses, all of those should work, you know, as though it was running locally.

And this will with runtime, right?

So now that you've made those little changes in your code, the next step is actually to use our Bedrock Asian code starter toolkit.

This is a really simple way to go from, uh, you know, a, a code that exists locally to a code that can, can be deployed. You

have 3 steps.

The first step is to agent is to do Asian core configure.

So Asian core configure will basically set up your things like ECR repository that you want to land in, uh, your role, uh, whether you want IAM or OA, uh, all of that just sits in a config file that's local. You can go back in and edit it on your own

local. You can go back in and edit it on your own as well, but easy to manage it with Asian core.

Uh, the Asian Corres CLI also has this next step which is deploy. So you can do Asian core deploy, which will deploy to run time depending on those two choices that if you recall from the previous slide, so you have either the container version or the zip deployment version.

And finally, you know, within a minute it'll be deployed and you can start invoking your agent.

All of the scaling and infrastructure management, all of that is, you know, happily managed behind the scenes, and you don't have to worry about it.

Once you've, uh, launched that agent, of course, you can start invoking it. And in this case, we're saying, OK, we're gonna use two different runtimes, one for the sciences agent, one for the Landgraf agent. Uh, there are use cases

agent. Uh, there are use cases where you want both to coexist, but in this case, we're saying you can have, you know, two different containers that are running up for you.

Um, everything else remains the same, meaning the Asian core runtime has an execution role. It

lets you access AWS, uh, APIs.

Um, it also, you can use identity to store, uh, secret credentials, and that can be an API key, for example, for your external shipping provider if you wanna track a shipment through that API, for example, um, but if you notice the integration between these, um, containers and the API as well as from the one container to another container something that you manage, right? So.

manage, right? So.

Um, before we dive in even deeper there, let's talk about context engineering for agents.

This is, uh, you know, you might have heard about context engineering and prompt engineering earlier, uh, but when applied to agent, if we zoom out, zoom into one of those agents, so either the strands one or the land graph one.

Uh, we can see that the context that you provide to that agent is extremely important, right?

So you can have user instructions, you can have, um, you know, system instructions, you have retrieved information from rag systems, for example, uh, you might have short-term memory coming in with previous conversations, and you might have external or internal tools that are giving you additional context.

So imagine if your tool uh gives you, you know, uh, a wrong context or a wrong response, uh, or worse, if you call the wrong tool, the LLM calls the wrong tool on your, on your behalf.

That still remains in your context for the agent and your agent is gonna respond based on that. So how can we, how can

that. So how can we, how can we make sure that all of this, uh, is handled, but we don't have to actually handle the integration points, right?

So you as a developer are, are basically responsible for connecting those different nodes, right? So you have the customer service agent which is strands, uh, you're writing integration code to connect to your order management system, writing separate integration code to do an API call. Um, and then when you're

call. Um, and then when you're connecting from one agent to another agent, you're also writing another integration code. So quickly,

your integration code becomes the bulk load of your entire code base, right? You're

not focusing on your business logic, but it's all of this integration code. But what if you

code. But what if you could use some of these open protocols to help you out there?

So one of the things that you've already probably heard about is MCP. MCP allows you to

MCP. MCP allows you to securely access external data and tools through a common interface.

Um, so there are two connection points to, you know, MCP within, uh, Asian core. One, if

you host your own MCP servers or you have the code for your MCP servers, you can host that on runtime with no changes in code and literally no changes in code.

Uh, if you have external APIs or Smithy models that connect to internal AWS, uh, tools, or even API gateway setup that you want to connect and MCP file.

You can use Gateway for that. So Asian Core Gateway helps you create an MCP interface to all of your existing APIs. So there's two ways you can

existing APIs. So there's two ways you can take Asian core components and then connect it back to MCP.

So that's from your agent to tools, right?

So what about agent to agent? So as, as the name suggests, A2A is standard, standardizing the interface when you want agents to talk to each other, right? So it was originally developed by uh Google, but then it was uh donated to the Linux Foundation soon after.

Uh, so agents may be built with different frameworks, different languages, different models, and might be coming in from different providers as well.

Uh, but you still want them to talk to each other, uh, you know, natively.

And so on the right we're just showing an example of uh three different agents talking to each other with uh with uh 82A since 82A is a native protocol like MCP that's supported in runtime. Once

again you don't have to, don't have to do any changes in code and you can directly host your 82A servers.

Uh, you can have an 82A client similar to an MCP client connect to these 82A servers to test it out as well.

So great. So now we're gonna change that little architecture diagon a little bit.

So you have a customer service agent that's connected through the blue line, which is A2A to another uh Landgraf's technical support agent. The red lines are now MCP,

support agent. The red lines are now MCP, so that's a standard interface to go from the strands agent to an auto management API, which is, by the way, an internal AWS setup that you have. It also has another connection to an external shipping provider API.

So gateway, for example, if you have an open API spec for your internal, uh, for your external shipping provider API, you can import that into an MCP, um, MCP gateway.

Uh, what we also have is deeper open source integrations into, uh, some of these agent core components from the frameworks itself.

So we're constantly, you know, uh, pumping back some of this development back into the open source frameworks.

One of the things that you can do, you can, you can have the stands uh a customer service agent use the built-in session manager, and you can add agent core memory to it. So both short-term memory, which is STM on the slide there, and then long term memory as well.

Um, similarly, in, in Landgraph, we have some deep integrations as well. So you

can use, uh, what's called Asian core memory Saver, and what that does is you can save and load your checkpoints, and checkpoints store information like your user AI conversation or your graph execution state, and that can be, you know, loaded onto memory, and then once you restart your applications your gentic container, you can rehydrate the container and restart where you left off. So

left off. So we're excited to continue our partnership with, you know, Landgraf and Langchain and, uh, all of us, you know, our examples are based on some of these open source frameworks.

Uh, for Landgraph specifically, we do this via the Lang chain AWS library, so you can pip install lang chain AWS.

This gives you all of the, you know, AWS Landgraph superpowers, uh, like the one we saw earlier for the, for the memory checkpointing, for example.

Um, with strands, of course this is our own homegrown, uh, agent SDK, but it also has 4000 stars and a million downloads, and it's growing, uh, you know, definitely check it out if you haven't already. It's

easy to get started with and we have a bunch of examples.

Uh, we are the primary maintainers of strands, but then it's a completely open source project.

Um, lastly, the way we collaborate with some other frameworks like Crew AI is that we help make direct contributions to their, um, to their repositories themselves. So Aris actually has a blog up

themselves. So Aris actually has a blog up there uh uh, where you can scan and take a look at that. It's

to build Asian systems with Crew AI and, and Amazon Bedrock, uh, separately, I helped Lama Index, you know, uh, with their RAG implementations and also integrations into Asian core.

So we're happily continuing to do this, um, over and over again. So let's get back to our. Application at the end.

to our. Application at the end.

So now you have, um, uh, you know, agent core runtime that is used to run your customer service agent and your Landgraf technical support agent.

Um, you have, um, A2A being the interface between those two agents, right? Uh, then you have, uh,

right? Uh, then you have, uh, memory resources that are attached to each agent.

There's also another alternative. You could have the two agents have a common memory resource as well if that's, that's the way an application is set up, but in this case, imagine you have two separate memory resources for uh this, the customer support agent and the uh technical support agent, and then finally you have a gateway that connects to the external tools as well.

With that, we'll have Iris come up and, uh, talk about how engineering teams are transitioning from traditional DevOps to what we're calling agent ops, Iris. Thanks,

Iris. Thanks, Treyus. So, now that we have

Treyus. So, now that we have learned how we can build powerful agents with AWS services like Agent Core and the Open Source.

And about the powerful integrations between these two, we wanna take it one step further and think about how we can put that into production with operational excellence, obviously, right?

So in the next 20 minutes, I wanna talk about uh uh a couple of things that we have worked on in, in the last months, together with some major players in the open source space uh here as well.

And um I wanna start with uh operational excellence. If you think about operational

operational excellence. If you think about operational excellence in traditional software engineering, probably everyone would say DevOps is really a thing here, right? And we would probably also

right? And we would probably also agree that DevOps is not one dimensional.

DevOps has a lot of different dimensions that we really wanna get right if we wanna be good in, in, in that field.

Starting from infrastructure, we need to manage our infrastructure from top to the bottom. We need to pick the right tools with

bottom. We need to pick the right tools with a lot of automation.

And then we are all people, we are an engineering community, and we work in processes embracing a specific culture.

Now, in the first days of DevOps, all of these things were really not easy because we didn't have, we had a cloud, but we didn't have a lot of managed services, right?

We didn't have that set of comprehensive tooling that we have today, and we hadn't yet agreed on a new common culture. A lot of the things were still very

common culture. A lot of the things were still very waterfall-like um so engineers really had a hard time.

But, and we at AWS have pioneered that in, in the last decade, um, you know, a lot of organizations out there and also the open source space have been inventing a lot of cool things over the past 1015 years here, right?

So we at AWS we launched powerful services like Lambda, like EKS, like S3.

Like CloudWatch, many more services that make all of our lives easier.

We have entire companies that have raced out of the open source space. If you think about GitHub or Docker.

And then we all as a community have embraced a new way of doing software engineering that is super agile with a lot of automation with CICD which is actually benefiting us uh a lot, right? So what I wanna talk about today is how can we take what's working well in traditional software engineering.

And kind of evolve it into uh the agentic era that is starting now with all the new functional and non-functional requirements um that are arising.

And I wanna do that sequentially, one pillar after the other before bringing it together at the end.

Starting with infrastructure, as I said, we have been pioneering that field, um, for traditional software engineering with services like Lambda, S3, and many more.

And we haven't started reinventing on your behalf. Um,

Shreas has mentioned a service that we have launched in July at New York Summit.

Agent Core really specialized, uh, for agentic workloads.

I'm not gonna spend a lot of time, uh, once again, you have heard a lot from Shreyas. You have heard probably a lot about this service in the past days, but It is really absolutely useful in this specific scenario because it is a civil service. It

spins up agents quickly.

There's a lot of things that the service is basically managing uh for you, so we strongly recommend uh recommend using a service like Agent Core um and you will see how you benefit from it down the line as we go through the other pillars sequentially.

Moving on to tools, um, tools is actually a pretty vibrant space or has been a pretty vibrant space in the last 1 to 3 years.

A lot of stuff that, um, you know, organizations have built, we have built a lot of tooling, uh, within Bedrock, but also is Agent Core if you think about, um, observability, the, the primitive or evaluations that we have launched here at Reinvent.

But also the open source space and the startup space was pretty vibrant, so I was working with a lot of those startups across, uh, or alongside the last 1-2 years.

The wide combinator cohorts were full of those, um, and, uh, you know, a lot of great tooling has been popping up.

Since this is kind of an ecosystem open source session, we're gonna focus on that space uh today and um as Treas has already said, um, I'm gonna focus on Langfuse, which is one of the largest open source uh uh providers for observability for, um, you know, agent ops um as well.

I've been collaborating strongly with those folks in the last months to develop the approach that we're gonna be presenting, um, but again. Whether proprietary

again. Whether proprietary or open source, you can really pick and choose because everything is built upon open standards.

And if you think about open standards, observability.

Operations, DevOps, agent ops, really one of the most important here um is open telemetry.

What is open telemetry? It's not a new thing. It's been around for years in traditional

thing. It's been around for years in traditional software engineering.

Um, but what we have done as community is really we have further developed it, um, again to like work with the new arising functional and non-functional requirements.

It defines a set of semantic conventions, uh, which you can see here on the right side.

So we basically standardize how we are capturing, um, the data points, the telemetry in objects.

And then we have also standardized, um, the way how we actually You know, collect them and ship them into so-called open telemetry back ends.

So the way that works is we all use these open source frameworks. They are auto instrumented.

That means there is already code in there that is collecting those metrics automatically for you if you configure them right.

There is standardized exporters that are being built by organizations or by the open source which are exporting that data, streaming them to the back ends, and these back ends. Can be the platforms we have

ends. Can be the platforms we have talked about. So our, um, you know, observability

talked about. So our, um, you know, observability platform that is tightly integrated with, with, uh, with CloudWatch, um, or a third party open source ones like Langfuse.

Now, I want to spend a minute speaking about Langfuse to really make sure we are all on the same page. Um,

what is Langfuse? Uh, it's an open source LLM engineering platform uh that is one of those open telemetry back ends. Um,

ends. Um, this means you can stream your traces.

Your telemetry there is being stored, and then, uh, on top of that, there's a rich set of features that that you can leverage. So you have core observability.

Um, where you can actually look into in a visual way what's, what's really happening in your agents.

And then there is a lot of other features as well, like prompt and data set management, EAs experiments um human annotation cues, uh, many more things, and we're gonna use a couple of those, uh, further down the line.

So to come back to the solution architecture that Rees has been building up um sequentially in the session, um, you might recognize we have removed a couple of components that are not necessarily crucial for agent ops, but what what we really wanna do is we wanna um use length fuse, so we need to host it.

It comes, it's open source under MIT license, so you can just host it in your AWS account if you want.

There's a marketplace offering where you choose uh the fully managed SARS version of Langfuse called Langfuse Cloud.

And this is also the one that you will find being used in the code repository we're going to share at the end of this session.

Now you see two black lines here uh connecting agent core runtime, um, and, and, and length fuse basically.

Um, I wanna focus on the top one, because this is really the last thing we need to do before we are up running, up and running.

We need to actually configure how, how the streaming of the traces works at the end. Um,

it's not difficult, but it's just a couple of things we need to configure.

We need to set a couple of, or we need to configure a couple of settings, like for example, which endpoint we're gonna stream our telemetry to um and things like how we're gonna authenticate against length use.

Um, a good way to do that is through setting environment variables. When we are deploying the agents

environment variables. When we are deploying the agents in Agent core, you can see that on the right side.

And then on the left side, um, uh, you know, this is basically a boilerplate agent implementation and strengths for agent core you will, uh, you know, uh, recognize a couple of things that Treas has shared before.

All we really need to do is, uh, adding two lines of code. I unfortunately don't have a pointer here right now, but it's two lines of code where you are initializing an exporter.

And this will, uh, you know, pull the environment variables in and then the configuration is being set up and you will see traces flowing into length fuse as you're using your agents.

And this is how it looks like you have a rich set of information like the time stamps, the input and output of your agent on a high level. You can see how many observation

level. You can see how many observation levels um the agent was going through, so how many turns it was doing till it solved the problem, talking counts, latency, like a lot of information, and then you can also zoom into the different traces.

With the observation levels on the top left again, we have a graph execution view can be useful for us humans, a visual way to see complex executions.

And on the right side you see, you know, the data points we have collected in the trace based on the semantic convention we have defined.

All right, so now we've connected infrastructure and tooling, uh, we know what's going on.

The next question is obviously how can we build processes around that, that are leveraging this, that are kind of making our life easier towards production.

And we were actually thinking a lot about what would be a good approach to start together with the lengthiest folks and.

Uh, we started working backwards from, from DevOps, right? So we were thinking about what is working good in DevOps, um, and how can we evolve it. And the first thing we actually realized was a crucial concept in DevOps is testing.

DevOps doesn't work without testing.

And the analogy, even though it's a bit more complicated to testing in in Gen AI and agents is evaluation.

So we were thinking about um how does evaluation play a crucial role alongside the entire life cycle of energetic application. And we found a pretty cool illustration

application. And we found a pretty cool illustration from from evidently.

That is actually showcasing that. And

uh we, we usually start kind of in a, in a, in a mode where we are trying to figure out a good implementation.

So we are changing the code, maybe different frameworks, different models, prompts, sets of tools.

And we want to figure out what works best, so we are running experiments based on some data sets that we have put together and we are running EVAs on top of that. It is an iterative process, and at some point we come to or at some point we basically have a state that is as we think, good enough to deploy into production.

We deploy the production, we collect the traces, as we have seen earlier on, and then we enter a kind of a phase of where we are thriving towards, first of all, operational excellence.

By moving into a more online way of running EVOs, building dashboards, creating automated alerts, for example, right?

So this is operational excellence.

We want our stuff to be running well.

But then we also want to close the loop. We want to learn from what's happening in production, which means we can feedback learnings from production into our test data sets.

If we realize that the data set distribution is not matching the distribution in production in reality, but we can also get inspired by what's happening in production.

To implement new things, right, so whether this is fixing bugs, fixing issues, or just the next best feature.

All right, so the next step was really, um, we were thinking about, um, this is a great theoretical framework, but how can we make that a bit more tangible, a sequential process, uh, that we can start also then, uh, uh, or that, that can get, that we can also implement later on, right?

And uh what we came up with, with, was this uh three-stage approach. We

start with an experimentation HPO phase where we wanna try out what's working well. Then we move further

well. Then we move further towards production where we have guard rails to production. Basically,

we wanna, we wanna really make sure that everything is working. So we have a QA

working. So we have a QA and testing phase and uh we wanna automate that, we wanna do that with CICD because that's working really well in DevOps as well.

And then we go into production, operational excellence, learning from working what's in production and closing the loop.

Now, a lot of theory, going a bit more practical, uh, again, um, one step deeper, uh, we have talked about, you know, the process, the culture that we, that we wanna evolve now. I wanna

come back to the infrastructure and tooling pillar to speak about how we can actually implement that. I want to start with infrastructure,

that. I want to start with infrastructure, then go over to tooling, but before I do that, um, we have two core assumptions that we are building all of that up on.

And the first assumption is really that we strongly recommend to use a multi-environment setup.

Different staging environments, dev, test, prod. It really makes

prod. It really makes sense. You don't want to experiment in your

sense. You don't want to experiment in your production environment and break the back end of your production application.

Probably a lot of you have already implemented that. If not, we are strongly recommending

that. If not, we are strongly recommending it. And the second thing

it. And the second thing is maybe a controversial one for developers, but we strongly recommend end to end remote cloud-based development, especially when we have surveless services at hand.

I know a lot of people like to develop locally, to get started for a POC locally, notebooks maybe, and this is great, but as you mature in your work, as you have like a lot of teams doing that, a lot of use cases, you reach a point where it actually it doesn't scale anymore. You have

anymore. You have MCP tools that are remote that you're not owning, which you might not be able to access.

And you really don't want your developers to spend weeks of development work uh just to realize that the stuff they have built is working on their local machine, but not in the target environment.

So with those assumptions, uh, moving further, um, uh, and getting started, uh, to the implementation.

So first phase, uh, experimentation, um, HPO we wanna figure out what's working well alongside a set of hyperparameters or implementations.

But we don't want to do that by rule of thumb, just trying a couple of things out and seeing what works. We want to have solid data points. We want to have a proven methodology.

points. We want to have a proven methodology.

Um, so what we can do is actually, and automation concepts like infrastructure is code, um, but also the cloud, the fact that you can spin up agents in a matter of seconds with agent core comes really handy here is you can just go ahead and launch a fleet of agents.

For example, one agent per hyperparameter combination. This is then a grid search

combination. This is then a grid search in hyperparameter optimization.

You spin them up, you run your EAs, you tear them down again, super frugal way, you get your data points, um, you have tested out all combinations and you have a solid data uh base basically, uh, that you can ground your decision based on. Then you, then

on. Then you, then we move on, um, once we have found the ideal setup, we push this in the code versioning, repository, GitHub, code commit, whatever you want to use, this triggers an automated CICD pipeline that is here for QA and testing.

It's going to deploy another ephemeral agent.

Just one agent now. We run our tests, we get the results. If we are below the bar that we have set, we go back to the first phase.

But let's assume we are good enough, then we can deploy into production. Now it's a persistent agent that we want to use as back end for our production application.

And we close the loop by learning from from what's happening in production.

So this is the infra site.

Now how does that look like on the tooling site?

And I want to start with uh recognizing that all of these three phases, uh, you know, are based on the concept of that we have traces available, right? Without the traces, we don't

available, right? Without the traces, we don't know what's going on in the agents.

We basically need them. This is kind of the, the core assumption we are building the rest upon.

Now, the first two phases are actually pretty similar in a way.

We are running experiments of different sizes.

Uh, based on mainly offline evals with data sets that we have configured, the main difference is we might want to use different eAs because we want to test other things in those spaces, and then the size of the data sets can vary.

If you think about it, in the first step you have a fleet of agents. If you

use a huge data set, it gets expensive.

Now in the second step, you have only one agent to test on, so you can maybe with the same budget.

You know, do, do some more testing cycles, but this is really a trade-off we need to think through.

And then in the second phase, we are basically um you know transitioning, transitioning a bit more into an online way of doing eAs, even though I think offline eWs is also possible, and we use uh or we might want to use a cool feature in length use, the annotation feature where we can actually have humans annotate traces and feed them back into our data sets.

So to bring it all together, uh, we have talked about uh how to actually evolve DevOps into something we might wanna call agent ops, uh, satisfying your requirements that have been popping up, um with powerful services like Agent Core that are integrating, uh, with third party, um, you know, uh, observability agents, uh, tooling platforms out there, um, and we are really flexible because

they integrate, uh, well.

And we've also spoken about, you know, kind of a new culture with new processes that we are proposing here and we hope you will all embrace, um, uh, in order, um, you know, to have uh a new era of, uh, successful software engineering now here, um, uh in uh the Agentic space. Now, before

space. Now, before I, I close it up and hand over to Keith, uh, I want to point out that we have a couple of resources for you.

We have actually really built that. Um, there

is a code repository that you can fork and use as a starting point, uh, within your orc. There's gonna be some things that will be different to your orcs, but I think it's a pretty good starting point.

And then if you wanna dive deeper than what I could do now in 20 minutes, uh, you might wanna check out the YouTube video here on the right side. It's a deep dive

right side. It's a deep dive that I've recorded with the CEO of Langfuse a couple of weeks ago.

And with that I wanna hand over to Keith who is uh VP engineering at Kuher Health and um he's leveraging a lot of the things that we have spoken about today uh with uh his teams and in their day to day work.

Thanks, thanks Earth.

So, uh, at Cohere recently had the opportunity to, uh, partner with, uh, the AWS team on the, the beta and launch for Asian Core.

Uh, and we built on top of it what we call, uh, Review resolve, which is, uh, a product that we're using to help uh drive uh improved prior authorization, uh, prior, prior authorization processes.

So For those of you who aren't familiar with prior authorization, um, in the US healthcare system, it's common practice where prior to, uh, uh, a clinician performing a procedure, they need to make a request to a health insurance company, uh, to get, to get that procedure approved.

And the determination about whether or not that procedure can be approved is dependent on a policy.

That policy can be defined by the insurance company. You have

state level policies, you have federal policies.

In those policies in those policies, you've got uh medical necessity guidelines that have to be met uh prior to uh a a procedure being approved.

So for example, uh, you might have a knee injury, you're talking with your doctor, and she recommends that you get surgery for, uh, uh, for your knee.

Prior to performing surgery, she would have to submit a prior authorization request to the healthcare company.

We would take a look at the policy. The

policy might indicate, for example, as one part of the medical necessity criteria, you know, prior to, uh, knee surgery, more conservative therapy has to have been, uh uh Uh, attempted like physical therapy, for example, over 2 months before you can get approval for, uh, for knee surgery.

So that, that, that example is like sort of a very uh kind of straightforward one. The reality is that the prior authorization

one. The reality is that the prior authorization process, uh, is a very complex one, in healthcare in America.

Um, It has an estimated cost of about $35 billion annually.

It creates an enormous amount of administrative overhead for clinicians, which reduces their time with patients.

Uh, it also affects the timeliness with the patients get the appropriate treatment that they need.

So at Cohere, uh, we're really focused on helping uh clinicians and patients get to ES faster.

And we do a lot of uh auto decisioning, so based on the information that we receive in a prior authorization request, the clinical information that we have about a patient, uh, and the policy that we're examining, uh, in some cases we can get up to 85% auto approval rates.

However, in that 15%, uh, it needs to go to a clinic to a clinician for review.

Uh, we don't use machine learning, uh, or AI to ever, uh, deny deny deny a claim or deny a prior authorization request.

But what we do use, uh, AI and agents is to help surface, uh, clinical information in that, uh, review, in that review process.

For the clinician to help them figure out uh whether or not medical necessity guidelines are being met and that procedure can be approved or not.

This results in an uptick in like 30 to 40% faster reviews uh for uh clinicians, doctors and nurses who are interacting with our review tools, uh to make a determination on prior authorization cases.

So I'm gonna talk a little bit just about like the problem space here, uh, and some of the issues that we're solving with agents.

Uh, so I think, you know, typically if you're working, uh, with agents or language models, you're doing some sort of optimization between like latency cost and some accuracy measure.

Uh, in this case for us, uh, latency and cost aren't the primary factors, uh, precision and accuracy are paramount.

We're making decisions that, uh, our, our clinicians are making decisions that determine whether or not a patient gets the appropriate care.

We're looking through potentially, you know, huge amounts of data, uh, for very precise, uh, uh information.

The language here, uh, is like complex and specialized. Frontier languages, uh, frontier models, state of the art models, uh, do a reasonably good job of, uh, picking up on some of this.

Uh, but there are like some, some complex problems here so uh, Like lower back pain as an example.

Uh, that might be expressed as lower back pain, low back pain, pain in the lower back, uh LBP and You know, everyone who's worked with language models or NLP is thinking like, well, you know, that's, you know, that's a problem that's, you know, fairly easily solved, uh, to, you know, to sort of semantically group those.

Uh, the problem though is that, uh, In order for you to, uh, to do that in a way that's effective, it needs to use like the same ontology and taxonomy that the policy is using, right? So

term normalization, concept resolution, uh, match up to the language and the policy.

Or you're taking the policy and your clinical information, and a third ontology and mapping, uh, and taxonomy and mapping everything to that.

So this is a very, uh, complex space where, uh, just out of the box language models aren't able to handle, uh, uh, a number of these problems. Uh, agent management is also complex, so we're dealing with both, um, structured and unstructured data.

So structured data might be, uh, claims history, for example, uh, unstructured data could be, uh, you know, clinical records.

There can be an enormous amount, uh, of either of those two.

And I think that if you've done a lot of work with language models at this point, um, I think sort of intuitively like based on, you know, precision and accuracy, uh, You can sort of see that something like uh traditional like uh uh vector rag search, uh, similarity search isn't gonna be particularly useful here.

There, there are some times when it sometimes when it is useful, but you're not looking for something similar, you're looking for something that like is a a very precise match.

So we need to do a lot of in context, uh, learning with our models. So passing

the structure and the structured data in.

We also need different representations of the data.

So, you know, combining the structured unstructured data into, say, uh, a graph representation, uh, you know, and giving the agent options for, you know, uh, how it might want to, uh, you know, fetch that data to reason over, uh, is, is critical.

There's also like an important like temporal component here. So, you know, I mentioned previously

here. So, you know, I mentioned previously for the knee surgery, conservative therapy, you need to look at potentially clinical data or or claims uh to see if, you know, hey, over the last two months, has a patient undergone uh uh physical therapy.

Uh, you might be looking at blood values over time, you might be looking at very specific, very specific dosing information over a time window.

Uh, and that kind of temporal data, uh, retrieval is something that language models typically don't handle very well.

There's also like a, a context window management problem here. Uh, you can easily overflow,

here. Uh, you can easily overflow, uh, uh, a context window with claims and, uh, uh, clinical data.

So how do you narrow down what it is you're looking for, what kind of heuristics can you use to, uh, determine, you know, what you pass to the model in con in context.

And then, you know, the scope of your evaluations here expands rapidly. Um,

expands rapidly. Um, you're dealing with different data representations, you're dealing with a lot of data, you need a lot of, uh, you know, ground truth data to, uh, run your evals against.

This is a very, uh, uh, you know, the, the complexity here grows rapidly.

So I did want to talk, obviously this is an Asian core session, like why we're using Asian core. So,

core. So, uh, as you know, sure and Iris have talked about, you know, bring your own tooling is a sort of a key principle here.

We were already in like the Lang graph and Lang fuse, uh, sorry, Lang graph and Lang chain, uh ecosystem.

Uh, so we were able to, you know, fairly quickly, uh, move out of uh ECS into Asian core.

Uh we're uh, we're using AIS, uh, for telemetry, for, uh, observability.

Uh Uh, Iris was just talking about Lang fuse, which is great. Uh, Arise has,

is great. Uh, Arise has, uh, an open source product called Phoenix, which is also great.

We do use Arise's Enterprise, uh, edition, uh, Arise DX, uh, but it's, it's a great, their open source Phoenix is a great point of comparison against, uh, uh, uh Langfuse.

Uh, and we also use light LLM for, uh, an LLM gateway. I'm not sure how many people here are using LLM gateways, uh, and I'll talk about it a little bit on the next slide.

And then, you know, something that really appeals to us as a startup is, you know, the minimal, which is no infrastructure management here.

Uh, and anytime my team's not managing hardware or infrastructure, they're focused on product development. We're in a regulated

development. We're in a regulated industry, the, uh, micro VM architecture, you know, fast spin up spin down, no red memory is important for us.

Uh, Trey has brought up the CLI that fits really neatly into our existing CICD pipelines, and Otel works really well, like I said, uh, to hydrate, uh, Arise.

All of this works like pretty seamlessly out of the box with minimal code changes.

Two things that are really important for us, uh, is agent core gateway and governance and security, and they're, I see them as very tied together. Uh,

together. Uh, let's say that we have, uh, a specialized subagent that is, uh, uh, a claims related subagent, so it's able to, you know, pull claims for a patient and uh uh, search for data and and claim history.

Um, and we might have an API that we have internally that allows you, has different endpoints to interact with claim information. What's,

information. What's, what's nice is we can put agent code gateway in front of it, but what's critical is that from like not just a security by design perspective, but compliance by design.

We need to know, OK, like who's interacting with the main agent in a review session.

If there's a call to a sub-agent to look at claims data, what's the identity of that user who made that claim, who made that call to the subagent?

What are they authorized to see, and can we ensure that they only retrieve information for a patient that they have access to?

This is critical for compliance and, uh, compliance and audit.

Uh, Stras also mentioned, um, uh, short and long term memory.

Uh, so short term memory supports, uh, text and binary data.

Uh, uh, he mentioned Lagraph checkpointing which needs to use, uh, binary data, it's very straightforward plug-in from the AWS team to drop into, uh, La graph to, uh, uh, get short term memory storage.

Long term memory, uh, is more complicated, I think, regardless of the framework you're using.

Um, one thing that we found useful is name spaces for managing, uh, like a multi-tenant environment.

Uh, you need to spend some time thinking about your memory architecture here. This is, uh, I'd say

here. This is, uh, I'd say significantly more complex than, uh, uh, short term memory, especially if you have multiple agents.

So I'm gonna just talk about our architecture a bit here and it, it is, uh, yeah, I wish I had a, a pointer.

Um, uh, it looks busy but it's actually like the complexity of it, uh, uh, is reduced. I'm only showing one, agent, uh, which is in the review all platform and like just in the middle it shows the uh chatbot supervisor.

So that's, uh, you know, that's running in line graph, um, and we're connecting, uh, to Agent Core short memory there.

Uh, like I said, very seamless, uh, integration within, uh, the chatbot itself is or this particular agent is running on, uh, Agent Core infrastructure on the micro VM.

All of our traffic is running through Light LLM, uh, to Bedrock.

So Light LLM is an AI gateway.

Uh, if you haven't worked with one of these, it functions as sort of, uh, like a reverse proxy for, uh, talking to different, uh, model providers.

Uh, it gives you really nice features like, uh, uh, rate limiting, uh, insight onto like token usage by user, uh, cost controls, you can put common guard rails in place.

We're mostly working with Bedrock, but it's seamless for us to pull in Gemini, anything from OpenAI, uh, behind Light LLM, and if a client in front of Light LLM wants to use those, you're really just changing an API key in the name of a model. Uh, all the other, uh, work is being handled via, via the gateway.

You can see on the left, we're using uh Arise uh prompt management.

So So.

I'll talk about emails, uh, in a minute, um.

For us, uh, we, we tend to separate our prompts from, uh, like our application codes so that we have a life cycle and a deployment life cycle, a life cycle A life cycle and a deployment life cycle for prompts that separate from code.

You can make updates to prompts without having to redeploy, uh, redeploy containers.

Um, at the bottom you can see, uh, indication level, uh, LLM prompt in the lower left there.

So, thinking about that claims agent for a moment, uh, you might have this uh supervisor agent, and, uh, that's what the, the, the doctor or the nurse is interacting with. And you might make a

with. And you might make a call to a specialized sub-agent for claims. Something that I feel like we, we talk about a lot when we're talking about MCP is, uh, is tools.

I think how we use prompts in MCP is really interesting.

I don't see it like as much about it, maybe I'm reading the wrong, uh, the wrong authors.

But in this case, you know, you could pass the uh medical necessity criteria uh to that claims agent.

It could decide what, uh, you know, claim history it wants to grab, and also select what is the medical necessity criteria or the indication level is the terminology we'd use.

What prompt do I wanna grab, uh, in order to, uh, look for this information, uh, in, in the claim history.

Um, this is a, so there's multiple ways we handle, uh, clinical data extraction including like with, uh, fine-tuned models.

Uh, but this is like particularly interesting, and it's also adds to the complexity of your evals.

Uh, it's a powerful methodology, but as your clinical areas grow, as the number of prompts you have grow, Um, being able to, you know, evaluate whether or not you retrieved the right prompt and applied it in the right way for that subagent against that data to get the information you're looking for becomes another set of evaluations you have to run.

Uh, the announcement of session level evals, uh, is really exciting, being able to run evals across, like, you know, the whole, uh, the whole set of agentic interactions, uh, is, is really critical.

And then, uh, I'm just gonna touch on, uh, evals here briefly. Uh,

it's my, my slide here is much busier than Iris's.

So like I mentioned, uh, earlier, we're already in, uh, the Arise ecosystem.

Uh, we have online evals, so you know, from, uh, from Lang Langgraph and Agent Core, we have data going into Arise.

Uh, that can get piped into our online e e evaluation. We could

have a L as a judge there, looking at things like ha hallucination rate, uh, we could be looking at guideline, uh, guardrail violations, there's like a number of things that you might wanna, uh, be looking at for, uh, online evals.

And especially like if you wanted to determine if, you know, performance is changing in any way.

Uh, we also hook into our offline evals here. Um, so,

evals here. Um, so, uh, we are, we have clinicians on staff doctors nurses subject matter experts, so we're able to, uh, have, uh, human annotators, uh, in the loop here, uh, and that can all be fed into our uh offline uh uh, evals process where we're, you know, uh, You know, determining the the effectiveness

of a particular prompt, of a fine-tuned, uh, small language model, uh, of the agentic system or the set of agentic calls, uh, as a whole.

Uh, and just like with Langfuse, like we already had something like this, we're able to keep this mostly working as is, uh, and, uh, plug it into Asian core, uh, and get all this data hydrated the way that we needed it to.

Um That's uh, that's uh it for me. I'm gonna hand it back to Shreyas.

Thank you Keith and uh uh you know in summary what we saw is we saw Asian core uh that can integrate with multiple open source uh uh frameworks as well as protocols.

uh Aris walked us through Asian tops as well as uh a deeper integration with Lang fuse and we saw a really exciting real world uh application with, uh, coherent Health as well.

I'm gonna leave this slide up there for, for, uh, for you to scan. I know folks are trying to hurry up and scan quickly when we change. So thank you so much for

we change. So thank you so much for attending our session and I hope you found this informative.

Uh, we're gonna hang out here, uh, for a little bit to take any questions later on.

Thank you so much.

Loading...

Loading video analysis...