The Hidden Bottlenecks Slowing Down AI Agents

By MLOps.community

Summary

## Key takeaways - **Eval Bottleneck is Data, Not Tools**: The bottleneck in evaluation is not the tool itself but having the ability to generate curated eval sets and a product feedback loop to measure against new models and create a real flywheel. [00:00], [03:41] - **Pizza-Fueled Labeling Parties**: We buy pizza and invite anybody in the company to labeling parties to create eval sets by scoring images for food, code snippets, or customer conversations in different languages. [02:18], [03:03] - **Buy Tools, Live in Future**: We actively push the team to buy off-the-shelf tools to focus on what makes the product better, test the latest solutions for problems others will have in 12-18 months, and potentially invest in promising vendors. [04:01], [06:28] - **Coding Agents Excel on Existing Codebases**: Coding agents like Devon work great on existing, documented codebases with CI/CD where you ask them to make incremental tweaks like Jira tickets, but struggle creating maintainable architectures from scratch. [13:47], [14:55] - **Vendor Tools Fail on PII Privacy**: Vendor tools like Composio and Memo expose full user conversations and PII in dashboards without proper safeguards, making legal approval impossible compared to traditional tools like Datadog that alert on PII. [39:31], [41:12] - **Build In-House for Go Reliability**: Using Go provides compiled reliability for agent orchestration, allowing custom fixes like chunk timeouts in Vertex AI that Python-centric frameworks lack SDKs for, saving time on edge cases. [22:34], [24:41]

Topics Covered

Eval Bottleneck is Data, Not Tools
Buy Tools to Focus on Differentiation
Coding Agents Excel on Incremental Tasks
Build Core for Control and Compliance
Vendor Tools Leak PII in Production

Full Transcript

The eval one typically the bottleneck is actually not the tool itself. It's about

having the ability to generate those eval sets and having a product feedback loop which you can then take data from to measure against you know new models and improve and sort of have a real

flywheel. That's it.

flywheel. That's it.

[Music] We approached the build versus buy question when it comes to building agents and what kind of vendors you want to onboard in the enterprise setting.

Bruce and Paul talk us through this.

Paul is the VP of AI at process and Bruce is an AI engineer. I myself am Demetrios the founder of the MLOps community. Let's jump into this

community. Let's jump into this conversation.

You're saying, "Hey, we have this culture of we should buy it, we shouldn't build it, we don't want to think about building it."

>> Despite that, as we'll hear from Bruce later, >> it's hard.

>> It's so immature at this moment in time.

The tools that you have, and I kind of went step by step over the main areas where you have these categories forming.

>> Well, maybe what I can emphasize is, you know, sometimes the tool is not the hard part.

>> Yeah. So let's say if you want to have an eval solution, you want to know whether your models are good. That's the

problem you have and you want to know whether they get worse and how they perform and where they do worse and where they do better. The hard part is not the tool. The hard part is you need an eval set that you curate. You need to

have real users that give you you know new conversations that you can measure against. You need to um have that

against. You need to um have that capability. If you have if you don't

capability. If you have if you don't have that doesn't matter what whether you build the tool yourself or not.

>> Yeah.

>> And so there in that thing the tool itself at least in our setup right in our setup is not the thing that unlocks our ability to do eval better and human

time is expensive.

>> Yes. What I think about though is there's an argument for if you have a tool that makes it much easier for humans to get that eval set then you're

saving time right but I think that you're absolutely right uh but the you know what we do in the team to give you an idea is we have these uh labeling

parties right so we buy pizza we invite folks everybody's welcome right not just folks from the IT or engineers like anybody in the company is welcome and we'll for example like show them you So these are certain set of answers for a

prompt set. So we'll do like I don't

prompt set. So we'll do like I don't know images for food as one example or we do code obviously if you do code evaluation you need the folks who can can can rank or you know score or

evaluate code snippets uh we will do let's say uh customer conversations and then you need people in different languages if it's Polish or or Portuguese Brazilian or whatever.

>> So we'll do these labeling parties to create data sets right and these are the eval sets that we can then score. or you

can see that on Prolm which Zukof will talk about in another episode I think.

So the uh the hard part is getting those eval sets and it's not that if we then all of a sudden had a tool and we've tried them all right whether it's

spellbook or uh human >> human loop or orc AI and so on some of them are really great and by the way they offer 15 things one of them is evalu

>> the eval one typically the bottleneck is actually not the tool itself it's about having the ability to generate those eval sets and having a product feed back

loop which you can then take data from to measure against you know new models and improve and sort of have a real flywheel let's say well one thing that I think is wild is how you're actively

encouraging the team to go out there and buy tools you would prefer that they do that because as Bezos said back in the day focus on what makes the beer taste

better and creating your own orchestration system is not what makes the beer taste better it's not going to make took a better tool. If there's something off the

tool. If there's something off the shelf, go and grab that. And also, you have the angle of if you find out about a tool, then you might want to become an

investor in it. Despite that,

>> you have had a lot of trouble, per se. I

guess that might not be the best word, but you've had a lot of difficulty finding tools that you can actually buy.

Yeah. less adoption of outside tools for our agents in production systems than I'd like or than we'd like. Um, but

yeah, listen, we're all like we're time and and and and people are scarce resources, right? So, you know, I always

resources, right? So, you know, I always say to the team if if there's a tool that we built that we could have bought outside, there's no, you know, there's no brownie points for that. Nobody like

there's just no value in that. We could

have done, you know, other things, moved faster, whatever. Um, so we're ve I very

faster, whatever. Um, so we're ve I very actively push the team like hey check out this uh this new whatever open- source solution check out this new uh this tool out there for doing eval for

doing observability for doing whatever and I I do think our team lives a little bit in the future in some ways because we're so like our job is to test this the latest and greatest and then figure

out what makes sense. we have real problems that pro probably other folks will have you know 12 18 months down the road and so it's not just you know I

want my team to like have all the time that they uh can to spend on the things that make a difference and so I encourage them to buy and use tools wherever that makes sense. It's also

that if we understand, hey, this tool out there, >> these guys build something for a problem we know we have and others will have down the road, we should partner with those guys. It's like let's make sure

those guys. It's like let's make sure everybody in this group we've got like I said you know thousands of engineers all over the world that have that will have those problems some already have that we should then expose you know this tool to

like hey did you know that such and such actually can now help you with authentication in the agentic flow or they can help you with whatever observability of such and such part of

the stack and and that happens um and and it also means you know because pros is has a big investment arm we can also help you know partner with those

founders if they're raising money to say, "Hey, we've one, one, we're already using you. We want to use you. Two, you

using you. We want to use you. Two, you

may want to have other users in the group because there's we're a big global company.

>> And third, if you need help, you know, investments and so on, we can provide that." Um, because we use tools in my

that." Um, because we use tools in my team to try and solve real problems that we think others will have soon.

>> I tend to trust your opinions about what's real and what's not uh a lot. I

heavily weigh them because you do live in the future a little bit and you're able to tell me what is hype and what is

real. And so when I see you saying, "Oh,

real. And so when I see you saying, "Oh, this is great, but it's not for it's not for us," that kind of is a signal for me to say, "Huh, there's there's something

wrong with that piece. Why is that? What

is it? What?" And the fact that you haven't onboarded a lot of these common tools that you hear about every day is a very

big signal for me. Whether that is an orchestration thing like a lang chain, lang graph, whatever, or an observability tool like a lang fuse,

>> insert your favorite lang in here and eval tool. That for me is a huge signal.

eval tool. That for me is a huge signal.

It makes it uh all the more important when you do onboard a tool that I stop and I listen to why you chose that

tool. Yeah, I think sometimes like I

tool. Yeah, I think sometimes like I said like we I I do think in some ways we live in the future because we're we're earlier than than than many and trying to build things at scale in

production agents or other other let's say AI um um powered products.

Um so I I think one thing that we do understandable is what are the problems that others will have down the road.

We're also bit of a strange team. Um and

I've come to realize that as you know sort of look back is like we we are because we're much closer to you know we train our own models. We work with the biggest biggest labs out there. We have

our own eval. We we are not representative of many users that are coming down the line. So if you have other teams that already you know they have ML models or maybe they don't

they're smaller teams and smaller companies they will have different needs and they um you know even if you fast forward them 18 months they will still

not be same as as a sort of full AI team right so some needs and the ways to solve those needs say the problems are common but the ways to solve them for a

team that is less AI experienced uh they will maybe favor a more drag and drop solution for example as one right so I'll give you like there's a lot of observability tools out there and we

looked at all of them and many of them had this sort of drag and drop solution right and it turns out that for some people that's great because they they

don't want to get into the intricacies of the coding and so on but you know my team would say well listen I need an SDK or I need this to be headless or basically be able to kind of interact

directly with the underlying things we need to debug it. I need to see the logs and so on. Um so those things that there will be it's like if you look at um

let's say the observability tools and the the ones that create these chains of of agents you have zapier right which is just anybody can use that can have has

access to a keyboard and a mouse and then you have all the tools like I don't know the things that are native in in uh in GCP or AWS right that kind of help you with observability and chaining

these things together um those are very different end users. The people that access the AWS environments and the GCP environments and the Azure environments are your SR or engineers.

>> The Zapier users are you know anybody and so you need to solve for those different needs as well.

>> Are you using coding agents?

>> Yes, we are. Um we started already using them early on and by the way coding agents I assume you mean things like Devon and Manis and and and many others.

>> Um That's something that you haven't built yourself.

>> Uh, no, we have not. No. Um, so that's an clear area where we see, you know, amazing teams. Shout out to the cursor team for example, who have really

focused on, you know, the persona being the software engineer and how to really change the user experience for them to be able to use these things, right? So

we we don't have that's not that's not we we build products that are consumerf facing, right? We would never build a

facing, right? We would never build a coding agent, but we use them all the time. We test them all the time. Very

time. We test them all the time. Very

excited by all the things coming out.

We're investing in some of them as well.

Um, but if you think about let's say the first generation of um tools in this space, the first ones were were GitHub

copilot like they were the first ones to come in the space. They of course partnered with OpenAI and we could already see the early days.

This is actually pretty good. And then

soon thereafter you you saw let's say newcomers like um of course cursor but >> source graph kodium

>> yeah replet and so on that that came >> and started to do this slightly better and then they evolved and actually in my view they they they overtook let's say

the let's say the first movers in this case co-pilot to build what are now like software native sort of agentic flows and this sort conversation and the interface that you had that that cursor

built where they trained their own models to actually execute these tests was very quickly like it was good that was better and in fact you know my whole team moved to cursor >> and now we are sort of at the next level

think of it like autonomy levels that they have carry level 1 2 3 4 like we're we're now at the next level where let's say Devon like agents who can don't they just don't don't just do autocomplete

they don't just edit code they can create entire environments right and and to execute a task end to end. Um, we

were playing with them. I think started about a year ago when the Devon announcement came out and yeah, it was a little bit underwhelming, very promising but underwhelming.

>> Um, now fast forward you know almost a year.

>> Um, we see that it's much much more useful. Um, you still need to look at

useful. Um, you still need to look at this different let's say use cases. So

if I'm in my hobby project at home and I need to create a little I don't know a little app or a little dashboard to track whatever uh household chores to say something right by myself I can use

Devon perfect now today and it will work basically like first time right before I had to try it five times now it'll work for something simple great but how much of that it really happens at work

>> right so legacy >> yeah I need to commit stuff I you know have maybe smaller projects some of them are more demo like there. Um, you can also start to use Devon. But what

happens there is if you do what my experience say at least today is if you let it create the entire codebase from scratch for thing that we need to actually work on together and maintain like you create it in ways you didn't

like the architecture isn't exactly as you prefer and so it's not to your taste and it's not easy to maintain that's where I don't see it working yet. What

we are now finding it works really nicely is in the other type of projects where you actually already have a code base that isn't you know a code base that 3,000 engineers are involved in but

it's a code base that maybe dozens of engineers maybe hundreds are involved in but it's it's you know clearly documented repositories where you already have a CI/CD pipeline things are

you know already documented better because you got more people than you can you know individually see Um so that agent actually comes into something and

you can ask it to do these tasks that you know now are a you know a card on a Jira board >> right a ticket like I don't know make sure this becomes XYZ compatible or

insert this feature have give it a shot and that's where these things in my experience are now becoming great because it's an existing codebase we already thought about how the architectures would look but you're making tweaks and modifications that are

somewhat incremental And the agent, it's not that large that the agent can't handle the context.

Um, and of course all these features that we're seeing now is, you know, like Devon creating its own wiki, right? And

indexing the codebase, various repos together. Um, you can give it

together. Um, you can give it instructions that then it knows for anybody else working with Devon on that codebase. That's cool. And we're

codebase. That's cool. And we're

tracking sort of how Devon now like how many PRs is Devon actually making on the team >> and >> that's a KPI or is that something?

>> Well, it's it's definitely like we we want people so that we have this duty now. We're basically saying like Devon

now. We're basically saying like Devon duty, right? So somebody's on duty this

duty, right? So somebody's on duty this week on on Devon and this week it's Devon but next week it'll be Madness or Curser or whatever or Replet. But the

goal is that during that Devon duty any task you pick up from the Jira board you should first think let me run it with Devon.

>> Wow cool >> to see right because then otherwise how do we know if this thing got better and where do we apply it and so that's the KPI that everyone needs to try that be on duty to sort of learn and discover

and then eventually we may set like hey 10% of the PRs you know should be devon ready if we believe that's feasible or 20% or 25% or whatever. How do you ensure the cognitive load doesn't just

go through the roof because people are submitting a bunch of shitty PRs?

>> Yeah. Well, we had this, right? So, like

we actually built our own uh like uh uh PR reviewer, right? That thing was so bloody verbose, right? it would people turn it off after a couple hours because

it was just answer inserting a ton of comments and then the cognitive load of you having to read those comments compared to the submitted you know PR like it's easier just to look at it

yourself and you can you can spare the the the commenting from this AI verbose AI thing so that's the question right so where what's the sweet spot where the

work done offsets let's say the cognitive load you still need to additional cognitive load you need put into understand that AI is work.

>> Yeah.

>> And that's what we're trying to figure out and and then and so is it, you know, how many PRs are worth it to sort of give to a to an AI software agent. We're

going to play with an S sur agent soon.

So there's >> we just try to understand like where is it useful and where not and some to some extent there's also a team thing and individual thing more senior engineers versus more juniors are you familiar

with the code base or not and so on. for

onboarding for example is great. I mean

tell me more like I mean so for me like I don't spend a lot of time coding unfortunately let's say committing like production level code >> but I obviously want to understand what's happening in the code I can just

inter like in interrogate a codebase using cursor >> right or any of these other tools new hires have the same they come in and they're like hey I I need to work on this new repository we move people from

projects to projects then they can come in and say hey tell me describe to me you know how this codebase works what are the key, you know, endpoints, what are the key services, what are the standards agree on, what are the utils,

and it'll just describe that to you. And

so, it's a good way to get people up to speed or people like me that aren't sort of day-to-day in the codebase.

I think we should bring on Bruce now to talk about what tools you all have tested and where you decided to build your own and why.

>> Great.

You're here now. I am.

>> Who are you and what are you doing here, Bruce?

>> So, I'm Bruce. Uh, I work at Process in the AI team. Been over two years there now. And I work as an AI engineer in the

now. And I work as an AI engineer in the team. What I spend most of my time on is

team. What I spend most of my time on is building token, which is our own agent or agent platform.

>> Um, and we distribute that to the portfolio companies. So, I work as a

portfolio companies. So, I work as a like a AI engineer do software engineering, a mix of uh both of those.

and you spent a good amount of time trying to incorporate different vendor tools that will help you build tokon

faster. I know that

faster. I know that you've been working on tokcon for years and so there is this

maturity factor that goes into the tools that are out there that you can use and buy. Even if you wanted to pay money for

buy. Even if you wanted to pay money for something, maybe it just doesn't exist or it's not at a solid maturity level. I

typically see folks using tools in a few different places and you can tell me why or why not you ended up buying a tool in these places

>> and we can talk about which areas there's actually value and there which areas you potentially see value in the future if it matures or there's no value.

one is prompting tools or prompt tracking.

>> Yeah. So for I think for a lot of products out there and also for what we are doing um prompts are very essential to your system and storing that

somewhere else I guess is fine. Of

course, we also store it in a database.

Uh, but we're not storing it in an external service that's doing the evaluations for us because it feels so essential to have a very good prompt that I think it's something you want to

spend time yourself on.

>> Mhm.

>> But then the evaluation piece, >> so we've built our own evaluation flow for that.

>> Uh, two reasons. One big one is that within the AI team, we have also have a team that do evaluations that have a leaderboard. So, we have the knowledge

leaderboard. So, we have the knowledge to do those things. And a second big reason is if you're going to use an evaluations tool, uh you're going to need to send all the conversation data

to an external party, >> uh which usually is not it's hard to convince our legal department that we're going to do that.

Oh, so you bring up a great point there on the fact that maybe you wanted to use this tool, but to actually get it through and get the okay from legal is a whole another

beast.

>> Yeah. Yeah, I guess so. So, it's not only like getting the okay from legals, I guess you I also want to be confident that something like that goes well. And

if there's like very new startup companies out there and say, "Oh, you can just send all the whole conversation and we keep everything in track." It

does feel a bit strange to to send it all out there.

>> Yeah.

>> Because it's core to your product that you do well uh on when you change prompt for example and um let's say I'm talking to a colleague at

another company and they ask me this question. I would think like I I'm a bit

question. I would think like I I'm a bit hesitant to tell you how these do this because they are probably not going to like if we send all the the PII to uh to these EVL sets. I think it really helps

then if you have these companies that provide that that if they have a self-hosted version so the data is still at your um place. Yeah, that's a common

design pattern or or vendor offering where it's like bring your own cloud or VPN and so then you don't have to send that data anywhere.

>> No, that's true.

>> But even still, I can imagine legal is one vector that you're thinking about when you're thinking about that build verse buy.

>> Yeah.

>> And so you said we're going to do the prompting tools and the evaluation tools inhouse.

>> Yeah.

>> Those are two big ones in my mind that people will pay for. The other one is orchestration. And so these are

orchestration. And so these are frameworks like the llama indexes and the ling graphs, link chains, that type of thing. So I have an example in our

of thing. So I have an example in our codebase where we built our own orchestrator and we just added vertex AI and we noticed that sometimes we are throwing uh timeouts. So the request was

taking too long um because we set the timeout but it was 5 minutes. So that's

very long. So

>> because we're doing the request ourselves, we're uh did the API implementation ourselves, we weren't using the SDK, we were pretty close to um what the code does. So we easily

added metrics there to see oh what time how long does it take to get the first chunk from V vertx AI. We saw it was super fast. So we still didn't know

super fast. So we still didn't know what's happening here. And then we added uh an extra metric which did the time between chunks. And then we notice all

between chunks. And then we notice all of a sudden that sometimes the time between chunks is super high uh like minutes which is probably just a yeah which is probably just malfunctioning of

vertex AI which is fine >> but um if you're then using something like a library to do that you're probably missing a way to fix that. So

now because we did do that it was very easy for us to first uh add a special time out saying okay if the time between chunks is really large we just try again

with so the user can get a reply quicker uh and because we're making the whole API implementation ourselves we saw that in the headers there was a request ID that vertex AI sent us uh we could just

easily add that to that metric again in the logs that we have we exported those send those to Google uh and they easily can then see what happened and I think

you need to be quite lucky if you ex precisely want to do that because it's quite niche but it is a fix but I feel like using an orchestrator there probably wouldn't have that or you would

be very lucky and would pass it all the way up top but yeah >> it's that transparency piece you have so much more control of what you are able to see and what you're able to do and

you do say that this is a very niche and specific situation but it can be generalized to if it's not this specific situation, I'm sure you encountered five more like it.

>> Yeah. Yeah. And then it's nice that because saying, okay, one of our main um advantages of using us instead of using a different um orchestrator or agent

framework is that we take care of that.

We're very reliable. Mhm.

>> On the reliability piece, do you think that >> the orchestration frameworks although they give you a lot on the abstraction and they're able to make it much easier,

they're able to make the prototype experience nice, >> the reliability experience drops. And so

it's like you have to balance those.

That trade-off is very clear. you have

speed to prototype is very high but then reliability is very low.

>> Yeah, definitely reliability on those frameworks can be super good as well and if they just have what you need uh then it's perfect.

>> Well, explain why they didn't have what you need what you needed. I know that you had mentioned the different SDKs and specifically like a lot of these frameworks are in Python.

>> Yeah. Yeah. So we use Go um and there there there are SDKs of course out there that do this kind of stuff. Uh but even Google and uh Entropic say okay this is

our beta version which then funny because Entropic says in their documentation use our SDKs if you're streaming because the API the direct API implementation uh might not work as

nicely. Um but

nicely. Um but >> so it's not only with the orchestration or the evals or these other tools. It's

with the foundational models too.

>> Yeah, definitely. I think a lot of these f they like these AI products out there are really focused on Python doing very they're probably very good at building

the SDK, but yeah, if you're not using Python um yeah, it's pretty hard to >> Yeah. And what are some

>> Yeah. And what are some uh downfalls or what are some pitfalls that you've had because of that?

>> Not using Python.

>> Yeah, >> we were using Bing Python two years ago or even a year ago still, I think. Um

yeah, so one thing we're missing is the SDKs.

Um because we're doing sometimes like these small tweaks that I just talked about.

You could do that in Python with monkey patching which is also not possible in Go.

But using Go was such a nice transition for us because it's compiled and there's already almost already so much uncertainty with what comes out of a

model. If then the code is, you know,

model. If then the code is, you know, okay, this is fine. And um

>> and you didn't look into using something like Pyantic.

>> We did.

>> Why didn't that work? Um,

>> are you working?

>> That's just for like the the models that come in like the objects.

>> But then in the code we I think it just requires a lot more effort to make Python fully fully fail safe >> where with Go it's way easier to to do something like that.

>> It's almost like it comes out of the box. Yeah.

box. Yeah.

>> In Go and in Python you have to do this extra work.

>> Uhhuh. and super if when I switched from Python to go then it was it felt super restricted but uh in the end now it's it feels way faster because I just know okay I can't do this I should do it this way

>> something new comes in and it's just easier to uh switch >> it's funny that you it's restricted in a good way >> yeah I guess so yeah

>> now you do so on one hand with Python you are or without Python on you're able to

have that reliability and you get all of these things out of the box. But then on the other hand, when you're working with the different foundational models or

orchestration or any kind of tool, vendor tool, you have to do extra work.

Do you see it that way or do you see it a little bit different?

>> Yeah, it's a little extra work. Yeah, I

would think that's the case. So of

course there there's u frameworks out there and orchestrators that are super hard to replicate because they do something very good then but that's also the reason why for example we use uh

LLMs uh that are that we don't make ourselves for this uh for this agent um because that would be too much work um and then for agentic tools as well

they those are it's too much work to build the whole OCR pipeline we can just use something that's out there But for this for these core functionalities like

an LLM orchestrator in your agent I guess it yeah it's a little more work maybe but maybe that's in the beginning but then in the end if you're trying to find that time between chunks it's way

easier if you have the code and it's you don't have to fork anything and change the library or if you're in Python monkey patch something to to get that that one thing you need.

>> Yeah. So I I see it a little bit as you're investing more time up front, but then you save it on the back end and when you have these edge cases, that's

where you really see things shine because you can go and debug much easier.

>> Mhm. M and we try to figure out what do we need and then um it's way more easy to to build something yourself and then maybe in 6 months we feel like oh this

this needs to be expanded uh because it needs to um let's say we do memory um memory on the user level and first we just build it on what the user and

assistant is saying but then after a couple of months say okay it would also be nice if it remembers how it did tool calls so because we're now integrating a lot of uh o tools

and maybe ID if I say okay I want to send something in a Slack channel and I always call that channel operations but then the actual ID is um like token

operations it would it would be super nice to that it remembers then that but then if you didn't build that whole memory thing yourself to begin with and they don't provide that extra feature then you're going to need to rewrite the

whole thing again instead of if you have built it yourself you know what you want you just add it >> and of course you can do feature request But yeah, I just like that if something is so core to your product and you want

to expand the feature that it's easy to do instead of having to hack around or change the library or for that last for the newest feature, right?

>> The next piece that I think a lot of people end up paying for or that there is a lot of attention around is the

observability piece. And so tools like

observability piece. And so tools like Langfuse or I think Lang Smith also does this. Obviously insert your favorite

this. Obviously insert your favorite word after Lang and there's probably a tool that is called that.

>> Mhm.

>> Are you all using a specific observability tool for the agents?

>> Not specific for the for the LLM. So I

think most of those lang something uh are all for agent specific observability

but we use data do for observability. So

I guess those lang uh product tool yeah let's say lang tools I guess those lang tools do >> uh a little more things out of the box

for uh for the for those agents for the observability on the agent but yeah we started using data dog and um it's also

easy just to add the metrics there and the logs there that you would probably also get at other observability tools.

Mhm.

>> And it also depends a bit probably on your back end as well. So most of the time with those uh lang products you can also rerun if you make a change and uh

to your prompt. That's something we have in the back end ourselves. That's how it built >> uh with event streaming. So

yeah, I guess I guess you could use also traditional tools. But if you don't have

traditional tools. But if you don't have that in place, then it's probably nice to do something like that as well.

>> And you don't feel like you're losing out by using a data dog, which is not specific for it.

>> Um, no. Yeah, maybe a little. I guess a

no. Yeah, maybe a little. I guess a little. Of course, when when there's a

little. Of course, when when there's a new model coming out, it takes a little more effort for us to to change uh to to to roll out that model and check whether

it's still performing as well.

um because we don't have those tools but then we save that we don't need to send all the data to that tool as well and then we have a data for the sorry we

have a uh vendor tool that does um the LLM logs and traceability and then we have data do which more more does the software engineering side now it's all

just combined >> which is also nice uh to work with >> and so it does go back to this idea of

to bring on a new tool is a bit of a journey.

You need to get it through all of these different things, whether it's contract and pricing and legal, etc., etc. that when you look at what's out there and

their capabilities versus what you're using inhouse with a data dog, >> you say, I think if we squint, we could

probably make data dog work up to 90 95% of what you get from all these other tools and we don't have to go through any of those hoops.

>> No, the Yeah.

Yeah, I think so. I do I get I always uh catch myself if I new look at any of these new products I'm super excited about them say oh they can do this and they can do that and it's better than data dog because it can do this and this

and I and convince everyone in the team we should do it and I make the PR and I we we test it on staging and then oh this and then and then we find out oh either something is missing

>> or we could have just done this in data dog it's not that different or like a more traditional uh software engineering tool Not that it's traditional, but in a sense it's not AI.

>> It seems like that's a little bit of a trust issue also, right? There's

advertisement >> and what they say they can do and then what they actually do and you don't find that out until you >> test it out yourself.

>> Yeah. So, I have another example for that. Um, we were looking into adding a

that. Um, we were looking into adding a provider uh like a vendor tool. Sorry.

We were looking into adding a vendor tool that could help us roll out tools more easily. So we have a bunch of like

more easily. So we have a bunch of like more general system tools we call them.

It's OCR, it's image generation, all that kind of stuff that you expect from an agent. But then we also wanted to do

an agent. But then we also wanted to do uh oath tools. So going to Google on a on with the user's oath credentials and maybe creating a doc or going to GitHub and list PRs.

um which is a lot of work because you need to go into the documentation of Google, go through the API and be creative and think if I have these scopes and I have this endpoint, I could probably make a tool that's called a

create blank document.

>> Um which is nice to vibe.

>> Why why can't this just be done with OOTH?

>> Oh, it can be done. We want to do it with oat but like finding out the writing the tool descriptions coming up with >> what a tool should does it should do. Uh

it's not like that in the Google documentation say oh and these endpoints are super nice to create an agentic uh tool for kind of looks like uh I'm talking about um MCP now.

>> Yeah.

>> Um but a step back so we were looking to a company that would provide all these tools for us that would do the oat flow and we could just provide a user ID saying okay this user ID these arguments

and this tool. Um, so we found one, a Composio, and they had a huge list of tools available. So I was super

tools available. So I was super convinced, okay, this is it. This is

super nice. Uh, I tested a couple of them, they worked. So I made the PR, went to staging, we all started testing it. And then we noticed that a lot of

it. And then we noticed that a lot of the tools weren't working and there were just minor minor things each time that needed change like oh you need to

request an extra scope or um yeah there was something you needed to add in the dashboard but you didn't do that but it was also not really well documented

because they don't companies like new new startups don't focus that much on documentation >> sadly I wish more did >> so yeah that's definitely example of where I got really excited And I was

like, "Oh, this uh this was a little too too soon to communicate to the team. We

should do this."

>> Cuz then the team's looking at you like, "Dude, come on."

>> Um, it's more so I'm saying, "Okay, this task is done at the end of the week because and then we have 20 tools and then we have 30 tools and then we and I tested five of them and then you test

the other 15 and then you need extra scopes and getting certain scopes at Google is so much work and uh yeah.

just things you didn't expect because it wasn't in the documentation. What I do like about these startups though is that they're always very much listening to you. So if you if you would tell them

you. So if you if you would tell them like something like this, they would probably uh help you out and they also love the feedback I guess because they this feels like a feature request

because yeah, they're just new anyways and the the road map is not that defined yet >> but uh yeah. So then you scrapped that

one and you decided to just go with MCP.

>> No, we didn't. So a lot of our users are not working used to working with AI or are very technical people in HR are using. So this we we're distributing our

using. So this we we're distributing our agents across the portfolio companies.

So um someone from someone from HR is not going to uh go online download some MCP um hope that it the the the docker

runs instantly. I don't think even like

runs instantly. I don't think even like a regular person easily can set that up.

So >> we considered it and in the future it might be nice if for example Google themselves are going to offer MCPS and it just uh maybe you go to your Google account and you say you can create an

MCP link.

>> That would be nice I guess. But if the users need to run the MCPs locally themselves, it's going to be too hard for them.

>> So basically, you check that one off the list.

>> Yeah. And then another thing with these startups um they we really tend to also look into uh the legal part and the PI and the privacy um because we need to be

compliant and that kind of things. So

I'm used to uh if I go into data dog, I won't find PII. If there's PII, we get an alert saying, "Okay, this there's probably PII addit." We never log PII.

>> Um, sentry the same thing. If I want to look at user data, do I need to consent that per conversation? But then when you start using some of these fender tools.

So when I started using Composeio and some users edit their accounts uh so we could test it out. There were like buttons where you could pick one of the tools for example search documents and

you could pick one of the user ids and just execute the tool. So I thought >> so you had god mode.

>> Yeah basically. So I thought okay maybe this is a one-off. It's not that bad.

Maybe there's they're going to add a feature later or a PI feature. But then

I looked recently into memo which is like a long-term memory service and uh did a b um added that to our

system just locally to test it. Went to

their dashboard again. Um and then I saw my whole conversation being there like on the front page of the dashboard saying like oh this is your recent traffic >> and then I went to the memory section I

saw all the memories which is >> it is basically all PII because it's about okay this user >> likes this and this user prefers it if you do that.

>> Wow.

>> Yeah. So those tools are not really made for like made in a way how they respect PII in respect to A. There's probably fe probably buttons and settings out there, but that's that was something I needed

to get used to as well.

>> It's a little eye opening. You weren't

it's like >> you weren't trying to look, but you had to >> and then all of a sudden I could see you looking at your screen and then trying to put your head down. Just locally me,

but >> I didn't mean to say that. I promise

>> it does beg this question of you want this capability inherent in the memory capability is you need to know things about people

>> but the way that the company is setting it up >> they now have to think oh uh if we want

to go to users with various users we want to make sure so that we're not inadvertently making their lives hard

with doing things like keeping PII or making it front and center. So

>> prototyping it, you need to be able to see that quickly, right?

>> Yeah.

>> But >> uh yeah, out of the box seems just so uh different than like the more traditional software engineering tools where you definitely don't want PII in places like data dog.

>> Yeah. And it's like you would have to add this extra layer of getting rid of all this PII. So adding that extra work makes you then come back to the question

of should we just try this ourselves?

>> Yeah. Yeah. And the self-hosted version is then a really good option as well.

>> Yeah. It it feels like you're taking two steps forward and then one step back >> or two steps back and one step forward sometimes with the different tools because they give you certain capabilities. It makes your life easier

capabilities. It makes your life easier on one vector, but then when you are looking at other vectors, you're saying, "Oh my god, to actually get this into production

>> and to get rid of all this PII and to comply with these norms that we have set up, we would have to do so much extra work. It's not worth it for us."

work. It's not worth it for us."

>> No. And then if if it's easy enough to do, then you're probably going to save some time in the long run if you start from the beginning just experimenting and building something like that

yourself.

>> And do you see this becoming more common or is this just a maturity thing?

>> No, I don't don't I still think these vendor tools out there are probably going to stay for a long time. uh

because there's always people building prototypes and you don't always need to be super compliant. You don't always need to be able to scale it indefinitely.

>> Uh so do think these vendor tools out there are perfectly fine. Just not if you have agent in production and you need to be to be compliant, you need it

to be too scalable, uh etc. >> Let's change gears for a second and talk about how you're leveraging different coding tools. Are you using wind surf

coding tools. Are you using wind surf cursor devon a little mix of all of them took even?

>> Yeah, we're doing a mix right now. So of

course we're using our own tool um but now since the IDE systems got so good we also use that quite quite a lot. So I

use cursor uh we use Devon uh and we started using GitHub copilot now to do uh PR reviews. Yeah, for

example, the PR reviews, um, it really took away that first step where you maybe made an spelling error which didn't get caught by a llinter or you copy pasted some adopter and you

needed another adopter, you just copy paste the whole thing and then you forgot to change one little name and the co-pilot, the GitHub copilot PR review

is pretty good at catching that. We did

try other ones out there as well, but we stopped using those because they were super chatty.

uh and very confident and now with the GitHub copilot you can it collapses um comments that it thinks it's not confident that that's the case. So yeah,

it really helps with like if I made an error, send it off to my colleague and I see oh GitHub copa already said okay this is this I don't think this is right that that that saves a lot of time

because it saves the time that I ask the colleague the colleague instantly goes back okay this is the comment there I need to fix it then I need to wait again till till he has time so that's super

nice to to have >> so these coding tools add to this piece

that I find fascinating which is you've been building a lot of tools for building agents like the eval

tool or the orchestration tool. You do

not go out there and buy something, but it's not because there's a lack of push from the team. I was talking to Paul and he was saying that he's really trying to

encourage you guys to buy stuff.

>> Mhm. Despite that, you come back and you say, "Ah, it's just not there yet." And

there's this vector maybe that is worth exploring in your eyes on before when you thought about building it yourself, it was with one scope in mind, but now

that you are using these coding tools, maybe you're a bit more ambitious on being able to build it yourself. Do you

look at it in that way where you say, "Yeah, I can probably build this myself with the help of cursor and wind surf or whatever in a weekend."

>> Definitely. Definitely. Um, if if something comes by and I think I get I get how it works, um, it's so much easier to have

something like cursor indeed explaining what I think it should do and then also me and cursor trying to find out how it works. If cursor wasn't there, it would

works. If cursor wasn't there, it would have been so much work to try to to copy something you think is worth copying instead of instead of buying like

copying the the functionality.

I Yeah.

Yeah. I really Yeah. I really think that tools like cursor enable you to um quickly build these these versions of existing tools that you think, okay,

they're lacking something like this. We

can probably do it better. But if if cursor wasn't there, it's super hard.

It's it's s it takes way longer to experiment for for a very small feature.

Definitely. Yeah.

>> That's all we've got for today. But the

good news is there are 10 other episodes in this series that I'm doing with process deep diving into how they are

approaching building AI products. You

can check it out in the show notes. I

leave a link.

And as always, see you on the next one.

[Music]

Loading...

Loading video analysis...