Colin Jarvis | Head of Forward Deployed Engineering at OpenAI: Trust. Product. Impact.

By Altimeter Capital

Summary

## Key takeaways - **Morgan Stanley's 98% Adoption**: Morgan Stanley deployed GPT-4 for wealth advisers to access research reports, achieving 98% adoption and 3x usage increase after 6-8 weeks technical build plus 4 months of pilots and evals to build trust. [02:45], [04:24] - **Semicon Bug-Fixing Agent**: Forked Codex into a debug investigation and triage agent that investigates bugs, raises PRs with execution environment, aiming for 50% engineer efficiency gain; currently at 20-30% in rolled-out divisions. [05:45], [08:05] - **Eval-Driven Development**: No LLM code is done until verified by evals; FDES add scaffolding like deterministic parts and simulators around eval framework to ensure production trust. [09:00], [09:53] - **Klarna-to-Agent SDK Path**: Parameterized instructions and evals scaled Klarna customer service from 20 to 400 policies, became open-source Swarm, extended to T-Mobile, evolved into Agent SDK and Agent Kit. [10:30], [12:56] - **Supply Chain Demo Guardrails**: Live demo orchestrates LLM for tariff impact analysis and simulations with deterministic constraints like minimum suppliers, verifiable tables, and maps to enforce rules outside LLM. [15:45], [18:36] - **Avoid Generalizing Too Early**: Biggest mistake is generalizing solutions before deeply solving specific customer problems; high-concept ideas without clear problems fail, while deep dives yield generalizable outcomes. [27:00], [29:20]

Topics Covered

Target High-Stakes Problems
Trust Trumps Technical Hurdles
Embed to Automate Painful Tasks
Custom Builds Excrete Reusable Products
Solve Specific Pain Before Generalizing

Full Transcript

You know, there's this whole narrative out there about AI being a bubble. The

MIT study that 95% of enterprise deployments don't work. You are the 5% making them work.

>> We aim at problems that are fairly high value. So, we generally will we'll aim

value. So, we generally will we'll aim at problems that are going to be saving saving customers or generating customers to the tune of like, you know, tens of millions to sometimes the low billions in terms of value.

>> My foreign boss Sham Shanker used to say at Palanter, you know, the job of a forward deployed engineer is to eat paint and excrete product. Sounds like

you've successfully excreted some product here. Yeah.

product here. Yeah.

>> A lot of pain. That's what it is.

>> That's right.

>> Today I'm joined by Colin Jarvis, head of forward deployed engineering at OpenAI. Colin, thanks for doing it.

OpenAI. Colin, thanks for doing it.

>> Yeah, thanks for having me.

>> I'm so thrilled to be doing this with you. You were actually somebody I was

you. You were actually somebody I was talking to you a little bit. You joined

in November 2022, the month of Chad GPT >> back in the day.

>> Back in the day. I think OpenAI was like less than 200 people back in back in the day.

>> Yeah, something like that. Yeah. I think

I was like one of one of two or three people in Europe when I started.

>> Wow. Wow.

>> Well, I'm so thrilled to do this, you know, for for a couple of reasons. One,

I started my own career as a former deployed engineer at Paler about a decade ago.

>> Typical.

>> And second, you know, there's this whole narrative out there about AI being a bubble, the MIT study that 95% of enterprise deployments don't work. You

are the 5% >> making them work.

>> That is certainly what we say in house.

Yeah. Yeah. Definitely. Yeah. Yeah.

We've been working with enterprises for a while. So, so maybe maybe start from

a while. So, so maybe maybe start from the start, you know, the origin story of forward deployed engineering at OpenAI.

How did you guys come up with this? What

was the motivation?

>> Yeah.

>> Uh, take us back.

>> Totally. Yeah. Yeah. Definitely. And

like when it started in 2022, I think there was like it was like, you know, Chad GBT came out, there was tons of hype and people were really excited, but it was also like kind of hard to get value from the models at that point.

Like people I think copyrightiting was like the big use case just like you could like you could write papers kind of with it. And um I think like what uh what we were excited about with with um forward deployed engineering is like

over those couple of years following chatbt we were able to get customers in production but it was pretty inconsistent and it took a ton of time every time. And the thing that was like

every time. And the thing that was like consistently successful in getting those folks to production was just sort of embedding with them understanding their domain really deeply and like sitting with the users on the ground and getting

them to properly like adopt the technology. And I think um we sort of

technology. And I think um we sort of found like our our like our previous hands-off like technical support model was just not reliably getting these folks to production. And that kind of led to us deciding to set up a forward

deployed model so that we could I guess repeatedly get these enterprises to production.

>> Makes a ton of sense. And maybe double clicking what does forward deployed engineering actually do?

>> It would maybe help if I like I'll I'll kind of ground this in in like in an example. So one of like the customer

example. So one of like the customer examples that like led us to to want to set up a forward practice was uh was Morgan Stanley who we deployed in 2023.

They were in fact our like first enterprise customer to deploy with GBD4 and uh the use case for them was that they wanted to take their their wealth management practice and uh this is kind of a consistent feature that we see with

people who are successful deploying genai is like they don't do like a kind of edge case like over to the side of the company. They pick something with

the company. They pick something with genuinely high stakes. In this case, Morgan Stanley Wealth Management is like their their kind of one of the biggest parts of their business. And uh what we were trying to do was put like the their

research in the hands of all of the wealth of all the wealth advisers so all of them would kind of I guess have like more more easy access and be able to give like more actionable insights to

their to their customers. And so what we did was uh we I guess the problems to be solved were twofold. At that stage like rag was not a thing yet. It was kind of like we have a bunch of research reports. We need to get those in the

reports. We need to get those in the hands of the wealth advisers and we need them to trust them. So what we did was we embedded with Morgan Stanley. We

worked with our engineers. We first of all did a bunch of like retrieval tuning to get to the point where we could kind of trust the outputs of the research reports and then we ran a bunch of pilots getting the wealth the wealth

advisers to kind of label data and uh eventually got them to the point where they kind of trusted the insights and started actually like using them to give to their customers. And I think that's like the kind of key thing with them is

like the main technical hurdle was probably solved within like 6 to 8 weeks. But it took a further like 4

weeks. But it took a further like 4 months of like doing pilots collecting evals and iterating to get to the point where the advisers actually trusted it to kind of put it in uh to like actually

use it in anger. But uh the outcome was really good like I think in the end about like 98% of them adopted it. It

resulted in like a 3x increase in the amount of like usage of their research reports. So like

reports. So like >> wow 6 months to effectively build trust.

>> Yeah.

>> But the original product was working within a couple of weeks as you said.

>> Yeah. Yeah. I'd say like probably like within like six to eight weeks we had we had the pipeline we had some guard rails. We had like you know we had the

rails. We had like you know we had the retreat we tuned the search enough that we getting decent results but it's kind of like it's almost like how good is good enough for production with these enterprises and especially Morgan

Stanley working in a regulated environment like there's a very high I guess need for accuracy but also this is a probabilistic technology like you do just have to like build the right kind of framework and then also just like get

the users to kind of trust it and give them the tools to verify it where they don't. So

don't. So >> yeah.

>> So that was 2023.

>> Yeah. Fast forward to today two and a half three years in forward deployed engineering is a hot topic probably the hardest job to hire for there's startups across the spectrum from AI native

startups like Sierra to infrastructure businesses like B standan to >> even business and financial services like ramp they all you know they have forward employed engineering teams really wide range means a lot of

different things to different people >> different use cases but maybe maybe we can start with a use case where for deployed engineering has been particularly useful Yeah.

>> Take take us take us take us deep um on on on an example or two where you've moved the needle and delivered like tangible ROI.

>> Totally. Yeah. Yeah. Yeah. Happy to take you through. So probably um like one of

you through. So probably um like one of our biggest projects right now is a is with a semiconductor company in Europe.

And what we're doing there is like I guess like going back to the theme with Morgan Sally, it's like you got to pick a big a big kind of problem to solve.

And uh in this case they basically said like look across our entire value chain and we want you to just like effectively take the biggest areas of of of like waste in that chain and then make them

more efficient with AI and like they basically left it fully open with us as to what we could do and we basically embedded on site spent like a couple of weeks just like understanding their business in a lot of detail and looking across their value chain. There's like

the design process the verification process and the like measuring performance once you've actually shipped these like chip designs. And uh the one that we really dove on was verification which is like again I guess like the the thing that the engineers don't like

doing but probably they have to spend like 70 80% of their time like you know just like fixing bugs like maintaining compatibility with old versions of chips all this kind of stuff. Yeah.

>> So, um, what we did was we we delivered like 10 different FTE use cases across their across their value chain with the aim of eventually delivering like I guess a I think the the total figure that we're aiming for is like a 50%

incre efficiency saving and what the and the engineers time and currently where we're sitting is about 20 to 30% for the first couple divisions that we've rolled out with. And like if I give you an

out with. And like if I give you an example of some of the things we do um like one of them is that they have so many tests to run on these chips that they have to like schedule a bunch of them and run them overnight every night.

And when they come in in the morning every engineer sort of has like something between like a few hundred bugs which have not come to these desks and they're just sitting like looking at them like oh my god like I have like I

have so much stuff to do and already I just have to like start debugging and work from almost nothing. So what we did was we uh we actually took codeex to start with and then we like we started to build on codecs to sort of like

tailor it for their domain and and what we did was start off start off where the model would just like take the take the bug go and do a bunch of investigation and then write a ticket for the engineer to say like this is probably what's

wrong and when we got to the point where they trusted the results we then started to actually have the model like go and like actually try and fix it and then raise a PR and uh when we did that we discovered actually we need to like give

the model an execution environment so it can test its code and it sort of this iterative process where like we started off with just like this is a painful thing. Let's just like help it be

thing. Let's just like help it be advisory. What we ended up with was this

advisory. What we ended up with was this like what we call the debug investigation and triage agent which basically goes goes in and tries to fix like a significant proportion of these bugs. So when the engineer comes in in

bugs. So when the engineer comes in in the morning uh our like vision hopefully by like you know mid next year is that the engineer will come in and most of them will be fixed. The hardest ones will be clearly documented. they'll be

able to just go and do them. And like

the impact of the engineer sometimes it would take them like four to six weeks to actually get through these cuz they'd be trying to do their work while going and like you know swivel desking between fixing bugs going and writing new code.

Now what we're hoping is we can get them to a world where they're just like constantly just writing new code and the model is like doing a lot of the supporting >> bug fixing. Yeah.

>> Yeah. This is it.

>> Wow. At this is an interesting tango with the model and and your team. Yeah.

where where does the model end and your team take over and is that line moving over time?

>> This is a great question. Like I think I think ideally you would just go in and you would just say like hey here are the tools like here are the tools I have as an engineer and here's the problem and just go fix it. And I think that's that's maybe a world we eventually get

to. I think what we find is like the

to. I think what we find is like the first thing we did was fork codeex and add a whole bunch of telemetry so that we could build like really detailed evals. We could be like, okay, if I as

evals. We could be like, okay, if I as an engineer was going to fix this thing, there's a detailed trajectory I would follow of like, you know, maybe 20 different actions where I'm checking different like logs and saying like

whether or not there's a problem. And

our first step is kind of like figuring out, okay, making like a set of five with the experts from the customer and uh and making this like dislabeled set and then starting to work from there. is

I kind of feel like uh taking a step back, it's like even if you do have this perfect future where the AI is like is able to go away and do these things autonomously, you're going to want some sort of like ability to check its work.

And so we build this like I guess this like kind of eval framework starting point. We call it like eval driven

point. We call it like eval driven development where we but like effectively a bit of LLM written code is like not done until you have a set of evals that verify the efficacy and that's like that's sort of where we start with this. So I think coming back

to your original question, it's like if the models get better, hopefully we just run that eval set. It works out of the bars. In practice, the uh the FDES then

bars. In practice, the uh the FDES then have to add a whole bunch of scaffolding around that to then make sure that like okay, these bits need to be deterministic. These bits we actually

deterministic. These bits we actually have a different simulator. We're just

going to execute as a tool and get the results. So it's generally like a bit of

results. So it's generally like a bit of a kind of like a bit of a mix of the two approaches.

>> Super helpful. So that's your uh uh coordination with the model side. And

then I'll ask you another one on the product side. you know the big push back

product side. you know the big push back and you know sorry to use the C word is that FDs are just consultants.

>> Yeah.

>> But is it true that you guys are actually building products?

>> Yeah.

>> Uh and if so where does that line where's the line between products that are that are reusable for other customers >> and and custom work or or consulting work that you might do for one customer?

Even in that like brief example there there's there's sort of an element of product where like we're kind of figuring out okay is this eventually going to be a PR back to codeex that's going to just like make it better at doing these enterprise use cases. Is

there a different product which is different from codeex which is going to solve this problem but um I think maybe a better example is uh comes from uh it actually comes from all the way back in 2023 which was another one of our customers was Clara and uh I worked with

them >> of course you very vocal customer.

>> Yeah. Yeah. Exactly. And uh I worked with them for a while to do a bunch of things, but the first thing we shipped with them was this customer service application. And um what we found was

application. And um what we found was effectively customer service was very hard to scale. Like if you're like if you're manually writing a prompt for every single policy and you've got 400

policies, this is not a scalable method of uh of of approaching this. So we we worked out a way of parameterizing the instructions and tools and then wrapping each like intent with a set of evals so

that we could scale from like say 20 policies all the way up to 400 policies or more. And uh this like method worked

or more. And uh this like method worked pretty well at CLA. Um what we then decided was actually we were going to like kind of commod we were going to take this and boil it into like an open source framework that people use

internally which we called swarm and uh we eventually open sourced that and uh the market reaction was actually like fairly good and in parallel with that what we did was we started an engagement with T-Mobile which was I would say like

even more complex like maybe 10x more complex in terms of like volume and the number of policies and the complexity of policies as Clark but what we found was that this framework actually worked like pretty well there with a few with a

extensions. And so what we did was we

extensions. And so what we did was we then worked with our product team and we were kind of like hey look we've got some production customers like these primitives are working like this like swarm thing is actually getting like a lot of stars on GitHub which is like

actually has quite a lot of influence internally and um we eventually decided to work with product to um to stand up a team to build what became the agent SDK >> and um and I think recently you might

have seen that the uh we released agent kit which is kind of a visual builder type thing but that is actually just like a continuation of that uh story that started in 2023 which is like let's make a way to parameterize instructions

and tools. Let's make a framework which

and tools. Let's make a framework which then makes it much simpler to roll out and now let's make it easier to adopt that framework through something like agent kit. So and that all sort of came

agent kit. So and that all sort of came I guess originally from that was a solution architect engagement but very much an FTE style of delivery and today the FDES all use it as like a standard tool in their toolbox.

>> Wow. So what started with CLA grew with uh T-Mobile now agent agent kit.

>> Y >> that's fascinating. It's a real product that that's actually out there built by the forward deployed engineering team.

>> Exactly.

>> You know, my uh my foreign boss Sham Shanker used to say at Palanter, you know, the job of a forward deployed engineer is to eat pain and excrete product.

>> Sounds like you've successfully excreted some product here.

>> A lot of pain.

>> That's right.

>> That's right.

>> Well, maybe you know this is actually a great example. So while you were

great example. So while you were ingesting all that pain, maybe tell us a little bit about the before and after of, >> you know, if I was now on the other side a customer or an enterprise that had the good fortune of working with you,

>> what does my before and after look like?

How am I different uh before you showed up and after uh uh your engagement?

>> I'd say there's kind of there's kind of two things. There's like there's um

two things. There's like there's um there's the the problem we're solving.

So like there's going to be a whole bunch of like demonstrable ROI. You're

going to have like a piece of software which you didn't have before. It's going

to solve it's going to have your pro it's going to solve your problem. Yeah.

And generally we work with customers where we'll either have like a set of customer a set of engineers on the other side that we're going to like hand over to or we're going to or we work alongside maybe a partner who's going to then take that forward and operate it

because we see our FD team very much as like a zero to1 team like we're going to come in break the back of like the like difficult novel problems that are at the root of what you want to do but then we want to move on to kind of the next problem. We don't want to sort of be

problem. We don't want to sort of be there in the in the long term. So that's

generally the model I think like apart from the software and the ROI that they're kind of getting out of that software there's also like a bit of like an understanding of like the FB sort of ethos and like the way that we deliver

which is they should know like the way you develop with LLMs is like you orchestrate whatever your like whatever your your end toend workflow is. You

wrap evals around it. You put guard rails to protect the things that are that you can't really control that are going to happen at runtime. And then you basically just uh build up an eval set and then test the consistency of this

application and you learn how to like make applications you can trust. And

what we hope is that you like don't really need the FDES to come back uh once you've uh once we've sort of solved the solve this initial problem. Although

in practice what we've seen with some of the bigger ones is actually those like those building blocks like for the semiconductor example I mentioned earlier those building blocks start to actually become an application in their own right and maybe we start to actually

just keep building on top of the foundations. But the like preferred

foundations. But the like preferred method is that we find a hard problem, we solve a hard problem and then we sort of we extract the learnings and we move on to the next customer.

>> Fascinating. Fascinating. Is there a is there like a product that you can show us that that you've left behind?

>> Yeah. Yeah, definitely. Yeah, I've

actually got one here that we did for uh that we did for a customer in the manufacturing sector in Apac.

>> Let's see it. Yeah, let's do it.

>> Happy to see it.

>> Nothing more than a live better than a live demo.

>> Yeah. Yeah, for sure. Yeah. Yeah. Let's

uh let's take you through. Um cool. So,

this uh this dashboard here just to explain what you're looking for. So,

this was for a uh customer in Apac who worked in the automotive industry and uh this was this is a kind of a toy version of what we of what we produced for them.

But their problem was that they had an extremely complex like uh complex supply chain. So if anything happened that was

chain. So if anything happened that was going to affect the like supply of parts they would need to communicate through all these different teams in terms of like the manufacturing division then the like legit then the logistics team in

terms of how they were going to like surf where they were going to get those parts and then move them to the factories where they were actually going to where they were actually going to going to build everything. So there was like I um I guess the coordination of

this was all done manually by like teams sort of calling each other on the phone and uh what we tried to do with them was build like a data layer to start with that kind of that create like we didn't move the data but we just created a a

couple of couple of APIs so that we could then put an LLM in as an orchestrator that could then like effectively query that data and start to produce like an initial stab at the insights that would have taken you you

know hours or days of wrangling of these other teams and so Like the use case I'm going to go with here is that we have a customer who's um who has just heard that there's a 25% tariff on goods

coming from China to South Korea. And

normally this would require like coordination across these teams. They would have to figure out which parts it affects how that affects their supply chain. And then they would have to run a

chain. And then they would have to run a bunch of simulations to figure out like what is the right way to reroute our supply chain to get around this problem.

>> And in this case what we're going to do is use an LLM to deal with this problem.

And um the problem with this is that there's a huge number of variables and as we're like familiar with with LLMs, it's a probabilistic technology. There's

a lot of like if you give it full openness to just do whatever it wants, then we're probably not going to trust the results. So we're also going to

the results. So we're also going to build in some deterministic guardrails that we can verify that like whatever plan the model is coming up with, we can always verify deterministically, make sure the numbers add up before we

actually go and make any like genuine business decisions. So what we created

business decisions. So what we created for them was this uh was this offsp spew here. So whichever plan we're going to

here. So whichever plan we're going to make with our supply chain, there's a bunch of like deterministic constraints that are going to be that are going to be um tested here. Like for example, we always need minimum two suppliers for

tires that we're not going to we're not going to get stuck in this perspective.

Uh a few on lead time, a few on like where of uh whichever suppliers we're using, we need to have all materials covered. All these sorts of things.

covered. All these sorts of things.

These are things that you want to be done like 100% of the time. You don't

want to trust the LM to enforce those rules, you want to do those deterministically. That's kind of like I

deterministically. That's kind of like I guess a consistent feature of how of how we approach things in in the FTE team is always trading off like when whenever you can use determinism, do it and then

and then use L1's for what it's for what they're best at which is like using probably like using this probabistic approach where there's some kind of like nuance required as to what you need to do.

>> So let's play out this this uh this scenario. So, I'm going to put in what's

scenario. So, I'm going to put in what's the impact of a 25% tariff from China to South Korea.

>> And there's a couple of controls we built in here because the customer was uh quite conservative and did not trust LMS very much when we started. So, we

added in a step where it would always explain its reasoning to start with.

Again, these are fairly basic, but I'm just like showing you all the kind of different layers that we provide. And we

also provided it a bunch of widgets so that it could kind of like display, you know, just display things clearly. Like,

for example, here's the tariff. here's

the here's the impact on the different uh the different parts. Um we also uh often folks would not uh would not uh would still not trust it. So we also added other like kind of deterministic

checks like it could just expose tables that you could check its work if you wanted. So in this case if I wanted to

wanted. So in this case if I wanted to check that those uh that those values that it applied for the tariffs are correct I could pop up this detailed modal. You can see here that indeed all

modal. You can see here that indeed all of the parts that originate from China now have a tariff applied to them of of the percentage that I asked. So um you can also if we want to look at like a

map view. So we'll show you okay yeah

map view. So we'll show you okay yeah the tariffs are hitting this back drain that's producing these parts whatever this is.

>> Um cool. So this is like the first part is just like this like business intelligence previously required a bunch of like different BI teams go in write SQL queries. Now we've got a model just

SQL queries. Now we've got a model just doing all this and kind of like democratizing access to those. But the

real value is that then you have to do a bunch of like complex simulations to be like right you know what are what is the right combination of fac of factories that is going to like minimize my lead

time minimize my cost balance off all these trade-offs and equally that is something that you don't want to ask an LLM to do just like out of the blue but what we did was we gave it access to a

simulator where it could basically play with the parameters that an educated business user would would play with and then we would get it to use those to kind of to basically run a bunch of simulations and then decide what the

best one is. And uh in this case um I'll just uh I'll like set it off here because it'll it'll it'll execute a few.

So it'll actually take a couple seconds.

Um this is like kind of again a toy example. So it's going to do five

example. So it's going to do five different optimization runs. But in

practice, this is actually another area where we've seen folks recently use like agentic sort of approaches pretty successfully is effectively getting them to run like you know hundreds or thousands of simulations offline and

then come back to you with like a well doumented approach as to why this is like the right set of trade-offs that you could do.

So once this is done, it's going to come back with a little table here just showing like of those five runs, here's the trade-offs, here's the best one on based on what you asked me to do. And

then we'll pick one, we'll ship that, and then we're kind of good to go.

>> Fascinating.

>> Yeah. Oh, here we go. Yeah, there they are. So, we can see the different costs

are. So, we can see the different costs and lead times that we've that we've come back with and the model suggesting one. So, again, I'm just going to I'm

one. So, again, I'm just going to I'm just going to accept uh plan as final.

And you can see here as well the new plan that's come back. We can see that it's it's passed our deterministic checks. So we're like we're comfortable.

checks. So we're like we're comfortable.

We can also see that the tariff has been applied here. So we can uh so we can

applied here. So we can uh so we can confirm that. And we can also use the

confirm that. And we can also use the map to just like visually see that actually there's like actually there's a difference. So we're now using a

difference. So we're now using a Taiwanese factory where previously a bunch of things were coming from China.

So that's just a quick uh demo of like one of the things that we built for a customer. Again, we've like we've uh

customer. Again, we've like we've uh we've kind of simplified a lot of the details to to make this kind of like easily consumable for this. But each of the like aspects under here are things that we've genuinely delivered for uh

for folks like both in the manufacturing industry and in uh and in others. And um

yeah, maybe that's just uh I hope that's I hope that's illustrative of the sort of thing they're building. Yeah.

>> Well, that's a very involved um engagement clearly with this one partner and you know, the median public software company earns maybe hundreds of

thousands of dollars per customer. Panel

is probably the highest end of that.

Earns about five six million median ACV per customer. This this feels very

per customer. This this feels very involved. So like what is the median

involved. So like what is the median size of the prize for your team to get involved? I can't imagine that being a

involved? I can't imagine that being a low number.

>> This is an interesting one. I think um at OpenAI like like we definitely don't see the FD function being a function that we that is going to be like a cast of thousands. like we definitely want to

of thousands. like we definitely want to be very focused focusing on problems that we think are likely to be like generalizable and turn into platform in future or going to push our research in

a in like um in a new direction. So I

think uh we aim at problems that are fairly high value. So like generally we'll we'll aim at problems that are going to be saving saving customers or generating customers to the tune of like you know tens of millions sometimes the low billions in terms of value so that

we can um so that we can ensure that like if we're going to solve this it's going to have like a big kind of economic impact at the end.

>> This is the SWAT team.

>> Yeah. Well yeah in a way in a way of in a way of saying it.

>> How big is the forward deployed engineering team at OpenAI?

>> Yeah. So um when we started at the start of the year it was just myself and my colleague Arno in in Paris. Uh yeah.

Yeah. We did our our first engagement uh with John Deere with just the two of us and primarily him and uh then we're now uh we're now 39 and we're going to be 52 by the end of the year. So

>> 252 in a year.

>> Yeah. So it's been a it's been it's been a it's been a quick a meteoric rise for the for for the team. Yeah.

>> All right. Let's go. Wow. You guys are going fast.

>> Yeah.

>> There's a lot of this explains it. I

mean there's not lot there's not a lot of FDs out there. There's a lot of our, you know, I see founders and and and and and CEOs and and and you know, honestly, customers wanting to build forward

deployed engineing motion. They're

trying to understand, hey, what are the economics of it? How do I what is the the right set of my customers where I should deploy forward deployed engineers?

>> Y >> you know, what advice would you have for for for startup founders and CEOs um in the age of AI of like where or how and where to deploy FTEEs? The first thing

is to be very clear as to like what your FTE team is going to accomplish. Like

are you going to be determined are you going to be relying on the services revenue as like a key revenue stream?

Because that is I think a very different motion than what we're trying to do here which is like we're almost like betting on the uh on the increase in revenue that we're going to get from like successful product bets that are going

to come out of these FTE engagements.

And uh I think that's probably the key thing. I think people try and do both.

thing. I think people try and do both.

And then inevitably I think like from working in consulting in the past like I I always saw a failure mode in those consulting firms where they they had a vision of like we're eventually going to be a product company but unfortunately

the like shortterm uh the short-term lure of services revenue just starts to drag everyone in that direction and then suddenly you kind of lose like the strategic view. I think uh at OpenAI

strategic view. I think uh at OpenAI we're we're sort of lucky because we're at our heart. We're a research company and then we became a product company and those two like that that allows us to be strategic in terms of prioritizing those

those those customers. So I think coming back to your question of like what the advice would be, it would be like be very clear as to like what the purpose of this FTE team is and then fully like push it towards that and be prepared to

say say no at some very difficult times like when uh you know somebody's going to offer you a lot of money to do something that's not strategic and you have to be like you have to sort of hold firm with that >> and the thing that you've got to hold

firm with specifically is that you've got to produce a product or a platform at the end of it.

>> Yeah, >> that's true.

>> Yeah. And I think that is where at OpenAI it's sort of like there's a bit of a trade-off there. Like I I'll sort of I'll I'll like sort of break my own rule a little bit there because there's kind of like there's two things. It's

like there's sometimes very economically valuable problems. We're not sure if that's going to lead to platform but we're kind of like you know what that's like such a valuable problem that we're pretty sure there's at least some research learnings here that like if the

model were better at this sort of problem solving like that would be a net good for OpenAI. So let's prioritize that customer. And we actually sort of

that customer. And we actually sort of split our capacity in the team by those two levers. So some of them we have like

two levers. So some of them we have like definite product hypotheses and we're like looking for like the perfect design partner, the perfect person to build a customer service thing or the perfect person to build a like a clinical trial

like sort of doc authoring solution with. Whereas on the other hand we also

with. Whereas on the other hand we also have just like sort of industries which have interesting problems like semiconductors or or maybe life sciences. And for those we're more just

sciences. And for those we're more just kind of looking for the right partners who have these big problems that they want to solve and we split the team according to those two axes. Yeah.

>> Fascinating. And just to follow up so what is the so as you are creating all this product for these customers will open release these products for other longtail customers where maybe your team

is not involved. Should expect to see like an openi product and customer service or or or life sciences and and semiconductors and so on. Without giving

away too many details, I think it's like I think it's an option that we're kind of open to like we're we're looking for like additional moni like monetization options in the future. And um I think um that what I definitely would say is like

the FTE team is like is a zero to one team. Like the way that I would almost

team. Like the way that I would almost see it is like on this product side, we we want to prove the product hypothesis.

We then want to do it maybe two or three more times. So if the first thing was

more times. So if the first thing was maybe like 20% reusable, the next two get us to maybe 50% reusable. and then

we almost want to like push it into the scaled part of the business and try and do this thing like a whole bunch of times across the market. So that is definitely a motion that we want to uh that we want to build with this team.

Although I would say like to contextualize we're like very much at the start of that journey. We have a bunch of hypotheses. We're like trying to figure out where our bets are going to are going to go and uh that's kind of that that's kind of where we're at right

now. That's on the like product bet

now. That's on the like product bet side. If I go back to this like more

side. If I go back to this like more researchy bets of like trying to solve these big uh industry problems, I'd say we're we're considering on that front how we would do this. But I'd say we're less likely to like go into a vertical

and have like a vertical specific product that we support as a as like as something through the FD team. But um we do hope that by solving these very valuable problem these very valuable

problems, some kind of like horizontal platform that is useful might like fall out of them. So we're like we're open to that uh to that option, but I think it's more on this like product side that we're looking for the reusability.

>> Super helpful. So 0 to one, maybe another one to two to get it up to like 50% 60% as good. Yeah. And then spread it out to the rest of the world.

>> Yes. Exactly.

>> That makes sense. Maybe one question on what not to do. You've seen this enough times where I'm sure you've seen a good set of examples where it did not lead to a product. Yeah.

a product. Yeah.

>> Or did not lead to learnings. Was there

a common thread across those experiences of like hey this is >> this is these are things to not do in in in in for deployed engineering >> this is why I mean we we've made plenty of mistakes over the last year but I think one of the one of the biggest ones

is probably generalizing too early like I think there's been some cases whereas openai like because we have chat gvt there's like some functions where you just look at them as like a feature in chat gvt and you're like that would make a great generalizable solution for

enterprises and you kind of like go looking for looking for a problem for them and uh what we found in a lot of the cases is like we're kind of we then don't do the zero to one. We're just

trying to like deploy this generalized thing and it just like doesn't really work. Like I feel like generally in the

work. Like I feel like generally in the OM ecosystem there's too many like high concept solutions out there that don't really have like a a clear problem that they solve. Whereas conversely whenever

they solve. Whereas conversely whenever we've just gone super deep on the customer's problem and just been like we're just going to solve this problem almost every time there's been some kind of generalizable thing that's that's

like that's popped at the end. So I

think definitely generalizing too early is like the the kind of I guess the biggest mistake that we've uh that we've like made and then learned from this year.

>> Right. I can't remember. I think it was Paul Graham who said that um you know don't don't worry about scaling too early.

>> Yeah.

>> You'll fix the specific problem.

>> Yeah. I think doing what doesn't scale is like one of like the sort of like the watchwords of the team I think in terms of uh yeah how we try to approach things. So

things. So >> that makes sense. That makes sense. I

have a set of rapid fire questions for you.

>> Y >> uh we'll launch into them. My favorite

one which is the long short game which we love at Ultimator.

>> Yeah.

>> Pick an idea, a startup, a business that you think is going to benefit and is is underappreciated.

>> Yeah.

>> And the same same thing on the other side. An idea or a business that you

side. An idea or a business that you think is more more more sizzle than state.

>> This is a fun one. Um so I've got to I think starting off with the the things I'm long on. Um so if I look at like where we spend like almost like majority of our time in FD getting things to

production it is like creating the translation layer between the data like the raw data and then uh kind of injecting the business logic in some way so that the LMS can make effective use

of that data and like people like often just start with MCP and they just stick the MCP connectors but yeah exactly exact well exactly but um I think there's almost like the um I think the

thing that we've had to repeatedly do is sort of build almost like a sort of light like logic layer in between the two that the model then uses to make best use. It's like you know when would

best use. It's like you know when would I like take the struct this the data warehouse and mix it with like the contents of SharePoint to to answer this complex question and um that is uh I

think the companies that are in this sort of space that are like like almost it's like the metadata translation layer which interestingly is actually sort of like you know an old school concept it's like you think of the tools like Calibbra and stuff where you're just

like you're documenting metadata but actually now now that it's like LM's making use of it this kind of thing is actually fairly useful. Um and uh so that that's actually it's like I think if the the people who can solve that problem I'm like fairly long on the on

on on those folks there.

>> This is not too different from how you know where we place a lot of value at balance like the ontology layer is where most of the secret sauce is.

>> Yeah. And interestingly I mean like the problem is definitely not new. It's like

the classic problem of like do do we move the data to somewhere where it's easy to use or do we like build some kind of layer and use it where it is?

And uh that has not changed. I think the thing that has changed is that now you have instead of people having to like write queries exactly can do it for you.

So this is uh this is sort of and like that means like do you have to move the data at all like these are the the kind of decisions I don't know if we have like a final answer there but I think that's like a really interesting space

because it would just save so much time in these FTE type deployments and also just like democratize access to the data a lot better.

>> What a great point. What a great point.

Um short >> yeah so what am I short on? So this um and I'll caveat this by saying that I'm biased because I I'm I'm thinking about like I primarily work with enterprises and um I think um the the the businesses

that I'm short on are the ones that if you look across the whole like what what are the steps to get an LM to LM application to production. It's like

I've got to orchestrate it. I got to put tracing in. I got to label those things,

tracing in. I got to label those things, build evals, and then I need to like stick some guard rails in. Like these

are the the kind of rough components.

the companies that do only one part of that chain um are the ones that I don't see getting like broad adoption at a lot of these enterprises and um the the reason for that I think is like the

enterp like it is difficult enough just to like just to get an LM application assuming everything was the same tool getting that to production is hard wiring up all these tools and then dealing with all the integration problems as well is like is just a

bridge too far for a lot of the enterprises and they'd really want like kind of like one throat to choke with this problem so these are the uh these are the sort businesses that I think I'm I'm like more short on.

>> Makes sense. Favorite underrated AI tool GPT.

>> Yeah. Underrated AI tool. So So

definitely like the one This one's not underrated. I think like like I think

underrated. I think like like I think people like this one, but like Codex is definitely the one that like, you know, the first time that I just like turned it on and then I went and like, you know, went into four hours of meetings and came back and like something was done. I was like, "Oh, this is like

done. I was like, "Oh, this is like magic." But um but I would say the

magic." But um but I would say the underrated one is his is the playground like the the OpenAI playground. Yeah.

Like um so It's in platform.di.com

and it's basically like it's as simple as like if somebody >> is this the thing pre-hat GPT. This is

where you could chat with like the Davinci.

>> Yeah. Yeah. Yeah. But it's it's still there, you know. Yeah. Yeah. It's still

there. Yeah. Yeah. Yeah. And there's

actually like there's a new version there's like a voice playground that you can use for the real time API and just in terms of like it's basically like you could just somebody just stuck like a nice UI over like the API basically and

you can just interact with the API. But

the amount of like demos I've done with that that like blow people away like it's actually such a powerful tool for just being like is this use case even possible and you can validate it in the playground you know like like I'll give one example of like I was trying to

figure out whether this browser automation use case would would would work and I just like screenshotted the web page stuck it in the in the playground started hitting it and was like if it can I with an N of 10 like make this work and if it's like seven or

eight times then like maybe this is going to work in production. It's like

it's just such a simple tool and uh I think uh I think underrated for like just validating like just doing that quick like sense check people when you're trying to validate a use case.

>> I'm glad you reminded me of it. That's

actually where I had my uh aha moment before chat GPT. I was out playing with Da Vinci 03 I >> remember chatting with them on Playground. I was like wow.

Playground. I was like wow.

>> Yeah.

>> The chat happened. I was like well I told you >> back to playground.

>> Yeah. Yeah. Yeah. I mean it's it's an underrated tool. It's Yeah. For sure.

underrated tool. It's Yeah. For sure.

For sure. That's right. You know, a lot has happened since since you joined, since chat happened. Um,

>> what's your favorite uh, you know, highlight of your time at OpenAI and maybe a lowlight of your time at OpenAI?

>> Yeah. I mean, my favorite one or like probably the the more like actually the thing that I joined OpenAI for was like when Dolly 2 came out and um, I was just like like I'm like, you know, I'm like a creative like or I like I try to be

creative. Like I like like tattoos and

creative. Like I like like tattoos and stuff like like it's lots of fun, but like but I'm like not artistic at all.

And I was like, "Oh, this is super cool.

I can like express myself. Um and um so we got a chance to do an engagement with Coca-Cola where we did uh this create real magic campaign where uh like they were like the first partners to get to

use Dolly 3 before it even came out and we had to make like a version of Dolly 3 and uh we got to like generate all this like and it was like the process of tuning this that it made like perfect Christmas imagery with Coca-Cola was

like such a sort of like I mean it's like it's like sort of like just a fun use case. But I just thought like that

use case. But I just thought like that is um it was sort of like coming full circle from like that brought me in and then I got to actually ship it to all these people around the world who were like using using Jedi for cool stuff

like I I really like that one. It was

like it was not the hardest use case but it was also um I mean there were genuine risks like people could have jailbroken it caused lots of problems with the model. So I mean it was also satisfying

model. So I mean it was also satisfying to ship it but really just like just sort of like a fun use case and like a nice consolidation of like wow look how far we've come you know it was like from pre-hatbt all the way to dolly 3.

There's that one. And uh what was the other question? Yeah. Yeah. The low

other question? Yeah. Yeah. The low

light. So um

>> good question. So I think at OpenAI we go through this like uh you know B TOC B2B sort of like eb and flow and uh like given that I've always been on the B2B

side. I think there's been times and I

side. I think there's been times and I think it was maybe like from the first dev day like we released the assistance API and stuff and then there was maybe a period until the dev days of the next year where just it felt like the

company's focus was so much on the B2 B toC side and I talked about open sourcing swarm like one of the reasons we open source swarm is there was just like not a lot of interest in like this sort of framework which is like that was

viewed as more for like a B2B audience so I'd say that was maybe like my least favorite time and that I felt like in that time we shipped like you know we shipped shipped Morgan Stally, we shipped Clara, like all these super cool

like enterprise use cases, but it was just uh you know there just like wasn't the investment from the from the business and we kind of felt like we were we were like our words were falling on deaf ears until suddenly you know

towards the end of 2024 suddenly the pendulum swung again and then it was time to go all in on on B2B and ironically that was when the FD business case got approved and then it was like right let's go you know and uh yeah it's been good since then.

>> That's awesome. Well, you know, as you said, exactly. You know, Chad GPT was a

said, exactly. You know, Chad GPT was a big story of of 23, a big part of 24 agents, the year of agents. What is 2026 going to be the year of? What's the next big >> Yeah.

>> wave, you think? What are we going to be talking about next year?

>> Yeah, that is a that is a great question. So I I can I can kind of only

question. So I I can I can kind of only guess here, but but I I feel like um I feel like when uh when people like people are talking about this year is the year of agents, but I feel even now with agents, it's like there's still a

lot of work to like get those things to production. And I feel like if you were

production. And I feel like if you were to take a step back from agents and and every like fine-tuning and all these techniques, like what you really want is like LM application that you just like turn it on and it just sort of learns from your behavior and then it just sort

of works. And I I wonder whether whether

of works. And I I wonder whether whether 2026 is actually going to be the year where we actually go back to like fine-tuning models where it's like, you know, we now just like have these agentic networks, we orchestrate them,

but now we finally have all the plumbing to like make the training data, label it really quickly, build a training set, fine-tune a model for specifically this aenic use case, and then suddenly it's like it just works beautifully. I feel

like we're like like the building blocks have been getting built this whole time.

That's why people can now use agents.

Like before they just they couldn't even think of doing complex things.

And I just wonder whether that's going to be like sort of the next frontier where like agents aren't just being used. They're actually being used for

used. They're actually being used for very specialist domains where you do need to tune them to be good at like chip design or drug discovery or something. We actually we get to that

something. We actually we get to that point where it's like we actually go back to fine-tuning models now that we have all of like the all the plumbing and stuff. I uh yeah, so it's maybe like

and stuff. I uh yeah, so it's maybe like the year of optimization or something or or maybe the year of like one click to production, you know, who knows? But uh

yeah, that's such a good that's that's a good framing. I I was thinking

good framing. I I was thinking wearables hardware >> possible, possible.

>> The state of hardware leaves so much to be desired. Um,

be desired. Um, >> it does.

>> It does. Yeah.

>> But, uh, well, this was a lot of fun, Colin. Thank you so much. Thanks for

Colin. Thank you so much. Thanks for

telling us all about forward deployed engineering.

>> Cool. No, thanks for having me.

>> Awesome.

Loading...

Loading video analysis...