Databricks Lakebase (OLTP) Technical Deep Dive Chat + Demo w/ @Databricks Cofounder, Reynold Xin
By Josue Bogran Channel
Summary
## Key takeaways - **OLTP Databases Stagnant for 20-30 Years**: OLTP databases like MySQL, Postgres, or Oracle look more or less the same as in the '90s, while analytical databases have evolved two or three orders of magnitude faster thanks to cloud and database engineering advances. [00:43], [01:29] - **AI Agents Create 80% of Neon Databases**: 80% of databases on Neon were created and managed by AI agents post-acquisition, up from 30% a year ago, shifting the persona from humans to agents supervising them. [01:49], [02:06] - **Neon Separates Storage from Compute for Postgres**: Neon built safekeeper for replicated write-ahead logs and page server for caching Postgres pages on S3, enabling minimal changes to Postgres for low-latency OLTP access. [09:46], [10:54] - **Lakebase Scales Compute to Zero in 500ms**: Autoscaling provisions Postgres nodes in under 500 milliseconds, allowing databases to scale down to zero during idle periods like nights or lunch, changing the OLTP paradigm. [12:02], [13:05] - **Lakebase vs Aurora: Open S3 Beats Proprietary**: Unlike Aurora's proprietary storage engine, Lakebase uses open S3 or blob stores for cheaper costs and enables direct analytics queries with Spark or DuckDB on Postgres pages. [15:07], [16:13] - **Demo: Instant Free Trial Branches**: Lakebase creates trial branches from production instantly with copy-on-write, costing nothing extra for storage until writes occur, and autoscales compute to zero after inactivity. [21:10], [23:14]
Topics Covered
- OLTP Databases Stagnant for 30 Years
- AI Agents Now Create 80% of Databases
- Neon Separates Postgres Storage from Compute
- Scale OLTP to Zero in Milliseconds
- Lakebase Unifies OLTP and Analytics
Full Transcript
I'm here with one of the co-founders of data bricks, Rainol Shing. Did I say that correct? Awesome.
that correct? Awesome.
>> Close enough.
>> Close. Look, we were just talking about my name and how difficult it can be. So,
I think we're we're we're even. Real,
you want to go ahead and tell us a little bit as to why Lakebase matters so much to not just
current data bricks customers, but why it really matters to the ecosystem as a whole.
>> Yeah, hope I pronounced it correctly.
>> Said that couldn't have said it better myself.
>> Uh, well, thanks for having me here. the
I think from a broader industry point of view really um I was talking about this at data and AI summit we felt like databases uh OLTP databases really
haven't changed that much in the last 20 30 years if you look at analytical databases they have evolved a lot like we are now probably two or three orders
of magnitude faster than we were in the 90s and maybe even early 2000s and there's a lot that's because we built on a lot of foundational technologies
um that are both cloud-based and also just hardcore like database engineering.
Um but OOLTP database more or less if you crack open a MySQL, Postgress or Oracle today. Um there are a lot of
Oracle today. Um there are a lot of changes but they look more or less the same as what they were back in the '90s.
Um so we felt just that hey it's the right time to think about how we can disrupt and build next generational databases that are cloud native and also
um agent native um the one thing that was pretty interesting we actually didn't know about this until the neon acquisition which was 80% of the
databases on neon um was actually created or were created and managed by AI agents um and this is an incredible sort of um
stats because a year ago that number was 30%. So it went from 30% to like 80%.
30%. So it went from 30% to like 80%.
And you just extrapolate that line at some point 99% of the databases will be created by AI agents and historically databases were really
operated like provisioned and operated by humans. Um I think with the LM trend
by humans. Um I think with the LM trend agent coding uh we will see that basically the actual persona they might there might be still a human somewhere that's probably supervising all those
agents but really the persona is changing from humans to agents and that shift in persona I think comes with a lot of new requirements that historically simply weren't even part of
requirements list for uh databases or maybe they were very they're not even the top 100. Um, an example would be, hey, you want to paralyze a lot of
different agents um to work on a task and maybe have each agent run some version of the experiment and you want to compare which one does best. At this
point, you want to be able to snapshot your data. Um, and it's just and then
your data. Um, and it's just and then actually provide a different instance of the database that's fully isolated and sandbox just for that one agent because the agents could be making mistakes. You
don't want the agent to be impacting your actual um data. And every one of this operation be pretty fast because the agents are operating at pretty fast speed. So I think there's just a lot of
speed. So I think there's just a lot of all this new requirements that are coming in now that which historically simply not even a concern for database system developers and and we love that
uh when this new requirements come in we think it's actually gives an opportunity for us to rethink the fundamentals and think about how we should be architecting databases. Then we can
architecting databases. Then we can build the databases for the future. So
the databases of the past were built for human operators. In the future will be
human operators. In the future will be built for human operators that supervising a lot of different agents doing it. Um and it's also
doing it. Um and it's also intellectually sort of challenging but ultimately comes down to hey we think we can build a new era of databases are just much better than what was done in
the past especially given a new persona >> and and I also think too u the agent side aside obviously apps have been a big investment for data bricks too and I
think a very good opportunity of growth uh >> so I think you also have that going on for you agents aside I know everything everything is agents right now but I
think even agents aside I think having the ability to have everything together in one single place. No, absolutely.
Data bricks apps has been the number one I think maybe might have been the fastest growing product in history of data bricks. Uh it used to be data
data bricks. Uh it used to be data bricks SQL but I think apps actually took that now it's earlier on so maybe um it's it's a little bit hard to predict exactly how we go in the future but it's been growing super fast and the
number one feedback we've been getting from customers is hey can I just get a provision database together with every app I provision. Um so we're definitely seeing that trend.
>> Yeah. No, and I just wanted to add that just just because I think yeah, again there's there's so much on agents and I know there there's a lot of value going
there, but but for me personally as someone that um that has dabbled back and forth in application development in the past,
having so many of the pieces together, for me at least, it it helps me actually get stuff done. So, I appreciate that part. The truth is what's really good a
part. The truth is what's really good a lot of the requirements for agents um once we actually can like support them it ends up being helping humans a lot
also but I would say for humans it's a much nicer to have whereas for a lot of agents it becomes kind of necessity but let's take like we talked about maybe creating snapshots and branches of your
databases like quickly right that's actually incredibly useful feature even just for human coding uh one of the thing we see a lot have customers doing now is that they would um for every git
branch it would automatically create a snapshot like a a branch of the data uh for the database um and and developers really love it because it makes it much easier to test against high fidelity
data.
>> Yeah, I know some of the founders tend to be more on the business side, some more towards the technical side and I know you are definitely a lot more on
the heavy technical side. So give us a little bit more details as to just how the architecture uh really works underneath the hood for
Lake Base.
>> Yeah. So um Lakebase is has been building on this technology we actually acquired from Neon and uh the interesting thing is we have been working together actually long before
the acquisition.
Um the and that actually led to the acquisition itself because we just realized hey how great the tech was. Um
and one I think one of the big like earlier we talked about lake analytics had analytics database look very different today uh from like 20 years ago and one of that change was lick
house and lake house uh has this one very important property which is separate storage from compute um so you can store your data in massive volume in
cloud object stores um this like s3 or a blob store and then uh you can launch um compute instances is very ephemeral um to be processing the data um in object
stores and there are many many advantages here. Uh one of them is it's
advantages here. Uh one of them is it's just enable elastic scaling right you can you can dump as much data as possible and then now um you can put s
uh when you have a big job to run you launch more compute resources when you don't you shut them down um it's much more cost- effective and then the cloud object stores also tend to be uh
probably the cheapest medium uh durable storage medium for uh data um and the thing that the neon team have pulled off and built uh design and build is that they
actually uh they have a similar architecture but it's actually harder to accomplish in OLTP because in analytics um 100 millisecond query is like
considered pretty fast um but as a result that tail latency spike of the object stores are fine but in OLTP you want to be able to process queries in less than a second or less than a
millisecond um a second way too slow like even 100 milliseconds often considered they're too slow.
So the the neon and the other thing is um Postgress has kind of worn and become the lingua franka of OLTB database. If
you look at all the uh database usage trends uh stack overflows the RDBM DBMS ranking Postgres like going up super fast uh in
>> community too right >> yeah community large community large open source community uh like I think that's probably thanks to that large open source community its usage and
adoption is growing like crazy um but Postgress is not fun fundamentally is a couple story from compute instance. If
you go pretty much to like vast majority of Postgress providers, they give you a box. The box has storage in them, has
box. The box has storage in them, has compute in them. You cannot
independently scale it. Um if something if you need more storage, you now need more compute. And even that move itself
more compute. And even that move itself is very difficult because it's more considered fixed capacity provisioning.
Um so the Neon team actually figured out a a really uh interesting way um to implement separation of storage from
compute uh for Postgress specifically.
Um they build this storage service um there's actually two key services that sits on top of uh cloud object stores like F3. There's a safekeeper and
like F3. There's a safekeeper and there's a page server. The safekeeper is basically what uh a replic it's it's think of it as a replicated write a
headlog. So the way databases work is
headlog. So the way databases work is instead of actually modifying the data in place um they write it to a riot headlock first. Um and the safekeeper is
headlock first. Um and the safekeeper is basically a distributed replicated right headlock service and then safekeeper is
a service that uh takes all the uh actual Postgress pages. So databases in the case of Postgress separates all the storage into different sort of tiny
granular small pages. Um the page server uh is a distributed storage system for all the pages. it ultimately stores everything in S3 but because we don't
want to pay for the uh latency hit of going to the S3 it's a distributed farm that uh caches all the data also as much of data as we want uh to provide much
lower latency access and then they make extremely minimum changes to Postgress um to to actually change the very underlying storage layer they find some
very narrow waste um in Postgress so you can change Postgress instead of writing to local disk uh for both write the headlock and for the data pages you just use this two services um they're on top
of S3 um and there's a lot of benefits that's uh make possible with this uh interesting architecture one is the
storage cost is super low um and storage is automatically durable it's as durable as the cloud object storage which are often either multi-AZ or multi-reion um whereas in the past when you get a
fixed capacity postgress instance that storage is only as durable as the storage on that box. So if that box goes away, you lose all data, right? So
people then come up with storage sol like backup solutions or so of they may sell them like use EBS and like super expensive compared with S3. Um the other
one is it makes it uh very super fast autoscaling like on top of this architecture. Um the Neon team actually
architecture. Um the Neon team actually built an autoscaling service that basically provisions compute nodes like Postgress nodes in less than 500 milliseconds.
And because you can do this now, I actually think it it changes the paradigm of both for agents and for humans here. The uh agent is not just a
humans here. The uh agent is not just a marketing term here, but it changes the paradigm of databases. Um because for many
I think the while there are c one class of databases that extremely latency sensitive and you simply cannot afford to have a t latency spike at all. I
think most applications uh especially internal applications that a lot of enterprise build it's okay every once in a while to have latency spike of a few hundred milliseconds. You
don't want it to be like seconds but 100 milliseconds fine right because that's roughly human perception time 300 milliseconds.
um if we can actually provision and uh acquire compute resources in hundreds of milliseconds. Now we can actually scale
milliseconds. Now we can actually scale the database itself entirely down to zero uh without paying for it when there's very little when there's no traffic to your service which happens all the time
also right like um you might have a lot of traffic for internal app is specific time zone during 9 to5 but then past that hours it drops down to approximately zero for many companies um
and for many companies maybe the lunch hours also dropping down or maybe in some cases you have spikes at 9 a.m.
because everybody like gets into work and start looking at some uh looking up some app, but then at 10 a.m. it it goes down. It doesn't go down to zero, but it
down. It doesn't go down to zero, but it goes down. Um so they build this
goes down. Um so they build this autoscaling architecture that makes the database itself can adjust uh and um its resources dynamically based on the load
and can also go all the way down to zero. And this is super cool uh because
zero. And this is super cool uh because I think it's actually probably the first time that you can have a OOLTB database be so responsive and so elastic. One of
the comparisons that I've heard and read some people make is well isn't this similar to how Aurora works from AWS? So I'm curious to hear your
from AWS? So I'm curious to hear your thoughts on that.
>> Yeah, it's a great question. Um I I think at a very high level there's some similarities. Um especially if you think
similarities. Um especially if you think about Aurora also have some attempt at uh Aurora might have been the most popular of separate storage from computer LTB offering out there. Um but
if you actually look deep enough and think about the technical Brahm harder it's actually very very different and that leads to very different benefits for the end user also. And one of them
is that um Aurora was actually I think because Aura started many years ago. AR
started more than a decade ago and at the time the object stores were much worse. So they had to build a completely
worse. So they had to build a completely proprietary storage engine um that are not open at all and only Aurora services
would have access to um and whereas if you think about the lakebased side neon um it's built on just cloud data lake it's built on S3 it's built on Azure
blob store and this might seem like hey what's a big deal like proprietary storage system um a more open storage system. What's what's the big deal?
system. What's what's the big deal?
There's actually a massive difference here. One is uh aside from engineering,
here. One is uh aside from engineering, it's actually engineering wise more complicated to build it to make it work for object stores. But what what's the benefit? Why should like maybe the user
benefit? Why should like maybe the user care? One is um uh cloud object stores
care? One is um uh cloud object stores are far cheaper than EBS um just because the way it was done um EBS considered more of a high-end
network storage whereas cloud object stores more of a data lake and data lake are fundamentally cheap. you just have much better unit economics. Um the
second thing is because it's built on cloud object stores and the data that store on it are also open spec and open source like they're literally just Postgress pages.
So we're kind of recreating the interesting lakehouse paradigm here except for OLTB which is you have the actual uh source of data stored in cloud
object store which can be very high throughput in open format. So on top of um it we could not just build we're not just building OLTB database at Postgress
being able to access that but we can actually build we haven't done all of it yet this are still in the work but enables that possibility of having for example spark directly quering all the
data um having um sort of duct DB quering the data just provided there's the right adapter for it um but because it's open format um and it's honestly
it's just Postgress pages um we could be creating either us or sort of the community could be creating those uh adapters. Yeah. So it just
opens a lot of use cases and I can give you an example of what it might be. Um
let's say um well one very simple one is hey there used to be this giant divide between analytics like lakehouse data warehouse
data links and uh OLTP. Yeah.
OLTP.
>> Yeah. When when you become lickbased or LTP when when you become successful enough, you want to start applying analytics orb data. Um how do you do that? You start trying to build
that? You start trying to build pipelines to get data out and you learn like complicated concepts like CDC and
all that. Um but if the actual OLTB data
all that. Um but if the actual OLTB data are open format on top of uh data lake which is basically lick houses now
they're no different from a lakehouse data just go read them directly you can read them using spark you don't have you can read them in massively parallel fashion if you have a pabyte of data go
read them it's not an issue uh I think that just changes the scaling and also makes a lot of operations simpler Um every database like every company
operates I think any large enterprises operates tens of thousands if not millions of databases.
Um they are destroying database instances because the storage was not shared. Um by having all of that
storage in one data lakeink basically the lake house here um you can now actually enable um workloads that plump through all of
them like for example if you want to identify if you want to do a global scan let's say hypothetically speaking um if you found so one of your database that
got compromised and there's a table that got created with a uh hey wire send uh bitcoin to this address and I'll give
you your data back, right? Um hopefully
or in the case of uh Lickbase uh you don't have to worry about getting your data back because we have like time travel and snapshot of them but uh you at the very least you want to know hey what what's going on with my other
databases did anything else get compromised how do you do that across maybe a million different instances if they are destroying it's incredibly difficult time consuming laborous job
but if all of them are actually in the lakehouse you could actually write a single spark query to scan often or in DBSQL if you like SQL um that I think is going to be
revolutionary to how we operate and manage databases um as well. Okay. Uh this is not like a super complicated demo. I just want to show maybe quickly the developer
experience and um how fast it is with linkbase. And this is just your standard
linkbase. And this is just your standard data bricks workspace UI. Any data
bricks user should be pretty familiar with it. But there's this new thing here
with it. But there's this new thing here with a little bit drop down that goes into this thing called lakebase. In
lakebase you can create a new project which is effective a database um with many different potential instances.
Let's say host way test. It's going to create a new one maybe one here. Um so
now we've created a new database project and um everything is super snappy super fast. In this case we created it
fast. In this case we created it automatically creates a development branch of the database also. Um and then
if we just come in here um this uh the primary rep basically the primary branch um of this or primary replica of this branch production branch is now
available and let's try just want to run something uh some query here. It's like
some sample queries it's running um and everything's completed all the SQL queries. Um and it just took uh uh it's
queries. Um and it just took uh uh it's fairly fast. But what is uh interesting
fairly fast. But what is uh interesting is if you want for example to create a trial branch of this production
like my trial um and you could actually uh have it automatically disappear. Um so for
automatically disappear. Um so for example if you know that you are uh only using this for less than a day you could do it uh but if you don't it's even okay and then you can uh create a branch
based on whatever data that's in your production branch you can actually based on past data I mean we don't really have anything in the past we just create it here you just create and now that branch is created and this branch is actually
it's effectively free uh because he has this copy on write uh architecture that unless you start adding more and more data to it doesn't really cost you
anything. Um, it only cost you when you
anything. Um, it only cost you when you need to use some compute um against it, which hopefully is pretty short because you're using it for development. Um, and
it only cost you if you start writing additional data to it. But otherwise,
you can have like say a terabyte of data in your production and then you create a branch. Um, the branch is copyright. So,
branch. Um, the branch is copyright. So,
it doesn't actually cost anything additional. But then you actually get
additional. But then you actually get all the branches um for basically as separate um independent branches that
each of them is kind of its own mini Postgress but just as a storage layer um they were pointing back to the same uh storage. Um so this trip quite literally
storage. Um so this trip quite literally it's it's not that you are duplicating the data in itself but it's your bas
>> yeah you are you're it's it's a logic it's a logic level basically that's ultimately in in a very simple way it's a logic that's telling you hey this are the records that should belong here
these are the records that should belong in this other branch.
>> Yeah exactly. Um so the and then because of well in this case we're doing the autoscaling of 0 to4. Um so at this is
sort of approximate some sort of memory um sort of the hardware but you can actually get it to like super small you can have it to scale to zero. So if
there's no usage after like 5 minutes this uh whole thing just disappears but then if it needs to come back um it will come back in hundreds of milliseconds.
Uh, so the compute is disappearing.
You're still paying for the storage.
>> You're paying whatever you have in storage. If you have like
storage. If you have like >> computer, you're not really paying anything more for compute in that moment.
>> And the storage is like super cheap.
>> Yeah. Yeah. Yeah. But yeah.
>> So, um, this is basically the entire experience. Um, it it's super simple. Um
experience. Um, it it's super simple. Um
and it's it's a full-blown I we didn't really show so it's a full but it's a full-blown Postgress um you connect you could use the building SQL editor but vast majority of the people will
actually be connecting um to it through uh probably uh from the applications right >> this is the new and improved lakebase
>> yeah so to to uh it's a little bit confusing here so we announced lakebase public preview actually at data and AI
summit um and that is more of a fixed p uh fixed provision lick base uh so you you provision whatever resources you
want and that's it um you could reprovision or provision it down um it would take a while uh because it's still separated storage from computer it's the
same neon techn storage technology under the hood but what I have shown here this is a improved version upon that the in beta right now. Um that actually has
basically all of neon.com's technology.
So it's not just the storage part, but the entire all those scaling, the developer experience, all of that. So
that's when we have now have like snapshotting, uh branching, backup, restore, all those are now and everything's like super simple. It's
just in the UI or so you can use command line, all that work. Um and it it's it's much snappier. Um, we actually see this
much snappier. Um, we actually see this to be the future and eventually replace the uh uh the current Nick Lickbase um basically provision Nick Lickbase.
>> It's kind of like a branch of lake base right now.
>> Yeah, exactly.
>> That's awesome. Thank you for that.
Loading video analysis...