TLDW logo

How to write your own Deterministic Simulator

By TigerBeetle

Summary

## Key takeaways - **Testing trumps coding in DistSys**: The hardest part of building a difficult software program isn't the coding, it's actually the testing. These systems are very hard and you spend so much time testing them. [04:07] - **Formal models tease but don't deliver**: Formal models like T+ give certainty that a design is correct by checking safety properties, but they only confirm it's not impossible to implement, leaving uncertainty in the actual code. The model tells you it could work if coded perfectly, but implementation fails. [11:04] - **Raft proven yet implementations broken**: Raft has formal proofs and models, yet almost every implementation is broken, leading to data loss and split brain, as shown in GitHub issues and a 2018 University of Wisconsin paper, because proofs ignore real-world disc failures and OS behaviors. [19:50] - **DST superpowers: fast-forward replay**: DST turns simulations into fast-forwardable movies where you control logical time, accelerating physical time by 700x in TigerBeetle, allowing massive state space exploration across space and time in seconds instead of hours or days. [31:00], [38:15] - **Determinism forces superior design**: DST requires controlling the event loop and making software logical, revealing atomic stretches and pruning nondeterministic trees into bonsai-like structures, transforming code into a reasoning tool with explicit guarantees. [42:37], [46:14] - **Simulator late, found 30 Heisenbugs**: TigerBeetle wrote the deterministic simulator a year in, taking three weeks to implement, then found and fixed 30 Heisenbugs in three weeks at five per day, bugs that would take months to reproduce otherwise. [54:43]

Topics Covered

  • Formal Proofs Fail Implementations
  • DST Fast-Forwards Bug Discovery
  • Determinism Scales State Exploration
  • DST Forces Superior Design

Full Transcript

and we are live hello everybody thank you very much for joining I am Dominic too and I'm super excited to be here on this uh live stream with uh yuran the

CEO and founder of uh tiger beetle and also the one who uh introduced me to deterministic uh simulation testing hey

yuran hey hey Dominic I'm also so excited I think we were working out we do this once a year and it seems like it's always a February and now now I think it must become a thing you know

this is so it's always so nice to to chat with you yeah up to this point we just in the preparation we just looked when did we do this last time and it was February right so up to this point it

wasn't planned but from here on out we should probably plan that def definitely all right and then today's topic is of course uh deterministic

simulation testing so that's also reflected in this uh live streams name the deterministic dojo and uh yeah as I said you

introduced me to the concept of uh deterministic simulation testing up to that point uh I did not I was actually not aware of uh deterministic simulation

testing and where it has been used in the ecosystem up to that point so maybe it's like you can kick it off by um telling everybody how you got into

it yeah sure so I should start by saying it's this principle of determinism and it's so powerful because if you if you think

about just take a step back and think how you make things and and you apply determinism into the mix um you know so T I don't think tiger beetle would be

here today if it if it wasn't for this and it was touch and go so it was July July 2020 um we were I I was designing tiger

beetle and up to this point point you know I've actually I realized you know this year I turn a certain age which means I've been coding for 30 years and I think most of my journey in coding was

okay let me learn how to code and I wrote you know basic and it was uh and and you could do things there with cars and sign in those early days and both it it was actually a flight simulator that

was one of the first um programs that I made was like a graphical simulator of a plane and then but most of like the time

of being a coder was like coding uh and then along the way it was like well you know you should write tests and and out was testing and in the few years before

tiger beetle I was doing some very difficult projects that were distributed so they were running software a program that runs across machines and I kept

failing and failing and failing um and then I wrote uh I wrote like a a simulator that could simulate the program running on real machines and try

and break things there wasn't determinism but there were also early parts of you know tiger style that were using tiger beetle with assertions and various techniques and and finally like

this the third version of the system it worked you know then it really worked but that that was coming into tiger beetle so there was learning how to code

um there was learning testing like unit tests I did a lot of like fuzz tests uh it wasn't always a enough you know just to fuzz individual modules of

code and then there were elements of tiger style in in this pre- igab project um and there was the idea of a simulator so you know you've gone from coding to

testing to simulating testing simulating and then in July 2020 there was this you know this of course we're going to we're going to do fuzzing and it's a simulator it's like chaos engineering like Jepson

you've got a harness these these are very powerful things um also because you realize that the hardest part of building a difficult software program isn't the coding like I used to think

it's actually the testing that you know these systems are very hard and so you spend so much time testing them and so coming into tiger beetle the the thing there was how can we take a step back

and think about how we think how can we engineer the engineering and and I happened to read this blog post by suj jokar at Dropbox on the testing they did

and they were bringing in more determinism so not only simulating but simulating the whole system in a repeatable way so that a

given simulation it's like the worm you know worms the game or Sim um uh scorched I think scorched Earth the levels for those games were randomly

generated from a seed but if you had the same seed you could generate the same game you know and so this whole idea is not only to simulate but let the whole

simulated environment um uh spring from a little number you know and uh like a big bang and it all goes out and so they

they were doing this at Dropbox and uh and so I thought we should we should do that because my previous experience was you know if you have these simulators it's very powerful but it can take you

you know you will find a bug but it can take you a year that you know the bug is there that that you have to fix it and so yeah Beetle was kind of the The Next

Step where it's like not only testing not only simulation but this principle of determinism and that's how I got in but how about you Dominic I had uh I had uh some like was

like just listening right I can hear I can hear some some uh similarities and uh parallels so and I also realized how

how old I actually am because I'm also like on this journey for about like 30 years so I also started with B

experience yeah yeah okay I also started with basic so but um so when I when I

started my my uh career and um say the starting point of like the the serious part right of the career when I when I

um joined sap in in Silicon Valley in 2008 um I was just excited right so it's like I was just excited and I was like

uh you know any young engineer just like run run run right but um over time I I

became frustrated like run run run was was not was not enough because with with every step I took I also kind of

realized and and understood what I what I don't know right and uh that I think is is one is on this like experience ladder when you climb the experience

ladder right at some point you think I got this right I I I have it all in the back and then you you take a few more steps and then you come to the realization oh I I I don't have this

right so and um so that was then especially true when I moved into distributed systems which when I joined sap 2008 that was the year or in 2006

was the year the the cloud came about right right so it's like I started um my career with the cloud in lock step and with the cloud in lock step that basically means that it's like every

system is a deter is a is a um distributed system and uh at some point just it's not a question if your system is going to be a distributed system it will be it's only a question of how

distributed is it right the degree of distribution and we saw that eventually when like microservices were all the then a degree just exploded right

and um then I what I what I find my I found myself at a spot that was marked by uncertainty it's

like I was just uncertain right I was constantly uncertain about my systems I was uncertain about like what is going to happen next what is going to happen

in this case I was just uncertain right and um it is I feel I I I feel at my worst when I uncertain

because then I'm not confident and um it's it's a it's a very like uncomfortable

feeling and my initial reaction was basically it's like I want this confidence I want certainty right and at that point I started looking what is out

there that that grants me certainty and uh so my first like um touch point with that certainty was not

deterministic simulation testing it was um formal modeling specifically uh t plus and then also to a degree alloy and

that basically means model checking right and then I consumed and at that time there was actually not that much material and especially not material for

like practitioners and beginners but um so I consumed everything there was and big part of

that was a Blog of H Wayne he did a lot of um uh like beginner oriented and

practitioner oriented uh models in t+ and I realized this like t plus is a thinking tool or formal models as a

thinking tool right just like all of a sudden gave me certainty right it's like I I learned a certain way to think about systems right that what I like to call

the distributed systems brain and then I found other authors like for example Martin kman and consumed anything they everything they wrote right and it's like I basically flexed my my

distributed systems brain and and tried to develop that muscle and um I enjoy it this you see this model and the model specifically

States everything that can happen right and then it methodically explores the space that Behavior space right

incredible of course Behavior space right Behavior space of Real World Systems can be infinite right um but the tools do a fantastic job uh to cover a

large ground right and so they methodically basically and I say that like just quote unquote methodically check everything that's

interesting right so you can basically say they walk out the entire space and then they say then they tell you it's like if your system is correct right is it good and that was an incredible

feeling incredible feeling but unfortunate unfortunate yes now you know your design is right yeah but not the

implementation man yeah so it's almost teasing you it's saying you know it could work if you if you code it perfectly it could worka

exactly yeah so it basically now tells you this model tells you right uh by checking its liess properties by checking the safety properties it tells

you that it is not impossible to implement this right which is good right that is that

is good but that it doesn't get you any further than that right so then that is still there is a lot you still have to cover

right yeah that's when the uncertainty creeps back in and came back you worse because you you know it

was possible but you know but we failed but you know but it can be done but yeah not for us it is it is almost like a difference from you never caught

something or you caught something but then it got away right so it's even more frustrating yeah so exactly it's like teasing you right but then you

introduced me to the deterministic simulation testing and then I consumed everything there was about that so it's like all the YouTube videos uh from the folks that did Foundation DB and then

also the Articles you recommended about uh Dropbox and um it was basically this

promise that formal models made right it's like we will replace this daunting uncertainty

with nothing but certainty right and uh but now it promises this actually for the implementation right not for a model

of your system but for your system right and then it's like oh man at this point you laid the perfect trap

right you catch it and you keep it yeah yeah that was my that was my journey into deterministic simulation testing and I I remember the first time we

chatted so it was a VI stamp replication made famous sticker that that we had made you know because this was our consensus protocol for tiger beetle this is what was scary for tiger beetles like

we have to write a distributed system a consensus protocol is like climbing the matter horn or Everest and you know it can be fatal to this is one of

the hardest things to attempt you know for me that's how I saw it and so I knew you know I actually saw it was funny on Twitter today James Carling posting who

was the one of the authors of the revised VI stamp paper in 2012 with barar lisov but he he was saying you know um I actually asked him for advice

as I set out to climb the matter horn for Tiger Bei I said you know James fan of your work at Dropbox we've picked view stamped any tips please you know um

I know this is not easy he says you know just have fun it's a fun protocol but then today on Twitter he says well you know learning consensus is like learning to to fight you know so that you know

you don't want to do it and then yeah so I love that you picked the the deterministic dojo because that came to my mind you know well there is a way to do it with

determinism but again it was James and them at Dropbox he was also you know on that team at Dropbox doing these techniques so but I remember then you and I connected over view stamp and I

just showed you our deterministic simulator running in my terminal and you said to me but this is like formal methods or you know model checking but on the code and I guess

obviously there's a little bit of gaps you know some formal methods can actually go you know you can annotate your code and um but it's not you know yeah it's it's still different and and I

guess sometimes you can get formal proofs that the thing is entirely 100% correct and obviously DST doesn't promise that it just gives you confidence that as far as your as far as

you are exploring the state space um it's not finding you know what what you tell it to find so you you also have to do all of that work but I remember how

how it clicked for you as I showed you the probabilities in the simulation and the you know the the categories we would then use it but this is like model

checking but on the actual code so it was an incredible moment for me because it's like

the the um consensus algorithm this actually it's a perfect example so when I started with distributed systems and

when I heard about consensus algorithms and that was before um formal models right before ta plus before alloy before um ma or anything

else you read about the consensus algorithms right and then it's like raft is also a great example so you read about it and the rap paper itself right

the the promise the promise that ra makes is not the consensus algorithm the promise is going to be it's going to be easy it's not it's not it's not it's

really not and you you try to think through it right and you try to think and then it's like I watched the YouTube videos where people like explain the different steps and different transitions of and it's like I always

came up with a question with one question that I like couldn't answer just based on the paper or couldn't answer just based on the video that was always a situation I couldn't answer

right but then when I got into formal models and you look at um the and this like raft because of its popularity you can almost argue that

that um raft is the hello world of of formal models right so it's like every single formal model um or a formalism that has ever been invented has been

applied to Raft yeah U because it's like everybody understands the problem everybody understands that the problem is really really hard and then you see it as like it's like does

specification um the the model does that does that look good is that relatable is that tangible is that something that I want to use right and then you can work

your way through the t plus model for example and at that point you start to understand you you get it it's like you can just like and it's not even that

long right and you you can go through it and you start to understand and whatever question you have you will find the path through the model right that is like oh

this is how that works so really really very powerful but at that moment oh I'm sorry oh just to jump in I I I I couldn't help but I love how you said it's like a thinking tool and I remember

you saying to me it's also it's a great communication tool for foral methods but yeah back to you Dominic yeah but at that moment whenever I was talking about somebody or to

somebody about like um consensus whether was raft or any other protocol but let's say like here is the model of raft right

which is the tool proofed this works right um here's a model of raft and then it's like oh should we implement it I was like no don't just don't do not

implement this it's like at that time I know there was a very popular implementation of uh raft I believe it was a goang library made by Hashi Corp where you could switch out the state

machine but they imped the protocol around it and I was I always recommended people before you ever ever think of implementing ra go to this repository

and just look at the issues that are opened right and then you will see that these issues right they basic they do not Trace back to this model right so

then you have some issue of a combination of uh this uh goang version on this operating system in this version

right or uh then you have um other situations when you say as like hey wait a second how is that behavior of uh fs and we can think FS gate and other

things all of that just is an implementation problem that does not make it into the mar right yes I I love that you know it's formally proven and

yet almost every implementation of raft is broken and can lose data and can into split brain and have catastrophic cluster data loss and there's a paper on

this from University Wisconsin medicine from 2018 protocol aware recovery for consensus based storage and the this is

the irony is yes there's a formal proof but the proof wasn't in the real world you know so if you run this code on physical Hardware the code is broken and

not correct the protocol is is broken um because the proof for the protocol didn't consider all the fault models so it assumes that diss are

perfect and and then again as implementers this doesn't really help us because again look at the GitHub issues

um and discs do fail you know so um yeah and generally you will not right you will not find this in the models

right you you just won't you will not find it in the model then but eventually there's like there was the next thing that I came across then and I listened

to some presentation of Kyle Kingsbury right where there's like oh wait a second right now we're actually looking at the actual

implementation yes and it started to feel like hey that is that's kind of close what the what the model checking

tries to do but on the system but also not so it's like hm so it's definitely trying a lot of the things right that um

where you think well I wouldn't find it in the model but I will find it in a real world implementation right so it it comes and

like adds latency to the network right a lot of models can actually deal with that right because they make very few timing guarantees but then it does something it manipulates the clocks on

the Node show me a pla plus model that does Z will not yeah oh yeah

it's yeah I I was also doing a bit of security work before tiger be and I realized that you know hackers always go through the back doors um there's always

there's a little Gap semantic gap between the system as modeled and like the code has written and hackers just walk through these doors and they chain them together but coming into tiger be I

realize well distributed system safety is you you spin that security coin around and the same techniques that can stop hackers just you know protecting

your interfaces and closing the semantic gaps between model and code yes this also helps you know and I I guess I should say too what you were saying earlier you and I both love this

principle of you know endtoend principle so I think it's just this desire that we're like how can we test the actual code please like we because you know

proofs for consensus you know VI stamp dra P house dime a dozen but like how can we now you know please can we test the actual code um end to end um you

know in a in a real like real that it works some real Hardware so guess that's just what we were trying to do and and just to get some certainty not not even

100% proof but exactly it's better than what we had so exactly and I love uh what you said it's like that's basically

a nutshell right so it's like yes it's formally proven correct a dime doesn't in in many different variations but formally proven

correct yet basically every implementation out there is broken so it's like oh my God right so that is

that is just mindblowing that disconnect right and uh yeah and again uh yepsen was was a or is a is a very

cool tool right that um uh is like tries to close the gap but um is short of the deter the

real the real binary too you know it's like it's the real thing you know it's a real thing it's extremely effective yeah it's a real thing like on real Hardware

on real software on the real operating system in the right version on the real runtime in the right version right and it's incredible right it's incredible it

get it gets you it gets you far right but to be fair deterministic simulation deterministic simulation testing gets you a step further and it's that step that closes

the gap for me where I have enough certainty to be actually confident and why why is it Dominic what's the difference between simulation

you know where where you you now are simulating the real thing or you're also doing fault injection we've we've got you know there's we we being we thinking about fault models what are the

guarantees does this program give to the user um you know we have Network fault model we have storage fault model diss

fail process fault model processes crash uh clock fault model you know some you know some projects even have like a CPU fault model and memory fault model but I

guess then you're into like Byzantine fault tolerant distributed systems which you know so it's kind of like let's choose how how hard we want to make it

for ourselves so tiger de it's a storage Network memory and clock process um and then ECC is the answer you know um and and there's there's also ways of

detecting you know you know compute errors if you have a lot of assertions you know um but I guess yeah so I guess this is the question what's the difference so we you know we've got

these fault models we've thinking about guarantees we want to make we injecting now faults we've got the simulation but what's the difference because that that is what you know je Jepson would do and

is effective with that but what's what's the little Gap you think um that because you and I have also done these techniques and we love them you know they're so powerful yeah and yet you know

um you know the obvious answer the answer that like anybody immedi medely right you tell somebody about deterministic simulation testing anybody immediately understands immediately

grasps and also immediately sees the benefit from it is the reproducibility and that is basically what you mentioned right you see the you

see the big bang and from that big bang the entire universe unfolds meaning that if you seed yet another Universe with

the same seed it unfolds exactly the same way right so so the reproducibility is basically the obvious answer here and the reproducibility is great right

because there is a difference between finding a bug and then um fixing a bug right yeah because if you do if you do random

simulation testing right you may find that bug you may never find it again right and then it's also very reassuring because you I mean this actually happened to me I wrote a simulator of a

dist you know the pre- tiger beetle um other system that I worked on and it found a bug and it could find this bug and it took me a year to fix it yeah and

I knew it was there like so it's so reassuring you know that your system's broken and you you literally takes a year it it actually turned out to be a c

it turned out to be a buffer overflow in lib UV on Windows if you use like an event notification mechanism and if you

have multiple very long directories and if you have multi-te codings and if you are combining NFD NFC normalization

conversion all of that had to be chained together so and it took a year to figure that out um and that was actually the first C code I wrote was to fix a buff

overflow in the BB which is why I Zig um yeah but my very first SE was you know finding a buff overflow uh that is jumping into the deep end yeah but that was when I realized like you know and

coming into tiger beetle I saw oh there's this principle of determinism that means I don't have to wait a year to you know to fix it because I can just replay it so quickly because otherwise

you have to you know run the simulator in real time you know um and and it'll never get to that result because it's so requires these probabilities all to

align you know once in a year bug to find um but maybe we should come back to this so it's all about repr ibility replay but what what would you say to

people that say you know DST is not that great because obviously testing is all about State space exploration that's that's the superpower what what would you say you know this is that true I

want I wanted to actually ask you yeah so I take the I I took I took the easy I took the easy answer right I took a reproducibility of the board yeah then I

was like but what is it uh what is it for for you I know that this like reproducibility is a big part right and it is it is a fantastic property no question because exactly we don't have

to wait a year right but there is more to deterministic simulation testing that is not as obvious as

reproducibility that's it yeah so I think um sh shall we leave that as like should we build some tension here so what what what's what's DST really about you know even even Beyond being

deterministic you know being producible is that where we're going that's totally where we're going unfortunately I didn't line up any sponsors for that video because otherwise I could play an ad now right yeah this would be the time for

the ad is that like but to tell everybody that DST is not about having reproducible simulations and and fixing bugs faster uh exactly exactly okay

unfortunately we don't have we don't have any any ads lined up so okay so let let's imagine that we've got a movie Dominic so so like if d as a movie um

you're saying that now for the first time we can watch if we like a movie again if we like a movie we can actually watch it again we can we can replay the

movie and that's nice yes yeah and we can we can replay it again and again and we can now add you know logging and and and fix it um but I think this is the thing that surprised me you know doing

this with tiger beetle is that that's how I see DST it's a it's a movie that you can replay but before that actually it's a movie that you fast forward and

so you can you've got the movie um and you just you fast forward it you because even time becomes something in your model that you can accelerate it you DST

is all about taking the physical random non-deterministic stuff network storage you know um time and you make that all

logical you know so you have a logical clock that you tick um and in this whole simulation you you control everything you even the clock on the wall in the

Matrix you just spin spin that faster so now actually your simulation is a movie that you can rewind yes but the real superpower I think is that you can fast

forward that movie to see if it has a happy ending and if it does you just forget that movie and you watch a new one okay happy ending okay forget it a new one and each each of these are just

little seeds and from that random seed you have a random stream and you use if else decisions like 50% probability let's do some fault or you know and you

you can there's lots of techniques we can come back to that but I think this is the real thing is that it's a movie you can fast forward to find bugs you know if the movie has a a sad ending

it's like Choose Your Own Adventure then you can replay and see where it went wrong you know butterfly effect and and make sure it doesn't oh that's good yeah but I think this is the other thing you

know is that it this is just determinism is a principle so DST is really what we're saying is we believe in testing yes we believe in and and all the

literature of all test techniques you know for good testing yes simulation yes but let's also make our movies fast forwardable and replayable so it's

actually it's just complimentary really I think to all the literature on state exploration minimization all these things are all compatible with the DST

so I think maybe sometimes you know there's a misconception that you know DST is a bit overhyped because like but you know State Expressions where at why

not both you know uh we don't have to be exor determinism is a principle um and we should be bringing it in U because I I I think it's a good idea you know that

um yeah so I think that's kind of the most surprising thing for me is that um it's just a it's a very good philosoph

iCal principle and and like philosophy you can now do things in The Matrix you know yeah what what are your thoughts on

on all of this um what should we dive into a bit more yeah I uh I very much agree so the the way so with reproducibility out of the way right

where we say yes it's now reproducible we can look at like the other um other enablers basically is like what does what does DST do for you

and I think there's like there are multiple levels to that there's like one more subtle than the next so the first one the most obvious one is reproducibility the next one is

um where DST enables scale and DNE uh DST enables scale across space and across

time so what what what do I mean by that right so we can we can view a system as a as a generator of behavior right so

it's like I I I have a system and I tell the system run right and then the system generates a behavior transitions from state to state to state to

stages so um turns out though systems have a happy path right and that's why that's why we have a word for it happy

path right so systems have a happy path so if I if I just if I just tell the system to run right it's like it's going down its happy path right and if I tell

the system to run a thousand times the chances are very very high that it goes down its happy path like 999 times right

because it's just very very rare that I accept a new network connection in the exact same moment that like a configuration reload is triggered it

just happens very rarely right but behind this specific path through the system there may be a bug right that's why I love this word Heisenberg right so

so from the Heisenberg uncertainty principle yeah so comes out of nowhere as soon as you try to look at it it's gone right so then it's like you you

measure it you can't see it again yeah so um but with with uh DST right and uh it's like and again it's like you seed

the universe right you seed the universe you seed it with a thousand different seeds it goes down this particular path right and it's not the same path that

the next seed is going to generate and it's not the same path that the next G seed is going to generate so if I just kick the system into gear and a thousand

times it's like chances are high the system is just going to go down like a thousand times basically same path but with DST I have a chance to actually if

I kick it into gear a thousand times to go down a th distinct path right so that's what I think is this idea of it

scales across space right State exploration yeah yes exactly it's a state exploration the other thing is

this uh idea of scaling across time right it's also what you said systems are time dependent right is like we have

time distributed failure detection usually depends on timeouts it's like um detecting message loss depends on a timeout if you need to test something

like rate limiting here you have a time interval right so real systems depend on time so if you want to test a system over a one system run over the over the

course of 24 hours you got you need 24 hours right so but with DST if you can virtualize time and you want to test the system over 24

hours depending on your test setup but it's like if in doubt it's less than a second right that's and now right and now you you multiply that it's like

scaling across time or scaling across space scaling across time and then you see it's like what is like the simulated runtime how long would I have to run

this system on how many machines to get the same experience the same coverage that I get here in a mere

hour it's incredible is just incredible yes I've actually chatted with people and they've had this misconception about DST that it's it's slow because you you

run it logically single-threaded and there's no multi-threaded parallelism and then I say to them but you know that you can make time this physical time you could make it logical and speed it up in

while true Loop and you can go 700 times faster so you don't you don't need and and then they're like okay wow you know never never thought that you can fast forward the movie again you

know so like like you say like scales in space and time time it's it's that kind of philosophy you know you take the physical and make it logical and

determinism encourages you to to do to to think like this and then now you look at your event Loop in node.js or libuv a little differently you know your timers

timer mths frequencies that are every 1 second doesn't have to be a second just if you simulate your clock you know spin it faster um so I think tiger beetle we

get like a seven we we we've worked this out roughly it's it's probably changed but it's roughly like a 700x speed up over physical time so one

second gives you 700 seconds and then you can run this on Parallel you know you take your single threaded simulator but Run a thousand of them so we we do have like a thousand dedicated CPU cores

they run a thousand simulators different seeds you can also do better than random so you can use different distributions like swarm testing you can CL um so much

so again it's all complimentary I think to existing techniques but it's this principle that is so powerful yeah

absolutely and for me there was so this is again reproducibility is like the most obvious right everybody like carries that on their sleeve when they say DST

reproducibility great the next one is a little bit more like subtle right where it's like you already have to like think about a little longer or have a bit of a longer conversation that is like this

scaling across space State SP State space exploration uh scaling across time speed up of time but then there's there's yet more

right there's yet more to come and that is well for Det deterministic simulation testing right your your system has to be deterministic and um there are now guess

we should say this is this is the the sad news if people are getting excited you know someone I saw in the chat wants to ask you know can you apply DST to spanner

um or or should you hire 300 engineers in QA um oh yeah it's like if you it's a good question right if like if you don't have the time to do it right the first

time right do do you think you have the time to fix it over and over again a good question I mean it might you know at some point it might be quicker just to rewrite it and and save on the you

few hundred test Engineers yeah but as like as we know like lat there's also you can get you can get uh determinism on like different

levels right so there is for example um antithesis who provide a deterministic hypervisor so that you can run physically uh what we can consider an otherwise non-deterministic system in a

in a deterministic way and that has a that has a lot of advantages in it of itself because of course now you can say we can apply this to a pre-existing

system but I in addition this like I am a big fan now big big fan of uh deterministic simulation testing basically on the application Level you build your application with

deterministic simulation testing in mind because it is a transformative ative experience um also not just for ourselves but for our system and for the

code the way that it is built and uh the first time I implemented determinism which is basically for resonate so is like uh me and my

co-founder when we when we started we started on a on a we had the luxury starting on a Green Field right so we started implementing the resonance server with deterministic simulation Tes

in mind but it was it was a transformative experience beyond the beyond the testing because it turns out that also DST is a

is a reasoning tool also DST is a reasoning tool about your code so the first thing you need to do if you want to do DST you need to take control of

the event Loop right of the schedu nobody ever takes control of the event Loop right we just throw something on to the python event Loop onto the

node event loop on to pick your favorite platform and throw your code at the event Loop right yeah and I know I know you really you you really love go you

know and you you had a a fight with this the scheduler there right yeah yeah love is a strong word but it's like so we did we did choose go

for the for the um we did choose go for the for the uh resonate server and um if you do that if you good choice

by the way um it's operational heaven right because out comes a little binary that is self-contained and off you go yeah but

um if you're looking for DST then it's like it's development hell right because it's like go doesn't provide you any of these

capabilities definitely not out of the box right you have a set of of um go routines that's basically the main unit of abstraction right or the main unit of

decomposition so you have a set of go routines and you throw that just like at the runtime right and you do not know you do not

know the yeah the cut points the slice points you do not know it's like is it going to

run these two lines is it going to run these eight lines of code you also don't know that that may change at any point

in time right so you have this massive massive nondeterministic tree of

possibilities that if you have to right so now you also communicate with uh channels and messages over channels where it's like go says this is the main

way of like synchronizing it's like where you prune this tree again right where you come back in turns out even for go that was not that

wasn't true right so now we also have mutexes and semaphor and whatnot so first off we let the thing explode right it's like completely explode and then we

try to rain it back in in certain in certain situations right so it's like oh but in this case let's like let's let's bring it all back in right and uh with

DST if you implement it to be deterministic on an application Level and you take control over the event Loop one thing that happens is it's like what

manifests is the atomic stretches of your application right because now you also have to implement um not preemptive

but Cooperative concurrency control now you have Atomic stretches you see that your system promises you that right it's a promise this will never be interrupted

I'm not promising you that this will run before that or this will run before that but I promise you once this happens it will run to completion before something

else will happen that's that's amazing a transformative exper your code your system gives you so much more guarantees so much more to latch onto to reason

about the possibility of your system right this wild Behavior tree is from the very beginning nicely pruned like a bonsai it's fantastic it's absolutely

fantastic yes I it's been my experience too you know I thought you know started off with coding get into testing and figure out how can we condense testing

time because that's you know 9/10 of the you know it's the reason why consensus normally takes 10 years to get right and then you still have you know all the issues with a non deterministic system

so like let let's engineer that problem okay determinism and then you have this this wonderful discovery that my code got better the design got better it's a forcing function I've got a first class

event Loop I'm thinking in terms of like the scheduling I want you know and and I think languages in future are going to get better um because determinism is is

is now a thing you know and uh so yeah but even you know with Zig we didn't use zig's event Loop because we wanted to control but tiger Beetle's design got better the

architecture so I love that too you know that it's a forcing function for good design because it forces you think of your guarantees that you make think about what is correctness what are the

what are the faults we need to survive you make a matrix of what you know what you know network storage Etc um and then test that and and on the real code you

know um so all all these things just get better being more explicit um I I did think one one thing too is that what's Curious I think you know with with

resonate you you knew from the beginning DST and and I I gather how for you like how quickly did you actually have your simulat up and running like were you

coding and then after a while you wrote the simulator or was it like upfront and you already had a simulator and then coding in it upfront upfront we we

immediately started with uh the the simulator the the harder part was uh was

for goang right to find um an abstraction a unit of decomposition right that is not the go routine um find a unit of decomposition

that also provides us a good developer experience then um to to express ourselves in so uh what we ended up

doing is we we implemented a kernel which is a a schul and instead of go routines we use Co routines right and

then uh so we implemented co- routines on top of G routin and channels and we inter leave them so that only one

goutine at a time can be activated by the goang schuer therefore we have full control over the over the scheduling okay so you were careful to

take everything that's physical or UND deterministic and Abstract it logically so that you have a complete logical piece of code correct correct and then

we uh then we were also inspired by by uh I uring um to to basically say we have this we have this deterministic core right and then

we hand something out into the non-deterministic world and this eventually comes back and it's phased back into the deterministic world right

yes and uh that is basically all we had to all I mean but what the what we had to simulate them was basically well schedu for the for

the um Co routines yes and since that doesn't allow any concurrency that's super straightforward and then we had to simulate basically the subsystems what

we call subsystems where we hand when we hand IO requests to the outside world had to had to simulate that and um then

everything came together very naturally that's that's also what I love about it is that it forces you to think where where the L if we had a t-shirt you know I would I would have a t-shirt

that says software should be logical because if your software is truly logical and I mean logical you know that it's not non-deterministic but

if your code is truly logical you can do DST but if your software is illogical let me not say that but uh you know with great respect but I think you know

software should be logical and so DST determinism forces you you know as a if you're architect in your system designing interfaces where are interfaces touching the physical world

think very carefully there and and just I found you know three good tips in tiger beetle minimize surface area of the interface make it very narrow then because you know the more

surface area the more State space explosion you're going to have to try and hunt down in your testing later so you know people often say like how how do you do state exploration and in tab

we like well design it way in the first place like try and reduce dimensionality in the first place now narrow those interfaces in what the state explosion they will allow because you will pay for it later in your

testing you know so so narrow narrow the The Logical to physical interface wherever you wherever you touch the physical world narrow those points then align them in terms of your

fault models what you know how can that physical Network or storage fail make sure your interface is aligned so in tiger beetle if you write to the dis we're not trying to make the interface

like posix because that would be a nightmare you know the state space explosion uh so don't do that you know don't be careful where you model your

your you know where that where that logical physical translation happens in tiger beetle if we want to write to the dis it's the function is Right sectors

here's a buffer at this location um there's a call back and that's it there's no error so it doesn't promise anything maybe it wrote it maybe it

didn't but that's how discs work that's now explicit to The Logical program The Logical program is now responsible for checking and and handling disc

corruption writing to the law so it's like minim minimize your interface surface area design dimensionality try and save yourself from State exploration

letter align to the the fault models you know of of that particular physical interface um and and promise nothing like the interface should make no

guarantees the less guarantees you make the easier it is you know down the line because that's just how the real world is um don't get your hopes up you know um you know otherwise we we you know

doing formal proofs you know like back in the day but we we're we're not being realistic um the real world everything just fails even where it promises not to

it still fails um so yeah I think that's that that's also yeah that's just been some of our experience um but oh do Dominic guess guess how long it was for

tiger beetle like do you think we started with the simulator or or when you know how long into the projects until we had the vopper I'm sorry you were you were the

the audio was cutting out I could not understand your question what did you what did he say okay so so guess for tiger beetle you know did we start with a deterministic simulator or when when

did we create it I take a guess oh I would think are you asking when you started to include the determinism into tiger

beetle or when you implemented the specifically the whoer when we actually started testing it like when yeah when when did because I think with testen driven design people think it's like you

know you write the test first so maybe in DST you you like you build you have to build the simulator first from the very beginning if you want to do this but you know for Tiger be guess guess when we actually built the simulator

when we actually just started testing huh that is a very good question I never thought of that okay now the way you ask the

question is like leads me to believe that it was not the very beginning but it must have been pretty soon though yeah it was actually at least a year so

we I think maybe this is I I I say this because I think I hope it's encouraging to people if you start with that t-shirt software should be logical principle of determinism if you understand these

ideas from the beginning I mean this is what we planned for Tiger be in the design thinking of all these things in the very beginning we only wrote the simul letteror year later it

was a whole year um oh that wow that is pretty late so it was a year of work you you know where you're going you know the principles um and we were quite

comfortable like we went off we go for a year um took three weeks to actually create the first version of our vopper simulator view stamped operation replicator Tri to war games Whopper uh

so it took three weeks to make the simulator it didn't take long just three weeks to write our test code and then we

ran it and then found 30 30 really like heisen bugs in three weeks uh so each day we were fixing five bugs a day and

the these were bugs that sometimes you know to reproduce they require zero millisecond round trip times on the network we we would each of these bugs would have taken a month two months but

we like fining fixing five a day this was just the first three weeks you know after switching on the simulator I mean yeah so you and as you develop new code

you're always finding bugs but um I think this is the idea is that DST doesn't have to be tdd you can just have the principles in mind write your code you're you're going to re architect a

lot of your codes so probably you know you can benefit from letting that simmer a little bit and churn and then you'll know when the right time is to give you velocity then you switch on the

simulator and then um yeah we had a we had a very interesting experience in that regard because we started with DST right but because of

the principles that you mentioned this um the the the connection from the from the logical to the

physical so my my co-founder and I eventually we called it the the not the distributed systems brain but the DST

brain so if you if you if you work with DST right you start anticipating before you even run the simulator right uh

because you you basically you everything is very explicit now this is a scheduling point right here the schul may choose something else this is an IO

point so this is concurr non-determinism introduced by concurrency right this an IO Point here um the nondeterminism introduced by distribution basically

partial failure right here a request May either be successful or it may generate it may generate a failure right so now it's everything all of these points are

incredibly obvious right they turned out to be so obvious that um our DST

like didn't find a bug it's like my co-founder David was like kept texting me all excited and was like oh oh there

like there's something else and then I was like did the DST find it I was like oh no no I just thought of it I was like

for the love man stop I want the DST to find the bug yeah so the the DST made it so obvious that um we actually from the

very beginning in the resonate server had a very high level of uh of quality and the DST just found the most

outrageous things if like this request and this request is basically duplicated and comes in in the very same batch but from two different clients something we

like it would just basically never happen right it's like the perfect hacker it's chaining like eight exploits together and each of them are incredibly

rare but it finds it you know and then it finds it correct so we were fixing bugs from the very get-go well was like in a million years I don't think anybody who deploys that system would actually

ever see that bug but we were we were fixing these Buck from the get-go an incredible feeling an incredible feeling yes it I I'll never forget the feeling when we

switched on that simulator like I remember listening to Kings of leyon crawl like that was plain and the the simulator was just burning

CPU and I you know coded for many many years and then you finally like you determinism the

principle kicks in and yeah I would I would never I mean at tarer Beetle we would never build any system with without this principle um yeah I think maybe we should give people some you

know some hope though what if what if you've got a big system that that is doesn't apply this you know what what could you do is there such a thing as a deterministic computer um and I think

neither of us have sharen antithesis so uh so like we really like it's just they've also you know Pioneers from Foundation DB but they actually built a deterministic computer so if you have a

non- deterministic binary put it in the deterministic computer and you can wear the T-shirt software should be logical um and they they do a ton of State

exploration and testing you know um but then what would you say Dominic like should you only have like an external audit function like a deterministic you

know auditor that you know like antithesis what or should you only have an internal audit function like your own

simulator what would you say um I would I would go for both um because the and the the reason for that is that it's

like first off this uh this um transformative experience right cells like in it transforms how you design your system and it's like the implementation of your system will be

different and better so I do like this application Level um application Level

uh uh DST and um taking control of the event Loop and taking control of your failure models and also the scheduling

points and failure points is incredible in terms of system design but that doesn't preclude that you can

also run that now in a uh deterministic computer right because the only thing that happens is you you can only find more you you can find less right and

it's like you know Will Wilson said to me Swiss swiss cheese model if you have I mean if you can have two determinist simulators great you know because now if one misses something the other finds a

Swiss Cheese model defense defense in depth another way to stop hackers you know just if you're going to be a bug in in resonate or tagab Beetle good luck but you'll have to get through two

simulators um I mean that then then I think at that point it's like maybe three or but I think two is enough you know internal audit external audit uh I

could actually also spring and I also have no shares uh in or no stakes in in uh yepsen but uh it's a application Level simulator platform level simulator

like the humanistic machine and then as like AEP test together um for for the

for the for the sheer realism that yepsen is is able to to to produce right very very realistic setups of systems

that yon is is able to produce like starting from the the software version of that operating system right in the software version of that compiler that you used in the software version of that

runtime that you used so very realistic settings that it that it can and at that point I think you basically reach the Mount Everest of of T is it's just not going to get any higher at that point

it's like yes if you if you have the resources and you can formally prove every single line of code good for you I don't think we will ever get there but

so it's like the DST and the Fon in combination that is something that is doable and the return on investment is is definitely there I'm so glad you say

that because that's kind of also how we think of tiger like Kyle's skill is phenomenal like he um because you see this is the other thing is that it's it's great to have a deterministic

simulated world but you need creativity you need the human you know you need you know are we actually doing a complete model Checker like how are we covering the whole product surface you know the

everything that's thing and Kyle is just great at that like the actual you know he he he yeah it's it's just phenomenal so yeah I think that's that's really I

think this is the way to do things you know have DST then use Jepson and and at tiger beetle we kind of thinking maybe I should say this but we've actually

written a non-deterministic simulator ourselves it's called Vortex so we've got a blog post coming soon with details but uh yeah so this is kind of where

it's at I think is that you do 90% of your local development on your machine DST DST DST but there comes a time where you want defense in depth and a

non-deterministic simulator can be great you know what you can do with it it's then it's like a canary because you should never be finding bugs there if you do you know there's something wrong

with with the automated deterministic CI yes yes yeah yeah very very well said so that would actually be if if uh

yepsen or and I'm already now super excited to read that blog post but if yepsen or something like Vortex finds a bug I Would by no means jump on fixing

that bug I would immediately jump on the question why did the simulator not find it and that's that's what's so amazing the feedback loop now that you get and I think this is the other thing you know

you it can take you a week or two three weeks I mean it took us longer at tiger beetle I always feel slower and slower but like it takes you three weeks to like set something up if you have the

principles in place that we've shared but then you never stop investing in that because whenever you you're just making your Simulator more and more protocol aware um and I think this is

the difference with the deterministic computers that if you have your own in-house simulator as well and a and an external audit your own simulator can

get very you can start pushing things to the theoretical limit of your protocol um the fault injection can be and that's what we do with tiger beetle

so and you just keep investing in that so exactly as you say you find a bug the question is I'm not even going to try and fix this under the non-deterministic simulat vortex why didn't the vper pick

this up and and and this is exactly what we do at Tiger be all which which is really fun you know so and that is actually that is a that is uh probably

also since we're a little bit over time a very good very good closing closing thought is that we also basically pay

tribute to the evolution right of of uh deterministic simulation testing because as you said once you once you adopt that

you keep improving the deterministic simulation testing itself right yes and that just keeps improving the uh systems

many times over right and so it's it's it's just complimentary with everything like you say it's Evolution but you you still use Jepson you all

these techniques it's just comp complimentary you know so that's and I I love like you say I love it to like pay tribute you know all the pioneers and we also just really lucky you know like

lucky timing you know that someone else you know suj James shared this idea of determinism yeah very true standing on the shoulders of giants

and having fun and having fun and having fun in the process it's like I was I was just lucky I saw the sticker yeah of the fuse stemed replication that's what

started this journey yeah no me I mean imagine if I hadn't chatted to you I mean there's there as you know there's like impact on tiger beetle

like that sticker you know all comes back to consensus which is only possible with DST so yeah all comes back to consensus okay for so

I guess eventually we just have to have James on the on the call as well yeah awesome hey thank you very much

thank you very much yuran that was really really fantastic I enjoyed that a lot thank you everybody for for uh joining I saw that there was like lots

of activity in the in the threat the the talk was so engaging I couldn't uh uh jump on the questions but I think um

there like the participants also uh chatted with uh each other so um thank you very much for that for everybody who jumped in and posted

references and um like answered questions that other uh other people in the chat in the ecosystem had right so I it's like just all here and then paying

it forward helping each other super cool that's a that's a really fantastic spirit and and um yeah if you have any if you have any questions it's like um

it's like of course I don't want to speak for yuran but from personal experience he's very open and very helpful so I think it's totally cool reaching out to yuran it's absolutely

cool reaching out to me so if you wanna if you want to know anything if you want to brainstorm if you want to just bounce off ideas or have uh questions please don't hesitate and uh contact me you can

either find me on Twitter or you can um join the uh resonate uh Discord and yuran what's the best way to reach out to

you uh put a DM uh email your on tiger beetle and yeah second everything Dominic says what do the Vulcan salute

Dominic the WAN salute that should be this no yeah yeah so I think I guess we say live long and prosper long and prosper that's right

yeah keep keep yourself logical yeah yeah any anytime always great to chat so yeah so love to you know help people out and I like like how you say you know

paid forward so as people have done to us you know exactly exactly very cool all right then it's like the next the

next uh um deterministic Dojo scheduled for uh February 26 yeah but uh until until until then yeah there will

definitely be a lot of chatter about a DST outside of the deterministic doo okay I know what we'll talk about next year there's something new on the horizon like what what's the next thing

you know yeah what is the next thing are we already talking about it this year uh I think we'll have to speed up

time and and find out but uh it will be interesting to see you know what what what we've got C cliffhanger just like Cliffhanger but I'm sure that movie has a happy ending so that's a that's a

that's a good part yeah all right so then thank you very much uh everybody thanks again uh yuran and uh then I see everybody virtually

and hopefully soon thank you thanks so much Dominic bye-bye

Loading...

Loading video analysis...