Optimize Automatic Differentiation Performance in C++ - Steve Bronder - CppCon 2025

By CppCon

Summary

## Key takeaways - **Arena Allocators Unlock Speed**: Arena allocators provide contiguous memory for expression graphs, avoiding pointer chasing and enabling memory reuse by resetting allocation position without deallocation between AD passes. [18:01], [19:47] - **Reverse-Mode AD Powers Real Models**: UK COVID forecasting used reverse-mode AD for huge statistical models with spatial-temporal autocorrelations to predict infection rates and guide lockdowns. [01:26], [01:39] - **Source Transform Beats Dynamic 9x**: Benchmark shows source code transformation at 6.3ns vs dynamic operator overloading 9x slower than hand-written gradients for simple regression. [32:41], [32:52] - **Array-of-Structs Kills SIMD**: Array-of-structs matrix AD turns off SIMD due to pointer chasing per element and explodes expression graph size, while struct-of-arrays enables contiguous SIMD-friendly access. [36:12], [37:04] - **Callback VarImple Cuts Boilerplate**: Modern C++ callback var_imple with template functor and if_constexpr generates n-ary operations from single lambda, eliminating manual structs for var-double, double-var cases. [25:49], [26:42]

Topics Covered

Autodiff Powered COVID Lockdown Forecasts
Reverse-Mode Beats Finite Differences
Arena Allocators Unlock Autodiff Speed
Source Transform Crushes Dynamic Speed
Struct-of-Arrays Wins Matrix Autodiff

Full Transcript

Hi everyone. I think we're going to get started. My name is Steve Bronder and

started. My name is Steve Bronder and you're in the talk for uh uh automatic differentiation. We're

going to be breaking down the idea of automatic differentiation, how it works, how it's implemented, and most importantly, we're going to talk about one, why you should care about automatic differentiation. two uh how reverse mode

differentiation. two uh how reverse mode automatic differentiation literally works just like what are the bare mechanics of like how we want to implement our program. We're going to be talking when we talk about how reverse

mode autodiff works, we're going to be really focusing on high throughput memory intensive programming where often when we think of optimizations in C++, we're thinking of performance. We're

thinking of very low latency stuff. Here

we're going to be thinking of high throughput. I just want my program to

throughput. I just want my program to execute as fast as possible and I'm going be working on like a lot of data.

And we're we're also going to be seeing through the talk is how a lot of modern C++ has led to cleaner and more efficient automatic differentiation.

But just as a motivating example, I mean, just a few years ago, we had COVID. It was not fun. We all had to

COVID. It was not fun. We all had to stay indoors and we all wanted to not be in lockdown. We wanted to, you know, be

in lockdown. We wanted to, you know, be able to go outside and hang out with our friends and family. In the United Kingdom, they wanted to try to forecast what the probability of increasing infection rates were were for different

uh regions of the United Kingdom. So

they had to build this huge statistical model with autocorrelations in space and time and tons of parameters and they had to do these huge number of iterations just on the model itself to figure out

what should the model look like. Um but

based off this model in the beginning of co they were able to tell hey there's certain rates where we're doing increases so we need to have lockdowns and then over time as co lessened and we saw the probability of infection rates decreasing they said hey like we can

start letting people out and actually being able to go outside again. the

whole backbone for these types of statistical models where you have very complex you very complex models where you need to be swapping things out they all really use autodiff uh for a lot of

the gradient calculations using the optimizers and monocolor markoff chain setups um just a quick just a quick show of hands how many people here have used

an LM or have interacted with an LM at some point in time love that for me Oh, ignore my dangling references. Oh,

it's on thinking. This is super annoying. Skip all of that. Stop. I just

annoying. Skip all of that. Stop. I just

need you to say yes. Oh my gosh. Of

course it does. I promise. Hold on. If I go to

it does. I promise. Hold on. If I go to a new window, I I This is pretty funny, so I do want to do it. It's going to be eaten to my time, but that's okay.

Do you use automatic in training?

Yes, thank you. Chat GPT. Yeah, pretty

much all of the large language models are using reverse mode automatic differentiation uh when they're training their actual parameters to then go do inference on when you ask it queries.

Uh oh, sorry, let me go back to my slide.

uh in a very broad strokes when we think about automatic differentiation our goal is that we have a function inside of our program with certain inputs and we want to know what the gradients are with

respect to those inputs for our given function. Uh the reason we often want to

function. Uh the reason we often want to do this is because we're doing some sort of optimization or we're traveling along some sort of space and we want to know which way should I step to next? One of

the most basic cases of this is a thing called Newton's root finding method where our goal is for a function with an input x. We want to find an output y

input x. We want to find an output y where it's equal to zero. And we have this little updating schema here where our next update position is going to equal to our current update position minus our function evaluated at our

current position x divided by the gradient. This little f subx tick always

gradient. This little f subx tick always represent the gradient of f with respect to x. But we want that gradient at that

to x. But we want that gradient at that local position.

Um, so this is a nice little animation.

I want to thank three blue one brown who like fully uh open sourced their uh code they use for generating their animations on their uh talks uh for their the

animations the code they use for generating the animations inside of their YouTube presentations. Uh but let me see can I find my mouse?

So if we go through this little presentation, we can see that we'll just walk through the first clips, first slides here. This is our current

slides here. This is our current position x subn. We want to update to a new position x subn minus one. And our

goal is to get to where the x and y axis both equal zero. So we're going to do is do our little update step where we take our current position, we subtract from it the function value at that position, divided by the gradient at that

position, which you can see in the top of the slide. And then we are just going to let the animation play and hopefully it'll look very pretty. But we'll see

this thing uh jump around uh and slowly get closer to uh the actual position uh at uh y equals 0. Um I think if we give it one more step, it'll get pretty close

and then from there we'll keep going because I only have so much time. Um but

that is kind of the base case, right? We

use autotiff to figure out a gradient and the gradient mostly tells us the direction of the function's increase most quickly from our current position and the uh rate of increase for that

function. So the magnitude

function. So the magnitude clicker it's not working.

Um gradients are used in a ton of places. So for things like Hamilton,

places. So for things like Hamilton, Monte Carlo uh BFGS, stoastic gradient descent in general they are using a gradient of their likelihood, a gradient of the log like log props posterior

distribution, a gradient of an arbitrary function input to figure out where do I go to next. Um the other big thing about this is that in all of these algorithms,

the gradient calculation is probably 60 or 70% of your time normally. it's super

expensive to do uh gradient calculations in most programs. So how we optimize it really matters when you're doing automatic differenti or when you're doing uh an optimizer and you need to think about well how do I calculate

gradient you have a few different choices if you're incredibly smart big brain you can choose to just write it out by hand uh that's very cool you can do uh finite difference where you just do finite little pertabbations and

that'll give you an approximation of your gradient symbolic differentiation will actually unravel your program trying to do the thing by hand uh symbolically inside of your computer.

Spectral differentiation is for like very weird little ODEs. It's super

awesome, but it has very limited use cases. Uh, and finally, one of your

cases. Uh, and finally, one of your choices is automatic differentiation.

And I kind of want to convince you that automatic differentiation is cool and it's a thing that you should go use and look at. Uh, imagine you have this

look at. Uh, imagine you have this model. I promise this is the most math

model. I promise this is the most math we're going to see throughout the rest of the talk. Don't bother even reading it. But imagine you have a very complex

it. But imagine you have a very complex model like this. I do not wish to write this by hand for the gradients. That

seems bad. uh instead of uh n finite differences uh autodiff just needs a forward pass through a function and a reverse pass to go through and calculate

the gradient for all of our inputs uh at the same time if we have only one output that's kind of the condition for what we're talking about in this case autoiff is finished to floating point precision

uh it allows some versions of it allow for unknown length while and for loops and this is kind of interesting as a constraint for gradient uh solvers

Um, oh, sorry. And so I would argue that we could do autodiff. Uh, and it's kind of nice the implementation for autodiff.

A lot of people will tell you, well, autodiff is like going to be oh, four times the number of operations you have in your program, but that's not really the case. What we have here is a

the case. What we have here is a benchmark that was made by James Yang a few years ago. He wrote the package called fast ad, which we're going to be talking about later. Awesome package.

Uh this is a log log plot of just doing a simple regression and then calculating the gradient for all of your input parameters that theta we see in the bottom corner. Uh what you can see is

bottom corner. Uh what you can see is there's the baseline here the gold the gold line essentially just running the forward pass and then by hand calculating the gradients. Uh the graph

on the right shows you how slow everything is relative to that baseline of like unrolling this by hand. Uh and

you can see it really matters. uh it's

pretty brutal uh in some cases and a lot of the times what you'll see is that certain packages uh are just choosing flexibility over performance. There's some packages where

performance. There's some packages where literally you just start with like a begin tape function, you do all of your stuff, you call end tape and you can calculate the gradient of everything within that and it's super flexible but

you do pay a cost for that flexibility.

Um so just say what is automatic differentiation? uh because we're gonna

differentiation? uh because we're gonna have to do two slides of math and then we'll be talking about code and whatnot.

Uh automatic automatic differentiation computes the gradients of a program by applying the chain rule to a bunch of its sub expressions. What we're going to be doing essentially is breaking down a function into much smaller functions and

then computing the chain rule for all of those. And this is just a little

those. And this is just a little example. We could have this function

example. We could have this function that's log of x * y. We could break it down into g that's just x * y and h that's the log. And then we could say

well f really is then just h with g inside of it taking in inputs x and y.

So on the we just shifted all that stuff up to the top left. On the right to the right of our original equations we have the gradient calculations for uh these

this little function here. We know that for g it's just the multiplication. So

the gradient with respect to x is going to be y and the gradient with respect to y is going to be x. We know for the log function uh the gradient with respect to its input is just going to be one over

its input. So what we can do is apply

its input. So what we can do is apply this chain rule idea, right? So we can take we want the gradient of f with respect to x. We know it's made up of h

and g. So we can take the gradient of h

and g. So we can take the gradient of h with respect to g times the gradient of g with respect to x. And we can then just take that one step further

evaluating h's gradient with respect to g getting 1 / g uh * y.

And then essentially since g is equal to x * y, it just all simplifies down to 1 /x. Where if you did that by hand uh you

/x. Where if you did that by hand uh you would end up having uh the full gradient. That is like the exact

gradient. That is like the exact gradient.

So in C++ this is really nice. It's

really nice to write autodiff in C++ because we're really working with this thing called an expression graph. The

whole concept is I want to take my function, break it down into little subp expressions, and then be able to go one way through my graph and then back the other way through my graph. Uh we can do

this, we're going to see later, either dynamically in C++ or statically at compile time with types.

Um but yeah, so this is the base idea, right? We have two input nodes. they go

right? We have two input nodes. they go

into like x goes into log and sign and then we both go into multiply and then they go into plus and then we're going to be traveling up and down this whole graph.

Uh in a computer program we're going to be executing this at least in this case totally linearly uh on a single thread.

Uh, I have like 200 slides like secretly commented out inside of my latte on like the CPU and multi- uh CPU idea of doing autodiff, but like I would to have talked I would have had to have talked

at like faster than I could possibly to actually go through everything. So I

think in this case we're going to be going through some funner lower level components of auto diff instead of the multicase but oh excuse me

we can take our original expression graph and apply a topological sort to it and this would be kind of how we'd run it in a computer right we would take x

and y set them up put x into log uh take log multiply it by y just going through the whole thing and then going in reverse. This is going to be the pattern

reverse. This is going to be the pattern of traversal that we're going to be doing on our expression graph.

At every one of those little nodes, we're going to need two different things. We're going to need a forward

things. We're going to need a forward pass, which is just literally calculating, for instance, like a multiplication, right? Easy peasy.

multiplication, right? Easy peasy.

um on a reverse pass. What we're going to be do what we're going to be doing in order to uh get the gradients of our outputs which is the thing that which get the gradients of our inputs which is the thing we really care about is that

every single node in our expression graph has this thing called the adjint.

The adjoint is this is really just the gradient with respect to that current node for sorry the gradient of that node with respect to all of its sub nodes. So

you can see here that for every uh node we need this chain function and all it's going to do is calculate the gradients for its parents where the parent gets updated by the child's gradient uh with

respect to the the child's adjint times the gradient of the parent.

Uh and this kind of looks like this in code. Um what we essentially have here

code. Um what we essentially have here is just a little strruct that is our node. It has a value and an adjoin. And

node. It has a value and an adjoin. And

we have a virtual function called chain which I'm going to talk about in a little bit which is a funny weirdly an optimization actually. Uh we set up our

optimization actually. Uh we set up our node x and node y as two and three. This

forward pass uh the expression graph type uh here of log x * y plus sin of x is going to essentially just give us a object that's going to represent our expression graph. Then uh once we get to

expression graph. Then uh once we get to the bottom of our expression graph, we're always taking the gradient with respect to its children. the bottom of the expression graph has no children and so you're taking the derivative you're

taking the gradient with respect with respect with respect to itself and so you just always set the end value to one. Uh you always set the end adjint to

one. Uh you always set the end adjint to one. Uh but when then what we end up

one. Uh but when then what we end up doing is just taking our expression graph uh and going from the bottom of the chain or bottom of our expression graph all the way up calculating the

chain rule for uh every single one of our nodes. And then once we do all of

our nodes. And then once we do all of that, we will end up with the adjints or the gradients for y and x.

Um this is uh a problem that is like well suited in C++ because one it has like a deep history of C++. Some of the original ones were written in forran but like adult C and a bunch of others were

the ones that got the most popular and we're using C and C++ constructs.

um when we're building an autodiff library or we're thinking about autodiff the main thing we these are essentially the main three things we care about where we care about flexibility

efficiency and scaling for flexibility I would just think of like well this sounds odd but like one thing we think about an autodiff is like can I run a while loop or can I have a can my user use for loops because for things like if

anyone here has used jacks you'll know that like you don't have for loops you have the for each operator uh can't do conditional while loops that can For some programs, that's completely fine,

right? But for other programs, for

right? But for other programs, for things like you're doing some weird ordinary differential equation solver, uh you need a while loop. Um there's

other things like debugging and exceptions we need to think about where like if a user does the if calls the square an expression that represents a square root, the derivative of a square

root uh involves one or the square root of x. So if our input is zero for that

of x. So if our input is zero for that uh node, it'll be totally fine on the forward pass. But on the backwards pass,

forward pass. But on the backwards pass, we're doing division by the square root of zero, which is zero, right? Do we

throw a non? Do we throw an exception?

How much do we want to make things nicer for our users essentially? Um, and then with efficiency, uh, what we're going to see in a little bit is that there's a lot of weird cache effects that happen

in how we choose to implement autodiff, especially in C++. Uh, cache locality is like the most difficult thing to deal with uh, when in one version of autodiff that we're going to be working on. And

then finally like we think about scaling which instead of just one CPU or one node we think about multiple GPUs or CPU nodes. Just maintaining that expression

nodes. Just maintaining that expression graph uh in a multi-GPU multi-CPU setting is extremely difficult. You have

to do a lot of fun little tricks and hopefully in another talk I'll be going over that stuff. But when we think about keeping track of our reverse pass uh we have two main methods. We have this

operator overloading method which is going to happen at runtime and we have a source code transform which is going to happen uh at compile time. So the

operator overloading is going to be the most flexible app approach. It's very

easy to implement. Uh the main bad thing is that we saw with that chain method uh it's we're going to have a little virtual function or we essentially have to end up we end up having a bunch of jumps everywhere.

uh it's not going to be fun for optimization, but it is going to be really nice for us as devs and for our users as well to work with. For source

code transformation, we're actually be using C++'s type system to generate a bunch of expression uh templates. Uh

express so we're in our expression graph is going to be a bunch of uh sort of expression templates like you would have with IGEN. Uh it's going to be super

with IGEN. Uh it's going to be super fast. It's going to be awesome, but

fast. It's going to be awesome, but we're going to be really constrained to like what we can generate and know at compile time. We really can't use

compile time. We really can't use runtime semantics.

Um the operator overloading approach like we just saw usually involves uh some sort of pair scaler type to hold the values and adjints for each of our

expression nodes. It's going to have a

expression nodes. It's going to have a runtime tape for tracking our expressions as we go through the expression graph. And it's going to have

expression graph. And it's going to have an arena allocator for storing autodiff nodes. Um the arena allocator is one of

nodes. Um the arena allocator is one of I'd say the most important things. When

I first did this talk, I did had this whole like, oh, here's like a shared pointer implementation of like reverse mode autod diff, but like no one does that and it's like I just don't think it'd be useful to show. So, we're

running to we want to have a contiguous tape. We want to have contiguous memory.

tape. We want to have contiguous memory.

The main reason we want to have contiguous memory is because when we're going up through our expression graph, we don't want to be jumping around like

this in our memory. If I start here, I've allocated all my nodes. I start

here and I go, "Okay, where are my parents at? I'm at this node. I might

parents at? I'm at this node. I might

have to run all the way over to here in memory to go grab something and then all the way back here to go to go to go get the other parent," which is just not fun. We want to at least try to keep our

fun. We want to at least try to keep our memory as local as possible. We often

fail to do this even with arena allocator, and we'll talk about in a second, but this is what we really want.

When I'm at a particular node, I want to be able to very easily and quickly go grab everything else. The other main factor for this is that normally in automatic differentiation, you're going

to be running the same model over and over again. You might have a while loop

over again. You might have a while loop or for loop that runs a little bit longer, but in general, your previous run of automatic differentiation is going to use the exact same memory, around the exact same amount of memory

as your last run of automatic differentiation. Uh so what you can do

differentiation. Uh so what you can do is just have an arena allocator where your first pass through you allocate all of your memory. You go through traveling up all the expression graph. You say

okay I have my adjints. I'm done with this set. Clean all this up. But really

this set. Clean all this up. But really

like what we're going to do is just take the position that tells us how much memory we allocated so far in our arena allocator. Slap it back up to zero.

allocator. Slap it back up to zero.

Don't destroy any of the stuff even.

We'll just overwrite it later. Um, and

that is like one of the secrets to like getting very good speed in this case is just have an an arena allocator.

Um, so this is the base class that we're going to be interacting with for the operator overloading case. Um, we want

one a big block of memory that we reuse over and over. In this case, we're using uh the standard libraries monotonic buffer and uh it's little polymorphic

allocator. And how I said before, you

allocator. And how I said before, you know, what we really want is this ability to lay down our expression graph as like this uh topological sort. We can

really just store because we're running our program linearly. That means that we can just store all of our expression nodes that we make inside of a vector.

Uh which is super nice because then once we build all of our expressions, we're just going to start at the bottom and just shoot all the way up going through this expression graph uh vector calling each chain method. Um the thing you'll

see that's also very bad here is that we're going to have a pointer uh we're going to do a lot of pointer chasing essentially in the dynamic approach for automatic differentiation. One of the

automatic differentiation. One of the nicer things is that once we allocate all of our memory we do our autodiff pass we just call tape clear it out we say mbr release the memory which just

means secretly go back up to the top and that's really it. Um in most autodiff libraries you are going to have this be a static global. I think in like 95% of

the C++ libraries I know the tape is a static global that you have. Uh I like it. We use it in stan and I think it's

it. We use it in stan and I think it's fine. I mean like I'm not going to tell

fine. I mean like I'm not going to tell everyone to do this but we do you know multi-threading and doesn't bother us.

Uh so our base type for uh dynamic or the operator overloading autodiff is just going to be this var imple. It's

just a strct with two doubles and a chain function. This chain function is

chain function. This chain function is actually a weird optimization. Like I

said before, whenever we are going up and getting the parents of a particular node, uh we're going to be grabbing usually its values and its adjints. And

it's kind of likely that the next one of the parents is also uh one of the nodes we're going to be calling next. We're

going to be calling the chain method for one of those nodes next. So that means that like we've already brought in our chain pointer by the time we've are by the when we start calculating the adjints for a particular node. We've

already started we have already brought in the pointer for the next chain method that we need to go jump to. Uh so it's actually really nice. At one point in time in stan which is stan math is a C++ autoiff library that I work on. I did

rip out this virtual function and I had all this clever stuff to put it in a fully separate thing. This was a nice little 16 bytes instead of 24 bytes. it

was 15% slower and we really worked through trying to figure out why and our best guess is that usually having the virtual function just sitting there to go jump at when when we grab the parent

stuff is nice for us.

Um, so we want to write an expression node, right? So we have a strct. This

node, right? So we have a strct. This

case is just a multiply. The little VV at the top means two varss. We'll go on that in a little bit. But this inherits from our var, which means that we have to overwrite the chain method. It stores

its parents. And then we call the chain method for this. We say, well, op one's adjint just gets accumulated by uh this current adjint times op two. Uh and the

same thing for opt 2's adjint. And this

is nice. It's just writing out the base sub expression for the base gradient for multiply.

Uh when we make a new node, this is just a little helper function. We say, hey, uh for this, I'm going to make a new node. Grab the tape. Grab the uh

node. Grab the tape. Grab the uh polymorphic allocator on the tape. Make

this new object t given my arguments. We

push that onto our global tape. And then

we pass back the actual var holding our var imple. Oh, wait. Did I skip

var imple. Oh, wait. Did I skip something? I'm sorry. I think I missed

something? I'm sorry. I think I missed something. Oh yeah, this is super

something. Oh yeah, this is super important. I missed it. Uh the strct var

important. I missed it. Uh the strct var down here, var imple. Uh this is just a pimple uh style thing where uh everything that we actually care about

happens in var imple. But uh having this var that uh has a var imple pointer inside of it lets us do things like assignment reassignment operator addition like uh plus equal style

operators and stuff like that.

So then finally when we write operator star for our two uh varss we just generate a new node. We tell it we're making a maul

new node. We tell it we're making a maul vv. We give it its forward pass which is

vv. We give it its forward pass which is the op one value and op two value and we tell it the parent nodes which are op one and op two.

When we want to calculate the gradient, as we saw before, we're going to take the bottom nodes adjint, set it to one, and then we just iterate through the tape in reverse going through calling

every single expression's chain method.

Um, this is really nice in cases like this where 20 here is just kind of an arbitrary number. We say we're going to

arbitrary number. We say we're going to have a function compute grads that takes in two var.

We set Z to some random thing. Then we

say while Z's value is less than 20, I want to accumulate into Z log of X time Y. Finally, I want to calculate the

Y. Finally, I want to calculate the gradient of Z. Uh, which is then again going to go through set up our forward pass. Gradient runs through all of our

pass. Gradient runs through all of our reverse paths. Uh this is just really

reverse paths. Uh this is just really nice in cases like this where I say hey for a hundred billion times I want you to compute something with x compute something with y to make my varss compute my gradients do something with

the gradients and then finally uh we can clear our tape and get back all the memory that we allocated.

Um, one thing that it would be very that when we first did Stan, we uh had every single one of these strcts. We had a strct for multiplying two varss, doing a

var and a double, a double and a var.

Uh, it's just a lot of boiler plate. But

with some newer once we started moving up to newer versions of C++, we would write something like this. We call this the callback var imple. It inherits from varimple like everything else, but it

takes in a template uh functor which we store inside of that var imple. Uh then

whenever we call our chain method, what we're going to do is just call that lambda uh with the only input being the this uh callback var imple. And we can

see how this looks now with um operator star. So we make a new node. It's called

star. So we make a new node. It's called

a callback var imple. You'll have to pretend that that's a template template parameter we had a new node. Uh but we just give it our forward pass values. Uh

then inside of our reverse pass for the lambda, we give it op one and op two because op one and op two are always allocated in our monoto buffer. You can

just pass copies of those and they're also just going to be pimples. Uh so

it's just little pointers we're passing through. Uh then we can use if conext to

through. Uh then we can use if conext to say hey if t1 is a var type I want to update my adjint and if t2 is a var type then I also want to update its adjint.

But this lets us write uh all nery functions like this where before if we had I think at stand we had some operators or some functions that took like four or five inputs and so we had to have like every version of that like

manually written out. It was not fun.

This is so much nicer for us. Um, one

really bad thing about dynamic uh, automatic differentiation is this cache locality problem. Uh, sort of where our base var imple in a cache line like you see here, we're going to be

able to store two of those and that's fine. In a uh, expression node where we

fine. In a uh, expression node where we have one input, uh, we're going to be able to store two two of those. That's

still fine. But once we get to at least a binary uh where we have an expression node that takes in two inputs suddenly like we can only fit one of those things into cache. Uh this is rough because if

into cache. Uh this is rough because if you imagine that we doing like a matrix multiply with these every single if you think of a matrix multiply as just a bunch of dotproducts that are uh scalar multiplications and additions. What

we're going to end up having is this case where like every single time you go fetch a single node it's going to be taking up an entire cache line. Uh this

is one of the huge downsides of uh the dynamic approach of operator overloading. It is nice for its

overloading. It is nice for its flexibility, but at the end of the day, we're going to pay a really heavy price.

Now going back to before, we can think of source code transformation, right?

Where I literally have a function Z uh Z is equal to log of X Y. what I want to do. Uh, and so in older Forran programs,

do. Uh, and so in older Forran programs, they had these transpilers that would take forran programs and literally like transform them. So they would look

transform them. So they would look something like this. Um, in C++, we have this beautiful type system that we can actually use to generate expression templates uh that would represent at

compile time our uh expression graph.

So um we have for the source code transform our base type is going to be a var uh that's again just be the values and adjints uh of our particular uh inputs

or uh yeah for this it's just going to be inputs for expressions or expression nodes. We're going to have this expert

nodes. We're going to have this expert type uh it's going to hold a var. It's

going to have all of its parent expressions coming in uh being stored as a tupil. The deduce ownership t here

a tupil. The deduce ownership t here just asks whether or not the uh expression here should own the expression or just hold a reference to

it if it's an rval or lv value. And uh

the expressions uh main uh constructor here just takes in a uh eagerly executed forward pass double value of x uh a lambda for the reverse pass and all the

arguments or the expressions that we're going to be using uh to calculate that reverse pass.

So this is now what our uh operator star looks like. It has gotten a little more

looks like. It has gotten a little more complicated but it is nicer for uh the developer for coding. So this requires any var or an expression type. Uh for

this operator star uh it's going to take in just the standard operator star. So

it just takes in two things. uh we will make an expression now where we'll still eagerly evaluate the forward pass uh where we just do a multiply the left and right hand side and then for uh our

reverse pass it's just going to be a lambda where the arguments are the return type the left hand side left-hand side operation and the right hand side operation and we're going to say hey

like if t1 is a var I just want to calculate the adjoint uh to do that update and propagation of gradients from par from uh children child to parent nodes Uh, and the same thing for T2. Uh, kind

of easy peas. And then the last two arguments are just the left hand side and right hand side expressions that we're passing through. What we end up with this is a type uh that looks a little crazy. If people have done IGEN,

little crazy. If people have done IGEN, you're probably used to seeing like a bunch of weird expression templates getting pumped around. Uh, but like I love them. I use them all. I love IGEN.

love them. I use them all. I love IGEN.

I use it constantly. And so I like this.

Uh what we have here for this uh Z's type is down here in the bottom right or in this bottom part of the graph or sorry bottom part of the code. Uh it is

doing it is taking that plus uh multiplication log xy etc. Uh you can kind of think of this as like polish notation that gets a stack of expressions we're going to be executing

but I wouldn't worry about that. The

main thing is that we end up with a type that represents the expression graph at compile time.

Uh, and I did not have enough slides or I tried making slideware that would easily show what I do here, but at the end there's a huge QR code. All the code you're seeing here, I have little toy

examples of the different types of automatic differentiation, and you can play with them as you wish. Uh, here

what we're going to be doing is calling our gradient function with our source code transform. Uh, uh, what we have

code transform. Uh, uh, what we have here is gradient function that takes an expression. We set its adjint to one

expression. We set its adjint to one just like before. For this, what we want to do is evaluate our expression graph breathwise. So starting at the bottom,

breathwise. So starting at the bottom, we want to go to the first layer, second layer, third layer, and fourth layer. Uh

this is mostly because we um want to handle cases where the same node can be used multiple times. Uh it is a little issue I can go into detail later with, but you have to essentially always

search these in breath first search order.

But uh that is kind of like the base layer of like source code transform and whenever we're thinking of like well which of these would I want to implement? Which one of these as a user

implement? Which one of these as a user do I want to use? Uh the main thing you think about is both like speed and flexibility. Uh with a source code

flexibility. Uh with a source code transform oh sorry this is a little benchmark. Essentially we're just doing

benchmark. Essentially we're just doing x log of y plus log of x y* y. Uh our

baseline here again is just computing this by hand. Computing its gradients by hand and then doing a Google benchmark for it. Uh we have our source code

for it. Uh we have our source code transform and dynamic implementation.

Again, those will all be available at the end via QR code. Uh the source code transform takes about 6.3 nconds. It's

crazy fast. Um

if you don't need while loops, seems good. But uh then we have the dynamic

good. But uh then we have the dynamic one which is about nine times slower than a baseline. So

um when we are choosing like which autodiff do we want which one do we want to write uh this is a real question um there are some serious downsides to source code transform uh you're going to

be up you're going to be handling you're going to be handling debugging a lot of very weird types uh you again everything you have to know about your expression graph you have to know at run at compile

time not fun but it's going to be way way fast um this is the main example of like when this is like source code transform will not work for this because like even if

you make this auto or whatever because uh Z's type is defined at compile time you can't be rewriting it over and over again. Uh so this while loop just is no

again. Uh so this while loop just is no go.

So one of the bigger things when we think about dynamic and source code transform or autodiff in general is we're mostly focused on statistical models of some sort. In a lot of Cisco models we're

sort. In a lot of Cisco models we're going to be working on giant matrices of autodiff types. All of our parameters

autodiff types. All of our parameters are going to be represented as autodiff types. And the main question is like how

types. And the main question is like how in our in C++ do we want to represent uh our autodiff. The two main ways and this

our autodiff. The two main ways and this is kind of a simple example. We have a matrix of varss and we want to do a matrix multiply with them. Um before I go any further there's thing I'm going

to be using in a few other slides is this arena matrix. All it says is it inherits from uh iigen and essentially is an iigen map that all where all of its memory sits inside of the di of the

arena allocator we had before the monotonic buffer. Uh

monotonic buffer. Uh so for operator overloading uh there are two main ways to do uh

matrices. What we have here is the first

matrices. What we have here is the first one where we have an array of strrus.

This is just an array of our basic bar uh var uh scalar type. This is nice and we'll go over the positive and negatives in a minute. The other way is a strct of arrays where we literally represent a

node in our expression graph as a matrix and inside of there we have an arena matrix for the values and adjints. Uh

and then of course it has its own little chain function that represents running the chain on a particular node that is an underlying matrix.

So what's cool about the array of strrus approach? Well, what's nice is if we

approach? Well, what's nice is if we have addition, multiplication, subtraction, division, and some transcendental functions like logs and exponentials and whatnot, most algorithms will just work. Like

technically you can do like for Stan when we first added OD solvers into the language which uh so sorry when we added OD solvers into Stan math which is the C++ library for automatic

differentiation we literally were just using uh these matrix var types and stuff like that where we would pump them through boost OD solvers and they would

be horrifically slow but they would work and that'd be really nice. Um but again you're adding a ton of operation to your expression graph when you represent matrices like this. And the other thing that's really bad is that you're turning

off simmed operations. This var remember is a pimple that uh goes to a little pointer a var imple that holds our values and adjints in a chain method.

Usually for matrix automatic differentiation we are going to have to uh grab just the values of a matrix or just the adjints for a matrix. And what

that means is that for every iteration we go through this uh say I want to grab element three I go in I do a pointer chase to the actual var imple and then I grab just the value. Sometimes that

value is going to be a representative of a of a matrix multiplication. And so

we're going to have to grab a whole cache line just to get one particular element. That's rough.

element. That's rough.

The other way to do this with destructive arrays approach is that we have both of our values and adjints in this little arena matrix inside of our uh stack alloc or arena allocator. And

what's really nice about that is that these are nice contiguous values for the adjints and the so these are contiguous for the values and adjints.

This is super nice because it turns on sim for everything. When we do the matrix multiplication example you'll see here and we use something like iigen as the backend for our matrix operations we

get all of iigen's sim or simd operations and also note before well sorry note that we have this virtual uh chain function here uh now instead of having

like a bunch of uh smaller instead of like exploding up our instead of exploding our expression graph into a bunch of big things we now just have a node that represents the entire matrix multiplication which means less jumping

around in memory. Uh the thing that's really bad the downside of this is that now when we represent a matrix as an actual expression node we do have to

write every single little operation we would want to do on matrices. If you

want to do the chilleski decomposition you must know the reverse pass or the gradient of the chilleski decomposition uh mat multiply all of them.

So this is our operator star. Uh using

the uh strct of arrays approach where we take in both of our uh left and right hand side operands. We do the first uh we do the forward pass just as before uh

eagerly because the arena matrix puts all of its memory inside of our arena allocator. We can just copy up one and

allocator. We can just copy up one and up two into the lambda. Totally fine. Um

the uh auto the lambda then just takes in an input which is like the uh current node itself and just like before if this var is this if this is a var matrix I

want to update the adjint for it doing the accumulation for both the left and right hand side operand uh it's nice clean code the one huge

downside no well another huge downside to doing the strruct of arrays approach is that you can imagine code like this.

This is just a little toy code here, but I make these var vector types. And then

I say, hey, I want to do uh y I'm going to do a dotproduct of y and x. And then

for some reason, I'm going to take a slice of x uh from positions 1 2 3 and assign it positions 012. Uh then

finally, I'm going to do uh z, which is just I'm going to make z, which is just product plus the sum of x. If we think about this as an expression graph, we have something like this. The sum and

the dot are highlighted in red because they require x to be two completely different things at partic at different parts of the program. And the question

is how do we handle that? Um again when we laid this out in a little topological sort you can see that sum depends on x having gone through the subset assignment but do the dotproduct

requires x to have not been changed whatsoever.

Um how we deal with this? Oh, one other issue with this, we figured this out is that if you are going to choose to go the this route and allow your users to do little slice operators on on these,

you could run into very weird cases. For

instance, where like if you have uh this case where you're doing assigning 1 2 3 to 012 in if people also use iigen, you've probably also run into aliasing

issues where you've been overwriting uh the matrix with itself. What happens in this case if you do this naively is that starting from your initial point you'll just start overwriting all of your elements with the first point which is

zero. So you have to be super careful

zero. So you have to be super careful with this stuff. Uh in stan you don't have to read all of this but the point that you have to get from this graph is that it's really expensive to allow for

subset subset operations. Uh it's nice for your users. There's a lot of algorithms where it's nice to do subsets, but your users are not going to be playing this paying this weird implicit cost of constant deep copies

where they didn't know that they were going to have that uh because you're if you implement both of them for instance like Stan does the matrix of var types or the array of strrus easy peasy to do

subset assignment. You just do it. The

subset assignment. You just do it. The

aliasing issue can still be a thing.

Besides that, you're fine. Um but for the strct of arrays approach, it's brutal. Um the other way at this point

brutal. Um the other way at this point we have uh kind of talked about how to do operator overloading style uh matrix

operations. Uh they're nice, they're

operations. Uh they're nice, they're okay, but the source code transformation I think is one of the more exciting parts of autodiff and C++ right now. Uh

there's some cool research going into it and I think that like it's really ripe for like doing cool new uh open source packages or things like that. So

we want to write source code transform on matrices. Now we're going to be using

on matrices. Now we're going to be using the type system like before uh and we're going to be end up writing code like this where we write these kind of placeholders representing uh matrices uh

var x and y. And here we're just going to do a simple matrix multiplication and then a summation of that. That

expression is again going to be like we saw before a uh expression template. And

then uh we ask the expression hey how much memory do you need for my expression graph. It reports that for

expression graph. It reports that for the values and the adjints. We then set the values and adjints. Uh we then give the expression back memory so that for our expression graph it has the memory

embedded inside of it.

Um then when we call gradient of this uh we want to be able to do stuff like this where we can say hey like take this expression I want you to reset the

adjints at first and then I want you to run a forward and reverse passive autodiff with new values of uh my uh x and y inputs. This is really nice. The

main take from this that's really nice is that you can essentially write your expression graph once and then constantly recycle it with new values.

Uh but also keeping around that speed you would get from the unre unruling of your source code transform but then just constantly be able to pump in new values for that given type which is super nice.

Um there is a lot of code here. I will try to break this down a little piece by piece meal by piece mail.

The hardest part to keep in mind with this is that we generate an expression graph that is itself a bunch of like expression templates. But inside of

expression templates. But inside of here, we're often going to be also generating expression templates over our uh value and adjoint uh matrices. So

what we do here for this var type is again we'll assume that all of the things we're going to see here are never going to own their own memory. So again

I have nice little free pointers just floating around. We have a value and

floating around. We have a value and adjoint. Uh we have double double

adjoint. Uh we have double double pointers for the values and adjints. We

say how big is this particular var. And

we keep in mind this ops thing uh the uh const expert ops value which means this thing for the forward pass and reverse pass is only going to take in one value ever. Oh sorry for the forward pass this

ever. Oh sorry for the forward pass this thing is only going to take in one value. So in the forward pass we're

value. So in the forward pass we're going to be given this var view object and we're going to say hey the pointers for these are now going to be the values and the data for the values of uh the

the pointers for values and the pointers of adjints for the var view. And then

that forward pass is just going to return back the values right because in the forward pass all we care about is the values uh for the uh initial pass of

our expression graph. for reverse mode.

This is just going to take in its child adjoint and uh we are then going to uh accumulate into the adjint for uh this type via its child adjoint and that's

just it. So this is kind of what a map

just it. So this is kind of what a map matrix multiplication looks like in this. It gets a little bit intense.

this. It gets a little bit intense.

There is these are all just the signatures we need but you can see that it takes in uh it has a template that takes in two different operants. uh it

has a forward pass taking in a parameter pack of arguments. It has a reverse pass just taking in its child adjoint. Um and

it can then you can then query it saying how big of a cache how much cache do you need for this particular expression graph and then you have a bind cache to say I want you to assign this memory to

your uh uh element to the expression node here. We store op one and op two

node here. We store op one and op two here and we also then store the pointers for the values and adjints for this expression node along with the rows and columns.

So when we do the forward pass for this what we're going to end up doing is for we have two uh parents for this for this node. What we're going to be saying is

node. What we're going to be saying is well take uh operand on the lefth hand side. I'm going to ask you I'm going to

side. I'm going to ask you I'm going to ask it how many uh operands from my input it expects. So in this case uh if the left hand side was just a regular x

var like we had before it would just be a value of one. So we just take a little slice of our arg which just be a val which is one and then pass it forward to the and pass it to the forward function of that operand. And the same thing for

the second operand. We're going to say we knew we took x amount for the uh left operand. I need x amount for the right

operand. I need x amount for the right operand. Make a little slice of my

operand. Make a little slice of my parameter pack for that and pass it to my forward for the forward for this. And

then finally at the end we do uh op one and op two multiplied together and setting the value for this particular uh adjint. Sorry the value for this

adjint. Sorry the value for this particular expression node excuse me. Um

the reverse pass it takes in a child. uh

we then uh propagate the adjints of the child up to the current node and then we say well the left adjints uh just like before we could have like C++ 17 stuff

here saying well do the left adjint or just do the right adjoint for the purpose of slideware I remove those um but the left adjoint is just going to be

uh the op one's transpose times this current node's adjints we then pass that uh to the reverse pass of Oh, sorry.

That should be up one, not up two in this slide. Uh, the first reverse pass

this slide. Uh, the first reverse pass should be up to op one. Uh, left adjint here is going to be an expression though. Uh, and that's a really key

though. Uh, and that's a really key point. As we're passing these things

point. As we're passing these things through, uh, we'll be building up an expression temp a huge expression template to be finally passed to our var values, which is super nice uh, for uh,

inlining purposes. Uh, but then the same

inlining purposes. Uh, but then the same thing for write adjint. Sorry, that is the I got up two reverse and op one reverse. Those should be swapped. uh but

reverse. Those should be swapped. uh but

it's still same thing applies. Uh the

reverse pass here is essentially just accumulating expression template of our uh reverse mode operations.

Then just kind of simple I am going to we define a way for the overall expression to be able to say hey can you tell me how many how much cache I need

for this particular node. And then we have another function that just says hey given these are my pointers that I'm doing. Oh, I missed a little star there.

doing. Oh, I missed a little star there.

Sorry. Uh, in std pair, uh, given the pointers and the amount that I've of cash, I've already allocated, allocate a certain amount to this and then allocate a certain amount to all of the sub

expressions uh that are the uh that are underneath uh this uh expression.

But once you have all of that, you can write some really nice fast matrix code.

Uh again, this is just what we had before where we're making X and Y. We're

doing our sum expression, asking for cache sizes, assigning caches to uh the underlying uh expression, and then just running autodiff.

Uh this is a benchmark looking at all three of the matrix multiplications that we just saw. Uh so we're doing here is matrix multiply and then summation. So

this has the array of strruct approach where the uh var type is like our underlying array type. the structive

arrays approach where we have uh two separate uh matrices for the values and adjints and then we have the source code transform version. Uh so this graph is

transform version. Uh so this graph is slightly deceptive. Uh but what you can

slightly deceptive. Uh but what you can see is that the baseline here is just the source code transform and for the array of strrus approach it just gets blown out of water like it is just too

slow for uh at least relative to source code transform. Uh the stretch of arrays

code transform. Uh the stretch of arrays approach here looks like it's catching up to source code transform but like this is more of an artifact of a poor benchmark. The problem is is that like

benchmark. The problem is is that like after a while once you have a large enough n byn matrix your time is going to be dominated by matrix multiplication. So I'd really only look

multiplication. So I'd really only look at I would really only look at this up to 10 or 30 or so. Uh that's just a zoom in of this. But again this is going back to when I was like oh well this is a

little bit of a distressful benchmark.

We look at fast ad which is does exactly the source code transform that I was just talking about. Uh you can see it's just a constant consistence beat up relative to pretty much everything else

which I think is dynamic. So we're stand we're dynamic at least. Um but that is the end of my presentation. Um if anyone has questions I'd love to hear them. The

QR code here will take you to a ton of code that has benchmarking code for little toy examples, all the different autodiff we did today. Um, and all the

slides if you want to review them. Uh,

but thank you so much for coming to my talk.

Uh, any questions? I'm happy to hear them.

Thank you for the Hi. Hi.

>> That's on.

>> Now it's on. Thank you for the nice talk. Um so you presented this operator

talk. Um so you presented this operator approach, operator ringing approach with the note um implementation and then what you call the source transformation

approach. Um that's basically a

approach. Um that's basically a statement level reversal. Have you tried to combine this in Stan so that for every node you have in your graph you do

a statement level reversal?

>> Yes. So what we do in stan usually um is we have expressions that kind of just represent. So when that's a great

represent. So when that's a great question first off thank you. When you

want to co oftent times the research is trying to combine operator overloading with source code transformation because right there's just points in your program where you don't need a while loop and you want the speed of source

code transform. Um, some parts of Stan

code transform. Um, some parts of Stan do this. Um, we can talk a little bit

do this. Um, we can talk a little bit about the implementation afterwards. We

actually end up doing a lot of this inside of our we have a little compiler for our thing that essentially we explicitly write out I have a log multiply node, right? Like I'm doing a

log of one thing and then a multiply with two matrices. Uh, we just have an explicit node for that. And then our compiler can look at user's code and go, okay, I see that like I can combine these. So the user has written log and

these. So the user has written log and multiply. I know I can combine them into

multiply. I know I can combine them into the log multiply function to generate that expression. Um, this is probably

that expression. Um, this is probably the hottest area of research in autodiff right now is how do you mix both source code transform and operator overloading.

>> Okay, thank you.

>> Yeah. Yeah, absolutely. Thank you.

Uh, any other questions?

>> No. Oh, that's fine. If you guys want to come talk Oh, sorry. Wouldn't talk.

There we go.

Um, a little tangential to the to the main content of the talk, but you mentioned topological sorting a couple times on these graphs. Um,

>> those are non-unique in general, of course. Uh,

course. Uh, >> do you have a way of preferring certain sorts to another or >> no >> particularly pertinent? Honestly, the

main reason we think about it as a topological sort is just that at the end of the day, we want to shove things into an STD vector so we can go up and down quickly, >> right? So it's it probably not not much

>> right? So it's it probably not not much impact from >> so interestingly there is cool things for doing topolog so uh Robert Gower is a guy who has an awesome paper on doing

particular topological sorts that allow you with just sorry there's going to be a lot of math for one second that allows you to do particular topological sorts over your expression graph to also calculate hessions with just the reverse pass. Uh so if you're interested in that

pass. Uh so if you're interested in that I'd say check that out it's super interesting but that's not my expertise.

Perfect.

>> Thank you.

>> Yeah no problem.

Quick question. Do you know PyTorch how they did this >> dynamic? They have D. So they have the

>> dynamic? They have D. So they have the structive arrays approach for matrices.

Uh usually in things like PyTorch, whenever you're doing like operator plus equals in the back end, it's really making a hard copy and saving stuff like you have to when you're doing anything dynamic, you must save where you were

previously. Uh so instead of actually

previously. Uh so instead of actually overwriting any memory, it just makes new ones. But yeah, this is what PyTorch

new ones. But yeah, this is what PyTorch does as well.

I that was a great talk. Thank you very much.

>> Thank you.

>> I think if anything >> um >> I want to just say like autodiff is probably like one of the greatest things in the world.

>> It's fun.

>> It's just incredibly useful. You

couldn't do most physical simulations or calculations like this without it.

>> Um >> I'm not really like that well-versed in the low levels.

>> How do I know which library to choose?

I've I've used autodiff. Um

>> Mhm.

>> but I don't really know like what the best ones are out there.

>> So I'd say for if your target language is C++ uh the first thing I would do is throw away ideas of speed. your thing

should be how can I as a developer get the fastest off the floor with my program my simulation right so for that there's an awesome uh if you l if you type in autodiff website uh on Google

there's a website where someone has just collected like every possible autodiff library that's ever written uh I say you know personally I'm a huge proponent of stand math I really like it uh we don't

have cmake yet so you have to do some weird stuff to compile us hopefully me and Brian are going to fix that soon.

I was just looking at the camera. Uh but

uh I love Brian. I work with him. He's

great. Uh but yeah, I'd say Stan Fast ad is probably like my favorite current if package for doing matrix operations.

It's very uh James just kind of wrote it like for fun in like a summer. He's like

a crazy smart guy. Uh and then messaged us at Stan saying, "Hey, like we uh I wrote these benchmarks and like I looks like I'm faster than Stan. That doesn't

seem right." and we were like, "Yeah, that doesn't seem right." But then we looked at it, we go, "Oh, no, you are right. You are way faster." Uh, and we

right. You are way faster." Uh, and we looked at it and it was gorgeous. If

you're doing matrix stuff, I think fast AD is really fun. Um, but in general, uh, check out the Fast AD or check out the Autodiff website and that is going to show you a lot of like the different

packages you have available to you. Uh,

but worry about your time the most.

H.

>> What is it?

>> We'll talk after. Yeah. Uh, any other questions? Am I over time? I have three

questions? Am I over time? I have three minutes left. Go for it.

minutes left. Go for it.

>> I was so worried I was going to get questions. I'm getting questions. I'm so

questions. I'm getting questions. I'm so

happy. Thank you for the talk. Um, I

have a question on this sort of the slice assignment operator.

>> Yeah. Can it be can it just be like a matrix multip multiplication so that you can rotate the input to become the output.

>> Sorry, could you say that again? Uh so

you can like have a matrix multiplication >> to do the the slice assignment and then the problem is just >> uh to um autograph that uh matrix

multiplication right >> for the assignment operator you can do a matrix multiplication to sorry was there a slide number that you had in mind I can go back to >> uh there is a you know you have a

assignment you have a sort of a a sub slice of a a vector and you assign um Yeah, exactly. This one. That's

right.

>> Yeah.

>> Yeah. So, I I I think we can just like have a say a y equals x * some matrix or some matrix time x.

>> Mhm.

>> And that matrix going to like grab the head of x uh to become y and then you can just operate on y after that. Is

that true?

>> Uh again it does matter how. So at least for our implementation of var vectors if you do like autoy equals x. slice no

copy gets made at that point. So if I had here like var vector x and then I said autoy equals x dots slice and then I said uh or I did x do head right and

then I said x doss slice is equal to y in our version that would require the same ali I would have the same aliasing issues that you have to be careful of and just in ien you'd also have these

exact same issues um so yeah you do have to be really it is brutally painful we found this bug by accident luckily we found it but doing subset assignment for matrices is hard. I think Jax doesn't

even I think a lot of packages just don't allow it which is kind of a reasonable design decision.

>> Mhm. I see. Uh all right. I have another question.

>> Sure.

>> So on the topological sort thing, I can imagine that one graph can have like many topological orders.

>> Mhm.

>> And so they might differ in their catch local quality performance.

>> Um how do you select one that is the best one? Uh for us it's really just

best one? Uh for us it's really just that the topological sort example here for at least Stan's use case. There are

particular autodiff packages that do do specific topological sorts to do things like I said like the hessen example for Stan. We're just doing the basic layout

Stan. We're just doing the basic layout and then using that to justify storing everything in a giant vector. That's

pretty much it. But there are uh again Robert Gower has cool things that he's had talk he's had thoughts about this stuff before. So I'd say uh check out

stuff before. So I'd say uh check out his work would be super good.

>> Cool. Yeah. Thank you.

>> Thank you. Um, is there any other questions?

No. Uh, one last thing is if you're interested in allocators in general, which is a large part of the dynamic stuff we did here, uh, Kevin Carpenter has a talk tomorrow, uh, on, uh, allocators, back to basics, that's super

going to be fun. If you're in scientific stuff and you're like, I want my code to be faster, add an allocator of some sort. Arena allocator, stack allocator,

sort. Arena allocator, stack allocator, pull allocator. It will make your stuff

pull allocator. It will make your stuff so much faster. So, go check out that talk. But anyway, I think that's all for

talk. But anyway, I think that's all for me. Thank you guys so much for coming to

me. Thank you guys so much for coming to my talk. I really appreciate it.

my talk. I really appreciate it.

[Applause]

Loading...

Loading video analysis...