TLDW logo

impl Rust: One Billion Row Challenge (live version)

By Jon Gjengset

Summary

## Key takeaways - **Naive Rust beats Java baseline**: The naive single-threaded Rust implementation using BufReader, HashMap, and string parsing takes 98 seconds on 1 billion rows, already faster than Java's equivalent 2-3 minute baseline. [33:03], [34:17] - **Memory mapping saves 66% time**: Switching from BufReader.lines() to manual mmap with memchr for newlines drops runtime from 48 to 32 seconds by avoiding buffered reads and enabling sequential access hints. [01:25:46], [01:26:07] - **Custom hashing halves lookup time**: Replacing std::hash::DefaultHasher with a custom 64-bit mixer using 8-byte chunks and rotates cuts HashMap lookups from 90% to 40% of runtime, bringing total to 21 seconds. [09:00:19], [09:16:19] - **SIMD parsing eliminates branches**: Manual i16 parsing of fixed-format temperatures with conditional moves instead of f64::parse avoids branches and reduces parsing from 10% to 2% of execution time. [07:49:54], [08:01:31] - **Inline string keys reduce allocations**: Custom StrVec packs short station names inline in HashMap keys, eliminating vec allocations for 500+ unique stations and saving pointer chasing. [04:46:16], [04:51:32] - **32 threads scale to 1.1 seconds**: std::thread::scope with available_parallelism splits mmap into per-thread chunks aligned to newlines, merging via mpsc channel scales single-core 21s to 1.1s on 32 cores. [08:51:38], [09:01:06]

Topics Covered

  • Full Video

Full Transcript

A one, a two, a one, two, three, four.

Does that Does that work? Did that scare anyone? I give you a count up at least.

anyone? I give you a count up at least.

Let's see. Does anyone Does anyone actually hear me now? And how many people got scared this time?

Yes. Okay, cool. I managed to scare people again.

Yes, here. Yes. See, no. Is the the correct state of affairs for now.

Let's see.

Got to do this.

Slowly but surely, we're getting there.

This is All right.

My light was wrong, but now my light is correct, but also very sharp. All right.

Let's see if you can see me. Like so. In

theory, you can now also see my face, but this light is very bright on me, but I think it might actually look correctly to you. Or is it too sharp and bright?

to you. Or is it too sharp and bright?

I made some changes to my settings and now I got to make sure the settings are actually correct.

The volume is low.

All right, let's uh see if I can do something about that. I can certainly move this closer to my face, but then I can also do uh

properties. No, advanced audio

properties. No, advanced audio properties is what I want.

And then over here, um, it should be about right. Like

looking at my volume meter, at least my output volume is correct. But maybe your input volume is down.

You're crystal clear, but maybe too bright.

Okay. Yeah, I know. I need a soft light on the other side. It is one of my current problems with my setup is I don't have anything over on that side.

Let's try this, shall we? That seems slightly less wild.

And we'll even do this.

Okay, that should be slightly better now. should be less less overexposed

now. should be less less overexposed and volume should be roughly right which is all we need much warmer I mean I did not make it warmer I just made it less bright but I think that's uh yeah this

time at least we're not stuck in the OBS settings now it's just the physical things cool all right we have roughly the light right lighting we have roughly

the right sound which I think means we're just about ready to go yeah I I swap back and forth between these glasses and these glasses.

And anytime I wear these ones, people go, "Oh, you're wearing glasses." When I wear these ones, they don't even register.

So, I figured, why not do a stream with these this time?

I should be a full-time streamer. I

mean, some someone's got to someone's got to pay me to make that work. But, uh

I mean, maybe I I would play I would play a bunch of like programming games, too. It's funny after the one stream I

too. It's funny after the one stream I did where I did like um uh like a like like assembly levelish or not assembly level um embedded deviceish programming

game. Uh I got a bunch of emails from

game. Uh I got a bunch of emails from other creators on Steam who were like, "Hey, try our game too please. We also

have programming in it." So I could probably do that just sort of separately.

Okay. Yeah, they are orange partially because I really like the color orange and partially because rust. It's the

same thing with my drawing tablet which you can see here uh is also very orange.

I don't know. There's just something about that color. It's since before I even adopted Rust, I started doing that.

Um okay, I think we should maybe get started. Um, let's go ahead and see if I

started. Um, let's go ahead and see if I do this. If I haven't completely broken my

this. If I haven't completely broken my setup, you should now see my screen and my face up in the corner here somewhere.

It's hard to point here when I'm already in that corner, but you know what I mean. Um,

mean. Um, I do not have an orange iPhone. I have a uh like a lime green uh Android. It's a

Pixel.

Uh, excellent.

All right, then. I think I think we're just about ready to start. Let me get this all set up. Get my Where's my water? Here's my water so I don't die.

water? Here's my water so I don't die.

Uh, I also have I'll show you.

This is for comparison. This is a normal webcam. This is like a Logitech

webcam. This is like a Logitech something something stream cam. So, the

difference in quality between that and this is fairly big. All right. Anyway,

back to screen. We'll we'll actually do the thing. Uh, can you move it to uh

the thing. Uh, can you move it to uh lower right bottom? No, because that tends to overlap more with my terminal command line. Uh, and so I'm going to

command line. Uh, and so I'm going to keep it top right. All right, I think we should start. So, how about we start

should start. So, how about we start recording and kick it off.

Hi folks, welcome back to yet another uh Rust stream. This is going to be an

Rust stream. This is going to be an imple where we sit down for several hours and just hack out a bunch of Rust code that does something hopefully interesting. Um the there's uh there's

interesting. Um the there's uh there's just tiny bit of housekeeping before we start. Um number one, the this channel

start. Um number one, the this channel just hit 100,000 subscribers, which is really really cool. Uh it means YouTube sends me a plaque, so eventually I will

have a plaque, which is fun. Um it uh it doesn't really mean anything beyond that except I guess ea us we did it. We we

made a big YouTube channel of people who care about writing Rust code. So that's

really fun. Um thank you for those who subscribe who make that be the reality.

It's fun. We now have a big number next to a Rust channel. Um the other thing is I'm also I've created a discount code for rust for stations on the no starch

uh website. So it's YT 100K YouTube

uh website. So it's YT 100K YouTube 100,000. um which gives you I think 30%

100,000. um which gives you I think 30% off the book or 25% of of the book. I I

I don't remember, but go check it out.

Um so that might not be active anymore by the time you look at this afterwards, depending on how late you are to the videos. Uh but for those who are

videos. Uh but for those who are watching live and for those who watch it shortly after it being published, you'll be able to use that code to to get a discount on the book to to celebrate. Um

and related to this, um just a reminder that I do also have a GitHub sponsors active. So, if you like these streams

active. So, if you like these streams and want me to do more of them, you can go sponsor there. And then, uh, you know, hopefully that gives me room to do more of these. Uh, and maybe other kinds of streams, too. Who knows? Um, cool. I

think it's time we dive into the thing we're going to do today. It's been a little while since we did an imple. And

partially that's because I wanted to find the right topic, right? Like the

the right thing for us to spend seven hours coding on. Uh, and then I came across this thing called the 1 billion row challenge. Um, this is

something that was posted in the Java community back in early 2024.

Um, and it was basically here's a billion row essentially CSV file of weather data. So it's basically city and

weather data. So it's basically city and temperature. And your goal is to produce

temperature. And your goal is to produce a program that outputs the minimum, maximum, and mean temperature for every location. So for every weather station

location. So for every weather station um in the data set and do so as quickly as you can. It was published for Java.

So specifically like all the infrastructure around it and all the testing and all the results that were published were all in Java. And if you look at like the official repo for this um I'll put the the link in the video

description. Um you'll see down here

description. Um you'll see down here that there's sort of a a ranking of the results uh for how quickly different people did it. and they did it using like there were a couple of rules for how you needed to do it in Java and what

you could and couldn't do um and which VMs you could use and everything but ultimately there's like a results um that were computed by the person who set up this challenge and it was set up to be evaluated on a very specific kind

of cloud instance and you got this many cores and like everyone got the same input file but it was randomized. Um,

and then there's also results for what if you give it more cores or what if you give it more kinds of locations in the billion rows like u more distinct keys and so on. Um, but ultimately like this

challenge ran for a while. Um, and then it's done now like there's no it's there's no more additions being added at least as far as I'm aware. Um, but

obviously what happened is that people got excited about this and started want wanting to do it in other languages. So

during over like over the course of 2024 um you'll see here there's sort of a discussions page and in the discussions page there people doing it in lots of other languages. Um and you'll see here

other languages. Um and you'll see here if you look at the right on the comments you'll see one of them has 61 comments in the thread and lo and behold it's Rust. So this apparently got people

Rust. So this apparently got people excited in the Rust community too.

Partially because people like these kind of challenges and partially because I think Rust people just like writing performant code and see how much they can push out of it. Um and so there's lots of cool solutions in there. Um the

there was sort of lots of iteration in the ecosystem and people trying to find like how can I really extract uh every last you know microscond of time from this. Um I was not aware of any of this.

this. Um I was not aware of any of this.

I found the 1 billion row challenge I don't know like I guess I found it maybe a year ago and then I just put it in like a log somewhere of maybe I'll do this one day. And then I was looking around for stream topic. I was like wait

this is a great topic. How about we just do this? And one of the reasons I wanted

do this? And one of the reasons I wanted to focus on it was because the the challenge itself is really simple. And

in fact, where we're going to start today is just to write like the stupid naive implementation that just does this and just see how long it takes. Um like

I think the uh with the Java version there's included like a obviously correct but slow version just to have give people something they can compare the results from. And that one takes I

think about two or three minutes to run.

It's like singlethreaded like super super um straightforward code. Um and

we'll write that in Rust 2 and see how fast that that is. And then we'll iterate on profiling why it's slow, figuring out what we can do instead to make it faster and see how much we can

get that down. In the actual challenge, the winner with uh when run with the eight cores on the particular instance

that's being used uh runs in 1.5 seconds. Um now I also check these out

seconds. Um now I also check these out locally uh just to see uh and the winner here. So I took the the

winning solution that won in the most categories. So not just the one that was

categories. So not just the one that was fastest on that particular instance, but that was like the highest ranked in both eight cores, all the cores and the more

distinct keys just to have a um avoid sort of comparing to a hyper optimized solution to that particular problem set.

Uh and you'll see that this like one of the problems that you'll have is that Java has to compile into stuff. that if

I run it again, this completes in, you know, basically no time. 800

milliseconds. And that's because it gets to use all the cores on my computer and everything. So, it' be faster than the

everything. So, it' be faster than the 1.5. But this is, you'll see at the top

1.5. But this is, you'll see at the top here, this is not even the the fastest version of the solution because I'm using it in the standard JVM, not the Growl VM, because I don't have that

installed and I don't want to install all the the Java infrastructure for managing um SDKs and JVMs and everything. Um, but like clearly it's

everything. Um, but like clearly it's possible to make this very fast. I don't

know how close we'll get to this, but we'll profile our way there. Uh, and

see, uh, see what we get to. Um, cool.

Let's go here. Go here. Cargo new. Uh, now we

go here. Go here. Cargo new. Uh, now we got to pick a name. Now,

a name for this. We could just call it Billion Road Challenge, but that's no fun. What can we do instead? Let's do

fun. What can we do instead? Let's do

one billion rolls challenge.

So the logo for the official repo is the emoji for one, the emojion for B and then the emoji for a race car because RC

road challenge. Um,

road challenge. Um, so we could do something stupid like beeswax cuz it's trying to wax things so that they can be fast, but it's B for

billion.

Formula 1 wet thrust is not good. Don't

like that at all. Uh,

br I like br a lot.

I like br a lot. It goes br Okay. It is.

That's That's great. Um

Okay. Now, we're also going to pretty strictly follow the rules of the challenge. So, um I've I've I skimmed

challenge. So, um I've I've I skimmed through like the Rust thread discussion and I saw that there are a bunch of solutions that like don't actually follow these rules and I want us to actually follow these rules. Um so

obviously the use these versions of Java doesn't apply to us. Um but the other things we want to follow no external library dependencies may be used.

Nothing outside the standard library. Uh

implementations must be provided as a single source file. I I'm not too concerned about that one. Um computation

must happen as application runtime. So

you can't like do a build RS that does all the compute as your program competes completes in like constant time. Um for

input values. So the format here we'll look at the exact um file uh but it's like station names and then I think it's a comma and then the temperature and then a new line and then lots and lots

of these. The station name is nonnull

of these. The station name is nonnull UTF8 string of a minimum length of 10 characters and a maximum length of 10 bytes. But notice that this is a UTF8

bytes. But notice that this is a UTF8 string. It does not contain semicolon or

string. It does not contain semicolon or new line. I guess semicolon is the

new line. I guess semicolon is the delimiter. Um, and there's no guarantee

delimiter. Um, and there's no guarantee about which UTFA characters are in there. Like there's no guarantee that

there. Like there's no guarantee that they're all like ASKY compatible or something. So, we need to make sure we

something. So, we need to make sure we actually handle arbitrary Unicode names.

Uh, and the temperature values are non-null doubles between -9.9 and 99.9 always with one fractional digit. Uh, so

we want to make sure we we actually conform to to those rules. Uh, maximum

of 10,000 unique station names. Uh line

endings are new lines on all platforms. We're not allowed to rely on specifics of a given data set. So we can't hardcode the list of stations or anything. Um and the rounding of output

anything. Um and the rounding of output values must be done using the semantics of it e 754. Rounding direction round towards positive. Okay. So those are the

towards positive. Okay. So those are the rules we need to follow. Um

no libraries but stood lib is okay is very unfair between languages. Totally

true. Um, part of where I'm coming from here is like I think in Rust this is doable without external dependencies and so I would like to just follow that rule. It's true that like for some

rule. It's true that like for some languages that would just be unreasonable. Um, but for us that's what

unreasonable. Um, but for us that's what we're going to go by. Um,

okay.

But let's see what we get to. Um, so uh here's a input file I prepared earlier.

uh others one BRC uh and it's called measurements.ext. So if we ask for the

measurements.ext. So if we ask for the number of lines, you'll see it's 1 billion lines long. Uh and if I asked for the first couple lines, this is what it looks like. So it's just semicolon

delimited columns, new line separated rows. Now this is easy, right? Like if

rows. Now this is easy, right? Like if

if we were to just do this with like standard Rust, no, like nothing fancy in here, you know? We'll do uh let mute f

is file open measurements.ext

and this is a this is an unwrap kind of program. Um and then I will bring in

program. Um and then I will bring in file um and then I will do I'll probably do like a uh

in fact I'm not even going to use a no I need a buff writer so that I can do by lines. Uh so I will do um buff reader

lines. Uh so I will do um buff reader uh new of f so that means I don't need mute here uh and then I will do for line

in f dol lines um and we'll do something like a hashmap here

and in fact we can do with capacity 10,000 We know there are never more than 10,000 stations. So, let's actually No,

10,000 stations. So, let's actually No, this is an optimization. We're not doing any optimizations in this in this round.

Um, and for every line, we're going to do um line dot uh oh,

line is line.wrap.

Wait, why is it unhappy about this?

Oh, right. because I need to bring in buff read um line here as a result because reading from the file might error. Um and then

we're going to do station and temperature is lineit once by semicolon

again dot unwrap. Uh we will do we're going to not do any optimization.

Okay, I'm I'm so tempted to do optimizations already, but I'm going to try to resist it. Uh, temperature is going to be temperature.

And we'll do this as F64.unwrap.

Um, then we're going to do uh stats and we're going to do stats entry

uh station to string uh and then or default.

So this is going to be stats is this um so stats now right it doesn't know the type here um what I'm going to say the

type is is three F64s the min the mean and the max so stats here is going to be a mutable reference to one of those and

then we'll do stats.zero Zero is uh stats.0 zero dot min

uh stats.0 zero dot min with temperature uh one to max

uh and actually this is going to have to be uh I think to keep a running to keep a

running mean we actually have to do this and I know this could be a struck technically But we're going to plus

equal temperature and do plus= 1 for the count uh we need or default here so that we

insert the station if it doesn't if it's not already in there. Uh and then after we're all we're done with all the lines, then we'll Oh, we have to I think we

have to print the stations in like uh in alphabetical order. Let me see if that's

alphabetical order. Let me see if that's true. So, what does this print? Yeah,

true. So, what does this print? Yeah,

these are printed in alphabetical order in this kind of hashmap syntax. Uh so,

that means we're actually going to make this a B tree map. Again, we're not we're not optimizing. We're not

optimizing yet. Um, and then we're going to do for stations uh min

sum count max in stats.

Uh, right. And this is actually going to be

right. And this is actually going to be a string key.

Uh, and for each one we're going to do we're going to do print of an open curly bracket. We're going to do print

curly bracket. We're going to do print of uh what's the syntax here? It's just

like uh station equals min slash something slashmax.

Uh this is then going to be sum divided by count.

Um, and then this actually needs to be we need to make sure we match the sort of commaepparated syntax. We're going to do uh we're

syntax. We're going to do uh we're actually going to say stats is stats into peakable so that we can check whether it's the last one. Again, no

optimizations. We're not allowed. Uh

while let's so we'll do a manual iterator here.

uh is stats.next

is stats.next and then if stats do peak is sum

uh then we will print uh a comma space and then we will print end of curly brackets. We need the double so that it doesn't think it's a

format specifier.

We're not doing any optimizations. We're

not doing any optimizations here. We're

there's no there's no like and modify.

We're not doing any like sort after the fact. All of those are optimizations.

fact. All of those are optimizations.

We're doing the stupid thing. Um and I I guess we should put the um uh others

measurements in here.

So build with release.

Oh, why is it unhappy with me? Because

this needs to be this and can't divide this by this. That's

because this has to be ass 64.

Actually, wasn't there recently a um I div or something? I feel like they added a No, it's the other way around.

Okay. Uh, this has to be mutable because it's an iterator. And now it should be happy with us. This doesn't need to be this anymore. All right. So, what

this anymore. All right. So, what

happens if I now do uh time this?

So, again, this is singlethreaded, right? So if I do htop.

right? So if I do htop.

So using 100% of one core, which is what we would expect expect to happen. Uh

let's see the inter here as well. And

it's just going to run as fast as it can.

Uh your minmax is thrown off by the default zero on arbitrary data.

Oh, you're right. You're right. This

would be wrong.

So, we actually do need to change that.

But even so, like that wouldn't really affect the runtime here. Um,

it's still going.

Now, I think when I ran the default Java version, it took me about three minutes.

Well, we'll see if this is uh this is equivalent. Buff read is a memory

equivalent. Buff read is a memory optimization. So yes, but the reason I

optimization. So yes, but the reason I used buff reader here and I think that doesn't count as auto optimization is because it gives me dot lines, which is

like a very idiomatic way to be able to iterate over the lines of a file. Um,

the min should start at F64 max.

Yeah, you're right.

What is time? Would you before cargo R?

So time prints the total time it took to execute the command. Sort of a shell built in.

All right. I guess we could fix this while we're at it, right? So this would be or insert. This has to be F64

max.

Uh this can be zero. This is zero and this is f64 min

right technically. Okay. So it ran this.

right technically. Okay. So it ran this.

It took 90 98 seconds to run. It seems

to have printed something kind of reasonable. Let's see if we compare it

reasonable. Let's see if we compare it here. So ismir for example should be

here. So ismir for example should be minus 37.1 - 37.1

17.9 we got 17.89 8 9 and so on and 67.1 67.1 okay so this seems like just just from a random sampling with a an end of

one this seems right um we'll add this in we also need to make this print uh only to one decimal place

um and actually the same with these but it shouldn't matter because these are always values that are taken out of the input and the input always has has

position of length on um but even so we'll we'll um we'll retain that. Oops.

Cool. Okay. So now we have a comparison point about 100 seconds. Uh this is already faster than the stupid stupid Java version, but they're not really comparable, right? Because we get to

comparable, right? Because we get to compile and like we don't have a runtime and everything, but this is still like very slow. Um

very slow. Um okay. So,

okay. So, uh let's go ahead here and get add. I'm

going to get ignore measurements.ext

measurements.ext because anything else would be insane.

And then we're going to get to add these uh stupid version.

Um now let's let's start to do some optimizations. We'll we'll do the the

optimizations. We'll we'll do the the simple optimizations first, right? So uh

simple optimizations here include things like um through uh

so the B2 map is a little sad. Uh we

could have a hashmap instead which is usually faster to access and then just sort at the sort the keys at the end. Um

that might give us a slight speed up. Um

the lines here is not terrible, right?

Like the lines here is uh you read stuff from the file into a buffer in memory.

you search for the new lines in memory and then you yield the sub slices accordingly. Um, so that one's not too

accordingly. Um, so that one's not too bad. Um, the split ones here just

bad. Um, the split ones here just searches for the first semicolon and then gives you the stir slices on either side. Um, this is probably pretty costly

side. Um, this is probably pretty costly because it means that we're allocating a new string and copying them for every station even though most of them will already be in the map so we didn't need

to do the allocation. Um, made a max here shouldn't be too bad. Uh, and this only happens once at the end, so I'm not too costly with it. Um,

uh, okay. So, there's some sort of interesting things here, but what I really want us to do is, uh, have a look

at PF record of cargo run release uh, and see what we get

because it is otherwise what? Oh. Uh

graph is maybe the default now because it can be tempting to sort of guess at where the slowness is.

Uh port dash g see what we get.

Right. Okay. So we get basically nothing useful out of perf. Uh there are a couple of reasons for this but the main one is we're compiling in release mode.

And so in release mode you don't get any debug symbols. Okay seems like a

debug symbols. Okay seems like a problem. Let's go ahead and fix that.

problem. Let's go ahead and fix that.

So, we're going to say uh profile.release.

profile.release.

Uh we're going to say um LTO equals full even though that probably doesn't matter here. So, this is a this is like linking

here. So, this is a this is like linking optimization so that you get um better inlining across stood and our crate for example. Uh we're going to say panic

example. Uh we're going to say panic equal to bort um so that we get rid of all the panic machinery that doesn't matter to us. Um we'll also say debug

equals true.

And just to see how that affected things. Oh, right. It has to be thin or

things. Oh, right. It has to be thin or fat. All right. Fat.

fat. All right. Fat.

I just wanted to see what the impact of these are to begin with. It might be basically nothing. Uh but just so that

basically nothing. Uh but just so that we we've done the things that people are going to later be like, "Oh, you obviously need to do this." There's

another one which is like um codegen units equals 1 I think is um is decently common. Um

common. Um I don't think that would matter a whole lot here because the program is very small. Uh it's not clear to me we would

small. Uh it's not clear to me we would end up using lots of codegen units in the first place anyway. Um but let's see as the previous time was 98 seconds.

And we have but we have also enabled debug symbols which uh I think on Linux they uh forget whether they split the debug info. So that might end up making

debug info. So that might end up making the binary bigger and if the debugs but the debug symbol shouldn't really affect the performance all that much. Um, but I

think we're about to find out.

Your cargo colors are weirdly colored.

They are gray instead of green. I don't

know what to tell you. They turned gray a few months ago, maybe. I'm not sure why. Maybe it's my color scheme that's

why. Maybe it's my color scheme that's sets something wrong somewhere. Oh,

yeah. Target CPU equals native. That's

another good one we should set. Okay,

this ran in 90 seconds. Um, so it did do something. Let's do target CPU equals

something. Let's do target CPU equals native as well.

Oh, what's the do I have to set that through Rust flags? I think I do. Uh, isn't there a

flags? I think I do. Uh, isn't there a way to set it directly here?

Can I do this?

All right.

Uh oh, it's not stabilized.

Do I really have to do it through uh cargo target CPU native? Yeah, I think I can set it instead of setting it here. I can

set it in uh cargo config.toml.

cargo config.toml.

I can set um forget what the aha.

Yeah, I guess makes me sad, but all right.

Um, this it's like also a little unclear how much this will help us because the standard library comes pre-ompiled. Um,

and so I'm not sure we're going to get the optimizations we want here for the for the standard library part of the code, but let's see. Okay, so we're comparing

to 90 seconds and now this will be with target CPU equals native. to see what how much of a difference that makes for those who aren't familiar with this like um target target CPU is

normally when Rust tries to build binaries um it tries to build them for a sort of generalized version of your CPU right so for if you're on x8664 for

example it'll build a binary that's runnable on like most modern x8664 CPUs um but there might be like my CPU specifically supports some newer

instruction sets than the sort of the lagging behind uh general purpose target of x8664. So it might have some

of x8664. So it might have some optimized instructions that if we make use of will make the program faster, but also if we make use of them then the program will not run on other people's computers. And so it sort of depends on

computers. And so it sort of depends on whether your goal is to make binary artifacts are going to run on other computers like if you're shipping a distribution. In my case I'm not doing

distribution. In my case I'm not doing that. I'm just running it locally. So,

that. I'm just running it locally. So,

I'm fine with Rust being able to take advantage of those extra assembly instructions or extra um uh operations that my CPU specifically supports. And

so, that's what target CPU equals native will do. Um

will do. Um 92. Okay, great. So, it's slower.

92. Okay, great. So, it's slower.

Perfect. Uh that's fine. We'll we'll

still live that in there. Uh nightly

features I think are okay. Uh so,

portable SIMD I think is the way we're going to get SIMD. Um, and

uh we're not we could do build stood.

Um, I guess I guess let's do uh All right. Set nightly just just to

get that out of the way. And then sure, why not do build?

Oh, no it's not. It's um

it's - Z build.

So now it's going to build the whole standard library also from source.

This is a a nightly cargo feature.

Uh oh, it's unhappy about that. Huh. Uh

cargo target I believe you.

Let's see.

Yeah. So, so with target CPU equals native, it should make use of all of like the MMX and SSE and AVX standards that my CPU supports.

uh why this is one of the many reasons why buildstead is not standardized yet. It

is unhappy.

Uh yeah, I wonder why it is unhappy about this.

All right, we might just ignore that for now. I don't I don't think rebuilding

now. I don't I don't think rebuilding stood here is really going to make like a huge difference to us. Um, so we'll we'll stick with our our good old without without build stud and we could

always do that later. Um, time will include the build time. Um, but as you can see the the build time here is basically nothing.

But it's it's true. I mean, I'm I'm happy. I usually would run either cargo

happy. I usually would run either cargo build release first or I just cut it after I see it start running and then start running it again.

Yeah, I'm I'm guessing this is still going to be about 90 seconds. We haven't

really changed anything interesting. Um

okay so uh now if we do I I'll add in cargo and

add cargo toml uh and then we will do perf record-g cargo run release. So now we're just going to

run release. So now we're just going to do a perf record which is going to record a bunch of performance statistics about our code and figure out where is it spending its time. Um,

so we'll let that run for a little bit and then we'll do perf report-g to see where time is now spent. Uh, a bunch of

time in lib c. Oh, why

what why is it not giving me lib c symbols?

Oh, that's because uh headers, no lib C. I forget what they're called on arch. I thought I already had these

on arch. I thought I already had these installed.

Uh but what's it called on Arch?

What?

What?

Uh oh. It's because I need to

oh. It's because I need to hackman conf.

Uh I need to include uh is it core testing? No, it's not.

arch dibc headers. There's like an alternative thing you can do to get the uh

DBC headers. Why can't I find it?

DBC headers. Why can't I find it?

There's like a separate repo you need to enable. Yeah, they're dev.

enable. Yeah, they're dev.

No, it's not. It's not. Um,

they live in a severed repo, but where?

Why can't I find it? This is annoying. I

should have looked this up in advance.

Sorry about that.

Uh, no.

No.

Let's see if Google is better in Cogi than this.

I don't think it's Linux headers because I think I already have Linux headers.

Yeah. Uh

I'm confused.

I was pretty sure I already had these.

uh libcdev.

It's not usually called that on arch.

Uh incremental equals false shouldn't matter for the release build here. Um

is it because I need to actually do this probably.

Uh, no, that's the Linux headers.

I'm very confused because I'm pretty sure I've done this before. Uh,

yeah, you see the debug symbols on your Arch setup. I think it's just something

Arch setup. I think it's just something I don't have installed.

Oh, unless it's uh record. Um,

record. Um, yeah, it's because I need call graph dwarf, I think.

Let's see if that helps.

There we go.

See how much more data it wrote. Cool.

Yeah, that's fine. All right. Now we're

talking. Um, okay. So, you'll now see that there is our main. Uh, and if we go into main, uh, no, that's not what I

want. Uh, no, that's also not what I

want. Uh, no, that's also not what I want. Um, and you'll see here like the

want. Um, and you'll see here like the the if you haven't looked at the output of Perf report before it shows you self which is the cost of this function

excluding its children and children is the exe percent of execution time that went to this function and its children.

Right? So if function A calls function B, then the cost the the self cost of A is the time it took to execute A minus the time it cost to execute B. And

children is the time of it took to execute A including the time it took to execute B. Um as you'll see there's a

execute B. Um as you'll see there's a bunch of like uh there's a bunch of um various key things you can do here. Um

in particular plus lets you expand down into it. So if we go into main here,

into it. So if we go into main here, you'll see that a lot of the time here is spent on uh the entry into the B tree

map. Um which is in turn spent here on

map. Um which is in turn spent here on going into the B tree and specifically the compare operation for the B tree map, which isn't terribly surprising,

right? Like most of what we're doing

right? Like most of what we're doing with every entry is doing a lookup into this map. uh we'll spending a little bit

this map. uh we'll spending a little bit of time on the parsing a little like 4% of time on our uh allocation and deallocation of strings. Right? So this

is the allocation and this is the deallocation. So we're spending almost

deallocation. So we're spending almost 10% of our program just allocating and deallocating strings. Um so how about we

deallocating strings. Um so how about we try to do both of these at once? So

we're going to go here. We're going to um switch this into a hashmap.

Um then we're going to here match on entry instead.

Uh we're going to match on we're actually going to match on get we're going to match on get mute.

Uh and if we get a stats back then we'll get stats. If we

had none then we will do this. We'll do stats dot insert

station to string of this. And in fact we can do um yeah so the thinking here being um we're

going to do a lookup without allocating a string. Um and this is going to use

a string. Um and this is going to use hashing and lookups rather than using the actual comparison of the string key.

Um and then we will do um you know if it's there then we just return that entry. If it's not then we will insert

entry. If it's not then we will insert the entry. Uh let's is there a uh get no

the entry. Uh let's is there a uh get no there's not okay so I think the easiest way to just get a mutable reference out

of this is to do or insert this uh like so at least.

Yeah, because insert returns an option of the old value, whereas what we want here is a mutable reference into the stats. Um, so that's one that gets rid

stats. Um, so that's one that gets rid of the string allocation. We've changed

this to a hashmap. And now here we can uh there are a couple of ways we could do this. We could either either sort all

do this. We could either either sort all of the keys or we could just move them into a B tree map. um

it doesn't they end up being roughly equivalent I think but we can do a you know keys is uh stat keys.colct

uh then we'll do keys dossort and we can do sort onstable is fine uh then we will do keys is this

Um actually this will be a little slower because we need to do a um so what we'll get out of this right is just the keys and then for each key we will need to do

a lookup in the map. Um I actually think if we here do a uh B tree map uh from it

stats uh and then just keep this the way it was. Uh then we save all the lookups,

was. Uh then we save all the lookups, right? Because we don't need to do a

right? Because we don't need to do a lookup in the B tree map. We don't need to look up in the hashmap. Um we just do the construction once.

Um, and this doesn't need to be okay.

So, uh, let's do cargo be release.

So, remember what we're comparing to here is about 90 seconds.

Why not a vector of tupoles and then sort by key unstable? Um,

oh, I see. Just turn the turn the hashmap into a vector that includes the keys and the values. That's not bad.

Let's try that. Uh, I just want to see how much. So, we we optimized out the u

how much. So, we we optimized out the u the string allocation, the string deallocation, and we moved away from using string equality to just using the hashing.

So this should be uh based on our PF at least 10% faster. So it was 90% we should cut about 10% of the runtime. So

it should be around it we should be no more than 80 seconds and it it's 52. Okay cool. So um get

diff this one is now um uh fewer allocations and no partial comp.

Okay. Um let's try the proposal here

which was um stats into it uh collect stats.

So I think this will basically not matter. Uh and the reason for that is

matter. Uh and the reason for that is uh right um the reason for that is this is the printing at the end.

Why? Oh right uh by A.0

A.0 uh compare B.0 zero.

compare B.0 zero.

Um, yeah. So, I don't think this would be a lot faster and the reason for that is because uh it's really just the printing and the printing is printing very few records.

Uh, sort by key versus sort sort by are they end up being about the same.

There's no huge difference between them.

In fact, I think uh I think sort by is supposed to be faster than sort by key.

Um I believe I read that at some point.

Maybe it's actually in the standard library documentation.

But this is an example of something where if we do a a profiling here, it's not even going to show up because we kill it before it even does the print, right?

Uh what about the key being a slice instead of an allocated string?

Um so it could be a slice uh for the purposes of equality comparison, not for the purposes of ordering, right? So you couldn't order

ordering, right? So you couldn't order these by their bite value, but we could keep them a Yeah. Okay. So 53. So

there's like in fact it is slower but at at this scale like 52 versus 53 is just like different in this particular execution. So I'm I'm going to keep this

execution. So I'm I'm going to keep this the way it was and we can come back to optimizing if it if it turns out to be a problem. Uh so B3 map because this is

problem. Uh so B3 map because this is very straightforward right? Um so I'm going to keep it that way. Um but the so

UTFA parsing is known to be pretty painful. Um, we're already getting this

painful. Um, we're already getting this with lines because line gives us a string.

Um, but what we can do here is say this is going to be a vec of U8. So don't

bother trying to parse out the um the bytes here. Um and then we will do

bytes here. Um and then we will do uh instead of lines which requires you parse as a

string we will do um where do I have it get diff. Okay great

uh I will do split by new line. Now this is already broken um but we'll I'll I'll explain why in a

second. So now this will be evacuate. Uh

second. So now this will be evacuate. Uh

we'll split once by a uh literal semicolon uh which doesn't exist on slices yet

which is fun.

Do we have an Rsplit?

Yeah. So this now needs to be uh fields is this and then uh in fact it's rsplit

n of two um for the same reason that new line is broken. This needs to be rplit 2. I'll get to it in a second. Uh so the

2. I'll get to it in a second. Uh so the temperature is going to be fields dot uh does this yield them in backwards

order?

It starts at the end of the match last element if returned.

So does this uh it returns them in forward order.

Okay. So

station is fields.next.un ununwrap

and temperatures fields.next.un

ununwrap like so.

And then it is unhappy because the n needs to come first.

And this needs to be a this oops uh and now it's unhappy about this

because this is now a slice of u8s and pars expects to operate on strings. Uh

so we'll do stood stir from UTF8 of temperature uh unwrap because we know that this is valid UTF8. We've been told

that that's the case. Um and here this is now to vec instead. So we're

avoiding the UTFA parsing for everything except for the temperature. Only the

temperature we're going to UTF8 parse.

Um and then only down here when we construct the B tree map uh we need to map

um so we're going to get the key and the value and we're going to have to map the uh string from UTF8

of key and then value. And again, we're going to unwrap because it's all valid unic code.

Uh, yeah, I'll explain why this is broken in a second. I just want to start the run.

a second. I just want to start the run.

Oh, parse float invalid. I think I messed this one up.

I think maybe these are in fact reversed.

Great. Okay, so the problem here is that in UTF8 encoding of strings, the way this works is anything that's ASKI just stays the same in UTF8. It's a single bite and it's just the one bite that is

that character. So a new line for

that character. So a new line for example is uh uh zero a I think in hex.

Um, and so you like if there's a zero a in the byte stream, you would assume that that means new line because a new line I if if you have a new line from

ASKI like the actual new line character, it is encoded in UTF8 as hex0A. The

problem is UTF8 is a variable length encoding. And so there is uh there's one

encoding. And so there is uh there's one bite in the UTF the there's one single bite in the UTF8 encoding that says this is a multi character or character is the

wrong word but basically um the the problem you run into is that if you have that character that says this is a multibbyte character and then a new line then that means a different character.

It doesn't mean a new line. So if you just look for zero A in the byte stream, you have to look back to see whether the previous character was an escape. The

same is also the case like UTF8 does not just have to be two characters. It can

also be three or four uh depending on what that first bite is. And any of those could be a literal zero a hex. And

it wouldn't be a new line. it would be some longer byte sequence that represents a unicode um unic code character that just happens to have the

bytes 0 in them. Um and so this is why this code is actually broken like this is just incorrect both for the new line.

If we split by new line we might now split in the middle of a unicode character um and for semicolon because the semicolon is also just two hex characters. I forget what they are. Um

characters. I forget what they are. Um

but so if we look for that unic code character that so if we look for that bite it might just be that that bite appears in a different unicode character. Now for this particular input

character. Now for this particular input I don't think this actually occurs but that doesn't matter. We've been told arbitrary UTF8 and so that is what we

have to deal with. Um the continuation bytes always have their highest bit set.

Uh, is that true?

Uh, let's let's go look at uh UTF8 continuation uh or Wikipedia maybe. Uh

yeah. So these are different uh UTF8 encodings. Um

encodings. Um so we can have right. So this would be anything that's

right. So this would be anything that's ASKY, right? So the first 128 code

ASKY, right? So the first 128 code points. Um then if the high bit of bite

points. Um then if the high bit of bite one is one, uh then it means it's a um you know I'm encoding bits in the I'm

encoding parts of the code point in the subsequent bytes. Ah but all of these

subsequent bytes. Ah but all of these have to start with one zero.

Interesting. So they can't be asy. Okay.

So that saves us then from this problem.

This problem does not occur.

Nice.

Okay, cool. Ignore me. We're fine. Um,

so this could be a danger, but it's not.

Good catch, chat. Um,

all right.

Uh, but interesting though is that this didn't save us any time. We're still

spending about 50 seconds. So, not doing the unicode decoding here didn't really save us anything. It saves us like it.

We still have to do the decoding here, but not doing it for the station names didn't make a difference, which is fascinating. Um,

fascinating. Um, but even so, I think we should keep it because we're clearly doing strictly less work, right? Uh, not uh let's do

lazy unicode.

Oh, unchecked. You're right. Uh, let's

do unchecked.

Uh, and same thing down here.

And then we don't need the unwrap safety. Uh the read me promised.

safety. Uh the read me promised.

We're in performance hacking here. Uh

this is okay.

All right, let's see.

You know, when the readme promises, that's all we need.

We know we're in good hands.

So the difference between the unchecked and the checked version here is the checked version will walk the bite sequence and check that it's valid unicode um before and then return a

result in what we're that means it's still doing the unicode decoding. So

Chad is totally right. This needs to be unchecked. Um which is really just

unchecked. Um which is really just saying trust me these byes are UTF8. You

don't have to go check is what we're telling it now.

And so now we should actually not be doing the unic code validation. And we

should see now the the speed up.

47 seconds. Okay. So we saved three full seconds on the walk here. Wow. That's

wild. Such a speed up chat. We did it.

No, I see no problems. Uh, great. So,

uh, we'll do a get commit a a amend.

Um, and then someone said, "How about we get rid of overflow checks?

Let's see. I doubt this will make much of a difference, but you know, if if we're really going for it.

Oh, trust me. We're going to get into bite searching with with Memcar and SIMD. We're going to get to all the good

SIMD. We're going to get to all the good stuff, but we we're starting slowly.

It's a gentle introduction, you know.

So, now we're going to do uh Okay, so we're doing without overflow checks. I'm guessing this will save like

checks. I'm guessing this will save like I don't think this will save anything actually. I I think this will be the

actually. I I think this will be the same time And yes, o overflow check should should be off in release mode anyway. Yeah,

exactly same time. Uh so we're going to leave this out because it's in release mode anyway. All right.

mode anyway. All right.

Uh okay. So let's now think about what do we want to do next? Well, one thing that's bothering me here is this reading because we're taking the file and we're reading it in like smaller chunks, but we know we're going to walk the whole

thing. So, what we actually want to do

thing. So, what we actually want to do is something called memory mapping, which is you tell the kernel, hey, can you just make this whole file be present

for me in memory? Uh, the way you would normally do this in Rust is through something like the memap crate, uh, which allows you to do exactly this. It

lets you um take a file you've opened in Linux uh and then memory map it into our um into our thing. Now, this is

technically an external dependency and I said no external dependencies. So, I'm

going to I'm going to disallow us from using this even though it would save us a bunch of time because it lets us talk about what memory maps are and how you use them. Uh what we're actually going

use them. Uh what we're actually going to do here is instead look at what would happen if we do uh no

uh so you'll see here there's an M map as raw desk here which is really just implemented for raw fds. The thinking

here being that if you have a file descriptor to a file that's all you need in order to tell the kernel to do this operation. Uh and so let's look at what

operation. Uh and so let's look at what they do when you do a map. What does m map options new? Okay, that's unhelpful.

And then map. So, let's go ahead and find a uh a map options uh new, which is just default. Cool. Thank you.

Uh and then it does map, which is the thing that we want. So, it gets a raw descriptor and then does mapper. Oh,

boy. Okay. Let's go actually look at the source here. Uh source

source here. Uh source Unix.

um mapper new. Okay, so it computes you you'll see here um it gets the length of the file um and then it passes that in as length here. File is going to be the

file descriptor. Uh flags, we we'll talk

file descriptor. Uh flags, we we'll talk about flags a little bit later. Um and

then the offset into the file you want to start reading from that's going to be zero for us here. Um okay, so we're we the alignment here is based on the

offset which we don't care about. Uh,

adjust MAPAP RAMs. Let's go look at what they're doing over here. Just see if there's anything we

here. Just see if there's anything we want to keep. Um,

not worried about we're we're we're hyperoptimizing here. So, we don't care

hyperoptimizing here. So, we don't care about 32-bit targets. We're just going to say 64-bit targets. So, this can't happen. Um,

happen. Um, length is length plus alignment. Our

alignment here is always going to be zero because our offset is zero. Um, so

that can stay the same.

Uh, it's not going to be zero sized.

Okay, that's all it's really doing. Uh,

and then you'll see it calls m map here.

Uh, m map comes from lib. Now, libby cy I think is a dependency we have to be allowed to take because it's lib. I'm going to claim lib is

it's lib. I'm going to claim lib is allowed. Libby is also used by the

allowed. Libby is also used by the standard library. So, we're not actually

standard library. So, we're not actually taking an extra dependency here. Um

because otherwise we would have to do like raw sis calls and that that feels excessive. Um so this is the main thing

excessive. Um so this is the main thing that we need here. Um and I'm going to make this slightly nicer uh a map and it's going to take a file uh which is

one of these. And it's going to return you a uh static U8 slice. Maybe not

static. We'll we'll find out. Oops. No.

No. Oh, what did I do? Oh, I did something stupid. I don't understand

something stupid. I don't understand what I did.

Okay. Uh F file returns you a static uh U8 slice.

Okay.

Lib CM up.

Uh this is going to be a pointer to null mute because we're not trying to map into a particular area of memory. We

want to the kernel to tell us where it mapped this. Um the length

mapped this. Um the length is going to be f uh fetadata.wrap.length.

fetadata.wrap.length.

Uh so that's going to be the length here. Uh the protection bits. So the

here. Uh the protection bits. So the

protection bits uh you'll say up uh is here desk.0

what second argument to map new.

Yeah that's just down here. file is

going to be fascd.

Um the protection bits are you can basically set whether this page whether the mapping of the file should be read only uh or allow rights and so on. Um so

let's go ahead and look at what we do for the protection bits. In our case all we need is read access. Um but let's see what the default here is for self.pop

populate. Uh let me go ahead and look at the default they have for M map options.

Uh they derive it. So populate is false.

So populate is false.

Uh okay.

Oh, I closed that, didn't I?

Uh okay.

Map error create OS.

Where is the ah if populate map? Okay. So they set uh populate to zero. So we just want

protect read.

Uh yeah, and we want map shared.

Uh we don't care about no reserve and we don't care about populate because we don't want to pre-eread the whole file.

Um, we do however want to do one more thing which I'll show you in a second. Um,

cool. And then the offset is going to be zero. So that's going to give us a

zero. So that's going to give us a mapping back.

Uh, this is going to panic.

Otherwise, they have a from raw parts uh which is really just a pointer that has

the offset added and the length. So,

this in our case is just going to be a pointer and we're going to do stood uh slice from raw parts. Now, there's a bunch of reasons why this is broken, but let's

let's uh let's be have it be fun for now. Um,

now. Um, cool. This pointer is a pointer to a C

cool. This pointer is a pointer to a C void.

Uh, len here we're going to cast to a U size.

Uh, this we're going to cast to a const u8.

And this is going to be io error last error.

Now, technically this isn't static. Uh

technically this reference only lives until um until this file is closed.

So I think actually what I'll do is this.

Yeah. Yeah. Yeah. Okay. Fine. I'll liide

the lifetimes.

That's fine. That's fine.

Okay. All right. All right. All right.

All right. All right.

Cool. Um, so now what we can do here is we'll still open this, but then we'll do a map is m map of f. Um, and there are a

couple other things you can do that are neat about a mm map, which is you can also tell the kernel how you're going to

use it. So, you'll see here, um, not on

use it. So, you'll see here, um, not on there, but on their mAP type um, how do they expose it in this library?

Yeah. So, you have this thing called advice, uh, where you can tell the OS how it like how you're intending to use this. And in particular, there's one

this. And in particular, there's one thing that's really neat in here, which is the ability to tell it uh here. Where's their definition of

uh here. Where's their definition of advice?

Uh here. Okay. So, you can tell it that you're going to read the file sequentially. And what that means is you

sequentially. And what that means is you tell the kernel, hey, for this file, when I read at a given bite offset, I'm going to keep reading in that direction.

So feel free to fetch more of the things for disk from me um in advance. And so

we're going to tell it that we're going to do this uh the sequential reading.

And you do this through a separate call here called libby advise.

Uh the we're not going to do an offset here uh because this applies to the whole file.

Actually, what did they why do they even pass an offset?

Yeah, offset zero.

All right, that's fine because you can say for this part of the file, I will read it sequentially and stuff. Doesn't

matter to us. Uh, in our case, we will just say here um

we'll do that here as well. lib C advice uh of pointer and for length len uh we're going to m advise let's see is it

in lib c I wonder uh advice uh sequential which is two why does the liby

Okay.

Uh cool.

Like so. Uh, and then we will actually do is no real this.

Um, all right. So now we get a map back.

Uh, we now don't need the buff reader anymore. Um, and we can actually we

anymore. Um, and we can actually we could still have a buff reader over the slice if we really wanted to. Um, but

there's no real reason to do that. Uh,

this is now just going to be over map.

Uh, split is going to be by C. C equals

new line.

There's no unwrap anymore.

uh and this needs to be a star. Uh the

other thing that's neat is now we don't even need even need to allocate vectors, but let's do that as a separate optimization path.

Um okay, let's see just let's first see whether that's any faster. So remember

we're comparing to 48. So we'll run that. Um all right, what is Chad asking?

that. Um all right, what is Chad asking?

What is the difference between mapping and reading the full file into a U8 vec?

Isn't that just basically the same? Um

it's similar except a you don't need to keep a vector of the whole contents of the file. Right? So if you do that you

the file. Right? So if you do that you would need to read all billion rows into memory which means you're you're allocating that much memory. You don't

need to um if you instead just know that you're reading through the file sort of sequentially then you can deallocate the things that you've passed by. Whereas

with a vector you're going to have to keep doing the allocate a vector of twice the size copy over the things each time you go over that increment.

Ooh, interesting. Panic. Well, we we'll look at that in a second. Um, so you don't have to do any of the sort of grow by two. You don't have to do any of the

by two. You don't have to do any of the copies. Um, you can deallocate um more

copies. Um, you can deallocate um more eagerly and so you don't need to have uh quite as large of a memory use, but also you're giving a clue to the kernel about what you're going to do in the future

because the kernel now knows a that this is a memory mapping of this file. it

doesn't if you're doing um incre just reads and b with a sequential flag we're telling it where we're going to read in the future so it gets to start loading things from disk in the background while

our program is running so we get effectively IO parallelism um so like map doesn't actually copy everything in memory it's more of a like the kernel

makes it look like that's the case but the kernel owns that mapping um closing the fd doesn't do anything to the mapping you must unmap if you want it gone. Oh, even better. Okay, so this

it gone. Oh, even better. Okay, so this doesn't even need to be tied to anything. Cool. Uh, this now fails at

anything. Cool. Uh, this now fails at here.

Interesting.

Why?

Uh, I'm going to do I'm going to do this just to give us slightly more information about what's happening.

Uh right unchecked of line.

Yeah. Yeah, and with MAPAP, someone also pointed out with MAPAP, um, you could even map a file that's larger than the amount of memory that you have, whereas you write it on to memory, that wouldn't

work. Um, so like it's not allocated in

work. Um, so like it's not allocated in kernel memory either. To be clear, it's it's memory mapped. So that it's not all of it is not stored in memory. It's the

kernel makes it look that way by uh configuring the virtual memory system in the right way. Um it the the way it actually works is it maps uh only

certain pages at a time. Oh, let's see.

Bad line, empty line.

Oh, it's because um there's probably a new line at the end of the last line. So

here uh we'll do if line is empty then break the actual challenge assumes the file in place in a RAM discs it's already in

memory uh Even then you would if you're doing this with a buff reader even if it's mounted on RAM disk then you would still be copying all

those bytes into your own memory into a separate vector right whereas with the with the m map the kernel will realize that these are the same and not have to

do that um 32 seconds. Okay cool this is definitely faster. Uh what do we have

definitely faster. Uh what do we have here? So now we have get commit uh mAP

here? So now we have get commit uh mAP 32. Okay, that's a lot better. And

32. Okay, that's a lot better. And

remember, we're still running single core. We haven't done any multi-core

core. We haven't done any multi-core parallelism. One of the reasons for that

parallelism. One of the reasons for that is because once you do that, it becomes much harder to figure out where you're spending your time. Uh because all the threads are doing things concurrently.

So this is on purpose. We're doing the single core version first. Um there's

another optimization here which is uh we can in fact let's um uh perf record just to demonstrate that I'm data driven

here.

So we'll let this run for a little bit.

Perf report.

Let's see what Perf report tells us. Um

okay. We're still in Maine.

Uh it I'm I'm still not getting lib C symbols which is very disturbing. Okay.

Uh where is our time spent? Okay. We're

spending a decent amount of time, 10% of our time now parsing F64s.

Um 5% in splitting here, 2% in splitting here. So definitely some time spent on

here. So definitely some time spent on the splitting by new line and splitting by semicolons. Um hashmaps.

by semicolons. Um hashmaps.

Uh yeah, so this is doing the hashing and then this is doing the lookups in the hashmap. So there's some time going

the hashmap. So there's some time going here, but still only 70% is going to main. So

where's the rest going? This seems like we're losing some of our stack information. And so I'm going to do one

information. And so I'm going to do one more thing in our cargo config. uh which

is I'm going to do um uh what's it called in the Rust configuration?

It's called uh is it called emit frame pointer? No,

it's called force frame pointer equals yes, I believe.

So, we'll run that for a little bit.

We'll kill it.

Let's see if that gives us a more reasonable report here.

Aha. Now we have 99%. Okay. So, what did I do? Um, force frame pointers. Um, so

I do? Um, force frame pointers. Um, so

when you build programs, uh, there's this thing called frame the frame pointer. Not when you build them,

frame pointer. Not when you build them, but when you run your program, there's something called the frame pointer. The

frame pointer is a register in the CPU um that points to the um

well no it's not a there is a register in the CPU that's the frame pointer um but the um uh let me change my definition here

slightly maybe I'll do this with drawing now that I think about I will do this with drawing we haven't drawn in a while on streams so let's do that Um, okay.

So, uh, when you call a function, there is something called the stack. Oo,

that's not the that's definitely not what I want. Uh, give me a slightly less insane brush, please.

Uh, insane brush, please. No.

Uh, what why is my brush selector not working?

Come back. Why? Okay, fine. I will do this. So,

this. So, okay. All right. Settle everyone settle

okay. All right. Settle everyone settle down.

Brush. I want my brush. Okay. I think

this program may have some trouble.

Okay. Maybe I will not draw this then. I

know this program hasn't been updated in a while and it's it's showing. All

right. Fine. We will not do it with drawing. um I will try to explain

drawing. um I will try to explain instead. So um when function A calls

instead. So um when function A calls function B uh normally in execution um there's a bunch of stuff that happens in

the prologue of function B. So this is before um this is before the body of function B is executed. we go into what's called known as the function

prologue and then function B executes and then we go into function B's epilogue which is code like the prologue and the epilog are injected by the compiler um to do the things that are

needed to like call functions and return from functions at the CPU level like this is sort of the this is the calling convention this is how does for example

a function know where its arguments are located like how does like when A calls B and passes a bunch of arguments where do those arguments go and the calling convention dictates that some of them go in CPU registers, some of them go on the

stack and they're placed there by the function prologue. The epilogue

function prologue. The epilogue similarly dictates where the return value goes and also um does the instruction that's needed to go back to

a if a called b. So this is the instruction for example. Now, as part of the the sort of function prologue and epilogue, um there's an optional step.

Um the optional step here is to put on the stack a pointer to the previous stack frame. So when A calls B, the

stack frame. So when A calls B, the question to ask is do you also push a

pointer to where A's stack is um or do you not? Um, and this step is optional

you not? Um, and this step is optional because um, technically speaking, you can you can when you're if if you're at the top

of B's stack, like B is currently executing, um, technically you can figure out where A's stack is um, through the debug info because you know the size of all the arguments, you know

where they're supposed to be passed. H

and so you can just like be at my pointer to B is here and I ded deduct all the information that I know should be on the stack according to the debug and so therefore A's frame must start

here. Um, this is a bunch of work

here. Um, this is a bunch of work though. Uh, and this tracking doesn't

though. Uh, and this tracking doesn't work perfectly. And so sometimes you

work perfectly. And so sometimes you don't realize where a stack frame starts or you it points to the wrong place. And

so this matters when you start unwinding the stack. And unwinding the stack is

the stack. And unwinding the stack is exactly what happens when you try to construct this kind of thing is that when you do um when you record through PF, for example, it it you knows it

samples every now and again. And it

checks your program of like what is it currently executing? And that question

currently executing? And that question of what is it currently executing is which stack of functions is this thread currently in. In order to figure out

currently in. In order to figure out that it needs to walk the stack to figure out which function call chain the program is currently in inside of. Now

that walking is uh more difficult if you have to rely on the debug symbols and it's much easier if you have the frame pointers. Um, now the reason people

pointers. Um, now the reason people don't have um the reason people don't have frame pointers in like modern programs or why they're often off by

default is because there was a period of time where CPUs had a limited number of CPU registers. So every register was um

CPU registers. So every register was um precious. You wanted them to use for

precious. You wanted them to use for compute, not to use for things like frame pointers. Um, but in order to push

frame pointers. Um, but in order to push the frame pointer onto there, you needed a register to pass the frame pointer in.

And there's a bunch of details here that we can mostly ignore, but it comes back to the fact that there was a period of time, especially with 30 32-bit processors, where registers were

precious, so we didn't want to use them.

And because it's usually possible to unwind the stack without this using this register for the frame pointer, well, then we're just not going to use it and use that for compute instead. it.

There's been a bunch of experiments since then that shows that on 64-bit processors which have a lot more registers. You don't need to do this

registers. You don't need to do this optimization. You can just always put

optimization. You can just always put the frame pointer in there and then unwinding is always perfect because the information is stored there. You don't

need any huristics. You don't need the debug info. Um but the defaults haven't

debug info. Um but the defaults haven't really caught up. Now there has been an effort to change the defaults so that this is like so that we get this the frame pointers and always get perfect

unwinding. Um, but Rust, for example,

unwinding. Um, but Rust, for example, does not default to this being on. I

think it probably should. There's

there's actually an ongoing discussion in um uh on the Rust issue tracker for whether we should change the default here. And I think the default has now

here. And I think the default has now been changed for stood, but has not yet been changed for crates or it's the other way around. Um, but this is like a thing that I'm guessing will probably change because it makes debugging much

better and it doesn't really make programs slower, at least in the the vast majority of them. They don't have unlimited res registers, but they have a lot more. Um, so the reason I turned off

lot more. Um, so the reason I turned off turned on force frame pointers is so that now my stack trace doesn't have any gaps in it. Uh, and so I can actually

look through this whole thing and now you'll see the the whole stack call chain is all perfect.

Okay. So, we'll see 10% of our time in parse. Tiny bit in max and min. Um,

parse. Tiny bit in max and min. Um,

decent amount in alloc here, hashing, uh lookups.

What's interesting to me here is there's nothing here that shows that we have to do the allocations for each key. And I think the reason is

because we only do that allocation if the entry is not already in there. So

only in this case do we turn it into a vector and we only drop the number of unique keys. But we can now

unique keys. But we can now do this uh nope. Uh do this because the can be

uh nope. Uh do this because the can be these can be references into the m map.

There's nothing there's nothing that needs to be owned about them. And so now this can actually go back to what it was because we always have

the owned version like so.

And so this is sort of a neat thing that makes the code nicer. Uh

and now this can be stir from UTF8 because it doesn't need to be a string anymore because we already have a reference. Um, now this is something

reference. Um, now this is something where I don't think this will actually make a difference.

Um, like what was our previous run here?

If we go back up, our previous run was 32.

This this will probably not make a difference because we're still at 32 seconds, which is quite a long time. And

most of the keys don't need to do this allocation. Um, same thing with the

allocation. Um, same thing with the strings at the end. like allocating

those strings. We're only allocating what? Less than 10,000 strings because

what? Less than 10,000 strings because it's just the number of keys in the map.

Number of unique keys in the map. So

this I think is not going to make a difference. But it is something where

difference. But it is something where yeah like effectively the same number.

But as we optimize the rest of the code, this matters. So I'm just going to do it

this matters. So I'm just going to do it proactively even though I know we should be data driven. This is a premature optimization. But I'm going to do it

optimization. But I'm going to do it anyway. Um

anyway. Um okay.

Uh less allocation please.

Um doesn't this break the sequential read assumption? Um not

read assumption? Um not an interesting question.

That's an interesting question actually.

Yeah. No, you're right. Let's

let's undo that.

I think I don't know if you're right, but I agree that I want to measure it. Uh so

let's do uh maybe make this maybe make the key uh but measure since we're breaking

sequential.

It'll still be in the kernel page cache, but that means that we're holding up basically a random set of pages because we happen to have keys to them. Um,

and we're going to be accessing them every time we look through the map. So,

the colonel is going to have to keep those pages to memory. There probably

aren't that many, so it's not using that much of the memory. But given it doesn't even show up on our current allocations, I think I think let's keep it. Um, okay.

So, uh, let's do this. Let's do frame pointers please.

Uh okay. Uh what do we have next? So in our

okay. Uh what do we have next? So in our PF report, what did we get? What where

else were we spending our time? Well, in

here. Uh main uh we're spending a decent amount of time here parsing floatingoint numbers. Um and that's sad. We don't

numbers. Um and that's sad. We don't

want to spend that much time parsing floatingoint numbers. So, how about we

floatingoint numbers. So, how about we do something better than that? Well, uh,

remember how they're floating point numbers with a single decimal point?

Well, those are just integers shifted by a power of 10. So, how about we do instead of parsing them as F64s,

um, we parse them as, uh i16s.

Um, so this is going to be a little interesting because we have to skip the dot.

Now we could actually implement this parsing ourselves uh by saying temperature is zero.

Uh, and then we're going to do sign is going to be uh temperature

zero is equal to minus actually We're going to do temp is going to be

temperature incopied because why not? They're they're bytes.

And then trying to decide how I want to do this.

There's sort of a there's a stupid simple way which is you just branch on whether uh whether the sign is minus or not. Um

there's also a way in which you don't do a branch at all. But how about we do it the the simple way first? So we say if

it's zero if temperature zero is this then temp is temp skip one else

which actually means we can do Skip one else zero.

Uh, and then we're going to say temperature is zero. Uh, and then four.

We know it's always exactly three digits. So, we're going to do

digits. So, we're going to do 4 I in 0 to three. Um, the reason we want to use the explicit uh loop

iteration here is because then the compiler can do loop unrolling, right?

We tell it how many um how many characters it is. We know it doesn't have to count.

And then we will say um temperature times equals 10 uh right because it's time equals 10 from

the previous number we're at. We're

shifting the previous number left. uh

and then we will add uh temperature uh actually for I in we have to be slightly

more specific than this because it's going to be 0 uh 1 2 no 0 013

right because we need we're after the minus sign potentially minus sign we want the two remember is between - 99.9 and positive 999 9.9. So, we want the

two um whole parts of the number and then we want to skip the decimal place and then we want to get the character after the decimal place and we're going to just treat that as though it's a

three-digit integer.

Um so, plus equals temperature of I.

Uh this one can also be no bounds checks, but that's fine. We're in

release mode anyway.

Um, right.

Now, this is going to be I16 max, I16 min.

Uh, this is now going to be I16, I16, I16.

Uh and in fact I don't think this is right. This is uh skip is what this should be.

This is like the offset. And we want here skip plus i.

And this needs to be t so that we don't alias the value.

Uh this is a u8 uh which we're just going to cast. No,

which is a this is a character. So this

is going to be minus b 0, right? So

we're going to subtract out the asky value of zero from the asky value of the digit we're looking at. And that gives us the actual numerical value. Um

and then this whole thing we're going to cast 16.

This is now actually zero.

Uh this is going to be of t. This is

going to be of t. And this is going to be of t. Uh, but the accumulator here probably needs to be more than I6. This

probably needs to be an I32 because if you have enough rows and we're adding them together so we can eventually take the mean. Um, then this

needs to be um this is going to this might overflow in I16 which is only 65,000. So if you

given the number of rows we have I think realistically this needs to be a bigger like the accumulator needs to be bigger um which is fine. This can just be I32

from this. Similarly can be I16 from. Um I

this. Similarly can be I16 from. Um I

tend to prefer using the from implementations rather than casts because from for for things like this that are infallible. Uh from just means that like statically when you look at the code it is infallible whereas the

cast is allowed to be lossy. This is not allowed to be lossy. Um, and then down here, um, the printing is going to be a little

weirder now because printing this will actually be turning

it back into a uh, floating point.

164.

You think it needs to be on I64?

Unsure.

I guess we could do 64. I don't think we'll need 64, but all right, fine. Um,

now the printing is a little trickier because we now have these integer values that we need to print as decimals.

Um, we could cast them back to floats and then print them.

Um, which given the printing is cheap, maybe that's what we do.

So, this would be uh min as f64 / 10.

Uh max max as f64 / 10.

Uh and this has to be sum as f64 uh divided by 10 divided by this.

Yeah.

Oh, right. And we're missing um yeah we're here also missing uh t dot.

uh if skip is equal to one then t is minus t.

This is where we're going to get into branch prediction problem, but we'll get into that later. Um

what did I mess up? Oh, this needs to have a colon. This needs to have a colon.

And All right, let's let's see what we get here.

L is three, but the index is three.

Uh, interesting.

in release mode.

Interesting. Um,

well, that's interesting.

Oh, it's be okay. We can't actually guarantee this

okay. We can't actually guarantee this because um because it's not guaranteed to have two whole

digits, right? Nine 9 9.3 degrees is not

digits, right? Nine 9 9.3 degrees is not zero 9.3 degrees. It's not zero padded.

Um so yeah. Okay.

yeah. Okay.

Um, we could actually walk from the back here.

I think walking from the back here is actually going to be easier. In fact,

it's going to help with the skipping here, too. So okay, t is going to be

here, too. So okay, t is going to be zero. Then we're going to do uh 4 i in

zero. Then we're going to do uh 4 i in uh 0 to 4 no in

let's just do for temperature. Let's

see. Let's see how the compiler does with this. So, we're going to say um

with this. So, we're going to say um we're going to walk from that side.

We're going to say if I is a dot, then we're going to skip.

And yes, I know this branch predictor will probably be sad.

Um, and then we're going to add, we don't need the skip anymore. Uh, and this is

not I. This is now D for digit.

not I. This is now D for digit.

Um, and then this is going to be uh we need a mole.

So one small mole uh mole times equal 10.

Right? So we walk from the right. We

take the digit we have. We multiply with a multiplier which starts as one and then will become 10 and will become 100.

Um so we take the first digit multiply it by one which is just the digit add that to t then we go around again we're now at the next character going back.

This needs to be.rev.

Um in fact this is going to be a match on d where if uh

if we hit dot then we continue. If we

had hit minus uh then we break and we say t is minus t. Uh

t. Uh and otherwise we do this and this needs to be this.

And now this all goes away middle out. Yeah, that's funny.

middle out. Yeah, that's funny.

The other way here, right, would be to look for the dot, parse as I64 using the standard Rust parsing logic, um, and then parse the fractional digit, which

we know is always there. Um, I'm not convinced that's faster. It sort of depends on how complex the Rust parsing logic is here. Um,

and then it would be nice if we could tell the compiler what we know about these numbers, right? the temperature is always um

is always at least three digits.

But let's let's see here what we get with this uh B all the numbers have a dot. Yes.

uh we could match on the length of the slice, but that doesn't tell us whether there's a dash

as a negative or a missing digit, right? So, uh - 9.3 is the same number

right? So, uh - 9.3 is the same number of characters as 99.3.

Uh, cool. This now ran in 30 seconds, so not really any faster. Um, let's do a perf of that and see what we get.

Let's see where this time is coming from.

Obviously, a lot of it is still the hashmap.

um the split and the Rsplit, but I'm not really seeing anything spent on parsing.

So, the parsing time is now basically gone from here.

And yet, we're no faster.

Interesting.

Yeah, like 10% vanished, but we weren't 10% faster. That could just be luck of

10% faster. That could just be luck of the draw or the lack thereof, right? Oh,

there's this iterator, which is the iteration from the back on temperature, but that's like adds up to 1% of execution time.

Um, this is also the kind of stuff that you can pull out with some tools like Hyperfine, right, which will run it multiple times, try to gather some um some information. Oh, it's because it's

some information. Oh, it's because it's inline, right? So, the cost is still

inline, right? So, the cost is still there. It just doesn't show up as the um

there. It just doesn't show up as the um Yeah. Yeah. So, if we go to main

Yeah. Yeah. So, if we go to main No here.

What? No.

I see it's self cost on main is what's gone up. Yeah. All right. Um there is a

gone up. Yeah. All right. Um there is a So you can also go here and then say uh annotate br main and then it will show

you the actual assembly for um for this function and which which specific instructions are causing you this this

cost. Um,

cost. Um, now it should also give me the source annotation, but why is it not? Uh,

oh is what I want. That's entirely

unhelpful.

Why isn't it showing me the uh not line number source code view? Yes s

we go. Okay, so here we now have inline the uh all of the Rust code that's actually being called in here. And so if we now go up,

where's our set here? This is what I wanted.

set here? This is what I wanted.

Okay, so if you squint between the lines here, you'll see that we're currently in main. Might be hard to see because it's

main. Might be hard to see because it's inlined everything, right? Um but you'll see some calls that are familiar. like

here's our here's our I65 from and so on. Um you'll see if predicate yeah here

on. Um you'll see if predicate yeah here 4d and temperature and so you can see where all of the time like percentage

time for each assembly instruction comes from. Uh and this comparison here on D

from. Uh and this comparison here on D is certainly where some of the time is coming from. We're are we are however

coming from. We're are we are however getting to the point where the looking at the assembly instructions is sort of misleading because also where the CPU

like which part of the CPU you end up actually spending the time for. Um so

what I mean by that is like this jump is expensive but why is a jump expensive? Like what is the CPU

jump expensive? Like what is the CPU doing that is expensive here? And this

is where you get into things like branch misprediction and um pre-fetching and all of that stuff. And what you can do

is uh perf oh I never remember what this is called. It's like perf uh

is called. It's like perf uh perf stat something. Perf stat.

something. Perf stat.

Uh, perfect stat.

What's the one I'm looking for? Ah,

maybe I can just run it with this and see what it gives me.

Yeah. So here you'll see it shows um this is like we're not giving a line by line view here or a function by function view. Instead we're getting information

view. Instead we're getting information about how did the CPU spend its time like where did the time go? And you'll

see we run three instructions per cycle.

Um some amount of stalled cycles. So

stalled cycles are when the CPU wanted to run something but it couldn't because the data wasn't available yet because it still had to be fetched from memory. Um

you'll also see yes we have spend like about 8% of the CPU is just idle because it's waiting for things. Um the number of branches per second. This is a lot of

branches, right? We're we're branching

branches, right? We're we're branching three billion times a second. Branches

are expensive because the CPU ideally it like it knows what the program is about to do and so it has like multiple compute units in there and things that can access the memory and things that do

floatingoint operations and ideally you keep all of those active at once. The

way you do that is instead of just executing the next instruction you execute like the next 10 instructions all in parallel because you're using different parts of the CPU. But when you

have a branch, so like an if this or that, then the CPU doesn't know which branch you're taking. So it doesn't know which code to run next. So what the CPU will usually do is it tries to guess which branch you're going to take based

on which branch you've taken in the past. And so if it let's say in the past

past. And so if it let's say in the past you took the A branch. So it's going to start executing the code in the A branch. By the time the result of the

branch. By the time the result of the branch is known because again remember it's running multiple things in parallel. when it knows the result of

parallel. when it knows the result of the branch, if it mispredicted, so if it predicted A, but it actually took B, it has to throw away all the stuff it did in the A branch and now execute the

stuff in the B branch. So this is a misprediction. When you have

misprediction. When you have mispredictions, then you end up with exactly this problem of the CPU now needs to like scrap a bunch of work that it did. So those CPU cycles were wasted,

it did. So those CPU cycles were wasted, but also it might need to wait for things to need to be loaded in this other branch because it didn't load them because it didn't think it was going to have to do this. Um, and so this is where you have this thing called the

branch predictor that tries to keep track of these branches and which way you take. But usually if you have

you take. But usually if you have branches that are highly difficult to predict things like is there a minus sign at the beginning of this temperature which is going to be like 50/50 then the branch predictor is often

going to be wrong. And so we are often going to throw away this this um the cycles. Uh and you'll see here it

cycles. Uh and you'll see here it mispredicted 2% of all branches. But

that's not super helpful to us because the challenge we run into here is not how many% of all branches is that if there is a branch on the critical path of the program like that we execute a

lot very frequently then mispredicting that branch even if it's only one might cause a lot of stalls and a lot of wasted cycles. Um and Chad pointed out

wasted cycles. Um and Chad pointed out you can do dashd to get more more stats out of here.

Uh, so we'll let that run for a while.

There we go. So now we get more stats.

Um, you'll see here this is like so this is uh how often do we load stuff and we're able to get it from the cache that's closest to the CPU. How often do

we miss in that part of the CPU? Um,

this we're going to ignore this for now.

This is like uh stuff around the the page table and when you do faults and all that stuff. I think we can mostly ignore that right now. Um

but the bench misprediction here is a little bit sad. The other thing that we saw out of the and and I think that mostly comes from this parsing. Um, the

other thing that we saw from our Perf report was that we're spending a decent amount

of our time down here in uh in the split and the Rsplit. So, the parsing out of the new lines and the semicolons. Um, so

we kind of got to pick here which one we do want to do first. Do we want to try to speed up the um the stuff around the parsing which remember was about 10% and

is probably still about 10%. It's just

it doesn't show up as a sub function. In

fact, maybe let's make it a sub function so it will show up. Uh let's do that. So

uh fn parse um parse temperature uh which take takes a

temperature which is a slice of u8s and it gives you back an i16.

And then we take all of this stuff.

This goes away. Now

uh we do this stick that into there. This now returns T.

Uh and then we say here let T is parse temperature of temperature.

All right, let's see what we get out of this.

Uh, so we'll let this run for a while.

And now we'll do port. See what we get out.

port. See what we get out.

Yeah. Okay. So our parse temperature which is inline useful information uh is about 8% of execution time. So we didn't really save very much compared to the

10% that was floatingoint number um parsing. We saved a little bit but not

parsing. We saved a little bit but not all that much. And if we look inside of this it doesn't really tell us where that time is going. Um

although we could here uh no that's fine. Okay that's fine. Um,

I now want to Oh, why won't Oh, because it's inlined.

Yeah.

Can I still Yeah, that's that's what I'm worried about. Um, we can here if we

worried about. Um, we can here if we really want to do a uh inline never.

Um this will make the program slower but it will also allow us to introspect specifically the assembly of that function to look at where is it spending

its time. Um this might not be

its time. Um this might not be super useful but let's see what we get out of it anyway. Um so if we go in here

find the main find the main parse temperature and I want to annotate this with source.

uh without source maybe uh with source um so let's see where this time goes so parse temperature

uh does add so this is just the adding of the indexing into the temperature array

uh then these the partial eek for the pointer yeah so this is the termination check for the loop of have we reach the end of temperature um

which I actually think we can omit entirely because we can instead just um loop over a particular set of numbers.

We can just loop over the indices directly. Um, so this we can probably

directly. Um, so this we can probably eliminate entirely because we already have a an explicit break here when we

hit uh oh no we wouldn't know because there might not be a minus in which case the way you know as you at the end of the or the start of the thing but this

feels like a thing that can be optimized. Um,

optimized. Um, and then yeah, this is the iteration from the back up here, right? Because

we're decrementing the pointer, we're moving from the end. Um,

and then here we have our jumps for matching on D.

And then we have Yeah, the multiplication and the adding which takes some time and then we have the jump back up. So yeah, so we have a decent number of these like jumps and branching instructions are unfortunate.

This jump shouldn't even be needed because we should be able to unroll this loop at least in theory. Um

yeah, let me think how we want to do this.

But I think the so the question still arises, right? Like do we want to spend

arises, right? Like do we want to spend our time optimizing parse temperature first? Like I think we're going to do

first? Like I think we're going to do both, right? Do we want to spend our

both, right? Do we want to spend our time optimizing parse temperature or do we want to spend our time optimizing the um the parsing of the the uh the sort of extraction of the

delimiters. Um I think maybe let's

delimiters. Um I think maybe let's switch gears for a second and go do the uh the delimiters.

Yeah, let's do that.

Okay.

So, let's leave parse temperature there.

We'll come back to it later. Um,

temperature as int.

Um, and now let's figure out what we can do about this splitting. So, one of the things that's annoying about split is that it walks character by character and compares them using a closure. Um that's

fine if you don't care about the performance but if you care about the performance this ends up being very very slow. So what are ways in which we can

slow. So what are ways in which we can do it better? Well there are a couple.

So one of them would be to use something called meur. So meur

called meur. So meur um this is basically a super optimized function for finding the first occurrence of a character in some

sequence of bytes. Uh and it does this with like sim and everything. So it's

super super fast. Um, now there are other ways to do it such as basically implement Memcar ourselves. Um, I forget whether Libby Cy comes with me. Yeah, it

does. Okay. So, let's let's do it with Memur first and we'll we'll see what we get to. So, the way we're going to do

get to. So, the way we're going to do this is we're going to keep track of where we're at. Um, and we're going to keep track of that might actually be all

we need. Um, and then this is now going

we need. Um, and then this is now going to become a loop.

Uh, and inside of this loop, we're going to call memcar, which takes uh three things. It takes the the string that

things. It takes the the string that we're trying to parse. So, in this case, that's going to be um a or the slice we're trying to parse. That's going to

be map, but it's specifically going to be map starting it at, right? Because

that's where we're currently at. We

don't want to search from the start of the file. Um

the file. Um then uh this is the character the bite that we want to look for. In our case that's going to be new line. Uh and then

n here is the length of the map. Uh so

rest here is going to be map at dot dot.

And so here we can then say this is going to be rest and this is going to be rest.len.

rest.len.

uh and what it returns is the uh a pointer to where I think it's where that character appears. So next is going to

character appears. So next is going to be this next line uh or this is really next new line.

Uh and if we look at the documentation here for mecar uh you'll see that the uh down here scans the initial end bytes of the memory area uh both C and the bytes

of the memory interpreter unzended car uh you can also search backwards from the end which is not what we want to do here. It returns a pointer to the

here. It returns a pointer to the matching bite or null if the character does not occur. Uh and so what we want to do here is we want to look for the next new line. We want to say that the

line is um actually we want to say uh uh a couple of ways we could do this. I

think we're going to say um line is going to be um trying to decide how I want to do this

because if if next new line is pointer null which I think I think it's

a pointer right so we can just say is null um if it's null then I think we want to you we like

the sort of a loop inversion we're going to have to do here. So the line is going to be all of rest. Uh otherwise, but we

also are going to have to remember to break after this, right? So there's

slightly more than this. Um the else case here is going to be uh actually no we can leave this to be the break because we know that there's a new line

after every line which means that we should always find a new line until we get to a line that's empty that makes us

break. So we don't uh don't need to

break. So we don't uh don't need to remember to break since next iteration

we'll find empty line.

Um, and in the else case, the line is going to be actually this is it's not even rest,

right? It's rest. Um,

right? It's rest. Um,

yeah. No, that's right. Um, and here at is going to get added to line plus line + one.

So skipping over the line we found and the new line. Okay. So the line is if we found a new line character then the line no if we didn't find a new line character the line is everything from

where we're at to the end of the file.

Otherwise it is from at until um and this is where it'll be slightly

annoying because uh next new line gives us back a pointer. Uh so

so in this case actually maybe we walk this instead of by index we just walk it by pointer.

Uh I don't love that but maybe that is what we do but this will actually be um

we have to compute the length using next new line uh offset from

and this needs to be as uh on st8 uh we want the offset from

uh rest aspointer right so this is the distance between the where memcare returned and where it

started searching so that's the length of the line um and then we want rest to be Uh

no, that's not right. It's the offset from Yeah. No, that's the that's the offset

Yeah. No, that's the that's the offset we want from here.

Uh, but this is going to then include the Yeah. And this is not going to include

Yeah. And this is not going to include the new line character because this gives us a pointer to the new line character which we're ignoring. And

that's why at is going to add both the length of the line and one. Um,

so then this is going to be why is this unsafe?

Like what offset from unsafe?

All right.

Uh safety uh me car returns pointers in rest always returns pointer and rest which

are valid.

Um, it's unhappy here because this is supposed to be a Let's see in um

and then it's unhappy because this uh safety uh rest is valid for at least rest. len

rest. len bytes and this has to be as pointer and as c void

and really this should be c int. We should really just import

c int. We should really just import these because we're going to use them in a bunch of places. Um, and then it's unhappy with me here

because what's returned from here is an eye size. Uh, and we happen to know that

eye size. Uh, and we happen to know that it is a um we happen to know that next new line is

always greater than the pointer we pass in. Uh, as we know this is positive. So

in. Uh, as we know this is positive. So

we know that this is uh we can just cast this as us.

And then just as a sanity measure here, I'm going to do this.

Um because this feels very brittle, which is all the more fun.

Uh unchecked.

Yeah. Yeah. Yeah. Yeah. That's fine.

Unsafe.

That seems pretty promising, right? I'm seeing one point after the

right? I'm seeing one point after the decimal. I'm not seeing a new line

decimal. I'm not seeing a new line inside the string. So, I think I think we got that right.

Um, this only got rid of the new line scanning that we're doing, right? So,

now we do the new line scanning with with um Memcar.

Let's see what we get out of this.

And so in theory, this should be what?

8% faster or something. Okay, 28. Hard

to say whether like previously we were at 30. Um, so 10% of 30 is three. So we

at 30. Um, so 10% of 30 is three. So we

it should be about 3 seconds faster, right?

A little less than three. 8% I think was the the search for new line if we eliminated it completely, which we didn't. So, a saving of two seconds is

didn't. So, a saving of two seconds is actually about what we expect here.

Let's do a second run just as a just to see it in action again.

28. Okay, so pretty consistently a few seconds faster. Um, so that's nice.

seconds faster. Um, so that's nice.

So, get diff.

Uh, let's see. I want to get rid of Oh, this has to be that. Just wanted to see if I could get rid of some of these imports, but maybe not.

Uh, uh, me car new lines.

Okay, so that's one of the searches. The

other is this look for semicolon, right?

So semicolon is also something we're trying to look for very quickly here. Um

now here we could also use uh Memcar but we could also do use SIMD because that seems like fun. Why not? Um so I think I'm I think I am going to use SIMD here.

Um what we're doing is you know we know that this line we know uh line is at most 100 characters for the station name was the

max station name plus one for the semicolon plus at most minus two digits decimal point fraction. So plus five is

106 bytes right? uh now 106 bytes uh if you count

right? uh now 106 bytes uh if you count it in in U8 is well it's 106 it's length

of 106 now if we go to um look at the stuff we have in the standard library for SIMD you'll see

that there is a U8* 64 is the biggest we have here so a SIMD vector with 64 elements of type U u8 um now I think.

Oh, so maybe we can't use portable SIMD.

That would be pretty sad.

We might then have to use the arch ones because I think there is a uh Oh no, I have to look at uh intrinsics

sim uh I believe I want SIMD. Where's CD?

Sim.

That's unhelpful.

What if I just look at the SIMD crate to see whether Yeah. Yeah. Okay. It's

okay. Okay. Okay.

because I'm pretty sure it should be possible to do 128 U8s, right?

Like why does this only go to 64?

That's pretty unfortunate.

Um I think it said I think the instructions here were yeah maximum length of 100 bytes not 100

characters 100 bytes. Um is 64 really the max for SIMD here? That's really

sad. Um what I was hoping to do was to parse the entire line using a single

SIMD. Yeah. Max is 512 bits. Um, no, but

SIMD. Yeah. Max is 512 bits. Um, no, but they have So there's U16 * 64. So

shouldn't there then also be U8 * 128?

And and I mean to be clear, there are no lines that are actually that long in the generated thing. It's more that the spec

generated thing. It's more that the spec says they can be.

And we might have to do two SIMD instructions which does make me a little bit sad

but so be it. Okay. Um so how does SIMD work? Well, so the the actually let's

work? Well, so the the actually let's start with what is SIMD? So SIMD stand for single instruction multiple data. U

the idea here is that you run one CPU instruction and it computes over multiple elements of data at the same

time. So this usually means that inside

time. So this usually means that inside of the CPU there are sort of multiple compute units where you can give them let's say a uh like 64 elements and it

can compute over all 64 elements in parallel with one instruction in the CPU rather than doing it for element one then for element two then for element three and so on. Um, and the the way

that you operate in SIMD mode, you can like the extreme version of SIMD is GPUs which can do like concurrent like in parallel. It basically lots and lots of

parallel. It basically lots and lots of very small cores that can all do the same operation at the same time but on different subsets of the data. And

that's what we're looking for here, right? We really want to do an equality

right? We really want to do an equality check with the with the semicolon for are any of these bytes semicolon? And if

so, tell me which one. Uh and then SIMD has like a you can do like a bitwise and for example between two very long lists of elements. Uh and you also have a

of elements. Uh and you also have a reduce operation and reduce there are a couple of different reduce operations that take a the result of one of those

SIMD operations like equality over two SIMD vectors uh and gives you one result like a U64 back from them that for example could be a bit mask of which

ones were equal.

Um, so here's what we're going to do. Um,

we're going to create a U8x 64.

So, uh, we want a call it a mask. It's

not really a mask in SIMD terms, but let's call it a let's just let's just say delimiter. Um

the the limiter is a U8* 64 uh

and I forget what the constructor is for this. It's like

this. It's like isn't there a is it called spread scatter?

There's like one where you can give one value and say that should be the value for all of them. Um, but I think it just imp it also implements just like from uh maybe it shows in the description. It

does not. Okay.

Uh, resize. No, that's not what I want.

resize. No, that's not what I want.

Where's my constructor?

Splat. Construct a new CD vector with all elements set to the given value.

That's what I'm after. Uh, so I want to splat a semicolon.

So this will create a single thing of 64 U8s all of which are set to semicolon.

Uh and and I think what we'll do is if the line length is greater than 64 uh then we'll go to

uh then we'll go to the slow path. It's

going to be a to-do.

Um but otherwise we'll we'll take this sd1. Uh we'll say that the delimiter is

sd1. Uh we'll say that the delimiter is semicolon.

Um and then what we want to do is get the current line also as one of these um

SIMD things. Uh and then we want to do a

SIMD things. Uh and then we want to do a concurrent like a parallel uh equality

between these two. So, SIMD,

uh, where's my Yeah, it's going to be unhappy about this because the line might be shorter.

So, we would have to pad it out.

Um, but I think there is a uh there is a way to make one that pads,

right? Uh, load or default.

right? Uh, load or default.

Yeah, load or default is what I want.

Uh, so I want a uh line is U8x64.

um load or default of line.

Uh and then I want to do uh delimiter and then I want to do eek.

No, I want to do sim of line.

Yeah. Yeah. Yeah. I want to I want to use no feature portable CD.

Yes. Yes. Portable CD, please. Thank

you. Um, so what does it give this give me back? Well, this gives me back a

me back? Well, this gives me back a mask. And what is a mask? Well, a mask

mask. And what is a mask? Well, a mask is basically we have 64 UATs here. We

have 64 UATs here. And what we get back is a mask which tells us um which which stores the result of doing that SIMD operations across all of these. And in

fact, if we here go look at where's my uh yeah, so SIMD element is the trait we're after. Um you'll see mask here is

we're after. Um you'll see mask here is uh for u8s the mask type is an i8. So

this is the result of doing operations over u8s is you get back i8s. So the i8 of doing equal like elementwise pair

elementwise equality is each equality is going to end up as an i8. Um and now what we want back from this is we want to know the offset at which the

semicolon is. So how would we do that?

semicolon is. So how would we do that?

Well um what you can do is look for uh le what is it called zeros?

Why can't I find the zeros?

Uh yeah. So what I want is the number of

yeah. So what I want is the number of leading zeros.

But I think maybe I want to do a reduce first.

No, I don't want to reduce. I want

do I have AVX 512? That's a good question. Uh CPU it's not CPU info. Um

question. Uh CPU it's not CPU info. Um

uh what's the test for uh AVX 512?

There is a command, right? Yeah, this

one.

Ax2. Hooray.

So I should get uh get that um dot first set maybe. Yeah. Uh so first set finds

set maybe. Yeah. Uh so first set finds the index of the first set element in the mask. Um which is what we're after,

the mask. Um which is what we're after, right? So we we did the um we did the

right? So we we did the um we did the equality and now we want to find the first one where the thing was set. Uh

and this is the index of the delimiter is this uh this is now an option u size. It's option because it might never find the the semicolon. Um

but that shouldn't be the case, right?

because um uh we're promised every line has a uh semicolon.

And in fact, many of these expects could become uh in fact if we really want to go for it, right, we could do uh unwrap unchecked.

Is there an is there an expect unchecked? Because I I want to I want to

unchecked? Because I I want to I want to write y, right? But here we can do uh safety promised there is a in every line.

Uh we can always do the the check later as well to see um you know if we use the safe version how much would it cost us in terms of performance. Um so the index here is that what character is the is

the semicolon. Um and now what we get

the semicolon. Um and now what we get out of this is uh in fact let's do that here.

Let temperature and station no

station and temperature are going to come out of this. Um,

and so in the slow path, we're going to do this and give station temperature.

Uh, and in the fast path, we're going to give back um the index until the delimter is going to be the

station, right? So that's going to be

station, right? So that's going to be line to the index of the delimiter. Uh,

and the temperature is going to be the index of the delimiter plus one to the end of the string.

Why is it mad at me?

It's mad at me because I forgot to write line here.

Uh oh.

Uh let's do this.

Is there a reason I didn't use Rsplit once? Yeah, there's no um there's no

once? Yeah, there's no um there's no RSplit once on slices. Or rather, there is, but it's unstable. Um, but I guess like we would need to pull in the

library features uh slice split once because it hasn't been added yet. Um, so

I just made an R split end instead. But

I guess in the in the interest now of we're we're using nightly anyway. Um, so

we might as well just do uh slide. No

slide. No slice.

It's just under split once.

Um, and then we'll do let's um actually we'll just do unwrap here.

Which order does this return them in though? Uh

though? Uh it returns them in straight order. Okay. So this will be

straight order. Okay. So this will be temperature and station.

at which points no station then temperature. So this just reduces like

temperature. So this just reduces like that. Cool.

that. Cool.

Uh, and then maybe let's do just to sanity check ourselves um line

station and we'll do uh from diff 8 and because this is going to go away

I'll just use the safe version line mine.

station and temperature.

That seems quite promising.

Yeah, that looks real good. Okay, so now these print lines can go away. this

branch even though it is a branch is a branch we'll never take because if we only take it for very long stations which don't even happen in this data set. So this branch is an example of a

set. So this branch is an example of a branch that's basically free because the branch predictor is going to be perfect for it.

Uh now let's see how much we ended up cutting. So let's do a a guesstimate

cutting. So let's do a a guesstimate here. The

here. The uh I believe this was about 5% of execution time.

So we're at 28. So 5%

is like uh what 1.4.

So yeah, we're not going to save very much.

We're going to save us a little over a second right?

Oh, not faster.

Yeah, the the print line isn't in there anymore.

Try a second time.

We're already building with target CPU native 29. Interesting. So, this is slower. Why

29. Interesting. So, this is slower. Why

is it slower?

Um, all right. Let's do a perf. See what

we get out of here.

I also want to look at the actual assembly here.

Perf report. What does Perf tell us now?

Nope, that's not what I wanted. Uh go

into here main vein.

Uh I guess we should make this a function as well so that we get to know um uh

to station and temp uh u8 split semi and it's going to give you back a u8 and a u8 just so we get that

breakdown separately.

Um split semi of line.

Oh yeah, this splat we only need to construct once. Is this a const event?

construct once. Is this a const event?

Yeah. Okay, great. Uh, so let's go ahead and do um const semi U8x64

is this and then this should now be semi this.

I mean, the splat should be pulled out by the compiler. I agree. So, this feels like feels like a red herring, but we can pull it out anyway.

27. All right. Perfect cord. Actually,

27 is about right. Maybe it didn't hoist it. Um, so we're at right we were at

it. Um, so we're at right we were at 28.5ish and we're expecting to save between 1 and 1.4 seconds.

So that's that's about right.

Nope. I need to kill this, right? I'm

not waiting for this to finish.

per report.

Let's see what we get out of here.

Is a split semi here is a very small amount of time, less than what we used to have. Oh, I guess the new line split

to have. Oh, I guess the new line split is also not pulled out yet. Um, so let's also split that out perhaps.

Um so we'll do let line is next line which is going to take

um map mutable reference to at and that's all I think Okay.

Um, so we take a U8. And here we do actually need lifetime annotations because we don't want this to be tied to the

mutable reference of um of at gives back a u8.

cord.

Let's stop. See what we get for report G.

Okay, what do we got down here?

Okay, so next line and split semi now is a combined 5% and it used to be a combined 10%. So not a huge amount of

combined 10%. So not a huge amount of saving here. I'm guessing this is all um

saving here. I'm guessing this is all um mecar uh unclear we can do better than mear.

We could try to sim this as well, right?

Because uh the new line should also be within the same. So maybe maybe we do that. Let's uh let's save that for a

that. Let's uh let's save that for a little later. Um,

little later. Um, SIMD. Um, me.

SIMD. Um, me.

No, it's just SIMD in this one.

Um, all right. I'm pretty tempted to SIMD

all right. I'm pretty tempted to SIMD this one as well. There's also a fun thing where we might be able to SIMD the search for both of them at the same

time.

So, the trick here, I think, would be Oh, that's pretty tempting.

That's pretty tempting.

Here's what I'm thinking. You

you take the next 64 bytes.

You subtract the value of new line.

And now anything that's zero Anything that's zero is your new line

and anything that's equal to the difference between new line and delimiter which we can compute constantly uh is the

is the location of the semicolon.

I I wonder whether that's faster than doing each of these separately, but maybe it's the same.

It would mean you only need to load the SIMD once.

All right, I think we'll we'll leave it like now like this for now and then we'll come back later.

Um, okay. So, when we look at Perf report,

okay. So, when we look at Perf report, there's sort of an obvious elephant in the room, right?

The obvious elephant in the room is of course Max and M. No, I'm joking. Is of

course the hashmap. Um the hashmap here is like taking almost all of the time.

Um and even though there's only relatively few entries, like it's at most 10,000 different stations, um this feels ridiculous. Like surely we can do

ridiculous. Like surely we can do better. And especially because most of

better. And especially because most of this time is going into hashing. And

most of the time is hashing um is using the default hasher from Rust which tries to be like collision resistant and everything. So it injects like random

everything. So it injects like random data into the keying process as well.

Like this is stuff that um matters for real rust programs and it's not cost we want to pay here. So there's a really interesting question of uh what do we

want to do here? because we can either switch this to like a B tree map to avoid the hash but now we're still doing equalities um but we're not doing unic code so it

might actually be decent amount faster and B tree maps are uh our B tree like that they allocate I think five elements per node um so you get some some

amortization here too but let's first start with a sort of obvious obvious improvement here right which is with capacity 10,000

Um, so remember what we had previously here was what, uh, 27.

I don't think this will make a huge difference. I'd be very surprised

difference. I'd be very surprised because we're only growing the hashmap.

I I guess initially we're growing it quite a lot, but like advertise this will like over a billion rows. I don't

think this matters, but I could be wrong. I could be wrong.

uh 25. Okay. Okay. So, pre-allocation

helps. Uh pre-allocate map. It's not

quite pre-allocate, but you know what I mean. Um

mean. Um but I'm curious if we instead say that we make this a B tree map.

I just want to see I just want to see what we get out of it. That also means

it. That also means uh actually this we still need even though it's kind of stupid because you should be able to compare

these very quickly now that they are now that the keys are U8s uh bite slices right because you don't need to do the

expensive uh UTF8 character comparison Then so we're competing against 25.

Oh, UTF8 comparison is also mem comp.

Okay, so then this shouldn't this should probably still then be slower.

Okay, that's a lot slower. Cool. All

right, back back to hashmap. It is um Okay, let's also make this change that we were talking about a little while

ago. No, we decided not to do that

ago. No, we decided not to do that because we might keep the pages around.

Um yeah, I want to I want to optimize the mapping here before we start playing around with making these pointers.

Um so the the obvious next thing then is the hashing algorithm. So currently the hashing algorithm we're using is like the the the SIP hasher that comes in the standard

library which has like randomization in it and stuff. Uh we want to avoid that.

Um and so there are a couple of ways we could do this. We could bring in a crate like fxash or or aash but again no external dependencies is the rule. Uh we

could have our own. So there are like percent twisters um or not percent twisters but um like we could just implement a hasher ourselves. There's

also the stupid hasher which is just like add all the bites together. Um and

let's just let's just see like let's let's let's just see. Okay. Uh, I want a hasher.

just see. Okay. Uh, I want a hasher.

Dumb hasher. Sorry, I mean fast hasher.

Um, and this is going to be a fast hasher in there.

Is this It's being sad now. Why is it being sad? Uh,

being sad? Uh, it's being sad because this needs to be this.

Uh, great.

Okay. So fast hasher needs to implement build hasher which creates a hasher which is fast hasher

uh which returns a fast hasher.

Fine.

uh use 64 fast dasher fast dasher builder builder here because every time you need to hash something you need to make one of these. So you need a builder that you

of these. So you need a builder that you can call to get a hasher. Um and this is then going to be zero because we don't care about random here. Uh, and then we

want to implement, we don't care about um dealing with uh malicious hash collisions and stuff here. So, we're

going to implement hasher for fast hasher.

Yes. Yes.

Yes. Yes.

Uh, forb in bytes uh self.0 Z

uh self.0 Z plus equals by actually U32.

And we actually want this to be wrapping ad.

Yeah. Yeah. Yeah.

All right. Self.0 is this.

So, this is a very stupid hasher that just adds the bytes of all of its inputs and that's the hash. Um, and we could even do this with like a SIMD add. Um

but I think we might not need to. Um

I wonder actually is there a uh yeah right um because we know that the hasher no the hasher here is the station names

which are indeed just U8 bytes. Um we do however know that they are at most 100 characters.

So we we could sim here. So uh to-do sim if less than 6 uh if

less than or equal to 64b.

Yeah. Yeah. We can do a reduce sum here.

Um but let's just let's just see what happens.

No, the the SIMD here would be to do the addition. If if the hasher now is just

addition. If if the hasher now is just the addition of the bytes, you could do the addition of the bytes as SIMD.

Now, this is obviously stupid, right?

like this is not usually how you would mix in the the hash because um it's much more likely you get collisions and in fact yeah you see we we do worse right we were at 25 and now we're at 35 and I

think it's because we end up with way more collisions um because remember you take you take this number and then you uh like the implementation of the hashmap takes that modulo the number of

buckets um and so really what we want is to generate numbers that spread a lot but what we're actually doing is just adding um numbers where many of them are roughly the same length and many of them

are asy so you don't spread the value quite as well. Um so this is why you often want to do this using what's

someone proposed in here U64* U64 multiply and then exor the two 64bit halves together. Okay. So we would let's

halves together. Okay. So we would let's let's try that. Why not? Um, so U28, uh, and we'll do self.0 dot. Is there a

split?

Is there a I guess it's just um this U64 modulo with this as U64.

Uh and then we do here chunks of uh chunks of sites eight actually. Is there a um

actually. Is there a um yeah that's what I want mute chunks is this of size eight uh and then I want

because 8 is 64 here so we do actually no we can just keep a U64 we don't need to keep a 128 so this uh this

becomes self.0 zero. Uh and in here we

becomes self.0 zero. Uh and in here we do for chunk in no we do while let sum chunk is chunks because we need the

remainder at the end. Um and then we do um self.0 zero as U128

um self.0 zero as U128 multiply by uh chunk which here is going to be

uh U64 from native Indian bytes of chunk as you would Right.

Uh mixed and then we'll do here mixed this mixed this self.0 is equal to that.

self.0 is equal to that.

And then for the remainder uh for the remainder I guess we will just do the same but there will be U8 so we

can't easily turn them into a U6. Well,

I guess we could, but oh right. And you need a this can't be

oh right. And you need a this can't be zero. This can be uh

zero. This can be uh there is some optimal number like this should start with a prime or something, right? Uh it should be co-prime with

right? Uh it should be co-prime with most of the numbers you might select or something. Um, let's just do

something. Um, let's just do seven.

Why is it upset about this?

Oh, right. Next. Uh, and then for byte and chunks remainder.

I suppose we could just wrapping add these Just pick a randomly selected 64 by 64-bit string

where oh for here yeah um doesn't seven is is random right nothing not random about seven

FNV is offset bias perfect thank you why not Um, and then I think, yeah, I thought about 1337. Um, but hey, let's

use a a value someone chose chosen with a fair random roll. D1. Yes, exactly.

Um, and then why is it unhappy about this?

Oh, why doesn't chunk exact Yeah. as chunks. That's what I want.

Yeah. as chunks. That's what I want.

Yeah. Yeah. Yeah. Yeah. That's what I want. I want chunks and remainder. Now

want. I want chunks and remainder. Now

we're talking. And then I want to do for chunk in chunks because now chunk here is um is an array

future. I don't care about future. It's

future. I don't care about future. It's

a it's an array of fixed lengths. So now

we can actually do this. Uh

yeah, that's fine. Um okay.

What did I do? Why is it upset with me?

Ah, this needs to be this. Um and then 4 B in remainder.

What do I want to do with the remainder?

I guess I can just add them in. That

seems fine. should probably do better, but Oh, it needs Yeah, it's needs to be a const.

Uh, and then we can just do we can just add them in.

We could mix in the same way, I suppose, but that feels excessive for the remainder here.

Let's see. So, remember 25 is what we're trying to beat. 25 is the default Rust hasher.

There's an interesting question here whether should this even be a hashmap, right? Because we know there are at most

right? Because we know there are at most 10,000 stations and so we could just store it as a

vector although you would do log comparisons.

That's not ideal. So this is slower.

Interesting.

Interesting.

Yeah, I want to I think I want to SIMD this. And I think the way

this. And I think the way although that gets harder because now if we're trying to do it this way then

that's more painful because I'm curious where that time is going.

Yeah, it's almost like the default one almost certainly has like a bunch of optimizations for vectorization and such, right? So that that is what we're

such, right? So that that is what we're competing with. Uh why

competing with. Uh why why is it being sad about this now? That

it's confusing me. Um,

what?

Why is it lost?

Why? It lost one of the stack frames in between. So, it's like splitting these,

between. So, it's like splitting these, which is sad. Equivalent key. It could

also be that we're producing a bunch more collisions. Like it's not

more collisions. Like it's not necessarily that it's the hashing.

Like if I annotate this. Oh, it's been inline. I can't

this. Oh, it's been inline. I can't

uh How about we say um inline never here?

There we go. Now the stack trace is fine again. Why? Very strange. Um, yeah. So

again. Why? Very strange. Um, yeah. So

80% is going there, which is roughly what it was before.

Less time is going to the hasher. So

that makes me think that this is actually more collisions.

So what is happening to our hash here write length prefix.

Oh, right. It also hashes the length of the slice.

That's interesting. So I wonder if we could get away with just emitting zero as the hash and just using the length.

But let's let's see what happens if we annotate this. Right? So

annotate this. Right? So

where is this time going?

Okay. Okay, for some prelude and it loads the bites.

Why?

Oh, this is the loop enrolling.

Interesting.

I'm trying to figure out.

Yeah, it's vectorizing here.

Interesting.

Let's remove this. Let's go here for a second.

I I'm curious. Actually, there's one thing I want to test. I think someone suggested it too like what happens if we add um

uh like a hash here for example.

Um and then we try to use aasher What? A hash.

What? A hash.

What's their builder?

Random state. I don't want random state.

I want I don't actually care that it's random, but I I guess it's fine because the random is created when you create the build hasher.

And I just want to see whether this is like like where are we doing something stupid here 16 seconds. Okay. So we are definitely

16 seconds. Okay. So we are definitely doing something stupid.

But is it collisions or is it the hashing speed? We should be able to test

hashing speed? We should be able to test this pretty easily, right? Because uh if we here say get rid of the remainder

like how does that go?

Just trying to narrow down where that time is going.

So we were at 25, we got to 16.

And then with this implementation, what did we get?

This is definitely more than 16 seconds.

It's not going the way it's supposed to according to the name.

52.

Okay. What if I What if I just do byes zero?

Like just just hear me out here, you know?

It has to be collisions then, right?

Oh, I got to remove my inline number.

Lots of arrays of length less than eight.

That could certainly be the case, right?

Like looking at the output from the stations list, there are a bunch of them that are less than eight and those would just end up with just doing remainder.

And I guess the remainder we could do with SIMD here also, right? So we could just add those

also, right? So we could just add those with SIMD.

But my worry is that if we just add them, then we increase the number of collisions again.

If collisions are indeed the problem, I guess we'll see what what comes out here right?

If this is still slow, then clearly the problem is collisions because we're basically computing nothing now in the in the actual hasher.

And it's only eight bytes is the other thing that's interesting here.

And like we could do a lot more than that. It sort of I guess we're making

that. It sort of I guess we're making the trade-off here, right? Between we

could use like a U8x64.

Arguably, we could use more than that, but we're expecting these to be short strings anyway. So U8x64 is going to

strings anyway. So U8x64 is going to capture every station name and we can just add them. The problem is then we're not getting the collision resistance because we're just adding them and that's how you end up with the

collisions in the first place.

112.

Okay. Yeah. So collisions here are very expensive.

So it sounds like do more work to avoid collisions is probably worth it.

So, so if we do something like a u8 over 8, so this will also be an eight chunk.

And then we'll do last dot copy from slice of remainder and then we'll do chunks

chain I guess it chain

last and this is actually I guess this becomes remainder but that's fine Uh, right.

Once of last of last.

Shouldn't we use vec instead of hashmap?

I don't think so because um because if you use a vec instead, you're going to have to do log n string

comparisons. So you're going to do like

comparisons. So you're going to do like somewhere around six string comparisons.

Um although I guess it's at most 10,000 stations, right? It could be way fewer.

stations, right? It could be way fewer.

In fact, I think in the default data set there are way fewer. And so then a vector would actually probably be pretty fast.

Yeah, I I think this will still be faster. Um source, right? I can't It

faster. Um source, right? I can't It won't let me do that.

Isn't there a copy?

I think I can do mute blast. Make it a slice.

What?

There's definitely a way to do this.

I'm so confused. Isn't there If you copy from slice into a slice, the length of the source must be the same as L. There's a

I mean, okay, we can do remainder. It

just makes me sad like this. That feels unnecessary there.

like this. That feels unnecessary there.

I'm pretty sure there is a you know, copy as many as you can. Um,

alternatively, is there an is there a array or is there a remainder two array?

All right, fine. We'll we'll keep it like that. Uh,

like that. Uh, and this is like if remainder is empty.

Uh, no, you're allowed to slice to zero, I think.

If you have zero then the hash becomes zero.

Uh oh. Yeah,

oh. Yeah, that's true.

That's a good That's That's a very good observation. But we could just do this,

observation. But we could just do this, right?

Just as long as it's not zero, it's fine.

Unless we like accidentally you could imagine it accidentally becomes zero.

right with the exors 21. Okay, that's

better than 25. Worse than 16.

All right.

Copy from remainder as pointer to last as

mute pointer for remainder len.

But that should be what copy from slice turns into, right?

Make it branchless.

Yeah.

Where's the branch? The branch, I guess, is only really the termination case for the loop, which I think you need anyway because the bite input here is a slice of

unknown length.

Yeah. So, this is still 21. So, that

didn't that didn't really save us anything, which like I would be surprised if it did. Um,

okay. Well, 21 isn't bad. Um, it could be that the mixing here is suboptimal where

like rather than multiplying this together that we should add this multiplied by some constant.

There's another thing we can do here, right? Which is just like throw more

right? Which is just like throw more more memory at the problem.

Like if we have twice the number of buckets, we should also have fewer collisions.

21.

It's what we already had. I mean, it was slightly better.

I still wonder if we could just sim this. Uh,

this. Uh, let me Okay, let's let's try. I just

want to see. Oh, but then we get back to the collisions, right? But I I kind of just want to see if I go ahead and do uh

this, right? So we we go back up here. We say

right? So we we go back up here. We say

that the hasher is going to be if bytes len is greater than 64, then we're going to do the the current stuff, right? But

otherwise, we're just going to do uh uh this ends up being pretty weird, but

bear with me. Um, what if this is an a U8 X 64?

I know, I know, but hear hear me out.

Um, and this is zero.

I know. I know. Hear me out. Uh,

this is a reduce.

No, cuz I that would make it a U8, which it doesn't need to be product. This This is going to be a

product. This This is going to be a What? No.

What? No.

Some probably what?

Yeah. Ex. Okay. That great. So, that

turns a U8, which is not what we want.

So, we need we're going to have to do something here, but like bear with me.

Um and then in this case we do a um we do self.0

sim add some Oh yeah, true. We could just do um as use 64x8,

right? Which is the same type or same

right? Which is the same type or same size rather. And then reduce product or

size rather. And then reduce product or reduce some reduce product. Sure, why not? Uh why does it

product. Sure, why not? Uh why does it not let me do that? Yeah, it's not a primitive type. Yeah, it has to be cast.

primitive type. Yeah, it has to be cast.

U648 is not a SIMD type. Why?

Lies lies lies lies lies lies.

Oh, it this only wants this. Okay, cool.

Um, but here we do a sum across the load of bytes.

We don't need to do this and we just say self.0 zero. Is this

self.0 zero. Is this

is it not is it not Cindy sum?

No, we're optimizing for less than 64, right? So, u x64. So, I'm trying to

right? So, u x64. So, I'm trying to concurrently add all of them together.

You see what I'm going for, right? Like

I want to Can I maybe just add them?

Cool. Uh

this is just to do for now.

Now this is going to be worse mixing.

Um this will be a u splat. Uh

and in fact sure this will be a con uh u8x64.

Nope. U8 x64 splat of 42. Why not? Doesn't really matter.

42. Why not? Doesn't really matter.

And this plus equals this.

Now what I want to understand here is this will obviously be faster to compute.

It does take into account all of the bites, but it will have more collisions because we're not doing quite as thorough mixing.

You think reduced product is worse here?

I guess we'll find out.

Um, this isn't this is not 64 bits. 64

bytes.

64 bits would only be eight characters.

Yeah, this is definitely worse. This is

certainly more than 20 seconds.

Oh, you're right. It's No, you're you're totally right. The it can't be reduced

totally right. The it can't be reduced product because if any of them are zero, the result is zero.

Now to be clear, I don't think we our goal should be to build the fastest hash function ever built, right?

Our goal should just be to make it close enough to what we know is achievable with Aash. And we can try the the

with Aash. And we can try the the Victory here as well. 40 seconds.

So it really is the collisions because this compute should be real fast.

So I I genuinely then think that this old version we had Oops.

This version we had does decent mixing because we got to 21.

What if we just like I mean I know birthday paradox and everything, right? Good.

everything, right? Good.

Yeah, we could also in that sim loop we could have uh started the number differently from every character so that you take the position of the character into account.

All right.

A four byte chunk case instead of eight like this. And then this is a U32.

like this. And then this is a U32.

I'm tempted Leave this one for now because I think we can drive ourselves crazy trying to come up with a better hashing

algorithm here.

Although I suppose yeah a lot of our strings are a lot of our strings are like six characters.

So they would end up in the remainder but that would just be a copy and then it would be a chunk.

Yeah.

Okay.

Let's then this goes away. Uh, and we'll say this is good enough for now.

It's so tempting to do this with SIMD.

All right, I'm going to leave the hashing for now. Otherwise, we will spend literally all of the time on it.

Uh, where we at here? So,

hasher.

Um, okay. Someone said, "Please, please, please do the vector instead of the hashmap." So, let's try that. So, we're

hashmap." So, let's try that. So, we're

going to make this be a vector of these guys.

Uh we're going to say with capacity is now going to go back to 10,000.

Uh we're going to do we're going to do a binary search by key.

And we want to search for station where the extraction is this.

Uh, and if we get back an I, then this is stats I.

Otherwise, we do stats dotinsert at I. And then I think maybe this

at I. And then I think maybe this becomes a it could be this becomes a dq but let's let's leave it as a ve for now. Uh so

insert at I uh station.2c to

uh station.2c to I and then stats is at mute stats.

In fact, we could do get uh unchecked mute of I unhappy station.

Oh, this has to be here's a ve of u8. Yes.

Oh, this needs to take like a reference to the key type or something. Okay,

cool. Uh,

this does nothing. So this is if let air I is equal to this to simplify it slightly uh

oh no we do need the i out actually we need let i is uh is i and this is also I And then

right stats is stat one.

Let's see.

So now we're trying to skip the hashing al together. We're trying to use a

al together. We're trying to use a vector instead and it'll be a a sorted vector of keys.

And then we do a binary search over that vector for the stations. My guess is that this will work pretty well if there are few stations, but with many stations

it will be significantly slower.

Yeah, this is definitely slower.

Remember, we're trying to be 21.

This is way slower.

Yeah, rough. All right, back to hashmap.

rough. All right, back to hashmap.

But hey, we tried.

Okay.

Uh to do this makes too many collisions.

Yeah.

Um, okay. What do we have next? Uh,

let's go ahead and do a PF record.

And then we'll do per report to see what we get.

Why?

Okay, fine. Uh

fine. Uh why is all the time going in here? Did I

leave an inline in here somewhere? No.

Did I not let it run long enough?

Yeah, I think I didn't let it run long enough. See what we get here.

enough. See what we get here.

It's all in the get mut mute of the hashmap.

Like all the other things aren't even registering.

Huh.

Okay, let's think. What What can we do that's smarter here?

So we use a try but a try feels wrong here. Plus you need to now you need to

here. Plus you need to now you need to keep track of um you need to traverse a tree that goes to like depth of the length of the strings.

I know parallelism is sort of the path to go down here eventually, but I'm just trying to think like this feels because again remember the the ultimate thing we're comparing against right is

this which oh why is that so slow now?

Okay. Yeah, which is 800 millies, right?

That's what we're comparing against. Uh

so why is this one so slow? Well, let's

see.

If you go to source test, no source main Java

dev.

Oh boy.

Where is it? Just just tell me where it is. Uh,

is. Uh, okay. It's in source.

okay. It's in source.

Oh, I main java dev moring calculate this. Okay, this was the like basically

this. Okay, this was the like basically winning entry like let's see what they do.

Okay, magic multiplier. Love that.

magic multiplier. Love that.

The unsafe. Uh,

okay. So, obviously they run in parallel. Let's ignore the parallelism

parallel. Let's ignore the parallelism here.

They also do a memory map. That makes

sense.

Uh, this is all for splitting the map for parallelism. We'll get to that

for parallelism. We'll get to that later.

text builder that's for the printing aggregates.

So they allocate memory. Okay. So I

think they have a Find what's this?

Okay, they have two longs.

Oh, I wonder if they do.

All right, let me find their that's for merging. That's the builder.

No, I think this is just where they keep the stats because they also use a tree map from string to aggregate.

Yeah, look, they they pack the aggregates.

That's fine. That's just how they combine the aggregates.

What I want is their lookup aggregator. Okay, run. Here we go. So,

aggregator. Okay, run. Here we go. So,

they get a segment. This is within one thread.

They do a chunk.

Okay. They split the chunk that each thread gets in three.

Okay. Why?

And as long as that's the case.

Why are they doing these?

Interesting. So, this is probably for CPU parallelism and so they're trying to keep the they're trying to keep the CPU busy

enough by I guess they I think I think what they're doing here is they're relying on the fact that one CPU can have multiple

outstanding memory loads at the same time.

And so you want to make the CPU start all three loads.

It's not usually necessarily three, but like you wanted to start multiple memory loads and then you want to do the compute for all those three memory loads. Otherwise, you're just doing in

loads. Otherwise, you're just doing in the CPU, you're doing one memory load and then doing the compute and then one memory load and doing the compute,

right? So, I think that's why they're

right? So, I think that's why they're doing this. Why? What's this word thing?

doing this. Why? What's this word thing?

Uh, word that's relatively unhelpful as a word

takes an address and gets the long at that address.

Okay, so they read a U64 for the first eight bytes of the name.

Then they read a U64 for the next eight bytes of the name. So they read the first 16 characters of every station name.

What does separator do?

Separator.

So the exor with the comma pattern I think comma pattern here is just um yeah 3b I think

is comma so they exor the whole thing with comma bytes. So they're doing it in

comma bytes. So they're doing it in right and this lets them do it in a single register rather than with SIMD and so you're scanning eight

uh yeah you're scanning eight characters at a time for comma and You subtract

one from each of them and you end with 80. This has to do with this has to be

80. This has to do with this has to be related then to the binary value of each character.

Oh yeah there.

Yeah, this is SIMD in a register.

Uh, but did they do that on purpose or because they didn't have SIMD is an interesting question.

Um, okay. So, they But what does this give you back? It just gives you It gives you the mask, I guess.

Yeah. And then they look for the number of trailing zeros. Okay. So, we'll we'll go back to here.

So this gives them the mask out of the and remember this is so word

one and word four are the first eight bytes and the second eight bytes of the first chunk of what this file is looking at. Word two and five are the same for

at. Word two and five are the same for chunk two and three and six are the same for number three. Okay.

And then they run find with aggregates.

Chunk one, word one. Yeah. Word one,

word four. Separator one, separator four.

So, yeah. So, okay. So, interesting.

This is also something we could do, right? We could um

right? We could um we could totally do this kind of unrolling of trying to get the CPU keep

the CPU memory loads more active by doing all of the loads and then all of the computes. That's a really

the computes. That's a really interesting insight. Okay, let's now

interesting insight. Okay, let's now look at the find. So what is their implementation of find? This feels like it's the main magic, right?

um small uh this is their fast path check. So if

the separator is in the first 16 bytes is what they're looking for here because separator one is a bit mask of oh that's what that's totally what they're doing

with the masking. The masking is trying to turn the So they do an exor and they want to turn

they want to set everything to zero except the first bit of the matching of the exor. So what you end up with is a

the exor. So what you end up with is a separator where you have zeros all the way until where the separator is. If

these are both zero, it means the delimiter isn't there in the first 16 bytes.

This also feels like SIMD kind of takes care of this for you.

Um, yeah. And so if it's not small, then they do then they skip 16 bytes and then they do something here. Let's

look at the the fast path.

Length here counts.

Length here counts the number of leading zeros. Pretty sure we saw that earlier.

zeros. Pretty sure we saw that earlier.

And then they mask, right? And they need to mask here with

right? And they need to mask here with the separators so that uh Yeah. So that you only keep the parts of

Yeah. So that you only keep the parts of that word that aren't after the the separator.

Interesting. What's word mask here?

Interesting.

Yeah. So they are depending on where the se whether the separator is in is in the first word they decide whether

to keep things from word two or not. So

so that's why if you if you look at the word mask

let's see 0 1 2 3 4 5 6 7. Yeah. So

these are eight entries. So if the length of separator Yeah. If so this means if the separator

Yeah. If so this means if the separator is in word one then you're using one of these indices

which means your mask is zero. Otherwise

your mask is this. And if your mask is zero, it means you're turning all the characters in the second word. So the

second set of eight bytes, you're turning all of them into null bytes so that you're ignoring them. Otherwise,

you do minus one, which is all ones. So

you're ending with all ones, which means you end up with the same bit pattern you had. That's clever.

had. That's clever.

And then their hash is just exoring the first eight bytes with the second eight bytes.

Yeah, I'm curious what their um what their mixing function is.

That's funny.

Yep, it's a manual hashmap.

So they exor the first eight bytes with the second eight bytes. They multiply it by a relatively random number.

Then they exor it with itself bit shifted right by 35.

Then they return that value.

That's great.

Yeah, this this does smell a lot like they're really trying hard to avoid branches, right? Because they don't want to branch

right? Because they don't want to branch on things like, you know, is it less than eight characters? Is it less than 16 characters and so on.

Interesting.

It does make like 35 as a bit shift here feels like hyper optimized, right?

And look at the look at the way they parse the value as well.

They get the digits out.

Uh I'll go back here.

So actually we also need to look at aggregates.find Find

aggregates.find Find is just the what?

I think they just assume that the okay so every bucket starts with the the first 16 bytes or includes the 16 bytes and they conclude that the find is

successful if the first 16 bytes match.

So they're assuming here that the first 16 bytes are always unique.

That feels that feels sneaky.

I don't know that I like that.

Oh wait, no. If the pointer is not equal to zero then the return pointer otherwise no no they they don't handle um

they don't handle collisions.

They just straight up don't handle collisions because the find here just takes the pointer to the buckets,

adds an offset from the hash and check it does check that the words are the same.

uh but if they're not, it returns zero. And this is can be the case for

zero. And this is can be the case for one of two reasons. One of them is the word isn't in the map yet. The other is you had a hash collision and they don't

distinguish between the two.

Yeah. And the reason they kind of get away with this, right, is because oh, what's it called? Uh,

what's the name of the generator?

create measurements on Java.

Uh, yep.

So, there's a there's a fixed set of cities and so all you need is a hash function that guarantees that these all get distinct hashes. So, you do perfect

hashing. That feels like cheating.

hashing. That feels like cheating.

I'm that I'm not willing to do.

I don't think it's against the rules.

Um because there was nothing like Okay, so there's two parts to this, right? One of them is uh if you're

right? One of them is uh if you're slower, if it doesn't match, if the hashes are aren't perfect, that I feel like would be okay. But this is just

incorrect.

If the hashes aren't okay.

Yeah, I don't like that. I don't like that.

I don't like that at all. Um,

okay. So, that answers our hashing question now.

Okay. So, we we could use the same hash function actually. So, so it's funny

function actually. So, so it's funny because for us it's not cheating because if the conflicts then we still have conflict handling.

So, all right. All right. Mix. Where's our

all right. All right. Mix. Where's our

mix?

Uh, nope.

So interesting. I wonder is there a

interesting. I wonder is there a We still have this right length prefix that I kind of want to like why does that get called?

I think I just want to have that be a noop and then here.

So, how do they avoid this being branched? They avoid this being branched

branched? They avoid this being branched because they just do reads into init uninitialized memory, right? That that must be the case

right? That that must be the case because imagine you're at the last row and you read 16 bytes. If the last row

has a station with like only three like four letters in the name or something, right? The worse it is that were four

right? The worse it is that were four name four letter names plus semicolon

plus in theory just three digits 0.00.0 uh plus a new line that's way less than 16. is if you kept reading you would

16. is if you kept reading you would read past the end of the memory map which is unallocated memory which is not okay.

So how do they avoid this problem?

Does Java just allow out of bounds reads?

Uh well, okay, there's a variant of this, right? Remember, we found that the cost

right? Remember, we found that the cost of right isn't really the problem. So,

how about we do this? We say

this is it if if uh we don't even need to check. We can just say

what do they do? They mix

the first they they exor uh they exor the first eight bytes.

So how do I want to do this?

I'm just trying to figure out if we can also make it not have branches.

One of the rules is not to rely on specifics of the input. So I don't think this is okay. That that's what this does. I don't know what to tell you.

does. I don't know what to tell you.

Maybe they didn't read the code, right?

Um but that certainly feels skeptical.

The problem here is how do we branch when bytes could be could be less than 16.

Uh I can't map this as more memory because this is a memory mapped file. So I don't control what comes at the end of the memory map and I don't control the input file

and this is called enough that branch misprediction here would be sad.

Um, but let's maybe just do 16 and say, "All right, maybe this is how we branch.

We say if bytes.len Len is longer than 16

bytes.len Len is longer than 16 then do this.

Otherwise do last bytes.len

bytes.len copy from slice bytes with zeros and then let

left is U64 from any bytes and then Uh, I guess I can do I have to do it

this way? That feels wrong, right?

this way? That feels wrong, right?

Surely there's a and so on. Right. And right would be the same but

starting at uh eight 9 10

11 and then let h is left exor right multiply by this but they're long so they're i64s.

Uh, yeah. That's fine. That's fine. It's

fine.

Uh, no. Ah, I messed up my vim. I messed up

no. Ah, I messed up my vim. I messed up my vimfu. This should say three. This

my vimfu. This should say three. This

should say three. Then I can go here. I

can do this. And then I can do four. Control A.

this. And then I can do four. Control A.

What? Why doesn't that add all? Well,

fine. I'll do it the manual way.

Oh boy.

Uh, okay. Fine. 8 9 10 11 12 13 14 and 15.

Uh self.0h

self.0h 64.

Okay.

Yeah. And this is not doing anything.

Oh, I'm not allowed to do that. Fine.

Um, uh, you can use try into.

Yeah, but I shouldn't need try into here because I know the length.

Let's see.

At this point, why even do the hashing?

Just copy the city array. Um, no.

There's a there's a distinction, right?

Like the goal here is that the this still works even if there are collisions. The difference here is this

collisions. The difference here is this just gives us a hashing function that happens to be good. Right? So there if but unlike the original if there are

collisions then we will handle them correctly. Although funnily enough this

correctly. Although funnily enough this is slower than what we had. Uh

Oh, it I bet you this is slower because of the copying we're doing.

Um, but this also suggests to me that, you know, part of where their win must come from is not just the perfect hashing,

but also that their implementation can be a lot simpler because it can assume no collision or because it ignores um, ignores the collisions.

So, I actually think I want to go back to just this because it's correct and somewhat reasonable. Now, we had a

somewhat reasonable. Now, we had a conversation about maybe this should be uh let's let's maybe just just because

it's fun do uh this times just to just to keep the

keep the constant they gave us, you know.

Uh do a plus here.

Let's see.

There's no comment about collisions in this file. Lies

this file. Lies depends on the implementation you look at probably, right?

If this is also slower, that's funny because that means we actually arrived at a pretty good hashing function.

We're not doing too bad on collisions.

So, someone actually benchmarked the number of collisions and ours is about the same. So, there might be some other

the same. So, there might be some other profiling. Yeah. Okay. Uh 27.

profiling. Yeah. Okay. Uh 27.

So, this is slower.

I'm, you know, I'm just going to keep what we had.

And then some of the performance here is definitely going to come down to the map, right? They can just have a much

map, right? They can just have a much simpler map than us, but they are still a lot faster and we're spending all our time in the map.

Ah, it's frustrating because it means there's not like um it's not a trivial way for us to improve this, right? Because we could implement

this, right? Because we could implement our own hashmap from scratch that deals with collisions, but we're like unlikely

to get it better than what hash brown can do.

It's like either you implement a map that like just doesn't deal with collisions, for example, you're storing a full vector keys and doing indirect key comparison.

Ah, okay. So, there Okay, here's this. Okay, that's true. That's true.

this. Okay, that's true. That's true.

Okay. Um

let me what's what's sorry okay I'm going to remove that comment here's this is a very good very good observation um what if

let's um down here so we're going to have a strct which is going to we're going to call arrayvec because that's basically what it is um

and what it's going to be is Um, it's actually even better than that.

Right. This is a sturve and what it's going to hold is a

pointer.

Bear with me here.

and a length use size.

And then I want the key here to be a sturve.

And now the trick with a sturve is going to be that we're going to be really really sneaky.

Uh pub fn new. Um so it's going to take a slice of u8 which are the characters.

Uh I'm going to give you self.

And what we want to do is inline shorter strings because that way we avoid the pointer chasing to the heap all the time or it's not really the heap, it's to the

mem but but even so no it would be the heap because we're keeping vex. So if we store these inline

keeping vex. So if we store these inline then we can pack into the pointer and the

length what to do here. So here's what I'm imagining. I'm imagining that we're

I'm imagining. I'm imagining that we're going to pack these such that if

uh the len high bit is set then Oh, even better. We can maybe use the alignment of the pointer

but let's say if the len hybrid is set then uh inlined into pointer then len

that's what I'm aiming for uh otherwise pointer is a pointer to evacuate I think maybe that way So

if uh s.len is less than how much can we

uh s.len is less than how much can we fit in this? Well, actually let's make this not a U size, right? This is a U64.

A pointer is of actually they're both of U size size. Um

so let's assume here that we're on a 64bit platform.

um then we can keep eight bytes in the pointer and seven bytes in the length.

We can't keep eight bytes in the length because we need somewhere to store whether it whether the string is inlined or not. The reason I wanted this to be

or not. The reason I wanted this to be string vec instead of array vec is we know that there can't be null bytes in the string because that's also something that they

said that there there were nonnull unicode bytes.

And so as a result we can do something kind of cool. If this is if the length is less than 16 then um

branching here is a little bit sad but I think it's going to be worth it. Um most

of our strings are going to be less than 16. So that part is easy. Then we want a

16. So that part is easy. Then we want a self where the pointer

is going to be uh this is where we need to be a little bit careful.

Uh I think we can do this branchless by saying the pointer is going to be 0 u8 of length 8.

Uh, and then it's going to be copy from slice of s

and then length similarly is going to be eight. But I want length

eight. But I want length to be uh two

right because the first bite we need to use for something special in fact we can pack an additional bite here right we can pack yeah so this can be one only

the first bite is needed to uh sll minus

8 so this will be min of that and eight.

Um, we might we might need to branch here based on whether we need to use the length or not.

Right. So, what I'm after here is just to demonstrate without it being correct yet uh is like

s uh 8 length zero. is going to be 0x FF

length zero. is going to be 0x FF and then pointer here is going to be U64

from any bytes pointer and then as U size as constate

and len is going to be the same.

I'm just trying to show what it is I'm going for here. Right? So, we're packing the characters from S directly into the pointer and length.

Uh otherwise we're going to say that the um value is going to be

uh a is just going to be suve into boxed slice.

And then we're going to say uh pointer is v dot as pointer

self uh pointer is v as pointer and length is v.len L

v.len L and as long as this length is less than like the U64 with a high bit set which it will be this representation will

never will be distinct from this one um as mute pointer sure why not

and this can also be mute that's fine and this can just stay a guys.

And that's fine. This can be this uh this won't actually work quite as written. But there's one other thing

written. But there's one other thing we're gonna have to do, which is if we're going to use that as the key here, we're going to run into trouble here

with get mute because get mute expects that the that we implement eek and hash uh but also that we can borrow the key we give from the type that's in the map.

Luckily, that's something we can do. So,

we can imple uh we can imple Uh which is just if self.length

uh and so what are we after here? We're after

eight bytes, right?

No, it's a U64. Yeah.

And really this should be this should not be that either. It should be uh Why am I blanking on what the high bit is?

Eight.

Um, oh, I did forget to leak the box. You're

right.

Uh, no, we don't have to leak the box.

Actually, we can implement drop here.

uh which is going to have a similar logic for restoring the box.

Oh, you're right. But we do that does mean we have to do um we have to do this uh it's not leak. It's uh into

raw of this What?

No, I want as as mute pointer. Why is this unstable?

Fine.

Why as mutate?

Uh it's unhappy about this. Okay. If the

high bit is set on the length, then we know that it's one of these. Uh oh, we can't ask ref to U8, can we? Because

they're not contiguous.

So, we actually need to pack even more.

We need to have this be a length and pointer.

All right, this is now now it's going to be a union uh length and pointer which is going to be a u8 of length 16 uh

inlined or it's going to be a heap which is going to be a this and this.

Wait, am I forgetting the syntax for union?

Oh, inlined is this uh heap is Fine.

Heap. Is this

self Actually, this kind of makes things easier because it means combined is of length 16

uh s.len cop. Actually, this is this is

uh s.len cop. Actually, this is this is way easier. uh combined

way easier. uh combined uh from one copper from size s and combine zero is

going to be all fs that's fine um and then we will do sturvec new. No. Uh, self

new. No. Uh, self

inlined.

How do you construct the union type again? Oh, right. You construct it with

again? Oh, right. You construct it with strruct syntax. Uh, inlined

strruct syntax. Uh, inlined is going to be combined.

Otherwise, it's this with heap as pointer.len and

pointer.len and this great.

And so now we can do um if self dot inlined zero.

In fact, we can just say 0xf.

We set it to FF anyway, right? So, we

can just check this. Uh, and it's fine for this to be unsafe. Thank you very much. Because this is always set anyway.

much. Because this is always set anyway.

So if that's the case uh then uh then we know it's inlined. So then it is

uh self.inlined.

uh self.inlined.

And this again is unsafe.

Um it's not quite all of them though because it's one onwards.

uh len is to do here, right? Because we need to actually

right? Because we need to actually figure out uh find first null bite.

Otherwise, uh otherwise this is easy.

Uh otherwise we know we're in the other case. Uh so we just do let uh

case. Uh so we just do let uh pointer no let len is unsafe

uh self.heap.0

uh self.heap.0

and pointer is self.heap one. Uh, and

this is stood slice from raw parts. And this whole thing, this whole thing is really just not safe. Uh,

safe. Uh, from raw parts of pointer endlin.

Uh and then interestingly drop is actually pretty easy because th if this is not 0x FFF

only in that case do we care. Uh and in that case we can do here box from raw of

printer and we can drop that.

That was pretty easy. Uh, this feels like a This one's actually easy, right? This is

just uh we could do this by just counting the trailing new lines. Sorry, the trailing zeros.

No, trailing zeros won't work because you have um you might have a character that ends with some zero bits. So I

think then this is just self.inlined

self.inlined uh fine lib me of

I guess it's self.in inlined

n here is 16.

Uh, and we're looking for zero.

And uh if end is null then it's the whole slice uh otherwise

it is otherwise the len is the end offset from

And then it's this.

Uh, it's technically that minus one, I think.

I love the I think here. Uh, safe.

Don't do this at home, kids.

Uh, why is it unhappy?

Oh, I guess technically this should be this actually. Okay, I guess this can just be

actually. Okay, I guess this can just be uh as pointer C void and get rid of my reference.

And what's wrong here? Now this is unhappy because that should be a void which is not true. This should instead be a constuate um

be the length and this is fine to cast to a U size because we know that the end comes after the start.

Uh, and this should be up to the length.

Cool. And it's unhappy because it needs eek and hash. Now we can implement partial eek for sturve. Super easy. Now,

right, which is just uh uh which is just slice U8 as partial E

of uh self and other.

And it's unhappy because we need to do this or this. How do I I guess I can do as ref to make it

explicit at which point I don't even need this. I can do self asref is equal

need this. I can do self asref is equal to other azref.

uh and then I can also imple hash for sturve and similarly uh hashing here is just selfass

of state uh and then it's going to complain about something else which is oh right that we also need eek that's fine imple

uh is your system big Indian although devices will crash with strings of size 255.

Why do I have my Indianness inverted here somewhere?

Oh, yeah, I do, don't I? Uh

because the length here would be uh yeah if the length here is stored little

Indian then the high bit of the length would be FF if um

if length was 255.

So, okay. Tempting then to do

okay. Tempting then to do to big Indian and then to here do from big

Yeah, let's do that. Uh, okay. So, I

have all the traits and now it's complaining because it expected to find a reference to a sturve. But get mute should be okay if the key implements

borrow of Q.

So what we want is for uh the key type in our case sturve to

implement borrow of u8 for sturve and what we're promising by implementing borrow is basically the same as but we're also promising that the implementation of

hash and the implementation of eek are equal.

U to that of the underlying key type.

And so now get mute works. And so now this allows us to do lookups using slices. So

we don't have to construct the sturve.

We only have to construct the sturve in the case where we have to allocate.

Uh and now this can be this.

This can be it. So we can actually keep referring to these.

Um this can be this.

Watch this. Immediately blows up in my face. But um I want to see what happens

face. But um I want to see what happens now. If I go down here,

now. If I go down here, actually, I guess we'll we'll see what happens. Build.

happens. Build.

Run. Trying to build 21. And we're

trying to hope that it doesn't crash.

Come on. Be faster than 21 and don't crash. Faster than 21 and don't crash.

crash. Faster than 21 and don't crash.

I don't think it's faster than 21.

The thinking here, right, is that we can inline the the station string into into the hashmap and that way the lookups should be faster. That's at

least the hope. Now,

this is slower, but why is it slower?

Oh, it could just be because I haven't run it in a while. So, it needs to be warmer in my cache.

Maybe. But these um names all look right.

And this town names looks very like 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16. So

there are names that are longer than 16 and many of them are shorter.

26.

Actually, let's leave that running a little longer.

I want more representative samples.

That's probably a good proof report.

How do we get out of here?

I mean, it's all in the hashing. So, but

why?

It's driving me nuts that I can't equivalent key find match tag.

Yeah, I mean basically none of the suspend in Sturvik.

That's interesting.

Like it makes me really curious why this optimization is not better.

We can do even better. Okay, here. Let

me let me just try one stupid thing. Uh

so we're going to store the length in the first bite because all that matters is that it's not zero. So we're going to store the

not zero. So we're going to store the length plus one so that it's not zero even if the length of the string is zero.

And then it's this is just if it's equal to zero then it should be dropped and this becomes

if it's equal to zero no if it's if it's not equal to zero

then the length is self dot is is okay.

This I think I'm just going to do this.

Okay, this is just very unsafe. Just

don't do this at home. Uh inlined

zero right a size.

And now I don't need this branch and I can just say that it's one to len minus one.

Yeah. Yeah. It's all unsafe. I know. I

know. I know.

Will it go burr?

What do we think? Will it go burr?

I think it might go burr because that code was executed on every dreerence.

Although we are still Yeah, because now it's just arithmetic over the pointers. 24. Okay, so we made it faster, but it's a little slower than

these being vectors.

Why?

Why? Why? Why? Why? Why?

I mean, okay, what if we're stupid?

What if we just do that instead?

That way the we'll never miss the branch.

Maybe Because there are some I mean we could make it 100 even right 24. Okay. So that didn't really make a

24. Okay. So that didn't really make a difference. I was just thinking of like

difference. I was just thinking of like some of these places that have much longer names like this one. 1 2 3 4 5 6

7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Don't tell anyone.

It's all Everything is fine. It's all

fine. Everything is It's so good. It's

all just It's all just fine. It's all

good. Don't worry about it.

Don't call the police. It's fine.

I'm just really curious about why this is worse because the branch predictor should be perfect at that branch of

whether it's inlined because all of them should be inlined.

25.

Hashing your equality should be the same as if it was just a slice because they're just doing this.

Oh, I bet you it's because the read isn't aligned anymore.

That could totally be.

Uh, here's something interesting. What

if we make it the last bite instead?

So we say this and then we say this this and then we say this.

I just have a suspicion And this is actually we don't even need the extra check here now because you don't need to offset.

Uh you think I'm leaking memory? Why?

I think this should be fine.

I just want to see because now the slice comparisons are going to be aligned and so that does make some CPU operations faster.

Well, still correct but not faster.

box from raw.

Uh, that's true, but that shouldn't matter. The deallocation here should

matter. The deallocation here should still be right.

Doesn't know anything about the length of the allocation. I don't think that matters.

Um, although I mean it could be that the deallocation just fails. I think it depends on when the allocator requires the length, but it usually doesn't. It

usually just requires the the uh pointer and to know the alignment of the pointer which will be the same here. Um

but I mean it's fine right? If you if you want us to, we can always do uh

do this this uh can you even construct?

Yeah, you can. Okay. So, uh pointer len like slice this guy I need and then

slice pointer but I don't think that should matter.

Yeah. All right. All right. All right.

Cons inline uh is a u8 is 32.

Last is in line minus one.

I hope you're all happy now.

Uh but I'm still so the reason I don't want to make this a bigger number is because you're bloating the size of the hashmap. Uh and

if you bloat the size of the hashmap, it means less of it fits in cache. Uh which

means you're more likely to miss in cache when accessing the hashmap. Uh

which in turn means that your lookups will be worse because we're gonna we're accessing the hashmap a lot here. And so

it's almost like I wonder if we should do less. Like the reason I wanted 16

do less. Like the reason I wanted 16 here was because that way we're not actually bloating the hashmap at all.

If we do if we do 24 then we are bloating the hashmap because that's larger than the size of these two combined.

Uh, and so you would end up with less hashmap fitting in in cache, which makes people sad. Um,

people sad. Um, can we do better here? I'm just with 16.

I don't quite understand why this is slower because it should strictly mean you need to access you need to do less pointer chasing, which means less stuff

has to pollute your cache.

Yes. 24.

Feels ashamed to throw it away because I'm like pretty sure this should be faster.

Uh here's what I'm going to do. I'm going

to do inline never on this guy. And then

I'm going to do a record. I just want to see

record. I just want to see I just want to see where this like is there something we're doing that's real slow in there?

And also why do we still lose hash frames? I'm still confused about that

frames? I'm still confused about that part, too. Sorry. Now, why do we still

part, too. Sorry. Now, why do we still lose stack frames? Uh,

chain.

That's disturbing.

This certainly makes it look like there are not that many collisions because on collision is when you need to actually compare for equality, right?

Uh no, you always need to compare for quality because you don't know if the lookup was a mix.

Like I don't even see calls to to Azra here.

Collisions causing slowdowns and lookups. I mean, yeah.

lookups. I mean, yeah.

I'm just like, why?

We might have to look at the performance counters here. Oh, this looks better.

counters here. Oh, this looks better.

Uh still bunch of time that goes into this like broken stack trace. Actually, here's

what we're going to do. Uh, get rid of this. Get rid of that. Um, and then I'm

this. Get rid of that. Um, and then I'm going to do perf uh stat.

Run that for a while.

Okay. And then we're going to get diff. All that's here is the

get diff. All that's here is the inlighting, right? Nothing else.

inlighting, right? Nothing else.

So, I will get stash.

Uh, and then I will cargo build and then I will perfat.

I'll let it run for a while. I will kill it and then we'll run it again. So we

make sure it runs warm with the binary as well.

And then I want to compare how we did here. So this is with the vec and up here. I should really I should do

this less stupidly, shouldn't I? Um,

heap text.

Oh okay.

Okay.

dash pop cargo B.

Just let this run for a while to warm up.

And then we do this. But this will be in line.

And then I want to do I want a diff heap and in line.

So we have heap on the left, inline on the right.

So inline has fewer stalled circles cycles, more instructions per cycle, fewer front end cycles idle.

So why is it slower?

I don't understand branches. Ah, it has more branches

branches. Ah, it has more branches and slightly more branch misses.

It runs fewer cycles though and it has fewer stalls.

has fewer L1 misses.

Oh, it has way more TLB misses. Why?

specifically for the data cache.

Why?

Why? Why? Why? Why? Why? This isn't any bigger than what we had before, right? Previously, the keys were

right? Previously, the keys were vectors. Vectors hold a pointer, a

vectors. Vectors hold a pointer, a capacity, and a length. So, previously

it was even it was even worse. It was 24 previously.

Oh, you're you're right. The numbers

don't matter because I uh I didn't run them for exactly the same duration of time. That's true.

time. That's true.

Although, I was pretty close actually.

Look, this one ran for 8.88 seconds and this one for 8.802.

That's the difference between me running C for the two of them.

So, actually that was pretty good.

and like 1.9 million and 3 million.

I mean, okay, let's let's go over look here.

Yeah, like look here.

Lo uh DTLB loads This is doing 2 million per second and this is doing 320k per second. So this

misses more. But this is doing way more TLB data loads. Why

interesting? We shouldn't be polluting the data cache anymore.

That's fascinating.

We also have fewer branch misses.

Actually, we do we do slightly more branches, but we have fewer misses.

Interesting. Okay, I have to think about that. I Okay, I think we should take a

that. I Okay, I think we should take a quick break. Um, and then I'll come back

quick break. Um, and then I'll come back and continue. But I need to I need to

and continue. But I need to I need to eat something and get some water. Uh, so

we'll do we'll do a tea break. Um, and

then I'll be I'll be back shortly after.

So, I'll see you in like five minutes.

And we're back. I now have water. I've

eaten a some amount of homemade pizza.

I'm happy. Um,

okay. I'm still trying to figure out what's going on here.

What I noticed is that we also have more page faults.

And I suspect this is related because like DTLB loads and misses are us trying to access the DTLB misses are

us trying to access data that is no longer in the CPU's um look side buffers. They're basically

like they've been paged out by the time we look at them.

Now why would this be happening is the the other interesting question here.

like why why would we push things out of the TLB as a result of doing this?

And I I don't see why because if you look at a rust vec, right? The rust vec

right? The rust vec uh source is at the very least a pointer and a length.

And so if we're saying we're 16 bytes long, then we should be

at most as big as a vector.

So we shouldn't be polluting with any more data. We're also no longer

more data. We're also no longer accessing the heap in a bunch of cases.

And so that should also be less pollution for the TLB in the first place.

Maybe something SIMD changed.

What if it's because of hash collisions?

Well, this shouldn't cause any more hash collisions though is the thing.

Yeah, we we might have to just do some I'm just like very Oh. Uh,

I should really have cargoism.

Is this going to make a mess of my terminal? We shall find out. Whoa.

terminal? We shall find out. Whoa.

What?

What?

Why is it doing that?

Oh, I need to give it main could not find function main. Oh, it's

okay. Fine. Can I just do main?

Can I can I not specify the the hash here?

Okay.

I don't understand why Cargoism is being unhelpful, huh?

Weird.

Um, the other thing is like do I need reper C on this?

Well, that ran for longer and it has fewer page faults per second, but it does still have the high number

of data loads.

You know, I wonder it could be that um I'm just trying to think whether vec could have an optimized partial eek.

or hash as in more optimized than slice, but that feels unlikely.

Um the other option would be if hash brown has an optimized implementation

if the key is veuate.

That also feels unlikely.

I'm just the thing I'm really confused about is why are there more misses?

because we're not polluting like we're not we're not causing any more data to be

accessed like more distinct memory addresses to be accessed.

So why oh why oh why?

I mean just just as a sanity check here like the thing that changes is the size and byes of the buckets inside the hashmap.

No, it that does not change because I'm setting the size here to be the same size as a vec, right? Like well, okay, fine. A vec is

right? Like well, okay, fine. A vec is to 24 because it's length length, capacity, and pointer. So, okay, fine. I'll I'll

and pointer. So, okay, fine. I'll I'll

make it 24. But we tried that, too, and it didn't really make a difference. And

so, the the bucket should be exactly the same size.

Why?

Yeah. Still 25.

Referenc did nothing.

And I mean, okay, just just to check that I'm Yeah, it's 16. Okay. So, so we're not it's not the representation that's messed up or something like it is 16

long. Uh

long. Uh maybe it sometimes tried to pre-eread from the inline data assuming it was a pointer.

It could do that with vec. With ve it could be prefetching, but like with the ve it always has to go to the heap, which means it has to always access more data.

Oh, you're saying it could maybe it prefetches based on it tries to prefetch this because there might be a pointer there and then it

there isn't and so therefore we prefetch stuff that's causing page faults or that there's nothing There seems unlikely.

So there is this part but like the minus one shouldn't matter. This slicing

is also really just a subtraction of the length.

I mean, okay, we're in unsafe land, right? So, uh, slice from raw parts

right? So, uh, slice from raw parts selfinlined as pointer and len.

But that like why would the sub slicing matter here, right?

We could also add like a like an unl unlikely annotation on the path that loads, but the branch predictor should be

Okay, that was slightly faster.

Just again, it's a sanity check, right?

Like 21 is the thing we're competing with.

What?

This is 26 now. Weren't we at 21?

We were at 21, right? I'm not I'm not insane.

What?

That's taking 26 now.

So we So we are faster.

What?

I'm sure I saw 21 seconds when we were comparing to Ahash.

So this is faster, but that's unhelpful. Uh

that's unhelpful. Uh yeah, executed in 21 when we added the stuff for the hasher.

Yeah. And then

Yeah. And then next we tried the vector stuff. And we tried the vector stuff

stuff. And we tried the vector stuff that was way slower.

Sis. So that's 20 and 72.

Oh, we do have more cyst time now. Don't

know why we have more cyst time now.

Well, okay. So, what size was 100,000? I

don't I think I kept that. No. Uh,

yeah, I kept 100,000.

Okay. Well,

me, I guess.

Uh well that this is faster. Great. This is

just inline strings.

And I I'm fine with this being 24 to reduce the number of cash misses here. But let's Okay. So cargo

here. But let's Okay. So cargo

be release. It's the room temperature.

You uh you could be right. Okay. So, let's

do this now. Let's do uh the CPU is tired. Yeah. Uh, I just wanted to see comparing.

So, this one up here is with 16 and this one down here is with 24.

So with 16 we're at 10,000 page fulls per second. With 24

we're at 12.

262 instructions per cycle. 262 004

stalled.

We have a lot of frontend cycles idle and the branch prediction is slightly better with 16.

But marginal.

Why do we have so many front end cycles idle? That means we're waiting a lot for

idle? That means we're waiting a lot for memory to load.

Okay. I want to try I want to try something.

I want to try the same optimization that they made in the solution we looked at which was

um to in parallel process each third of the file.

Sorry, not in parallel concurrently because it's strictly not parallel, still one thread. Um,

so the way they do this is they take the file, they split it in three.

Okay, I I think this is pretty easy, right? Um

right? Um because we can just do um we have like at at one

at two and at three.

Uh and we say map.lend

map.lend / 3 2 times map.len / 3.

Uh and we need to be slightly careful here. Actually, I think

here. Actually, I think what we really want to do is say slightly different map one is ah

so what are we going to do if we split in the middle of a line? I think what we want to do is to say you own reading all the lines

that start after the position you're given.

Uh, or do we say you own every line starting from the one you're pointed at? I'm

basically trying to figure out like if I'm pointed in the middle of a line, do I process that line or does the the or is that line considered a part of the

previous chunk?

Which one's easier?

Feels easier to say I'm going to look for my first new line and then process that. But that doesn't work for the

that. But that doesn't work for the first chunk. But maybe we just special

first chunk. But maybe we just special case the first chunk here.

Uh like do I want to seek back?

I guess the seek back is pretty cheap.

So we would do how would we do this? We

would say um like at three is um map until

here.

Uh, and then I want to It feels real stupid for this to be it rfind, but it varind.

Uh, and what I want to find is a new line.

What? Can't compare.

Okay.

I thought you had that directly on a slice. No.

slice. No.

Why don't we have find on this? feels like it should be doable. Um, but we we could we could lib

doable. Um, but we we could we could lib search here as well, but let's do it the stupid way first. Um, so this will give

me I want our position actually.

Um, and then I want at two to be I guess these can both be like this.

That's fine.

Um so at one starts from zero at two starts from

from this new line unwrap + one.

And then we just need to make sure that they actually stop in time which we do by saying that uh

let map one is going to be map until at two.

We say map two is map from at to at three.

And we say map three is at three and onwards. And then we say uh

at two is zero and at three is zero.

And then it becomes line one, line two, line three, map one, map two, map three,

uh at one, at two, at three.

and then it'll be interesting to see. Um

I guess so the way they had it was if not line one empty and not line two is empty

and not line three is empty.

Then they do the sort of joint thing.

And then they did like if is empty.

Line two. Line two.

Line three. Line three.

Oh, why is it very unhappy? Found macro

line. Did I

uh if line one is empty and line two is empty and line three is empty then break.

Uh so if if all of them are non- empty then they did this.

I still don't know that this will matter.

Uh actually I could maybe imagine it matters actually now that I think about it because

it will mean that imagine the hashmap has been pulled out of the TLB.

Well, if we do all three hashmap lookups back to back, then we'll pull it into the TLB once and then we'll get to reuse it because the

other axises are very proximal to the first uh and I guess I can then do Station one, T1.

Station two, T2.

Station 3, T3. Just so I don't have to actually repeat these.

I'm very curious whether this makes a difference.

Doesn't feel like is it has any right making a difference, but I don't even know what we're comparing to now. 23 24 25 somewhere around there.

to now. 23 24 25 somewhere around there.

Oh, did I undo the 16? Yeah, I did. Okay.

16? Yeah, I did. Okay.

Yeah, this is for sure slower.

It's It's very much slower, actually.

Whoa.

Yeah, that that sure feels wrong.

But but why?

Oh, this is the probably index into map.

Well, that is interesting.

Yeah, I I need the back batter back here.

Yeah, all the ATS should start at zero because they all get a dedicated map that has the correct start index.

But why did it crash after 40 seconds?

That makes me feel like it crashes when it tries to print.

Cross that 237 145.

Map three.

Oh, I know it's broken. It's because of the end condition. is because we haven't written next line

um such that it like if if at is already at the end uh

past the end here.

Yeah, it's it's this part.

Uh it's because Yeah, we can we can fix this by doing uh this

and then we don't add one in this case.

What? Oh, and this But it this still suggests that it's failing when it's trying to terminate the loop, which means it's already tried

to run the whole thing, which means total execution here will be somewhere around 40 seconds.

Why?

I mean, the hope here, right, was that we would The hope is that we would Yeah. 40

seconds. Okay.

Do Do we end up with the same results?

Just sanity check here. I think so.

Yeah, looks looks the same. Um,

what were they doing in their preloading?

Um, like what does word do here?

It just gets the value at that address.

So they read the values at each of these addresses.

which should be pretty equivalent to what we're doing here, right? Like what we're doing at the

right? Like what we're doing at the beginning of each of these is um we call next line or these are just constructing pointers

and then we call ah we do start the search already.

So, this makes me wonder if uh makes me wonder whether we could do a sort of

prefetch hint to the compiler here or to the CPU.

rather than do the the mem the thing I'm worried about is that we end up stalling the CPU because it there too many instructions between the load and the

next load.

Um, and so the way we could do this, I think, is um one.

Now, this won't actually work, but um we could maybe do hint blackbox to basically tell the compiler, hey,

don't get rid of this access.

I don't think that'll make a difference, but we could try.

I wonder if the compiler isn't inlining those functions now.

Oh, you could be right.

You could be right. Uh, let's let that run first just to see.

That could totally be. And that would have this a similar problem of there would be too much code in between the two.

Uh M advice sequential isn't correct anymore. I think it is right because the

anymore. I think it is right because the so as far as I understand the point of uh M advice sequential is that the kernel shouldn't keep a page around

anymore just because it's been read. So

the intent being that as you keep as you stop reading a page, the kernel is okay to get rid of it. So things will disappear more rapidly. Oh

uh ah okay well that means there is a load for them now. Uh but let me try to see if I don't

now. Uh but let me try to see if I don't do that. The the error the reason here

do that. The the error the reason here is because at can point at the end of the array and so there's nothing to read. Um,

read. Um, and the inlining might make a difference here. Um, I think we really need to look

here. Um, I think we really need to look at the assembly here for what it generates.

I'm annoyed that um, cargo ASM doesn't work.

Although we can look at the perf will give us roughly the same thing with some annotations.

No, this is still definitely slower.

Interesting.

Interesting.

What if we just did two instead of three?

Like like why is three the right number here.

Um, so the reason this is supposed to help is because um it allows the CPU it allows us to make the CPU load um two

regions of memory that we're know we're about to access at the same time and therefore in parallel. So currently like if you if you don't do this right then imagine one CPU core is running it wants

to load the next address and then it when that loads then it will compute over that but while it's doing that compute in theory it could be loading

the the next thing we're about to process but currently that uh there's no IO overlap here or actually maybe the way to describe it is more while it's

loading the CPU is able able to load multiple things in parallel. And so

currently, we're only giving the it the ability to load one thing at a time.

With this way, we should be able to have the CPU load multiple things concurrently and therefore uh not be stalling at every loop. Um that was the

thinking. Uh cargo ASM bin

thinking. Uh cargo ASM bin cargo there. There's no there's no such

cargo there. There's no there's no such thing.

Do I have the wrong cargo ASM?

That is the cargo ASM I have.

I mean, there's also cargo show ASM, but now I'm curious. This was updated 5 days ago. This was updated

ago. This was updated six years ago. I have the wrong cargo ASM. Cargo uninstall. Cargo ASM. Cargo

ASM. Cargo uninstall. Cargo ASM. Cargo

install. Cargo show ASM. That's why

cargo is uh I want main. Can I get main?

by name or by sequence number zero.

Okay. Uh what do we got here?

Feels like more instructions should be needed. Also, I want comment annotations

needed. Also, I want comment annotations to know where I'm at. Uh,

why is there panic machinery here?

Why is there panic machinery?

Okay.

Oh boy.

Oh, this is all the hasht stuff.

This is going to be less helpful than I don't I want add- everything. What does that do?

add- everything. What does that do?

I don't think I don't think that's better. Uh,

better. Uh, we already have panic equal support.

Uh, I want to find No.

No. I want to find next line.

Why isn't it This is not helpful.

What?

Uh, yeah. I need something that's way more

yeah. I need something that's way more scoped here. Uh,

scoped here. Uh, even if it got mangled, it should still have next line in the name.

I just wanted to give me the assembly kind of starting here. All

right. What do we What do we got here?

Cargo ASM help.

Uh, keep labels. Keep blanks. Reduce labels.

keep labels. Keep blanks. Reduce labels.

Keep blank lines.

Uh, That's not super helpful.

Include other called functions recursively up to count depth.

This defaults to zero.

Well, that is what I want, but it doesn't seem to be what it's doing. Uh,

Rust this workspace C1.

Like this is clearly emitting things that are much deeper than one deep.

I just want the assembly. Okay, we're

going to not do that this way. We're

going to instead do perf record.

I'll let it run and then annotate main report.

Uh annotate main please.

M advice maloc mems set give me the source annotations please okay

this is already where we're doing that I want to find that's the map entries These wait no this is the

this feels very jumbled.

Next new line is null.

This is the is empty call.

Okay, so this seems promising.

If line length is greater than 64, which is it never is So then we get down to the SIMD stuff.

SIMD.

So this is um this is split semi, right?

Yeah.

I'm wondering.

Oh wow.

Per seg faulted. Cool. Uh, annotate this for me again, please.

Yeah, we're we're not executing that line at all. We're instead taking the

SIMD here. And so that is taking

SIMD here. And so that is taking some amount of the execution of Maine.

What's confusing me is like okay this is percentage of the time spent in Maine.

Oh it's probably of the self cost of Maine though. Is there a way to add a

Maine though. Is there a way to add a column here uh for Yeah. T.

Yeah. T.

No, that's not what I want. Uh

ah P.

Yeah. So this is percentage of overall execution time.

Yeah. So these these add up to basically nothing is finding our Oh, and then it Seg faults again. Okay, cool. Uh,

annotate, switch to global tab.

In fact, none of these are particularly hot, which makes sense because all the time is spent in the hashmap lookup.

Interesting. So, I don't think we can even like It could be that this makes a difference, but it doesn't like that we could make this make a difference, but

given that all the time is going to the hashmap, it just doesn't. Um,

which is pretty frustrating.

Uh, flame graph wouldn't help us here.

It would give the it would give the same as what we get from PF report which would really just tell us all all the time is going to the hashmap. Um

okay here's another maybe stupid idea.

Uh I am going to keep this.

Uh I'm re-entry.

Not going to keep this.

Um so one other option here that we could explore is to have a hashmap

per first character.

It might be stupid, but it might also not be stupid.

Let me just let me just let me just Um, this should probably have way fewer, right? This should have like

right? This should have like 64 uh crasher.

This will only ever hash characters.

Uh fn write.

What is the implementation of hash for character?

What does it call?

U32.

I suppose that makes sense.

Uh, and so this is then just going to be 128 from U. Right.

U. Right.

And then we're just going to sit. No, I

did something stupid again.

This is just going to be a to-do because we should always be hitting that one.

I just kind of want to see uh it can't be a linear array because the Well, I guess we could just do the first bite. Um

first bite. Um maybe we just do the first bite because uh these are unicode characters, right? So the first bite could be the

right? So the first bite could be the umlout. Uh but fine well we can do that.

umlout. Uh but fine well we can do that.

All right. I buy it. I buy it. All names

are not asy.

Um so we'll say that this is a uh then this changes slightly. This

becomes a array of these.

Uh yeah, I think we're going to Okay, let's try this. This

try this. This of length 255.

Well, I guess 256. Uh, and it's going to be stood array from fn hashmap.

So, we create one of these.

We say stats is stats.

This could actually end up being worse because we'll pollute the TLB more. But

this can probably now be 10,000.

Uh, why can't it be indexed by U8?

Well, that's interesting.

I don't understand why this needs to be Shouldn't it be obvious that this is this? Why is the compiler unhappy?

this? Why is the compiler unhappy?

Oh because uh because this is a U8, but this is an array. So, this

needs to be U size from U8 is Y.

And then down here we actually need to do

stats dot va stats dot it uh dot flatten. H.

dot flatten. H.

Uh, why 256 and not 255?

Um because you can if there are 256 possible values if you count zero. No.

Am I being stupid?

Yeah. Um

okay. So that's not faster.

Even though this should reduce collisions.

Yeah, we we should be able to be way smaller than 10,000 here, right? we

should be able to do I mean we don't actually know the distribution but let's say we do a thousand but even that I don't I don't think should make a difference

the hashing here is very upsetting 26.

It doesn't really make a difference.

390.

Sure. I I don't even if I make this 300, right? I I don't think this will be um I

right? I I don't think this will be um I don't think this will really help. Not

to mention the the keys aren't uniformly distributed over the first bite.

Um I guess this will be about the same maybe slightly worse actually. Um

it's slightly better but overall this is still worse. Right.

worse. Right.

Like if I get stash, I can't type.

So we were at 24.

I think this is like I think this ends up being a wash.

You know, maybe the problem is that we've made the hashmap too big.

Okay, so there's a slight wind to this, but like it's marginal.

If it was a reasonable optimization, the hashma would probably implement something similar. Um, it can't

something similar. Um, it can't necessarily, right? Because it doesn't

necessarily, right? Because it doesn't know the nature of the distribution of your inputs.

Yeah, they're like about the same.

But it does make me wonder like what if we did this like just reduce the size of the hashmap without the um without without the pre-bucketing by the

first bite.

Like does that buy us any time?

Doesn't seem like it.

It's about the same.

I'm just thinking because if the if there are fewer buckets then yes, you collide more, but you also have you use less of the cash. So you're more likely

to hit in your cash the miss.

I mean what if we made it even smaller, right? Like what if what if we made this

right? Like what if what if we made this a thousand? I just want to see if

a thousand? I just want to see if there's like a meaningful although I suppose we then we might just end up with a hashmap resize. We won't

necessarily know if it has an impact.

Oh, that is faster.

24.

I mean, what happens if I just uh how many unique stations are there in the input? Um, up to 10,000.

the input? Um, up to 10,000.

But for this particular generation, I'm not sure.

I mean, we could easily print it out, right? We have it above here. This many.

right? We have it above here. This many.

So that's one, two, three, four, five, six, seven times

10 20 30 40 50.

It's about 500.

Yeah. So that didn't really save any more.

In fact, letting it dynamically size did the same as when we set it to a thousand, which sort of suggests we could set it to a th00and to save a couple of the

initial resizes, but um and I think this could go away.

Okay.

It does really bother me that we can't optimize the hashmap out of this more. I

mean, we could, right? Like we could implement our own hashmap here with a better default strategy, but

what's the other one that was like this was one of the other ones that did really well.

I just want to see what they do for the map itself.

That's just you. See, they do the same thing of

you. See, they do the same thing of having multiple concurrent scanners.

And I think this is to basically do preloading, although that didn't really matter for us. But it could be that it would matter

us. But it could be that it would matter if our hashmap wasn't such a bottleneck.

But what do they end up with here? They

use record. What does record do?

Final calculation for the index into the hash table.

They compute the hash as the exor of the first eight bytes with the second

eight bytes which is similar to what the other thing did and then they to

hash to index which is just a bit shift by the number of buckets.

Yeah, this also feels like hard coding, right? like

right? like there. This just

there. This just Okay. Check for collision

Okay. Check for collision because up here they just look up exulting existing result and if they're equal then it returns

existing result.

Okay. But what do they do on collisions if they get a collision?

Yeah. Then they probe for a different index.

But they don't do that here like in the lookup. They don't look at any other hashes.

Right? Am I am I being stupid? So here

they're extracting word one and word two. They're computing the hash. Fine.

two. They're computing the hash. Fine.

Then they turn the hash into an index which is basically a bit shift to do a modulo. And then they look up in results

modulo. And then they look up in results which is the the table they have like it's the bucket list and

if that entry isn't null then they check that the name is the same and then it

they return ah if it's not I see if it's not then they go here. So

this is for insert and for lookup. I

see.

But presumably this is what Hash Brown does as well.

Yeah, I think the trick here is to basically do the same.

All right. How about this?

I think we need to we need to bring out some some heavier guns here. Um,

not that we not that we haven't brought out heavy guns so far. Uh,

so we can either use the raw hashmap API to do this or we could have our hasher just

instead of all of this stuff.

We just say um that the hash is word one.

I feel like we've already been down this path, but we will do it again. So, damn

me. Uh

word one, word two. Actually

two. Actually no just word word is this which is 16.

Um, and then word dot copy from slice bytes

bytes.l

bytes.l and then word is.

Now we should be able to cast this into word as uh

U642, right?

That should be I said we were going hard here. Uh

transmute that because these have the same memory layout.

uh U8 16.

Uh, and then we we're going to do word one exor word two.

Whoa.

Why index 24 out of range for slice of length 16? Oh uh dot

length 16? Oh uh dot min of 16.

What source dice length does not match.

Um, I'm sure there's a there's a version there's like a mem copy thing that will copy at most the minimum of the two bytes

pointer copy from bytes as pointer to word as

mute pointer of length bytes. admin 16.

Yes.

Now, part of what bothers me here is I'm pretty sure there's an extra call to our hashing, which is when you hash a slice in Rust, it will also include the length

of the collection as part of the hashing.

Right? And that's I think because of this extra thing like uh if we now do PF record uh U816 and U642 do not have the same

alignment. That is true.

alignment. That is true.

Um, I suppose I can bring out the big slogans. Um,

slogans. Um, that's true.

Cuz in theory, we could go the other way, which I suppose is what we would have to do.

Um, I just want to see.

Yeah, exactly. Um,

I want to see what on earth is going on in here.

Now, we're spending more time in splitting. Why? Um,

splitting. Why? Um,

yeah, we're spending almost no time in hashing now. We're spending way more

hashing now. We're spending way more time in equivalent key. Equivalent key

equivalent key. Equivalent key borrow in Azref.

Well, that's interesting.

Is that because we have more collisions hash brown uses the top seven bits of the hash to group buckets. So, you need to use a hash function that gives you good entropy in the top bits.

Uh yeah unclear.

Do we currently get that?

What I'm trying to figure out is whether this is like linear probing that's causing us to do way more calls to borrow.

Yeah, I think. All right, let's try the um Okay, so we'll do this as uh 0642.

Um and then we'll copy uh we'll do the transmute here.

uh U642 to uh U816.

In fact, this just goes from a star mute U64 to a star mute U8.

That's the actual transmute, right?

Um because this function I also want to be a noop hasher prefix free extras.

Um and then someone was saying invert the bits or for pointers you can c that's true this can be a dot casts

uh u8 um no not bit mix bit mixing and finish at all. This is now I'm just trying to copy

all. This is now I'm just trying to copy the exact same hashing function that this solution is using and it is just a

straight uh exor between the first two words plus plus um oh they are actually doing that. You're

right.

Uh, they're doing Oops.

No, you're totally right. We should do the same.

Zero Yep.

Now, I don't think this is going to What I'm trying to decide here is whether whether the problem here is 22.

Okay. I mean, that's a little faster.

I'm just trying to decide where the time is coming from here because this ends up being a little bit

misleading. Because if you look at main

misleading. Because if you look at main here now, if you look inside of here, well, okay, we're now spending more time in this because of the mixing and

finish.

Um, it's interesting though, we're spending 40% of our time in hash finish.

Like genuinely 40% of our execution time is this.

Like not even the internals of the hashmap, just the computation of the hash.

That seems wild.

A third of execution in those three exors with bit shifts.

Um, And then it's spending 50% of the time in get mute equivalent key

tag full match tag. back.

match tag. back.

So match tag here is looking for the this is like internals of hash brown, right? It's like deciding

right? It's like deciding with like grouping the buckets by the high bit of the hash and then it's scanning all of those tags

using SIMD stuff.

But like how is all the time going here?

I just don't understand.

What if Okay, just because I feel like I'm going nuts, it could be lying to us with sampling.

Uh, we could do Oh, what? No. Stop whatever you're doing. I

what? No. Stop whatever you're doing. I

don't like this one bit.

Uh, okay. Let's do dash f like a thousand.

Oh, this sets the uh sampling frequency for the process.

So, I'm making it wake up less often.

I don't also don't understand why a bunch of this is getting disconnected from start.

I mean, they end up being the same. I'm

just Why calls This is the only thing that's weird. Um

why?

So we get hash one but we don't actually see like why is fast hash error here inlined.

I said inline never. Oh finish right. I

also want write and I may not have compiled in between the multi-threaded Java version is using all my cores. Um I forget how many cores

I have. Let's find out. I have 32 cores.

I have. Let's find out. I have 32 cores.

So we'll get a significant speed up from the parallelism. I just like in a way

the parallelism. I just like in a way don't care about the parallelism because the parallelism is easy. Um

this also doesn't make any sense.

Why is it Why do I now see the other things take more time when slowing down the hashmap by setting

it to inline never All right.

What we got in here?

Make hash hash one.

Okay. 50% is spent in Okay. So hash one

Okay. So hash one now 7% is in finish. What? Perf is

definitely lying to me. 7% in spend is finished. 50%.

finished. 50%.

So these add up to Okay. Yeah. So most

of it is spent here and hash should just be calling or hash one should just be calling hash

and finish. So it calls hash which calls

and finish. So it calls hash which calls hash slice which calls our hasher right and if we annotate this that's the

assembly for it.

Interesting.

So the exor we are now spending most of our time in the hashing not in things like the

traversal of the hashmap and stuff.

And we're spending a third of the function of specifically of the hashing function in the exor and 23.

Now this is probably misleading on the perf side of things.

In the move.

Why in the move?

It's the right back into memory into self.

I don't think we're going to get this hashing a whole lot faster. Like, okay,

we can rotate right here, for example, right? And we can, this can be a rotate,

right? And we can, this can be a rotate, right? Someone suggested in chat, and

right? Someone suggested in chat, and that's all true, but like I don't I don't think we're going to get a whole lot more performance optimization out of here. Um,

here. Um, and like I I think realistically we're not getting rid of this hashmap stuff.

So we might want to go to parallelism next. Um next line oops uh

next. Um next line oops uh is mostly now meer and this one I think

we could just replace with a single um simd sort of similar to what we did here for split semi.

And so that might be interesting to look at. Uh, parse temperature also feels

at. Uh, parse temperature also feels like we could probably make faster.

You know, I think what was maybe throwing off our ability to figure out where the time was being spent was that we weren't we weren't in we were inlining these and therefore it just all

got sort of wrapped up in the hashmap code. I think let's keep it that way.

code. I think let's keep it that way.

And then let's see what do I So I think the comparison point here we

have is like 24 24.5 seconds. Um let's

try real quick to make the next um next line thing use SIMD.

Um ah but we can't necessarily Yeah. Okay. Oh, what? Now it's 21.

Yeah. Okay. Oh, what? Now it's 21.

Why?

I'm losing my mind here. 21 was what we had earlier. We haven't We've just

had earlier. We haven't We've just changed the code back like forward and back again.

Uh you can experiment with different number of lanes pretty easily. Yeah,

that's true. The CPU is cooler now. It

could be. All right. Well, I don't even know what to say anymore. Uh,

more efficient hashing. All right. Uh, let's let's do

hashing. All right. Uh, let's let's do next line.

Uh, and do this.

So, if the So how do we want to do this? We do this by saying we have semi and we have new

line. New line is a new line. Um and I

line. New line is a new line. Um and I think the fallback here is going to be like if we don't find a new line.

Um, so this is going to be elements are read so long as they're inbound. Okay.

inbound. Okay.

Otherwise, they're filled with default.

So, we can just pass rest here actually, which is pretty nice. Um,

and then we will find the index of that new line. But we're not guaranteed that there is one because there it might be

uh because it might be that uh it might be that the delimiter

is not in the first uh but we can only search 64 bytes.

Uh so the search may have to fall back to mear.

So otherwise we're going to do it's not actually rest it's rest uh

it's actually rest starting at 16 right because we sorry starting at 64

because we've already searched the first 64 with SIMD here Um - 64.

Uh, and so if this also says ah, no, this isn't necessarily safe either because um,

down here we could be in a Oh, no. We

know there's supposed to be a new line.

Uh we know there must be a new line. So

uh rest 64 must be non-MPY.

Um so rest rest is rest from 64 and onwards.

And this is now rest.

Uh, if we still don't get a new line.

Uh, this is not re-entrant anymore now, but that's maybe okay. Um,

I actually think I want the place that calls next line to be the one to update at.

Feels weird for it to be updated inside of here.

So I think we're just going to say at um so if next new line is null shouldn't be possible

because there should always be a new line whenever you call next new line.

Um, if that's not the case, the loop outside should have exited, which is unsafe, but fine. Um,

so this is now an offset from rest, right? Because that's what we passed in.

right? Because that's what we passed in.

Um, and so we now want 64.

Oops.

Plus that length is the that is the Yeah.

If this then I otherwise this.

So this is now I actually think maybe maybe this just returns next new line and it leaves it to the caller to figure

out how to deal with that.

So this is now Uh, so if Oh boy. Uh,

Oh boy. Uh, I think we need to keep track of Oh, maybe not. Maybe we just say let

line is map from at to new at and then we say at is

I guess this is really new line at, right? Uh and then we say at is new line

right? Uh and then we say at is new line at + one.

Uh a new line at here is what is that relative to is the next question.

Yeah. So this should really be at plus I and this should be which really we can just say here that up here. Oops.

up here. Oops.

Uh that new line at is going to be at plus this plus.

That's where the new line is at. And

then we want the line is from where we're at to where the new line it is at.

New at is where the new line is at plus one.

Yes.

Yes. And if that line is empty, which would happen if uh No, that doesn't work either because when we find the last new line in the

file, there's still stuff left, but there won't be another new line.

So, I don't think we ever hit this unless the file ends in a double new line, which it does not. Um,

so, so I do think we actually need to here detect that if rest is empty.

Yeah. Okay. So, this is just uh we can just make this be while at is less than map.

Let's see it run.

Uh, I'm profiling my madness at this point. Yeah, you're not wrong.

point. Yeah, you're not wrong.

Uh, Double new line breaks the promise of no null station name.

Uh, yeah, but there's no double new line.

23. Okay, we're back up to 23 now for some reason. Um, but okay, so it didn't

some reason. Um, but okay, so it didn't get slower, but it also didn't get any faster.

Interesting.

So doing this initializ doing this um new line search with SIMD compared to using memcar doesn't really make a difference.

I do still feel like there should be a way to search for the new line and the semicolon at the same time.

Um, I also do wonder like do we ever get here? I don't think we do.

here? I don't think we do.

Right.

Yeah. Okay, great. We never get there.

So now we are just using the the SIMD for the lookup, but we need to have the fallback, which is fine.

But it isn't faster.

That's wild to me.

Um, all right. Well,

we simed. Why not? The parse

temperature. I'm also pretty sure we could optimize. Like, this feels real

could optimize. Like, this feels real dumb right?

Um, let's um sim new line search for parse temperature. What could we do

here? Well, we could use like we could

here? Well, we could use like we could use sim to search for the full stop, but that seems unnecessary.

Um, I really want to avoid the branch here and the termination condition for the loop.

So, okay, here's something stupid. We

don't need to do the subtraction.

We can just subtract after doing all the accumulation.

No real reason to accumulate. Do this

every time we uh every time we parse. I think.

But how do we get rid of the fork? Well,

this Trying to think if there's a structure to the string here that we can make use of.

We know exactly the structure of the last three digits always.

We could swap the last two uh that's a good point. Only problem is

good point. Only problem is uh we don't have we don't have mutable access to it. We

could copy it out and that would remove that branch. I

like that. Um, and I also think we could say t is and we can actually swap the can we swap the sign to the end?

No, because we're not guaranteed that it's there.

But what we can almost certainly do here is we can say um that this is uh 08 of length

five, right? It's at most five long.

right? It's at most five long.

right - 99.9 most five long um and then we could read

Yeah. But if I if we if we do a copy

Yeah. But if I if we if we do a copy from slice here, we kind of want it to be right aligned in order to know how to swap the last

two.

Um, I don't think the subtracting zero cost us all that much here. Like

realistically, it's probably just going to run in parallel.

Uh so there is something interesting here where um I would have to I have to look at the

excuse me for a second. I need to look at the ASI tables. Uh man aski um

where is minus?

Minus is before all the numbers.

What about period? Period is after.

Okay. So here's something interesting you could do.

There's um actually Do numbers have copy sign? Oh, it doesn't. That's a shame. So

sign? Oh, it doesn't. That's a shame. So

on on um on uh F64 you have this thing called copy sign that let you copy over the sign of another thing which is branchless. Um

branchless. Um which we don't have here.

Um but so if we copy from slice source no uh temperature.len

Then now we can do t uh of t.len

minus one uh minus 2 is equal to t of t.len minus

one.

So now we're removing uh No, we actually need to swap them, right?

Which which is not going to let me because I can't have two mutable borrows at the same time. Uh

but there is a get disjoint mute where I can do t.lend minus

1 and t.lend len minus 2

and t.lend len minus 2 and then I can swap t1 and t2 put dot at the end.

Uh unwrap unchecked. These are

unwrap unchecked. These are these are for sure non-over overlapping.

Right. And this is T1 and T2. And now I can swap them. Uh T is T. L.

Now we swapped them. Now we're

eliminating that branch. And then for the minus, how do we do the minus? What I was thinking was to do um

t 0 minus minus this. So this will be zero if it's

minus this. So this will be zero if it's dash and it will be no it'll be um

what am I saying this I want to be an i8 actually um it's going to be unhappy with Okay.

Uh, but there's no real reason why this couldn't be.

I guess our M map here, it it can just as well be I8, right?

There's nothing that nothing that says this can't be an i8 and an I. Oh, but this is going to be a U8 Tom.

But like a character is really kind of an i8. All right. Fine, fine, fine,

an i8. All right. Fine, fine, fine, fine, fine, fine, fine. Um,

what I'm trying to get at here is, uh, I guess I can just do as I um, minus this.

So now that will be zero and any other number will not be zero.

Sorry any uh any other value has to be a number and all numbers are up here right they're higher than this. So, we either

get zero or we get one of these numbers.

And then I want to mask it actually. Ooh. Ooh. Ooh. Ooh. Ooh. Uh,

actually. Ooh. Ooh. Ooh. Ooh. Ooh. Uh,

let's make that be dot instead.

So now this will be minus one or some positive number.

And then I want to mask it with I want to mask out just the sign.

Is there a way for me to get uh what's the sign bit of an I8?

Right.

I want to mask out just the sign bit.

Oh, returns a number representing the sign of self.

Perfect. Yes, that is what I want.

Uh so this will now be min -1 for minus and + one for anything else.

So this is now sign and then now we can do this and this should now iterate

from zero.

Can we make it not have to branch?

Um, we can get loop unrolling here by Yeah, we could we could skip sign, but

if we skip sign, we're introducing a branch in the iteration. Trying to not do that.

How can we do that?

Um, let's look at the bit patterns here. So,

what are what even are numbers?

three.

Okay, so that means it's one one zero.

I'm just trying to see if we can do a bit mask here to get that out of the way.

What I was thinking is is there a way we can unconditionally walk all of the numbers and just add them together?

And I think there might be.

So my thinking was what if we just also add in the dot and then just subtract it afterwards.

But this doesn't really help because we still then need to check the length of the array. No, it does help because then

the array. No, it does help because then we can always just do it for the first three. And if we happen to include the

three. And if we happen to include the dot, then we don't care. Ah, but we need to check whether we need to remove the dot or not.

Uh which we can do by uh are you sure it's a branch and not a pointer increment?

Well, but it would be a conditional pointer increment, right? Oh, I guess you can cast the bool to a one and use that to

decide to skip.

Yeah, the it comes down to whether um it comes down to whether if we do temperature oops what temperature

doitter.

If uh you know if t0 is equal to b dash then one else

zero right I'm not worried about branching internally in skip I'm worried about this specifically being a branch Right.

The sign is an integer. Yes. Uh but

sign will be one or minus one is the the problem here. Yeah, this

should be a conditional move.

That I agree it should be. I think it's interesting whether it would be Oh, s uh 0 minus sign is interesting.

Uh that doesn't help us because that just flips the sign.

Yeah, I agree. This this should maybe then be fine. Okay. So then we'll do this.

But now we still have an iterator that needs to check for termination.

What I'm what I was hoping for is for us to do a Let me let me see if I can express what I'm after.

Um, in fact, maybe this is the way to do it.

Uh, I want Here's what I want.

I want T skip plus zero.

times.

No.

If If t if tn is no is tn minus skip

is equal to Oh, this is hurting my brain far too much. Uh, is equal to four.

much. Uh, is equal to four.

Uh, 100 else.

No, it's it's actually a thousand.

Otherwise 100.

I I think you see where I'm going.

Um, and then I want T1 is mole times

T skip + 0.

M uh minus B 0 2

and then mole / 10 skip + one mole / 10 2

and now this T3 Three needs to be conditional because because it could be that there are only

two actual digits in which case this is a dot.

So I think this needs to be if If uh if tn minus skip is equal to to

three then uh is equal to four then one else

zero times that t is t1 plus t2 plus t3.

Right? Do you see what I'm after?

What's mole here? A mole here's No, I need this to be an I16 from. This needs to be an I16

from.

And this needs to be an I16 from.

And then uh this stupidly needs to be an I16. And

then this will then be s times that right and now here's what I here's what I think we should do inline never this cargo B

Where did this What is this EA thing? Go

away. I don't want you. Uh, and now I want cargo as um, no.

Uh bin parse temperature.

All right. What do we got here?

Okay.

Okay.

I I I already see I already see a conditional jump.

Why?

Why is there a conditional dump here?

What did I How did I already mess this up? Oh, and I want um

up? Oh, and I want um native.

Not that it shouldn't really matter here but huh?

There would be a single branch because of the minus sign.

No, I don't think that's true.

Not if this can be um a conditional move, but this is our um

this is our uh this is this copy which

feels like it can just be Oh, this is definitely wrong. That

should be temperature. Um,

but this feels like it could be a copy from temperature as pointer to t as mute pointer

and it should just copy tin and that should now be fine. question

mark.

Shouldn't be anything conditional in there. Maybe

there. Maybe great.

Okay.

Uh well, we do a call to me copy, which like okay, feels maybe excessive here, but all right. Um,

actually, do we even need the copy still?

We just did that to do the the swap of the dot, but that that may not be needed anymore actually,

right? Because I think now

right? Because I think now I think now we can instead of doing that

we can just do uh ten is still t.len Len.

Um, but this should now be uh No, because the dot.

Yeah, this should just not be skip. This

should just be tn minus one always.

But this one could be wrong because this could now need to skip the dot.

Ah, no. The the first one is always easy

no. The the first one is always easy because it's the one after the sign. The

last one is always easy because it's always at the end. And so there's no conditional on it. It's only this one

that is tricky. This is always just this. But T2

this. But T2 may not be there.

So T2 if it's there is going to be uh three from the end.

But we might have to null it out because um yeah, we might need to null it out because this hits a this ends up being a

dot instead or this ends up being the same digit as T1, right? This this could end up being t equal to t1 in which case

we null it out be hundreds should be but something's still wrong here.

This this should just be this.

Um, and this divide by 10 here is also a problem.

Yeah something's something's still not right.

Th this is always times 10. This is

never times anything.

And the question here is what is this multiplied by? And this is multiplied

multiplied by? And this is multiplied by 100 or by 10

and if that's 10 then this will be zero.

So because this is the same check. So if

mole is 10 then this should just go away. We'll

still do it, but we'll multiply it by nothing.

Yeah.

Well, those this might mean that we're just doing like unnecessary multiplies which might itself come back to be expensive. Uh yeah, this this probably

expensive. Uh yeah, this this probably should have tests. I agree. Um okay,

what do we got here?

uh pars temperature sign.

I mean there there's a jump right there on the sign.

See this is the thing. I don't think this needs to branch. I think there's a way to just uh instead of using sum here

to just say um like shouldn't it be possible here to what happens here if we uh and with this,

right? Then we should only be keeping

right? Then we should only be keeping the sign bit which means all the positive numbers become zero and the

all the positive numbers become zero and all the negative numbers which we know is only dash becomes minus1. one,

minus1. one, but we can't use that to multiply because we need it to be minus one and one.

Um I I agree this this Okay, I think I think we need some tests for this one.

Um test uh parse temper pt I don't I don't want to write

anymore. uh assert equal

anymore. uh assert equal um so if I pass in uh 0.0 I expect it to

become zero.

Uh if I pass in minus if I pass in 9.2 I expect it to be 92. If I pass in -

9.2 I expect it to be - 92. If I input 99.2 998.2 it should be 9 982

uh - 98.2 should be - 982 and I think that covers it.

Huh?

Can't compare.

Oh, pars temperature.

Thanks.

Yeah yeah yeah yeah yeah yeah yeah, yeah, yeah. It's getting late. Uh,

okay. It's unhappy at 231 because this returns 736.

Okay.

uh sign.

Uh that sign is set at the eight.

Um oh, I'm using Okay. Yeah, this is incorrect.

One, two, three. Oh, one, two, three, four. One, two, three, four. I need to

four. One, two, three, four. I need to be less stupid about this.

What?

Yeah, that's kind of what I'm after.

That's literally what I'm asking for.

I mean like Okay, no.

Wait, am I am I being real stupid? like

B of uh minus one all ones.

Ah right. It counts. Yeah. Okay.

Okay.

Tease. Someone checked this.

Oh, not equal is nice cuz that becomes binary. That becomes a bool straight away.

Times 2 -1 So this becomes a one if they're not equal*

2 still a one minus one is zero if they're not equal which means it's

positive. message.

positive. message.

No, the problem is it becomes zero and I would need it to become Yeah, sorry. This has to be dash

Yeah, sorry. This has to be dash if it's not equal to dash.

If it's not equal to dash, that means it's a number. So, we want it to be a positive sign. We want it to be positive

positive sign. We want it to be positive one because we're going to multiply. So

this is going to be true. We're going to multiply by two. So that's going to be two minus one is one. Okay, great. My

brain is just broken. Um if it is equal to dash then it becomes zero because this is not equal. Time two is still zero. Minus one is minus one. So

still zero. Minus one is minus one. So

sign becomes minus one. Great. Yes.

Okay. Cool. And that's why it passes all the tests. And now if we run this,

the tests. And now if we run this, what do we get? What do we have in here?

So, it still claims this is a test, but where does it jump to? It jumps to LBB104.

panicking balance check. Oh,

what? Why does it Pretty sure we already All right.

Uh I can assert here the t length is greater than three.

How do we feel about that?

That seems much more promising.

Yeah. And now the bounce checks went away.

Uh, greater than three. No, greater than or equal to three.

Yep.

So now it's just that assert which will never fire.

Um great.

So we have a conditional move exor conditional move conditional move compare with set E exor conditional move

compare.

Yeah, take that. No branches.

take that. No branches.

Watch this. Makes no difference.

It could genuinely make zero difference, but we are calling it on every line.

I guess compared to the hashing it's like yeah I know we could assert unchecked but yeah it wouldn't make any difference because all our time is in hashing but I

am still happy.

Yeah, because parse temperature was like 1% of the execution time. So, it really should if it had mattered, I would have been surprised, but it does make me

happier that it's branchless. Um,

yeah, like okay, 2159.

Is the hashing maybe doing overflow checking as well?

Um, is the hashing doing orful checking? I mean, it could be.

It could be.

Oh.

That's interesting. This looks much tidier now. Unclear why, but okay, we're

tidier now. Unclear why, but okay, we're spending 25% of our time in hashing, 50% of our time in

hashmap like searching.

I'm surprised to see he probe seek here.

That feels weird. Okay, we're we're ignoring hashing for right now. Fine.

Um, why is next new line?

Huh?

I'm surprised this even shows up.

But it is funny though, like nothing else shows up now. It's all hashmap and next new line.

But but I suspect the inlining is like messing a little bit with this. Um

actually, you know what would be really interesting here is uh branchless pars temp. um is to run this

with uh pull out cash grind to see what it gives us. Oh,

us. Oh, thanks Grind.

Or is it Val Grind? I think it's Val Grind. But um

Grind. But um yeah, seems like it doesn't like us using fancy instructions.

Yeah.

Yeah.

Oh, we're still not implementing parse temperature.

Good call.

Don't think it's going to make a difference either, but I mean, it's faster. It is getting faster. Uh,

faster. Uh, and I and I'm I'm still happy about that being um being branchless.

I'm I do wonder like if I look at this does a load or a load select

a mask up to I see. So it does a mask up to slice length. So, it's only selecting the things from there that are

uh in the length we're looking for, which will most of the time be all of them. Uh because the

them. Uh because the what does mask up to do if that length is very long?

Huh.

Because most of the time that select is just going to be with a mask of all ones.

It'll only be at the very end that it's not.

I wonder like what is there a load?

So const all it's going to be of some type. I don't

know yet. U x64.

Uh, no. I need to What? What do they use to

no. I need to What? What do they use to construct this slice mask splat?

Last splat of true.

That's not const.

What? Why?

Why is this not const? All right, fine.

Fine. If you're going to be like that.

Uh oh. No, that's still that only takes

oh. No, that's still that only takes const functions.

Uh, fine.

Fine.

I just want to see if it makes a difference is all.

Uh right. Uh thread local static all I don't that's fine. Uh

h I don't care.

Once fine, fine, fine, fine. It'll be a one sync. I just want to see if it

one sync. I just want to see if it works. Uh

works. Uh yeah. And then

yeah. And then this select slice with mask.

What? What? What else do you want?

or uh default. Default

default. Default seems fine. Uh unsafe. Yes. Yes. Yes.

seems fine. Uh unsafe. Yes. Yes. Yes.

It's unsafe. I understand.

I understand. It's my fault if it breaks. Um,

breaks. Um, what?

Sorry, I mean once lock this this is an indication that I'm getting tired. Uh,

getting tired. Uh, yeah. Yeah, that's fine, too.

yeah. Yeah, that's fine, too.

It's is it copy 64?

64.

Go away. Go away.

All I want to see here is if uh against is if um Rest.len

Rest.len is greater or equal to 64.

Otherwise this uh but actually otherwise this All right.

his next new line suffering from unaligned reads. could very well be

unaligned reads. could very well be certainly when we're trying to use SIMD here.

In fact, there's a really interesting question, right, of like what if we aligned the pointer to the previous

alignment and then searched for the new line. We might have to go pretty far

line. We might have to go pretty far back. So far back that the U64 doesn't

back. So far back that the U64 doesn't get us there. And then we need to two SIMD. So maybe it's not worth it. Um

SIMD. So maybe it's not worth it. Um

I couldn't tell you if that made a difference.

Right. Like, okay, it's dash cargo build uh time this and take it to

before.ext text.

before.ext text.

It didn't it didn't write any of that to the file before.ext

before.ext My computer is going but not very much because it's just a single core.

Okay.

And then get stash pop cargo B.

So, we're trying to beat is 22.14.

I mean, you tell me, right?

2156 2214 1577, 1588.

These view seem like not really any difference.

1591.

The number is lower some of the time, but my guess is also that like creating this is like not really something that

matters, right?

Like, did we actually see it show up? I guess

we did.

No, but like Okay, it's load select pointer that's showing up, right? And

like if I now do this and then perf record then run that for a bit.

Yeah, I mean I agree less code is better if they're the same speed.

It is true that like creating that mask feels like it shouldn't matter.

Whoa.

Why does it feel like it finished faster when I was running another Perf? That

seems That seems just incorrect.

Uh, what happened? Where did all my debug

what happened? Where did all my debug symbols go?

What?

Okay.

What's What's happening?

Why?

I don't understand.

What?

But I didn't I didn't change anything.

Maybe it's cuz we set it to be in mind.

Perf is tired. Yeah, I know, right?

I mean, okay. Did I

Are these different?

No, those are the same. Okay, great.

I'm like scared to open this now.

Where Where did all the debug symbols go?

I guess now we don't even see um we don't even see pars new line stick out anymore because it's inlined.

My hope here was that we could avoid the or actually because this shouldn't even need to be this. This should be like a 64.

Is there really not a just like load pointer like without the select?

Like can I can I just can I just unconditionally uh a U8x64?

Can I just unconditionally create one?

There's from array which uses self dot which uses self load.

Yes. So there is a load.

I'm being lied to. I want load. I don't

understand. From array,

from array.

There's a const from array and it calls load which is not pub. Okay, I

understand.

Okay. So then what I should be able to do is if let uh rest u8x

64 is equal to um u8 64 try from rest.

Then u8x64 from array just from no not from slice uh rest.

Okay, now I don't need the all anymore.

And like I don't think I should hit this until the end of the run. So just to see that that

okay uh false. Why?

Uh panic press.

Why can't I create that from a slice?

Uh, all right. This spammer has to get out

all right. This spammer has to get out of here. Report spam. Goodbye. Um,

of here. Report spam. Goodbye. Um,

try Yeah, I think you're right. The trif

requires that it be the exact same size, but surely there's a array.

What happens inside of this?

Well, that's a way to do it.

So, we could just cast it to an array, I guess. I just

feel like feel like that shouldn't be necessary.

Like, okay, we can do if rest.len len is greater than equal to

if rest.len len is greater than equal to 64 then is rest pointer.

Is there a cast array?

Yeah, there is.

pointer cast array cast pointer pointer cast array

Isn't that what it's saying?

Right. I don't need a bunch of these anymore. I don't need this. Don't need

anymore. I don't need this. Don't need

this. And don't need this. Uh I cast away. Oh, split at first chunk.

away. Oh, split at first chunk.

Some is rest.

Uh split first chunk.

Yeah. Okay, fine. That's better. I

agree. Uh

this too. I I would be I'd be surprised if it was fast. Whoa.

I think it just went.

That's what I think.

Yeah, I'm pretty sure it just said br.

Let's see if it says br again.

Watch this now. Like 25.

Okay, now it's 21. But

but it burned for a second for sure.

Now it's way slower. Okay, this is it's just I think it's just confused. Um, but

we can go ahead and do uh where's my assembly thingyamajig here. Instead of

parse temperature, we could do next new line. See what we get out of that.

line. See what we get out of that.

Let's inline please.

Now, can you give it to me?

Oh, never. Sorry. That's what I meant.

All right. What do we got in here?

Oh, there's a lot of panic things we can get rid of here. Um

yeah. Okay. Um

this is this can go away, that's for sure. Don't

need that here. Thank you.

Nope. That's not what I was trying to do. I was trying to do this.

do. I was trying to do this.

Great.

And then 10 one. What does 10 one do?

10 one. What does 10 one do?

What one? Oh, I see. Okay. This is like

one? Oh, I see. Okay. This is like whether it does SIMD or not.

It seems fine because this should be predictable.

This is just whether we've gone halfway through what's left.

Interesting.

I wonder.

Yeah. Okay. So this does a this feels like we could optimize, right? Like I

agree this is nicer, but at the same time I think I want this to just be

if I guess I'm just hoisting the bounds check really by doing this, right?

But the difference is so I'm just trying to figure out this is looking at mid mid less than length which the branch predictor should be

very good at because we're always we're basically always taking this branch.

Okay. So this branch is probably fine.

Um that's funny. This basically becomes a

that's funny. This basically becomes a Norwegian word cortis which means shortest.

That's not what the instruction does but you know uh if let's sum is first set

that's this one which is a likely branch.

So that's fine. The branch pred should be good at that one too.

And then there's the else branches which is like the mem character and stuff.

What jumps to nine and 10?

Oh, that's this and the slice fail up here.

So, this is can also be a get unchecked because there's no way it would otherwise fail.

That looks a whole lot nicer.

Okay.

So, this is now right. And we can go back to letting this be inlined.

This is now uh All right, I think it's time for multi-threading

is what I think. After,

before cache grind heap inline perf data, and VG core. Let's get rid of

these. Okay, let's now first and

these. Okay, let's now first and foremost do uh cargo b release

and then we're going to run time dubnull

and we're going to do orn in we're just going to do all true

I just wanted to gather a couple of things.

I think we're looking Yeah. Okay. So,

18. So, sometimes we do really well here with the current version.

Why this is sometimes 18 is still very confusing to me 20 20 so variable even with a single core

is still surprising to Okay.

All right. Okay. It's just getting slower if I keep running it. So, I'm not going to run it anymore. Uh, Linus mount RAM disk. I just want to see if that

RAM disk. I just want to see if that actually makes a difference here. What's

the uh Yeah. Okay.

Yeah. Okay.

Actually, how big is this actually?

Okay.

How's my RAM?

I mean it's a little painful but I I guess 15g make their temp RAM disk temp

RAM disk stage top say yeah that's fine uh copy measurements to temp no to I'm just

uh RM measurements ln d- s temp measurements here

just to see you Oh.

There we go. Now we're using memory.

So it's down from 18 to 15.

Yeah, doesn't make a huge difference, which I'm not terribly surprised by. I

have a pretty decent disc. Um, so I'm going to move temp RAM disk measurements back to here.

No, I ran it twice.

System time is lower. That's true. In

fact, it's mostly the system that's gone gone away. Uh sudo u mount temp ram disk

gone away. Uh sudo u mount temp ram disk arm error temp ram disk. Cool. Um

multi-threading. Okay.

Still has to be in one file, huh? All

right. Fine. Um here's Well, we actually kind of wrote the multi-threading. We

just skipped past it. Um the trick here is going to be that this is basically a single threaded. So we're going to go

single threaded. So we're going to go here and then we're going to say one uh which take a map of uh uates

um and returns one of these guys.

one of these guys.

Uh, and it's going to here return stats.

And we're going to go back up here just to show if I run run of map here, stats equal to this,

then everything should be just the same.

All of this is premature optimization.

Yeah, I'm not even going to bother waiting for it. Okay, so now we have all the stuff that needs to happen in one thread in one thread. Now the question becomes, how do we start multiple

threads? Well, um what we want to do is

threads? Well, um what we want to do is we want to figure out how many ways we're going to shard the the computation like how how many chunks do we want? Um

and realistically here the goal is probably something like the number of cores um on the machine uh which there is a way to do which is

thread uh they added this right like CPU CP CPU

count they definitely added Nope. I can't even I can't even do this.

Nope. I can't even I can't even do this.

Uh, didn't they add is it threads? Oh, a they called it something available parallelism. That's

the one I'm after. Yeah. Okay, great.

So, we're going to say let threads is stood thread available parallelism.

Uh now this returns an estimate of the def default amount of parallelism a program should use. Perfect unwrap. Um

so this is the number of threads we want. Um and now we want to split the

want. Um and now we want to split the map into that many parts. Okay, easy

enough. Um

we um we take our map and we say uh let sub mapaps is

zero to n threads.

How do I want to do this?

I think the way I want to do this is slightly different which is uh let mute at is zero.

Uh then I want to do let mute sub mapaps is a vec. We can optimize this slightly later but it might not be necessary. Um

let chunk size is map.len divided by nths.

Uh, and then I want to do four uh in zero to end threads.

Uh, and for each one I'm going to I might not even need the sub mapaps. I

can just spawn the threads eagerly. Um,

so this thread is going to start at at it's going to end at at plus chunk size

uh with a min with map.lin.

Um but we have to adjust this slightly because we want every thread to end at a new line boundary. Uh luckily we have a

line boundary. Uh luckily we have a function for that. Next new line. So if

I give this a map that starts at end starts at end. uh

and is currently at zero.

Then please oh please where is the next new line? And the next new line is what is actually going to be our end.

Uh, plus one.

Is that broken at the end of the file?

Might be broken at the end of the file.

Let's let's find out. Um, and so the actual map that this is going to be operating on is a map from that start to

end.

Uh, and now at is going to be end.

Yeah, at is going to be equal to end.

That's where the next thread is going to start. Um, now then we'll do stood

start. Um, now then we'll do stood thread.

So this is actually just going to be let mute threads is vec.

Uh I think I want to do this slightly differently which is this is going to be a because the threads might finish slightly out of time with each other and I want to start incorporating the

results when the first one finishes. Uh

so we're going to do a MPSC channel. Uh

it's going to be a sync channel with a bound of n threads.

Uh txrx non zero. That's fine. Thank you. Um

non zero. That's fine. Thank you. Um

that's also fine. Thank you.

uh consider borrowing. I will consider it. Yes. Um now we're going to do a

it. Yes. Um now we're going to do a thread spawn.

Uh move in here. Uh, and inside of here, we're now going to do actually

I think I want to do all of this inside a thread scope because uh, this gives me back a Yeah. And then in this scope is where I

Yeah. And then in this scope is where I want to spawn threads. The reason I want to do this is because this map um we've tied the life lifetime to the file which isn't isn't technically correct.

Technically it's static but feels more appropriate to do this. Um so we will do

this and in here we will do scope spawn.

Uh inside of here we will run one map uh and we will do uh tx send

one of map tx is tx.clone clone

and it's unhappy about me sending my mutuate.

What mutuate?

Ah, that's fine. unsafe imple send uh for

that's fine. unsafe imple send uh for sturve. There's nothing scary about stur

sturve. There's nothing scary about stur as far as threading is concerned. Uh so

that's going to send the result of running one over map. Uh we're going to do that for all of the threads.

Excellent.

Uh now what's the rule here for scope?

Uh let mute that's going to go out here.

Um mute stats that everything is called stats now. Uh

new and here we're now going to do for uh one stat

in RX.

We're going to drop our last TX here so that this gets closed correctly. Um,

all right. All right. All right. Chat is

upset with me.

Safety. This is fine. No, I'm joking. uh

uh effectively just a uh vector uh which is

uh fine to send AC is fine across thread boundaries.

Um, and in here what we're going to do is this basically.

So we're going to say um four.

Ah, how do I want to do that?

I think I might allocate strings here actually. Uh for KV

actually. Uh for KV in one stat because the the one stat here is going to be dropped at the end of this loop which means the keys which

we were previously just rebarorrowing we're not going to want to own. But I

think that's fine. Uh I don't think this is where our timing is going to go. Um 4

KV in there. uh stats entry.

Uh this now is going to be unsafe.

Um this will now become string from that.

So entry of this um no entry of K we can just keep the ownership of K here. Uh

and then match on that and this we can do this um none

insert v.

And this is the um this is the tricky one where here we actually need to do sum dot oh

no this actually does need to take it into a string because in this B tree map uh we are supposed to

I don't think it matters um in this Bri map we are supposed to keep things in unic code sorted order so we do actually need this to be like so um

this goes like this uh this becomes to back I guess this can implement into vec

actually if we really wanted to make it do that, but I don't think I care because like technically it might already have an

allocation we could reuse, but I don't care for now. Um, so with the sum here, we're going to have to say that sum dot uh

sum.getmute

sum.getmute um stat.0

um stat.0 uh is stat.0

min, right? We should I should really make these actual uh strct fields. I

know uh min of um v.0 zero.

v.0 zero.

Uh one plus equal one, two we plus equals two

and three.

I'm sorry, chat. I know.

Maybe we'll make it a type one day. Uh

great.

This is the the read me promised.

And I think I can just do into mute here actually. Um, so we're just accumulating

actually. Um, so we're just accumulating the things we get out of every thread into one that is in the main thread. Uh,

this is going to move the TX, which is fine.

And we should be good to go.

See, multi-threading easy.

Oh, well, it crashed really fast.

Uh, slice index starts at 12 but ends at seven. Great. Yeah, I figured this is

seven. Great. Yeah, I figured this is probably broken. Um,

probably broken. Um, okay.

Let's do slight debugging here. Um,

so this is supposed to be start to end.

Well, well, that sure looks wrong.

Uh, okay. What is What is our chunk size? What did I up? Uh,

size? What did I up? Uh,

I did something wrong.

Maybe that is our chunk size.

It's not a completely unreasonable chunk size right?

Maybe I should ask what's the number of threads here.

32 seems fine to me. Um,

okay. Okay.

So, we find the next new line starting at end.

So where is that new line?

And yeah, here I could probably GDB it pretty easily. But yes, that looks

pretty easily. But yes, that looks incorrect. That is incorrect because

incorrect. That is incorrect because this has to be end plus new line at Thank you very much.

Uh address boundary error, that's different. Um,

different. Um, these also don't overlap correctly.

So, some something's still wrong. Um,

what do we have up here? 37

to 11:44.

Uh, yes, because we're not adjusting.

Uh, there's definitely something we're not doing right here. Um,

oh, I'm printing before I'm updating.

You're right. So, here.

Um, and then what I actually want to do is I want to also print out Just to see that this isn't like

completely silly. I want a map

completely silly. I want a map from start to start + 10.

And I also want to map from end minus 10 to end.

Yeah. Yeah. Yeah. Yeah. Stir from UTF8.

I know. I know. I know. Unwrap.

Uh well that looks correct. All of these look like the start of a place.

They all have a initially starting capital letter and the ends all end with a new line.

So that also looks right.

So that means this starts at the boundary of a starting station name and ends including a new

line which is what we wanted.

and then add get sent to end.

So that all seems reasonable is the thing I'm worried about.

Well, we executed very quickly.

We just also hit uh a seg. Okay, fine. Uh let's do nope.

a seg. Okay, fine. Uh let's do nope.

Well, that's fun in mecar.

Uh in next new line at 236.

Okay. From 147.

Oh, okay. Yeah. So, we just failed to find Yeah. Okay. This is the thing I

find Yeah. Okay. This is the thing I thought was going to cause us problems, right? So the problem here is we set at

right? So the problem here is we set at to the end of the previous chunk. But if

that end is um if that end is the end of the file

That shouldn't matter. We should stop iterating early enough that that doesn't come to No, that's not what I wanted.

Uh uh. Okay. I want my debugging info back.

uh. Okay. I want my debugging info back.

Uh yeah. And then here

yeah. And then here uh I need um I need the total map length.

So the total map length here is uh 1379 ends with 06 79. Yeah. So we're on the last

79. Yeah. So we're on the last iteration. We're about to give out the

iteration. We're about to give out the last chunk. And

last chunk. And this end uh ends up being equal to map length.

And then we search for next new line starting there.

Uh but next new line is written so that it uh can't handle hitting the end of the file.

we would get this. It would be empty. We

would have this branch. I don't know why my syntax highlighting is now messed up.

We uh so this wouldn't find a new line.

We'd go into here and this says uh we know there must be a new line and so therefore this must be not empty. So

we skip 64 and that doesn't work. Uh

because this is intended to be called only when you know that there is a next new line.

Uh which normally we know but here we don't. If uh end not equal to math.len

don't. If uh end not equal to math.len

Len end is if end is equal to map.len then map.len else

map.len then map.len else

this.

And that way we're branching outside of this thing. This thing is called on a

this thing. This thing is called on a critical path. This is not. So we move

critical path. This is not. So we move the branch outside of there instead. Um,

and then I think this should be good.

Okay. Away with the prince.

Away with the prince.

All right. 1.08

1.07.

So that's pretty good, I suppose. Uh,

now it's parallel.

It's parallel now.

Um, okay. The Java one was 800 milliseconds, right? So now the question

milliseconds, right? So now the question is where's the time going? And perf

record.

Yes.

You tell me. Perf. What did you see?

No.

No. Okay, great. Uh, okay.

Okay. Okay. I see a little bit of drift here. Uh,

here. Uh, 2% on parse temperature.

Uh, what do we need here? So we're at we're at 1 second. So 10% would be 100 millies. So we need like we need to

millies. So we need like we need to recover at least 10%.

To beat Java, although we do know the Java ones are cheating now. So you know um

cheating now. So you know um I mean it just it comes down to the hashing again.

Oh, someone says they think we're summing the things for averages wrong.

Let's um I mean that's an easy thing to fix or to check. Uh output.ext text

and then let's just run one of the Java ones uh to

uh correct text that's I tried to write text I swear chat

uh okay so if I diff output text with other no other correct text.

Well, those are different, but how are they different? No new line at the end

they different? No new line at the end of file.

Uh, how about now?

Okay, they're different. How are they different? Uh, what's the Can I diff by

different? Uh, what's the Can I diff by word with regular diff?

How about I just do that?

Oh yeah.

8.6 8.5.

Why does it end up being wrong?

Uh all right. Uh

what are the rules?

Uh rounding of output values must be done using the semantics of ILE E754.

Rounding direction round towards positive.

Round towards positive.

Positive round 64. Round round.

64. Round round.

Yeah. Okay. What's the

isn't there a round towards positive in Rust 2?

Uh this is an issue in my code where when we accumulate the accumulation down here shouldn't be

the problem because here we're just adding numbers. Um

adding numbers. Um and same thing here we're just adding them.

We're keeping the min and the max.

Um, I actually wonder whether for men, what does min turn into?

I almost wonder if this could be um like only overwrite if smaller. I wonder if that would make a difference. But okay.

Uh, but here is where we turn. We take

the sum.

We divide by 10 to get back our fractional

stat instead of V for the averages.

Oh. Ah, yeah, you're totally right.

Good call.

Let's try that again.

And output.

No problem. No problem. Good job, chat.

Pat yourselves on the head. Uh, great.

Oopsie.

Um, you saved me. It's true. Uh,

right.

But the Yeah, we're still trapped in the in the hashmap conundrum here.

Although let's let's uh try slightly more sampling here.

How much data are you writing?

It lost 70% of its samples. Why? I guess

it's recording a bunch of data, huh?

Okay. Uh this this Oh lord. Uh be one. That is indeed the

Oh lord. Uh be one. That is indeed the one we care about.

Yeah, like okay the execution time spent in one here is 99.5% of total execution a total sample

sorry not necessarily execution time. It

is true that like uh that does mean that there is some amount of time that is uh in our printing but relatively speaking

nothing. Uh, and these are all

nothing. Uh, and these are all essentially nothing.

Uh, yeah, one of the Java implementations we looked at is using perfect hashing. It assumes no

perfect hashing. It assumes no collisions. Um, the other one handles

collisions. Um, the other one handles collisions. Um, so that one's not

collisions. Um, so that one's not cheating. Um,

cheating. Um, and we are using the same hash functions they're using, but we're not using the same hash implementation. And so they could easily end up with all the values

being in distinct buckets. Uh, whereas

hash brown is not because it's doing other things with the hashes, right? Um,

and I mean we are spending 40% of our time in rotate right?

What if we um I didn't try inverting the bits of the hash uh to make the high bits more impactful because the the rotating kind of does the same thing like we're the

mixing here achieves a similar thing. Um

unsafe uh oh there's interesting something else. Um,

what if what if we do this instead of a rotate?

And I assume these are equivalent, right? There's

right? There's Oh, actually my printing is really slow now that I think about it. My printing

down here is like printing small bits at a time, but it shouldn't be enough that it's like the slow part. Uh, but all

right, let's uh stood out is uh stood io stood out.

Uh, let stood out is stood out.

Uh let writer is stood io buff writer new of stood out frf writer.

It's not frrint f. It's uh

uh writer mute writer.right

uh no that's okay. Ignore me. Write is

what I'm after. Uh, and I want to use right what?

Oh, I also need to use uh I write I think right raider Shouldn't

make a difference to this, but uh Oops.

But you'll see we're still like quite a lot slower now. Now that this um hashmixing is less good,

what is the uh Java triple bit shift? Is that rotate right?

No, it just shifts with the sign bit.

So, and they were doing triple shifts here.

uh performs a right funnel shift concatenate self and the right hand side with self. That could also be something

with self. That could also be something useful here. But

useful here. But like what's the what's the equivalent of the triple shift in Rust?

I think I think it is the double shift like what I'm confused by is that the um the hashing here the the Java hashing

hashes this way. So this is a bit preserving a signpering shift, but it's not a rotate. Now, we had this be a rotate, but a rotate is slower because

the bits on the right need to make it to the left. Um,

the left. Um, but seemingly it's also making the hash function less good.

These are also U64s, so shouldn't matter. Um,

matter. Um, oh, for them though it's not for them. This is an I64. to long

for them. This is an I64. to long

we don't actually use this value for anything but zero um but I don't think that's matter either

right so why is that sufficient for them but not sufficient for us. And that

feels to me like it's probably related to um related to hash brown's assumption that

it gets the the high bits. So what if we took this whole thing and rotate by right

uh 32 So that's still one fewer rotate that we used to have

342. Okay. So that's slower.

342. Okay. So that's slower.

Um what's interesting though is like we have this um funnel shift right. I

wonder if we could use this perform concatenates self and right hand side with self making up the most significant half then shifts a

combined value right by n and least significant half is extracted to produce the result feels like feels very hashy um

but okay so rotate right is what we were using when we got pretty good per like we got better performance out of this uh

presumably because it's mixing the bits better right like this I think gave us one point what was it 1.2 two.

Oh what?

I think I want off this ride.

Why?

That is what we had, right?

Yeah, that's what we had.

Why is this slower again now?

It got slower and I fixed the accumulator maybe, but that seems insane.

That shouldn't even matter. That's at

the end, right? Like if okay if I'm

right? Like if okay if I'm if I make this now B stat again, there's no way that makes it

take two seconds less time. There's no

way.

Okay good.

Good.

That that would be wild.

I do however want to do this. I think

uh where this is going to be uh print and I want it as uh

inline never just so it shows up in our profiles.

Moving the fall into RAM doesn't make the difference here.

So, but like I don't think it's the printing is the thing.

Uh What's this? This 2% here is just

What's this? This 2% here is just leftovers. Okay.

leftovers. Okay.

Uh slightly less silly print.

Um, okay. Just I I want to I want to do this

okay. Just I I want to I want to do this uh sanity check with a hash again.

So if we now go back to here, we say a hash of this

and we say a hash of this and we ahash this capacity with a th00and What?

Okay. Don't know why that was needed.

Um, It could be that my computer is just getting sad.

But look like when we when we switch to a hash, we do see that these are taking up more time now and hashing is taking up less.

They're using like as hashing and such.

So there's definitely like performance on the table there. Um,

actually that makes me wonder if uh go away again. If um if we go back to

our old implementation here and I mean I can do that with the copy instead, right? So remainder

instead, right? So remainder dot as pointer uh last as mute pointer and remainder

len Yeah, the assist time is very high. Uh

341 344.

So that's about the same we got with Aash actually and and the same we got previously. So makes

no meaningful difference.

with our um with our hasher with our SIMD hasher.

How does this break down now?

Yeah, I mean it's definitely worse. It's

definitely worse because of the hash function because like with a hash this was like like these were almost up to 10% right?

I'm curious. Okay, let's do let's go ahead and do um nope. Uh,

let's Okay, go away. Um, I want to just get this

I just want to see if we now do here. Don't care about that.

Uh and FX hasher implements hash.

We know that this is the one that gets called.

Uh okay, hash is going to hold uh a U size in their setup.

Write is just a write of self.hash and bytes. Okay, where's the

self.hash and bytes. Okay, where's the write function?

This guy which just calls write 64. So

we want write 64.

You want this thing uh this can go away.

So this now becomes and final this goes away again. self.0 zero is write 64 of

self.0 zero

self.0 zero and the value, right? That's all it had, right?

Yeah. And bytes.

And then what's their finish?

Their finish is just self zero as U64.

And then they use hashword and hashword is just they're just hashing word here. U64s.

Uh so this just takes a U64 64

uh wrapping mole. And what's the thing they use for 64? Seed 64.

Okay.

Uh, okay. What do they use for this reading?

okay. What do they use for this reading?

Oh, they use the bite order crate.

Oh man, that's pretty annoying to bring in here. Um,

in here. Um, but this this is just from Um I think this is actually pretty

straightforward. It just reads a certain

straightforward. It just reads a certain number of bytes at a times. If you look at lib

um native Indian don't want the tests. I want the implementation

which is up here and I want its read read 64.

Yeah. which is just this from and this is native Indian bytes bytes to this to this

to this.

Pretty sure that's all it turns into.

Um, and then this is like an imple on U64, which we can pretty easily just say,

um, we're just going to do this and it's mute self zero.

Nope, it is uh new hash.

Mute hash.

Now U32 from any bytes of this.

Okay. And this is this is the same except it use 16.

Uh this should also be four. This should

be two.

This is U16 and this is hashword amute hash. Okay. So now we're

amute hash. Okay. So now we're mess up somewhere. Where did I mess up?

Expected U64 found U size.

Uh because I'm not under uh that's fine.

I'm just gonna make this hasher be 64.

Uh it's because I wasn't using um I'm just not using target pointer width here. Um

width here. Um and so Rust isn't willing to assume that U size is U64.

Now what do we get?

Go worse.

We were at 344 and now it's worse.

Why did it get worse?

The user time improved. Yeah.

The broken stack traces are also surprising. We are now spending much

surprising. We are now spending much less time in hashing. Actually, it's

true.

Or rather, we're spending more time in these, which is promising.

interesting.

Um, okay.

Let's just just to see Okay. So now it's around

Okay. So now it's around 1 point 20.

Yeah. It's just because it doesn't have to load it into memory, but like the the actual performance of the system hasn't changed, right? Um, but it should make

changed, right? Um, but it should make it easier for us to compare if I stash this.

Actually, here's what I'm going to do.

Uh, home dev stream.

No. Uh, copy.

What I want to do is copy cargo target release tor cell fash. Yeah, I'm about to use

cell fash. Yeah, I'm about to use hyperfine right now. Uh slash pop.

Uh I'll build this.

I want to copy that to FX hash.

Uh and then I want to do hyperfine of itself and fx.

What?

Oh. Uh dot slash Yeah. So

Yeah. So our hash is faster than FX hash.

I mean, I do prefer our hashing function.

Okay.

Okay.

Um, so the other thing I then want to see is what if we do this and this

and then we call that Uh then we call that SIMD hash.

I'm hoping the stream doesn't lag as a result of me loading my CPU now.

All right. The self hash it is.

Uh in the ash I don't like it.

Um but okay, commit to hash.

Um, what's also frustrating here is that this

like with this hashing function, this is the part I find hard to understand. with

this hashing function.

What's happening?

100% CPU somewhere doing what?

What is it doing?

What what is taking 100% CPU?

Okay. Uh,

what is it doing?

I think my CPU is tired.

Um, no. It could also be that because of the RAM disk, um, the computer has less space for an actual file caching,

including Perf.data. And so it gets sad

including Perf.data. And so it gets sad maybe. Um, but yeah, what surprises me

maybe. Um, but yeah, what surprises me is now like more time is spent in hashing, but we're still faster overall.

That's the part I don't understand because the only thing we changed was hashing. So if the hashing

hashing. So if the hashing takes more time then then why is the overall program faster? Remember like

when we were running with um with FX hash for example, we could see these be like 5% and stuff. Why are they not that? Ow.

that? Ow.

Even if it was fewer collisions like this is the overall time spent in Oh, it's the time spent in Yeah, it's the time spent in get mute.

Although this is now only 87% which is kind of interesting. Where are

my other percent?

Hm.

Where are my other percent?

Where are my other percent indeed?

In fact, where's my main proof? Is eating my time?

proof? Is eating my time?

Yeah. Yeah, I mean it's it's been a while since we ran perf stats. We can

see what it says.

And we do have quite a lot of branch misses now, which is disappointing.

Uh we're also taking a whole lot of branches, but it's hard to say how many of those are through um

are through the code that runs in hash brown.

And we're also doing fewer instructions per cycle than we were in the beginning.

Huh.

Yeah. Let's do um Oh perf dashevent CPU branch misses.

What?

Why? Why is it unhappy with me? Where's

my dash E?

No, that's for per record.

Uh yeah, it's just branch misses.

But is this showing me?

Yeah, this is showing me percentage of branch misses, right?

It is actually it is pretty interesting that really is one.

Okay, this should not be inlined.

Now, what I want to do is annotate this.

Um, but I want to annotate it uh with nope.

Uh, with showing I want to know.

Isn't there now I'm misremembering my um my perf keyboard things because there's a way to just show the source and not

show the assembly.

Uh but maybe that's not in this view.

L no L no only available for source code lines.

Now s just s uh all shows the source code but it also shows uh inline source which I guess is fine and shows the assembly but I'm looking for just the

just the top level source of this which I know there's a view for but can't find it now. Um

it now. Um this is also not in oh because it's it's inlined all of the all the stuff. Okay.

Um, what I wanted to find was yeah, I mean that's the but this also doesn't make sense because

if this is branch misses clearly 36% of branch misses are not there.

Right? Like these aren't branches. So

why uh no I mean switching between local and global here is basically nothing because all the time is spent in the in one.

It could be where it realizes the miss, but like where is the branch mis prediction from, right? Because what would be really

right? Because what would be really useful is source annotation at this level and not below.

But I guess all of these things get inlined, right? So

inlined, right? So in particular get mute gets inlined.

Um okay what if we do this? What if we say um update stats which gets a

a station which is a U8 and a temperature which is an I16.

Uh and it gets a stats which is a mutable reference to this and then we take this stuff we move it in here.

Uh and then we here do update stats of this. And then we here say inline never.

And then I want to do that on this. And

I want to do that on this.

And I want to do that on this.

Uh And then now I want don't even want that yet.

one. Okay, finally now we at least see with the that the self time is going somewhere.

Although why 3% is going to seemingly itself is a little unclear, right?

Right? But so this 3% of the time is spent in one but not in any of these functions which itself is a little odd,

right? Like where is that time going?

right? Like where is that time going?

Like this is now of cycles again.

Like why moving out of split semi?

So this is the actually that makes some amount of sense, right? So this is

sense, right? So this is uh sometimes we read into the next page and so it needs to

the OS needs to feed us the next page and then we're just like stuck in the read out of one of these pointers for a while. So that's not entirely

while. So that's not entirely unreasonable. Like that could be what

unreasonable. Like that could be what this means. Uh,

this means. Uh, why is there a checked sub here?

Well, this can certainly be get unchecked, but I doubt that it makes a difference.

Yeah, that didn't didn't make any difference.

Although it did get rid of some of the some of the noise in between. Check sub

goes away. Um,

yeah. So call split semi and then the move out of here. Although

it's interesting that that doesn't happen to next new line because next new line should have the same problem, right? We look for a new line and in the

right? We look for a new line and in the process of looking for a new line.

Uh we might have to traverse a page boundary and therefore wait, right? So why is the move here the slow

right? So why is the move here the slow part?

That's interesting, huh? Okay. And then we parse

huh? Okay. And then we parse temperature. So, some amount of time

temperature. So, some amount of time goes to there. That's fine. But you'll

see even I'll switch this to global. Oh.

Ah this is 50% of the time this is local period so it's 50% of the time that is spent in one that's not

spent in children is this move I see but if we look overall then basically none of this is

costly overall it's this is sometimes S this is 2% of execution time though.

So moving out of 58 out of RSI.

Yeah, I mean is it is move it is just moving the results back. It's just wild to me that it's 2% of execution time because

station here is a slice and temperature is a is a slice.

So like okay, this is pointer and length of two things.

Why why is it anything?

I'm just trying to see if there's anything to optimize here, but I don't I don't think so.

Best I could think about is like, okay, if they get inlined, then we don't have to carry uh four return values, pointer plus length, pointer plus

length. So, yeah, four return values out

length. So, yeah, four return values out of here. Um,

of here. Um, so that could be that could be just uh inline makes it go away. Okay. Um,

great. And if we go into the nested calls it makes, we get about 2% for next new line and

1.44 from split semi. Um, I mean, this is just a single oops. Uh

oops. Uh where does the time locally go here?

Why do we even have a branch at the start of split semi? Oh, that's this.

But that branch should basically never be taken. This is where it would be

be taken. This is where it would be really useful to see the branch mispredictions because this is a branch, but we should never take it.

Um, I suppose we could here do um stood hint uh cold path uh cold path.

just a hint to the compiler. And in

fact, what just as a sanity check, right? We're we're never taking this

right? We're we're never taking this branch.

Good. Okay. So, we should hint to the compiler that this is a cold path just so that we've done it. It's unlikely to matter.

Okay.

So, and we I guess we could do the same for next new line up here, right? Like

uh this is a cold path and this is a cold path.

uh we don't actually want to assert that the line is less than 64 because theoretically AC according to the input format it could be longer um

but I am curious now on split semi like what do we end up with yeah so this is all this is local period

now and so this is the sim Oh, why is there a comparison at the end split semi?

That is because this should not be that. This should be

what if we make this semi at and have it return a u size which is where it is. Uh

and then this will be index of delimiter and this will be uh that's fine.

And then up here this will be semi at uh and then we will do station is

line get.

And we can also assert that semi is less than line uh line.l.

uh line.l.

Um and then we can say station is line until semi and temperature

is line uh semi +1 dot dot and we know that this is true.

Can we trust the Perf numbers? I mean,

unclear.

Uh, Perf is is like if it's using hardware counters, it's usually pretty reliable.

Um well we have definitely done something because now this is down from 3% to 1%.

Um in fact semi vanished from the call.

Oh which makes sense because we moved it up into one. Uh but if we now look at one uh where do we have our semi here?

This is still doing an un Yeah, this our assert isn't quite doing enough here because the index is still doing a

checked sub.

Um, so that's fine. Let's um

get unchecked.

And this also is a unsafe line. Get

unchecked.

And just just to see whether we've made a meaningful difference here.

I think it was it was 144 I think.

I guess we still have the binary around, right? Yeah. So this is now

right? Yeah. So this is now in.

No. Hyper fine.

Where's lower? Oh, cuz it's because we have um it's because we have inline never.

So, that's not terribly surprising.

Let's keep going through these.

We go back down to one here. Look at the annotation. Now

annotation. Now that's more like it. So we call semi and then we just load

and then this is just unchecked sub which is fine.

That's um the unchecked sub here is just the slicing we're doing to the length.

Um and then this here, yeah, this is the comparison we do to check that we're still um

to check that we should continue looping, which is fine. This is a a branch should be very easy to predict.

Uh, shouldn't we use something other than the sequential M map now that we're multi-threaded? I don't think so because

multi-threaded? I don't think so because um sequential just means that the kernel doesn't keep pages around after you've iterated over them. And that's kind of

the behavior we want because we've we've fully separated where the threads walk the mapped file in. So it's true for each read that no one will revisit that

read location again. Um, and so I think this is actually still correct. Um,

okay. Next new line. What do we have in here?

Next new line has a compare. And that's the compare.

a compare. And that's the compare.

That's the 64, which is fine because this should be very easy to predict.

Uh, but why are there two?

Like I see why there's one.

Oh, the other comes from inside of there. So, this is the one that checks

there. So, this is the one that checks that the uh in our next new line one is inside of here.

Yeah. Okay, that's fine. That's the one that checks that this is uh less than the length of rest. Okay.

And then we do the simd.

Oh, and the other is the looking for first set.

Um, which can also be false if we're not in the first 64. That should also be an

first 64. That should also be an unlikely branch.

So that I think as is also fine.

Okay. So those two branches I think are okay.

Cool.

So now we're spending 91% of our time in in hashmap in the damn hashing. Uh okay,

that's interesting though. So if I now go back and remove, in fact, I think I specifically want these to be inlined.

Uh and this can Yes, that's fine.

Uh, and I want this also to be inlined.

Uh, we'll keep update stats around because I do technically want to look at.

Uh, oh, is it a problem that I changed these?

Uh, I think I do actually need to rerun this if I want the source annotation.

Uh, first chunk doesn't have as complex of a check compared to split first chunk. Oh, that's interesting. Okay, so

chunk. Oh, that's interesting. Okay, so

if I go where's where's this? It's like

here.

First chunk.

Uh, oh, that's true. I don't know that that will matter a whole lot, but fair enough. Um,

enough. Um, okay. Let's see what we got here.

okay. Let's see what we got here.

Let's annotate update stats to see what we get.

Uh, and I mean a lot of this is going to be stuff that comes from hash brown mostly because we have um

we have get mute and insert here, right?

So, a lot that's going to be going on.

Okay, so this is the hashing of the key for get mute.

still unclear why that load is annoying at all.

Uh and this is the finish of the hash, right? So this is effectively hash one.

right? So this is effectively hash one.

So hash one first calls hash and then calls finish. Um as we see those all

calls finish. Um as we see those all happening here and then the end. Okay.

Um this I think is the this is the tag compare uh for the buckets the groups of buckets

inside hash brown. So this I don't think we can do very much about.

Um, and then this is our this is our dref of inlined.

This is our dref of um sturvec which it is true. we have to do this check but that should also be uh easily

predictable and in fact let's do a um cold path in here too because that it should generally be the

case that these are uh are inlined Okay, it's interesting because we do end up we might end up calling as ref multiple

times on the same sturve because we need to use it once to do the equality and once to do the hashing.

Um and I I think we can do better than that.

There's like one interesting optimization for example which is the implementation of partial eek here um where

uh we can actually do self.inlined

self.inlined last is equal to other.inline inlined last.

And the reason we can do this is because either if so the lengths have to be the same. If the lengths aren't the same, um

same. If the lengths aren't the same, um then the strings can't be the same.

Well, is that true in UTF8?

Are there ever strings that compare equal in uni code even if their bit patterns are different?

Depends on the collation, right?

Yeah. Because then this is not safe. If

if if you can have unic code strings compare equal even if they are not the same bit pattern.

Oh that's true. I mean we we are just using U8 here anyway. So that would already be a problem.

Although in this case I don't think it's a problem because all the keys in this data set uh are presumably have the same encoding of the same characters. So we

can we can be yeah we can be slightly assuming of normalization here. And in

that case the lengths have to be equal because either if they are the same length then either they're both inlined or they're both not inlined. In either

case this value should be the same. Um

and Uh, and then here we could do there's like an optimization we could do here too, right? which is um

we could also if they are both in line which is likely to be the case then it's sufficient to just check that the pointer and length are the same. You

don't need to do this part, but I guess doing this part is like basically like it's not a lot cheaper to compare them separately.

I'm pretty sure. So, I think we will just do that.

I'm confused.

Why is it upset with me?

What? These aren't borrows. Do I Do I have to do this?

Okay.

Um Okay. Okay, so that will at least for

Okay. Okay, so that will at least for string equality I think it will often eliminate the need to do this conditional

although this is still a conditional jump right so uh there wasn't a space after the sec that that wasn't me that was um here look if

I do this it just rust format will undo it for If I put it in there, it will fix it for me. Um,

me. Um, what else do we have down here? Um,

yeah, and this is and then this calls slice partial eek. That's what I was thinking that we can probably here.

In fact, we could if we want to be real fancy, we can do this.

But it probably doesn't matter. Um,

okay.

And then all of this stuff basically doesn't execute, which is also a little interesting. Like

why? Oh, it's because usually you can get away with just comparing the length is my guess. So you don't actually need to walk the byte string, which is

probably what a lot of this code is for.

And in fact, no time is spent here, right? Uh

right? Uh and then we get in fact Okay. But where does the code go then?

Okay. But where does the code go then?

It's it really is like the hashing and the hashing is only done for the key

we pass in. So it's only done once.

And then yeah and then the the shift rights the exor all of this is just the hashing

and that's where the time goes. It

doesn't go anywhere else.

Interesting.

All right. Well, if we now get rid of the inline never for I do wonder whether update stats

should be in line never. It's a little unclear to me. Um but let's see what happens now.

Well, that seems better, right?

Then we can do the warm up three or something for E. Wouldn't it suffice to be or

for E. Wouldn't it suffice to be or instead of and when the repper is exactly the same?

Uh, no. Because if if we have a pointer to a

no. Because if if we have a pointer to a string, then the comparison actually needs to dreference that pointer um because we're not changing for the

pointer is the same, right? Okay. Well,

it's slightly faster. Not by a whole lot, but it is faster even though the system time is longer.

So that means it's even more faster. Um,

okay.

Uh, so we'll save this as uh micro.

Um, what else do we got?

Uh, the one we had one more thing that I wanted to look at which was can we

Okay, here's a thought.

What if this was a power of two?

Just throwing that out there.

It probably it probably is uh it rounds to the nearest power of two, right?

There's no way. It just takes the user input because if it's a power of two, then you can like do a bit shift instead of a modulo and everything is happier.

Nope. Okay. Then we will go with a,024 and like I mean, is that also faster?

I think we need to stop soon.

Yeah. Okay. So, that's not faster.

Cool. Just just as a sanity check, you know. Uh, great. So

know. Uh, great. So

power of two.

Um and then uh um right the U size someone pointed out that we have

well this U size doesn't matter too much. I don't think we want to get rid

much. I don't think we want to get rid of that one. Um, but the value we store in the map,

where's my value? Okay, this. Okay,

fine. We'll make it a strruct. We'll

finally make it a strruct stat. And it

has a min which is an I16, a max which is an I16, a count which is a U32.

Okay, let me do these one at a time so that we change one thing at at once. Uh,

and it has a len which is a what do I currently have it be? U32.

Um, and then here, oh no, I have an I64 and a U size. That

seems excessive, right?

We have a sum, which is an I64, and a len, which is a U size. That's what I currently have.

statum stat count.

Ah, let's keep it count.

Um and then this is min.

This is sum.

This is count. This is max.

Max max.

Min. So verbose.

Some count.

Uh, this can derive debug, clone, and copy.

That's fine.

Uh, somewhere else here.

Stat.

This should be stat.

This should be or default.

And then we need to do where is my stat?

Up here.

Uh imple default for stat.

So min sum count max.

Then down here, this is now going to be min min max count

sum.

All right, it's a strruct now. I hope

you're all happy. This better not change execution time.

Don't think it did.

Uh statistical outliers were detected. Oh

no.

Yeah. Okay. I don't think this is faster. It better not be faster. Yeah.

faster. It better not be faster. Yeah.

Okay. Great. Uh

in and three can go away and self hash can go away and brin is now brin

which is just br.

I'm glad we're all in agreement. Um,

so it's a strct now. Uh, and then some people were upset with me because this feels like

Okay, it's a billion rows, so I don't think it can be an I32.

because it can be a hundred billion and it needs to be an i because they can be negative.

count though uh can be a U32 because there are at most a billion rows and U32

goes to two billion.

I know it's per city, but we don't know that all the measurements aren't for one city, right?

Uh, Bur U32.

So, I want Burr and Burr U32.

Maybe also the sum. I don't think the sum can be an I32. I mean, I agree in theory it probably can, but I don't think we're guaranteed that

that is the case for this data set.

It's a max of 10,000 cities, but it could be one city is the problem.

We did save eight bytes per entry, but like so it's slightly faster, I guess.

smaller entries.

Running average is actually pretty hard.

Uh you the way you do it is basically this. You keep track of the sum and the

this. You keep track of the sum and the count and you compute them at the end.

Um now there is one other thing that I want to see. So there was a really cool

to see. So there was a really cool article by someone who tried this in Rust and now here no here. No here.

Yeah. Ragnar uh or Rangnar probably Norwegian um who did this challenge in Rust um and they've done a bunch of things including um with flame graphs.

It's a really cool article. Um, but one of the things I was just like skimming through and I noticed that there is uh a

uh purging all branches faster perfect hashing. I mean these numbers aren't comparable, right?

Because we don't know how many cores they have and what CPU they have and everything. Um,

everything. Um, oh yeah, this is the you don't actually need to do the minus B 0 because you can just do it at the end, which is fun. Um,

two max calls can be done using SIMD is also pretty fun.

Um, so but one of the things that I saw was uh I can't find it now.

They've pointed out that uh you can it can be faster now I can't find it. It can be faster to

not run min and max uh but only do like overwrite if less than and overwrite if more than min.

Yeah. here branchy minmax so you make it branchy um and then these branches are almost never taken because usually you're not

changing the minax um which I thought was interesting and notice that this has a minus and this has plus.

But like will this make a difference? I

find it hard to believe, but let's see.

So this would be in I don't think we care about it down here, but we care about it in the critical loop in one, which is

Oh, it's confusing. The one is below the other ones here. So it's like if t is less than stats.mmin

then stats domin is t and if t is greater than stats domax

then stats domax is equal to t.

So this is now minmax.

Can min max temperatures be more than 255? No. They're guaranteed to be

255? No. They're guaranteed to be between - 99.9 and plus 999. Uh but

because we're removing the point and treating them as I6s, then yes, they can be more than 255. Uh yeah, minmax is faster.

That is pretty funny.

But it's also surprising to me that this doesn't show up. I guess it's such small percentages, right? Um that like that's

percentages, right? Um that like that's what we're chasing now. Um

branchy minmax.

Um I know I still have some inline nevers, but those are there kind of on purpose. Um so that we get to break out

purpose. Um so that we get to break out both the one function uh and the

uh not update stats but um the print function. Uh parse temperature is now

being sad and get mute is being sad.

And I guess now we're like we're looking for small wins, right? Uh

there's a whole bunch of code down here that never gets called, which is fine.

Give me the source around it as well.

And it's like it's all less than 1% now here it's like this stuff for the asterf which we need in order to

call hash. Um

call hash. Um although no we don't we could actually for hash

uh we don't need that test which avoids the brand. No, we do need

the brand. No, we do need Yeah, we do need it because um uh in case it's longer, it just happens

to never be. Uh what about this stuff?

This jump. This is inside of hash brown.

So, this we don't really get to do anything about. This is the hashing,

anything about. This is the hashing, which like if if you have a better hash, call me. Uh,

call me. Uh, and then we have some like tiny 0 2 here.

This is what I mean. Like I feel like the min max didn't really show up as something that cost us time.

Oh yeah, this is the parsing. It is

funny because when you look at the parsing right like it is true that these are small numbers but they do kind of add up right so there's 222 31 so we're

at 0.5 6.78.9 1.1 so this does end up being uh 1.1 1.7

1.8 8 1.9 two 2.3 Yeah. So like where is two 2 and a half%

Yeah. So like where is two 2 and a half% that goes into pars temperature. Uh

which I think actually is what it set out here and yeah. Okay, great. 2.26. So

um so that time does go in there. Is

there anything we could do to that code?

Oh yeah, here's the stats stuff.

I mean, you see there's like no time that goes in here, right? And I guess that's the observation that that this jump is basically never taken.

Um, interesting that we don't we don't even get any samples from this because it's so rare, which is actually a little surprising, but um,

all right, but if we go back up to the time parsing here, like where where are we spending the most amount of time?

We have the exor as a compare.

Oh, we're doing these compares a couple of times.

I think we can probably simplify this because that compare is the same as that compare, which is the inverse of that compare.

And then this we could remove the subtraction for like.1%

although we do it in in each digit we parse.

And so in theory we could do we could do a sim subtract here but given we're talking about three digits probably doesn't make a whole lot of sense.

conditional move.

And I wonder if we could avoid this multiply by zero and instead just conditionally take the value.

So okay. Um

is sng is t0 equal to b dash.

This is not is neg. This I feel like the compiler should be able to hoist this right. Uh,

right. Uh, is Nick.

And then here um what are we trying to look for here?

This is like let has dd This is also if um if has DD

then this else Zero.

So that also feels like I'm trying to figure out if we can here do a um Oh, this can this can be an assert

unchecked.

What?

Um, yeah. Okay, so these minus b 0

yeah. Okay, so these minus b 0 also feels like we could get rid of, but we we would need to doing so is actually a little bit

annoying because we're also multiplying them by 100, by 10, and by nothing.

Skip is is snake as use size. Is that

true?

Uh yes, that's true.

And then let's now see if I quit this and do cargo ASM.

Cargo ASM par.

I might have to put inline never on that. Yeah. Okay. Inline

that. Yeah. Okay. Inline

never.

What do we have here now?

No, now we have some jumps in here.

We have jump here.

Has DD has now become a jump which I don't really understand why.

I guess so it is more likely that it has double digits than not. So maybe the branch predictor is okay there. But

I this feels like it should just be a conditional move, right?

Um, we're also getting a jump here on the is if has dd.

Yeah, that's what I was worried about.

That's that's why we pulled this mole trick of this being um of this being if

DD then 10 else zero times

but this is really I guess one which I think then just becomes I16 from past DD, right?

And in fact this and I mean this is getting kind of stupid but it becomes uh

I16 from past DD times 90 + 10 right so if if we have DD if we have

double digits the multiplier should be 90 + 10 otherwise it should be 0 + 10 which is 10 um

we could do a SIMD here it's true we could do a load select with the filter over the the dot. But I think that got

rid of the branches again.

Yep. These are all now conditional moves.

And just to check that it's still Okay, it still does that.

And now um if we now copy this in this is now uh

parse Why Why is this version slower? It

should not be slower.

Okay, they're basically the same.

But now, okay, let's go into let's go into one. Oh no, I'm losing losing my

into one. Oh no, I'm losing losing my symbols again.

We don't know why this happens.

How about now? No. No. Don't lose me. my

symbols. Oh, no. I think Okay, this time it's okay. Uh,

it's okay. Uh, so we were hoping that we shaved some of the time off

from the parsing, but uh, doesn't really look right like it, right? It was 2.7.

right? It was 2.7.

No, 2. It was about 2.5 previously. So,

we we we didn't make it any faster, but a lot of it is now on this one sub.

Why?

That doesn't make a whole lot of sense either.

Uh T2.

Yeah, it's probably misattributed here, but look like Yeah, now we're down in the hashing.

This is an IMOLE here and it doesn't even show up.

Uh let t2 is tn minus

4 - skip time 10.

Uh, I don't think that works because the length can be

well negative numbers would then mess that up, right? That wouldn't work.

Uh, plus one to select unpredictable.

Oh, stood hint select unpredictable.

Okay, double.

There's also uh an interesting post by someone here who's like they they got it down to 0.145 seconds with the default data set. Um

that accepts all valid inputs in C++. Uh

and in this post they go through like they have a bunch of ideas of things that they do. Um, I haven't actually read through everything here, but like it seems like they have some some sneaky

things that we should uh that we could Sorry, not going to say we should, but that we could also make use of. Uh, but

we we're not actually going to do that.

Unsigned int overflow hashing.

What? They just read them as U64s.

like just add them.

Uh, select unpredictable returns true val or false depending on the value of condition with a hint to the compiler. The condition is unlikely

the compiler. The condition is unlikely to be correctly predicted by the CPU branch predictor.

I see.

I mean, sure. Uh,

select unpredictable, true and false. Sure. Um,

that is now the Oh, no. It's uh

that one and this one.

No. Uh true. False.

Although I suppose we could do the same uh down here for example instead of this like sneaky sneaky

business.

Isn't par at most 30 milliseconds of the runtime? Yes, it's it's it's most uh

runtime? Yes, it's it's it's most uh well it's at most 2% of samples, but you have to remember that it's um it adds to

the the sequential time of the program in a sense, right? Because all the threads are running in parallel and so if each one is spending 2% of time there, then all of them will get

shorter. Um but but yes, I think this is

shorter. Um but but yes, I think this is like hyper optimizing things that uh matter much less,

which is uh it's a good uh it's a good prompt that maybe this is not the place to spend our time, but

it's fun, you know.

Uh okay.

More more more.

Um, okay. I think we need to stop here

okay. I think we need to stop here because I think the the place we're going to get any further now, right? If

we look at perf record, just to pull it up one last time.

um love to see simply adding up the U64s with overflow for hashing. All right.

All right. All right. All right. We

We'll try it. We'll try it. It's all I can give you. So

that would be uh self.0 and then it would be

uh self.0 and then it would be uh bytes chunks exact.

No chunks as chunks uh of length eight.

And then we do self uh for chunk in chunks uh self.0 0 plus equals chunk.

uh self.0 0 plus equals chunk.

Uh and then well it sort of depends on what addition you want to do here, right? Um

like should this be a U32? Should this

be a U64? The problem you run into is uh that uh many of our strings won't even be eight or some of our strings won't even

be eight characters but okay. So we'll

do four and then we'll do this and this will be on the remainder uh

use 64 from any bytes uh remainder remainder I can't spell from any bites Right.

Uh so zero.wpping

so zero.wpping add.

So subt.apping

head from Uh, what do you want to do with the rest?

Do it again, I guess.

But no, that feels uh this is a remainder of four. I think we just do four

r in remainder.

sub.0 plus equals R.

Uh, will there be at most one four byte remainder chunk? Yeah, there'll be only

remainder chunk? Yeah, there'll be only one.

at most one.

Um, and I mean maybe we just ignore this remainder, but uh time no cargo B

add Watch this. Like half the compute time.

Watch this. Like half the compute time.

I don't think it will.

Nope. It's slower.

All right, but let's All right. All

right. Let's not fully give up here yet.

Let's try this just so I don't get accused of not having tried it.

That's even worse if we don't add the remainder. So, that's fun to see.

remainder. So, that's fun to see.

Um, okay. So what if we just what if we just

okay. So what if we just what if we just do the four and we ignore the remainder.

Uh right. But there was a claim that maybe

right. But there was a claim that maybe this should be 2B bytes.

the 2B because the something something the high bits might matter more for hash browns uh grouping of buckets.

Nope.

That's for sure slower.

That's for sure slower. Uh

Yeah, I think uh I don't think that's a winner.

Okay. Um,

so what I was about to say was if I do perf P perf record

and then PF report and we look at the output

like if we look no inside of here. Oh

no. Oh no. Oh no. Okay, good. Um, like

our time here realistically is spent in is spent in update stats, right? And

update stats basically all of that time is spent here. So, can I change that?

No, I can't. Okay. So, what these add up to? 80 92

to? 80 92 93. Okay. basically all of it. So all of

93. Okay. basically all of it. So all of update stats right now if this is to believe be believed is in the

just the hashing. Um

and then there's a little bit in parse temperature. We've it's already

temperature. We've it's already branchless, right? But it it does have

branchless, right? But it it does have to do integer parsing. So it does have to do some like multiply by tens and whatnot. Um, so I I think the hashing is

whatnot. Um, so I I think the hashing is the one to to get at next. Um, and maybe that is writing our own hashmap, but the the moment we do that, I feel like

something is off. Um,

just because I know hash brown is so optimized here that like it just feels feels like something else is uh is off.

I I think actually the hash function is probably the place to spend time instead. Um,

instead. Um, and I mean even if you buy like 10% here, then we're already quite a lot faster.

Um, yeah, interesting.

Anyway, okay, I think I think we should now actually stop because I need to eat something and drink something and go to bed. Uh, but I hope that was

bed. Uh, but I hope that was interesting. I hope you uh you enjoyed

interesting. I hope you uh you enjoyed follow along. I hope you learned

follow along. I hope you learned something. Uh, even if we didn't end up

something. Uh, even if we didn't end up faster than Java, you know, Java beat us. We have to we have to admit when

us. We have to we have to admit when we've been beat. Um, although how much code do we have? We also don't have too much code. Like 300 lines. And how much

much code. Like 300 lines. And how much code is this solution? 400 lines. Okay.

So, we have fewer lines. Um, and also worse performance. So, you know, I'll

worse performance. So, you know, I'll take it. Um,

take it. Um, uh, I do. So, what I'll do is uh I will upload this to YouTube as usual. If you

watch this on YouTube, you already know this. Um, but I will also put a link

this. Um, but I will also put a link when I upload to the git repo of where I put Br, which will probably just be github.com/jonhoo.

github.com/jonhoo.

Um, and so if you want to then take a stab at making this faster, and remember, don't break the rules. Um,

then please do. No external

dependencies, no breaking of the rules.

Um, and let me know if you find a way to make this magically way faster. But in

the meantime, thank you in chat for joining me. Thank you for watching. Uh,

joining me. Thank you for watching. Uh,

and I will see you the the next time we do a stream, whatever it might be.

Thanks folks.

Loading...

Loading video analysis...