Hartley McGuire - Active Record 8: Resilient by Default

By Ruby on Rails

Summary

## Key takeaways - **Verify Connections Before Checkout**: Active Record pings database connections before use to discard bad ones and reconnect fresh, reducing query failure odds from idle periods. This pattern, from Rails 1.1, trades latency for resiliency by shrinking the failure window. [04:14], [06:22] - **Rails 7.1 Defer Verification**: Rails 7.1 marks 'BEGIN' queries retriable, skipping verification ping for them and enabling recovery from early connection errors. This challenges the idea that query retries are always complicated. [08:19], [09:13] - **Rails 7.2 Granular Checkouts**: Rails 7.2 refactors to check connections back to pool after each query instead of pinning for whole request, allowing smaller pools for high concurrency. Verification now skips if last query succeeded within 2 seconds. [12:47], [14:05] - **Auto-Retry SELECT Queries**: Rails 7.2 uses Arel visitor to mark SELECT queries retriable unless they contain raw SQL literals or non-idempotent nodes. This makes most SELECTs recoverable from connection errors. [17:19], [18:18] - **Rails 8.1 More Retriable Queries**: Rails 8.1 marks internal Active Record SELECTs like schema loads retriable and adds `retriable: true` to safe Arel.sql strings. Developers can log non-retriable queries to fix them. [20:43], [22:20]

Topics Covered

Verify Before Checkout Prevents Query Failures
Retriable Queries Skip Verification Pings
Granular Checkouts Reduce Connection Pools
Auto-Retry Most SELECT Queries
Rails 8.1 Maximizes Retriable Queries

Full Transcript

Thank you all so much.

I'm really so happy to be here at Rails World, especially because I really love the energy that the community brings to these conferences.

So thank you again for being here.

And of course, it's also so exciting to come to Rails World in particular because you get to hear about all the new features coming to Rails.

Just this morning, we heard about Markdown and active job continuations, multi-tenancy, and even a new structured event reporter.

All these cool new things that you can start using in your own applications.

However, my favorite part of rails is the magic.

The changes to the underlying components that all you have to do is upgrade and your application will just work even better than before.

So today I want to talk about active record, and instead of focusing on new features, I want to talk about how Active Record has improved over the years to make applications more resilient than ever.

But before we get to that, let me first tell you a bit about myself.

My name is Hartley Maguire.

You may have seen this beautiful picture of me on GitHub.

When I made this slide, I was on the Rails Issues team, but as of Tuesday, I was actually added to the Committers team.

Thank you.

One of my favorite parts about being on the Rails Our team is helping new contributors contribute to Rails.

If you have questions about contributing or even just getting help started, feel free to come find me sometime during the conference.

If I'm not watching a talk, I'll probably be at the Shopify booth because my day job is I am on the Ruby and Rails infrastructure team at Shopify.

I've also heard that there's an interest in more Ruby and Rails blog content, so I've included a link to my blog.

I try to write about more intermediate to advanced topics, and I hope you find them as interesting as I do.

Before I joined the Ruby and Rails infrastructure team earlier this year, I was on Shopify's Kate SQL team.

Kate SQL is our internal database as a service where we run MySQL on Kubernetes.

And if you're interested in hearing more about that, some of my former teammates gave a really great presentation at a conference, which I've also linked.

And the reason I mentioned being on this team is that it made me really conscious of the high up-time expectations that applications have of their database.

If an application at Shopify saw errors coming from the database, my team would probably hear about it.

And sometimes these errors were caused by real problems. There could be an outage in the cloud, a database writer could crash, or maybe a database was just a little overwhelmed.

Most of the time, though, these errors were transient, just flaky and infrequent.

Is it worth trying to fix these kinds of errors if they happen so infrequently?

If your application is internal, maybe you just slap some retries on it and you don't care.

But if your application is customer-facing, then the answer should be a resounding yes, because a single error page can be the difference between a customer completing a checkout or choosing your application over someone else's.

So how did our team respond to fixing these flaky errors?

Well, I told application teams that they should upgrade rails because I knew that this is somewhere that the rails magic really shines.

Over the last few versions, Active Record has become more and more resilient to exactly these kinds of issues.

If you're hesitant to upgrade your own application, know that you are missing out because Active Record 8 is the most resilient version of Active Record ever.

And today I want to explain how we got to this point.

So The first area of improvement I want to talk about is connection verification.

Let me start by explaining what that means.

Here we have a rails application and it's MySQL database.

When a request comes into the application that needs to query the database, Active Record will open a connection and verify that connection by pinging it.

And if the database responds, then the application can execute queries as normal as you would expect.

However, if Active Record pings the database and it gets a bad response or even no response, then the connection is discarded and Active Record will try to reconnect with a fresh connection.

So that's the how of verification.

But now let's talk about the why.

For that, let's go all the way back to before when this verify before checkout pattern was first implemented.

Now, I love reading through the history of things like this.

So I went to find the commit where this was first introduced and traced its origin back to 2005 for rails 1.

1.

Before that commit, rails would not eagerly verify connections and instead would just immediately execute queries.

Only if that query failed, would active record throw out the connection and try reconnecting.

And while this approach sounds simpler because you get to skip the whole dance of verification, the issue is that we now have a problem in a question which is hard to answer.

Now that we've reconnected to the database, can we run the original query again?

And the answer is it's complicated.

We'll come back to query retriability later.

For now, let's say that because retrying queries can be complicated, we'd prefer to just avoid asking the question in the first place.

And that's where verification before checkout helps.

If we think about the lifetime of a database connection, there may be long periods between queries, especially since database connections get reused between requests.

So if anything happens to the connection during that time between requests, the next query could fail.

But if we verify the connection right before it's used, we significantly reduce the odds of that query failing.

Now, the only window of time a connection issue could cause a problem is right between the verification and the query, making it much less likely the query will error.

So I also want to note that this verification is not free.

It does require an extra round trip to your database.

So if your application is physically near your database, such as in the same cloud region, that might be inexpensive, maybe one millisecond.

But if your database is physically far away from your application, like even US East and US Central, then you're adding 30 milliseconds or more to your request.

So since we're exchanging some latency for resiliency, it seems like a reasonable trade-off, but it definitely does have a downside.

As I mentioned before, this verify before checkout pattern is not anything new.

It was added to rails 1.

1, and it's remained mostly unchanged for 17 years.

The big improvement came in Rails 7.

1, and this change was a really big one in a pull request titled Defer Verification of Database Connections.

In this pull request, Matthew did a really large refactor of the connection verification process, and it set the stage for really huge improvements in resiliency.

The big change was to challenge the assertion from earlier that it's complicated to know whether a query can be retried.

To explain this change, let's go back to the timeline of a connection, but this time let's look at an insert query wrapped in a transaction.

Just like before, the verification prevents connection issues during that idle period from causing our queries to error.

So the failure points are still the small periods of time in between queries.

And of course, if any of the queries themselves fail, we lose the entire transaction and request because we don't know if we can retry the queries.

Or do we?

While the insert and commit fall very heavily under the it's complicated umbrella, the begin is actually straightforward.

If a begin were to fail and active record discarded the connection, it would be completely safe to just retry the begin on a brand new connection.

So the foundational piece of Matthew's pull request is the ability to mark queries as retriable.

And now the number of failure points has decreased.

Previously, there were three failure points, but now we have the ability to recover from a failed begin, so now there are only two.

And at this point, you may also notice that the verification ping is is not actually helping us anymore.

Before, it made it less likely that the begin query would fail, but now that begin itself can recover, the extra round trip to the database can be removed.

So to be completely clear, the verification ping will still happen on many queries.

This optimization only happens if that first query on the connection is retriable.

Let me quickly summarize everything covered so far.

Active Record verifies database connections by pinging them before it uses them to reduce the likelihood of connection errors.

In rail 7.

1, Matthew introduced the ability to internally mark some queries as retriable, which unlocks things like skipping the verification ping and the ability to recover from connection errors much more often during a request.

To explain why that's such a huge improvement to resiliency, I want to introduce another concept, connection pinning.

Let's revisit our model from earlier to get a better understanding of how Active Record handles connections.

Just like before, we have a rails application and it's MySQL database, but this time let's peel back a few layers and zoom in a bit on the application's internals.

If you're running a web server like PUMA, then your application will have multiple threads to process requests.

Additionally, Active Record maintains a shared pool of connections so that each request isn't slowed down by having to establish new ones.

Now, just like before, when a request comes into the application, it gets picked up by a thread.

And again, like before, when the request needs to run a query, it checks out a connection from the pool, and the pool will verify it.

And then this is the important part.

Now that the connection has been verified and given to the thread to run a query, that connection is now pinned to the thread.

So So if another request is picked up by a different thread, it will get its own connection that's pinned, and the first connection doesn't get unpinned until that first request finishes, at which point the first connection goes back to the pool to be used by future request. Requests.

To tie this back into Matthew's pull request, notice that the connection verification only happens when a thread first checks out a connection.

Once the connection is pinned, it doesn't get verified again.

This means The only opportunity to recover from a connection error is at the very beginning of a request, that first query.

With the addition of retriable queries, recovery can now happen at any point there's a retriable query.

This single verification per request model was how things worked through rails 7.

1 because connections are pinned for an entire request.

But that isn't quite the case for rails 7.

2 due to another really big change, this time in connection pinning.

The change ended up being split over many different pull requests, but this initial proof of concept links out to all the others.

As mentioned in the description here, unbounded connection pinning has a fundamental problem.

It generally requires that the connection pool of your application be larger than the maximum number of threads or fibers that your application uses.

So if you want to run at a higher concurrency, then the number of connections you can open to your database, then you will start to run into issues.

To address this, Jean did the very difficult work of refactoring the entirety of active record to support granular connection checkouts, meaning that connections get checked back into the pool after they're used.

The difference can be subtle, so let's see it in action.

Like before, a request comes into our application and wants to run a query, so it checks out a connection, verifies it, and runs the query.

And just like before, the connection is pinned to the thread, but only until the query finishes, at which point the pin is removed and the connection is checked back into the pool.

And like I mentioned before, this has the potential to reduce the total number of connections to the applications database if request spend large amounts of time doing things other than running queries.

However, these granular checkouts do have one small problem.

Each time When a request checks out a connection to make a query, it now reverifies it.

And this is unfortunate because the additional pings will add latency to requests that have many queries.

And as I mentioned before, that latency could be quite significant if your physical distance between your database and your application is large.

This problem was addressed in another tweak to connection verification.

Matthew added a time component so that verification pings only happen if the last successful query was more than 2 seconds ago.

In practice, this prevents the multiple verification pings per request and restores the 7.

1 behavior of only verifying the connection at the first query of a request.

If your application is on a released version of rail 7.

2, I highly recommend upgrading to either 8.

0. 2 or at least the 7.

2 stable branch on GitHub so that your application can take advantage of this fix.

Once again, I'll summarize everything that I've talked about.

When a connection is checked out from the pool, Active Record will pin it to the current thread or fiber, and this ensures that the same connection is used by subsequent queries on a request, and it also prevents verification from happening multiple times within a request.

In Rael 7.

2, granular connection checkouts will check the connection back into the pool after it's using it instead of holding it for the entire request.

And finally, Rails 8. 0.

2 includes another change to verification to ensure that the same connection doesn't get verified multiple times within a two second period.

And this fixes the small performance regression where granular checkouts could cause connection verification to happen multiple times and added some additional latency.

The next thing I want to talk about is the earlier topic, query retriability.

Because Because at this point, we have this really great foundation for improving the resiliency of rails applications, but only if queries are marked retriable.

And at this point, the only query that's retriable is begin.

So what else should active record automatically retry?

As a default, active record should only automatically retry queries that are item potent, meaning that they can safely be run multiple times.

You may have heard this term in relation to through HTTP methods in that get and head requests are considered item potent while put, patch, delete, post are not item potent.

To map this concept to sequel while insert, update, and delete are not item hypotent, SELECT query should be right.

Unfortunately, even that assertion isn't always true because a SELECT statement could contain a function or a subquery that modifies the database in some way, and we don't want to automatically retry that.

If Active Record is going to assert that a query is retriable, it needs to have full control over how the query is constructed.

So speaking of query construction, how does Active Record do that?

When you call one of active records query methods in Ruby, it builds a tree like representation of that query out of arrow nodes.

And then it uses the visitor pattern, which means that the tree is traversed and each node gets translated into a string.

So to assert that queries are retriable, active records visitor needs to be extended to report if some node that we don't want to retry is included in the query.

And this was the approach taken by Adriana in this pull request, which was released in Rails 7.

2.

Let's see how this strategy can identify non-retriable queries.

So This time, our query will contain a raw SQL string that modifies the database.

Since a raw SQL string could contain literally any sequel, its inclusion in a query will prevent active record from making it retriable.

The arrow tree will look similar to the one before, but notice that the wheres have been replaced from an equality node to a sequel literal node, which is the arrow representation of a raw sequel string.

The visitor will now also track whether it has run into a node which should make the query not retriable.

So now as the visitor visits each node to build the sequel string as before, but when it gets to the sequel literal node, it can then flag that the query should no longer be retried.

And since the majority of select queries won't include these raw SQL strings, this should make most select statements in applications retriable.

And making most select statements this retriable really unlocks the greater potential of the connection verification change.

While previously only begins were a common retriable query, now the vast majority of queries in an application are retriable, and there are only a few left which could actually fail due to connection errors.

Hopefully at this point, you're fully on board the resiliency train.

Deferred connection verification and the introduction of retriable queries laid the foundation in rail 7. 1.

7. 1.

Granular connection checkouts and automatically retriable select queries built on that foundation in rail 7.

2.

And rails 8 is even better with yet another improvement to verification timeouts.

And the best The big thing for me is that this all happens for you for free without any changes to your application.

All you have to do is upgrade to rails 8 because it's the most resilient version of active record ever.

Well, this is rails world, so we're not talking about active record 8, we're actually talking about Active Record 8.

1, which the first beta was released about an hour ago.

And the truth is, Active Record 8.

1 will be the most resilient version of Active Record ever.

You're probably thinking with all these features I've already discussed, how can 8.

1 be even more resilient than 8?

Let me explain.

When the kateSeql team upgraded our rails application to 8.

0, the number of connection errors decreased, but they didn't go away completely. 3.

And this is somewhat expected because as we discussed earlier, inserts, updates, and deletes are not automatically retriable.

However, when we started investigating the errors, we saw that we were still getting connection errors from select queries.

To investigate why this was happening, I wanted more information.

So the first thing I did was add a retriable status of a query to the sequel.

Active_record_active_support_notification.

And this allowed me to build a log subscriber in my application that would only log non-retriable queries so that I could identify which queries were not being marked retriable, but I expected to be.

And some of the non-retriable queries were internal to active record for loading schema data.

Similar to begin, these just needed to be explicitly marked as retriable inside active record.

But other queries were coming from our application and looked It's like they should be automatically retriable.

When I investigated these, another pattern emerged.

Active record itself uses SQL literals.

In this case, the one as one in this query is a SQL literal.

And as we saw with the visitor before, when a sequel literal is used in a query, it makes the query non-retriable.

So how can we fix this?

Before I tell you, I want to include a disclaimer that what I'm about to show you is a sharp knife.

If you haven't heard this metaphor before, some of the features rails provides are described as sharp knives, because while they provide a lot of power, they also come with a greater responsibility in how they are used.

In this case, the feature I'm going to show you should only be used if you know absolutely that a SQL string is safe, not from untrusted user input, and you want it to be included in retriable queries.

So this is what the one as one SQL string looked like inside active record.

It uses aero.

Seql, which is a helper for creating known safe SQL strings.

And again, this helper should only be used when the SQL string is known to be safe and not from untrusted input because it skips things like sanitization and quoting, it just gets put directly in the query.

So to prevent this SQL string from making queries non-retriable, all I had to do was add retriable true to it.

And if you have instances of known and save SQL strings in your own applications that you want to be part of retriable queries, you can do the same thing as well.

This retriable keyword was added to active record as part of the automatically retriable selects pull request, so it's available starting in rails 7.

2.

However, if you try to use it in your own application, there are a few places where it might not work, but this has also been fixed in rails 8.

1.

So for my application, I ended up finding writing and fixing all of the select queries that weren't retriable, but I thought should be either by upstreaming fixes in rails or with small adjustments to my own application code.

So now rails 8.

1 builds its resiliency by default even further with even more queries that are automatically retried by default and by providing developers the tools to identify non-retriable queries in their own application that maybe should be.

And with that, I hope you're as excited as I am about rails 8.

1, not just for the new features, but also for the most resilient version of active record ever. Thank you.

Loading...

Loading video analysis...