TLDW logo

Amazon's Secret to 85,000 Orders/Minute | Black Friday Architecture

By ByteMonk

Summary

## Key takeaways - **Walmart's 2011 Black Friday Crash**: In 2011, Walmart's website crashed on Black Friday because the largest retailer in the world simply couldn't handle the traffic. [00:00], [00:02] - **Databases Buckle First Under Load**: Databases have a hard limit on concurrent connections, typically 100-500; with thousands of users firing 5-20 queries each, queries queue up, taking seconds instead of milliseconds and causing timeouts. [01:28], [02:05] - **One Slow Query Triggers Cascade**: One slow query, like a complex search without indexing, holds connections open longer, starving other queries and cascading into systemwide failure. [02:20], [02:34] - **Load Balancers Amplify Failures**: Load balancers mark slow app servers as unhealthy and route traffic to remaining healthy ones, overloading them further in a vicious cycle. [03:36], [03:45] - **Cache Stampede Overwhelms DB**: Cache stampede happens when cache expires and thousands of requests simultaneously regenerate data, hitting the database all at once and overwhelming it. [04:01], [04:16] - **User Refreshes Fuel Death Loop**: Users refreshing pages create more requests, leading to more load, longer waits, and even more refreshes in a feedback loop that crashes systems in minutes. [06:50], [07:04]

Topics Covered

  • Databases Buckle First Under Load
  • Event Loops Block on CPU Work
  • Cache Stampedes Overwhelm Databases
  • Cascading Failures Create Feedback Loops
  • Read Replicas Offload Database Pressure

Full Transcript

In 2011, Walmart's website crashed on Black Friday. Walmart, the largest

Black Friday. Walmart, the largest retailer in the world, simply couldn't handle the traffic. And it is not just a Walmart problem. Recently, AWS and

Walmart problem. Recently, AWS and Cloudflare have suffered massive outages. So, the real question is, what

outages. So, the real question is, what actually happens when a website crashes under load? And how do companies like

under load? And how do companies like Amazon, who process over 12,000 orders per second during peak, keep their systems running? Let's break it down

systems running? Let's break it down layer by layer.

A typical web application has several layers. At the front, you have a CDN, a

layers. At the front, you have a CDN, a content delivery network that serves static assets like images, CSS, and JavaScript from servers close to your users. Behind that, a load balancer

users. Behind that, a load balancer distributes incoming traffic across multiple application servers. Those

application servers run your actual code. Could be Node, Python, Go, Java,

code. Could be Node, Python, Go, Java, whatever your stack is. Your app server talks to a database usually something like postgra SQL or MySQL for transactional data and you have probably

got a caching layer in there too radius or memcache to avoid hitting the database for every request and then there are third party services payment processors authentication providers

email services analytics so on and so forth. On a normal day, traffic flows

forth. On a normal day, traffic flows through the system smoothly. But on

Black Friday, every single one of these layers become a potential failure point.

Let's trace through what actually breaks. The database is usually the

breaks. The database is usually the first thing to buckle. Here is why database have a hard limit on concurrent connections. Typically somewhere between

connections. Typically somewhere between 100 and 500 depending on your configuration and resources. When a user loads a page, your app might fire off 5,

10, maybe 20 database queries. Product

info, user session, cart contents, inventory checks, pricing rules. Now,

multiply that by a thousands of concurrent users. You're looking at tens

concurrent users. You're looking at tens of thousands of queries trying to squeeze through a few hundred connection slots. When you hit the connection

slots. When you hit the connection limit, new queries start queuing.

Queries that normally take 50 milliseconds now take 5 seconds because they are waiting in line. And if the queue grows faster than it drains,

timeouts, errors, your app can't function. But it gets worse. One slow

function. But it gets worse. One slow

query can cascade into a systemwide failure. Maybe it's a complex search

failure. Maybe it's a complex search without proper indexing. Maybe it's a reporting query that accidentally got triggered. That query holds its

triggered. That query holds its connection open longer, which means fewer connections for everything else.

And this is why database connection pooling matters. This is why query

pooling matters. This is why query optimization matters. And this is why

optimization matters. And this is why read replicas exist to spread the read load across multiple database instances.

Next layer up your application servers.

Whether you are running node, python or anything else, your servers have finite capacity. They can only handle so many

capacity. They can only handle so many concurrent requests before they start choking. In node, you have got the event

choking. In node, you have got the event loop. It's great at handling IO bound

loop. It's great at handling IO bound operations. But if you are doing CPU

operations. But if you are doing CPU intensive work, image processing, complex calculations, JSON passing on large payloads, you can block that event loop and grind everything to a halt. In

a threaded model like Java or Go, each request typically gets its own thread.

But thread consumes memory. Spin up too many and you exhaust your RAM. And

here's the cascading effect again. If

your database is slow, your app servers are stuck waiting on database responses.

Requests pile up. Memory usage climbs, response times balloon. Eventually, your

load balancers start seeing app servers as unhealthy. It routes traffic to the

as unhealthy. It routes traffic to the remaining healthy servers which now have even more load which makes them unhealthy. You see where this is going?

unhealthy. You see where this is going?

This is where caching is supposed to save you. The idea is simple. Instead of

save you. The idea is simple. Instead of

hitting the database for every request, you store frequently accessed data in memory. Predis mem cache even

memory. Predis mem cache even application level caching. A well-cashed

system can handle orders of magnitude more traffic than one hitting the database for everything. But caching has failure modes too. Cache stampede also called thundering herd. Your cash

expires and suddenly a thousand requests all try to regenerate the same data simultaneously. They all hit the

simultaneously. They all hit the database at once and database gets overwhelmed. Then there is cash

overwhelmed. Then there is cash invalidation bugs. You update a product

invalidation bugs. You update a product price but the old price is still cached.

Now you're selling things in the wrong price on Black Friday. That's a bad day or the cache itself runs out of memory.

It starts evicting entries. Hit rate

drops. More requests fail through the database. Database gets crushed. Caching

database. Database gets crushed. Caching

isn't a magic bullet. It's another

system that needs to be sized correctly and monitored. Let's move to the edge.

and monitored. Let's move to the edge.

Your CDN. A CDN caches static content, images, JavaScript, CSS on servers distributed across the world. user in

Tokyo gets served from a Tokyo Edge server, not your origin server in Virginia. This takes enormous pressure

Virginia. This takes enormous pressure off your infrastructure. If 80% of your page weight is static assets and they are all served from the CDN, your origin

only handles 20% of the bandwidth. But

CDNs aren't invincible. In 2023, Fastly had a 52-minute global outage that took down thousands of websites during peak shopping. Your infrastructure can be

shopping. Your infrastructure can be perfect and you still go down because your CDN failed. And then there is cache configuration. If your CDN isn't caching

configuration. If your CDN isn't caching what it should or if cache headers are misconfigured, requests that should be served from the edge are hitting your origin. You are paying for a CDN but not

origin. You are paying for a CDN but not getting the protection. Let's talk about the things completely outside your control. Payment processes Stripe,

control. Payment processes Stripe, PayPal, whatever you're using, they are handling Black Friday traffic from thousands of merchants simultaneously.

If they slow down, your checking slows down. If they go down, you can't process

down. If they go down, you can't process orders. Period. O providers. If you're

orders. Period. O providers. If you're

using an external identity service and it's having latency issues, your users can't login. Analytics, tag managers,

can't login. Analytics, tag managers, third party scripts. If any of these are slow to load, they can block your page rendering. Target's Black Friday crash a

rendering. Target's Black Friday crash a few years back was partly caused by payment processor integration failures combined with unexpected mobile traffic patterns. Their infrastructure was

patterns. Their infrastructure was optimized for 60% mobile traffic. They

got 78%. The point is your system is only as reliable as its weakest dependency. Now, here is what makes

dependency. Now, here is what makes Black Friday crash so brutal. These

layers don't fall in isolation. One

bottleneck creates pressure on the next.

Database slows down. App servers wait on the database. Load balancer sees app

the database. Load balancer sees app servers as slow. Request cues up. Memory

usage spikes. Server starts failing health checks. Traffic gets routed to

health checks. Traffic gets routed to fewer servers. And those servers get

fewer servers. And those servers get even more overloaded. And the whole time users are refreshing. Every refresh is another request. More request mean more

another request. More request mean more load. More load means longer weights.

load. More load means longer weights.

And longer weight means more refreshes.

It's a feedback loop that can take your system from a little slow to completely down in minutes. So how do companies like Amazon, Shopify, and Walmart actually survive this? It starts with

load testing. Before Black Friday, they

load testing. Before Black Friday, they are simulating traffic. Not just normal traffic, but 2x, 5x, 10x expected load.

They want to know exactly where the system breaks. There's a difference

system breaks. There's a difference between load testing and stress testing.

By the way, load testing validates that our system handles expected traffic.

Stress testing finds your breaking point, and you need both. Some companies

go further with chaos engineering.

Netflix famously runs Chaos Monkey, a tool that randomly kills production servers to make sure the system can handle failures. If your system can

handle failures. If your system can survive random servers dying, it can survive a traffic spike. Then there is capacity planning. They are analyzing

capacity planning. They are analyzing last year's traffic patterns, projecting growth, and provisioning infrastructure ahead of time. Not reactive, proactive.

Autoscaling helps, but it's not instant.

Spinning up new servers take time. If

traffic spikes faster than you can scale, you're still in trouble. And

that's why pre-warming infrastructure matters. And caching strategy. Big

matters. And caching strategy. Big

retailers catch aggressively. Not just

static assets, but API responses, database queries, even rendered HTML fragments. The less work your origin

fragments. The less work your origin servers do, the more traffic they can handle. Let's get practical. What can

handle. Let's get practical. What can

you actually do? Starting with the database layer, read replicas are your friend. Your primary database handles

friend. Your primary database handles all the rights. New orders, inventory updates, user signups. What reads? Those

can go to replica databases that sync from the primary. Most applications are read heavy. So this alone can

read heavy. So this alone can dramatically reduce load on your main database. Connection pooling is

database. Connection pooling is essential. Instead of opening a database

essential. Instead of opening a database connection for every request, which is expensive, you maintain a pool of connections that get reused. Tools like

PG bouncer for post SQL can handle thousands of application request with just a few dozen actual database connections. Index your queries. A query

connections. Index your queries. A query

that scans a million rows to find one record is going to be slow. The same

query with proper index returns instantly. Run your slow query log, find

instantly. Run your slow query log, find the offenders and fix them before Black Friday, not during. And query

optimization in general. Are you

fetching more data than you need? Are

you making multiple round trips when one would do? These things don't matter much

would do? These things don't matter much at low traffic. At high traffic, they are the difference between staying up and going down. At the application layer, horizontal scaling behind a load balancer. Make sure your app is

balancer. Make sure your app is stateless so any server can handle any request. Use async processing for

request. Use async processing for anything that doesn't need an immediate response.

Cache aggressively at every layer. CDN

for static assets. Radius for session data and frequent queries. Implement

cache warming so you are not starting cold on Black Friday morning. A

misconfigured CDN is almost worse than no CDN at all because you're paying for it but not getting the protection. Make

sure your cache headers are correct. Use

a multi-dian strategy if you can afford it. If one goes down, traffic fails over

it. If one goes down, traffic fails over to another.

And finally, third party services. You

can't control whether your integrated third party such as Stripe is having a bad day, but you can control how your system responds to it. So have fallbacks where possible. Can your checkout flow

where possible. Can your checkout flow gracefully degrade if analytics fail to load? Can you cue your orders if payment

load? Can you cue your orders if payment processing is slow? Implement circuit

breakers. If a service is failing, stop calling it temporarily instead of letting failures cascade. After a

timeout, try again. And finally,

testing. Load test your system regularly, not just before Black Friday.

You should know your breaking point all the times. Stress test to find your

the times. Stress test to find your ceiling. How much traffic can you handle

ceiling. How much traffic can you handle before response times degrade? Set up

monitoring and alerting. Dashboards are

great, but dashboards don't wake you up at 3:00 a.m. when something is wrong.

You need alerts on error rates, response times, CPU usage, memory usage, database connections. You want to know about

connections. You want to know about problems before your customers start tweeting about them. Now, if you want to go deeper, I teach all of these concepts in detail in my system design master course. Link is in the description if

course. Link is in the description if you want to check that out. For a lot of teams, especially those running WordPress, that's a big ask. If your

WordPress site can't handle a traffic spike on Black Friday, and it's probably not a code problem, it's a hosting problem. That's why I want to tell you

problem. That's why I want to tell you about Kinsta. They are sponsoring this

about Kinsta. They are sponsoring this video and offer manage hosting for WordPress. And here is why that matters

WordPress. And here is why that matters if you're building anything serious.

First, speed. Switching to Kinsta could make your site run up to 200% faster. We

are talking about 37 data centers worldwide, over 300 CDN locations, and edge caching built right in. Your users

notice, and so does Google. Second,

security that's actually enterprisegrade. SOC2 compliance, DDoS

enterprisegrade. SOC2 compliance, DDoS protection, isolated container technology, and up to 99.99% SLA backed up time. And when something

does go wrong, you get real WordPress experts available 24/7. Not chatbots,

actual humans who can solve problems in minutes. Their Black Friday office runs

minutes. Their Black Friday office runs until December 2nd. New customer gets 6 months free on annual plans or 50% off first 6 months on monthly. Plus, there's

a 30-day money back guarantee. And if

you're worried about the migration, they handle the entire thing for you. No

technical expertise required. Check the

link in the description to get started.

Loading...

Loading video analysis...