Amazon's Secret to 85,000 Orders/Minute | Black Friday Architecture
By ByteMonk
Summary
## Key takeaways - **Walmart's 2011 Black Friday Crash**: In 2011, Walmart's website crashed on Black Friday because the largest retailer in the world simply couldn't handle the traffic. [00:00], [00:02] - **Databases Buckle First Under Load**: Databases have a hard limit on concurrent connections, typically 100-500; with thousands of users firing 5-20 queries each, queries queue up, taking seconds instead of milliseconds and causing timeouts. [01:28], [02:05] - **One Slow Query Triggers Cascade**: One slow query, like a complex search without indexing, holds connections open longer, starving other queries and cascading into systemwide failure. [02:20], [02:34] - **Load Balancers Amplify Failures**: Load balancers mark slow app servers as unhealthy and route traffic to remaining healthy ones, overloading them further in a vicious cycle. [03:36], [03:45] - **Cache Stampede Overwhelms DB**: Cache stampede happens when cache expires and thousands of requests simultaneously regenerate data, hitting the database all at once and overwhelming it. [04:01], [04:16] - **User Refreshes Fuel Death Loop**: Users refreshing pages create more requests, leading to more load, longer waits, and even more refreshes in a feedback loop that crashes systems in minutes. [06:50], [07:04]
Topics Covered
- Databases Buckle First Under Load
- Event Loops Block on CPU Work
- Cache Stampedes Overwhelm Databases
- Cascading Failures Create Feedback Loops
- Read Replicas Offload Database Pressure
Full Transcript
In 2011, Walmart's website crashed on Black Friday. Walmart, the largest
Black Friday. Walmart, the largest retailer in the world, simply couldn't handle the traffic. And it is not just a Walmart problem. Recently, AWS and
Walmart problem. Recently, AWS and Cloudflare have suffered massive outages. So, the real question is, what
outages. So, the real question is, what actually happens when a website crashes under load? And how do companies like
under load? And how do companies like Amazon, who process over 12,000 orders per second during peak, keep their systems running? Let's break it down
systems running? Let's break it down layer by layer.
A typical web application has several layers. At the front, you have a CDN, a
layers. At the front, you have a CDN, a content delivery network that serves static assets like images, CSS, and JavaScript from servers close to your users. Behind that, a load balancer
users. Behind that, a load balancer distributes incoming traffic across multiple application servers. Those
application servers run your actual code. Could be Node, Python, Go, Java,
code. Could be Node, Python, Go, Java, whatever your stack is. Your app server talks to a database usually something like postgra SQL or MySQL for transactional data and you have probably
got a caching layer in there too radius or memcache to avoid hitting the database for every request and then there are third party services payment processors authentication providers
email services analytics so on and so forth. On a normal day, traffic flows
forth. On a normal day, traffic flows through the system smoothly. But on
Black Friday, every single one of these layers become a potential failure point.
Let's trace through what actually breaks. The database is usually the
breaks. The database is usually the first thing to buckle. Here is why database have a hard limit on concurrent connections. Typically somewhere between
connections. Typically somewhere between 100 and 500 depending on your configuration and resources. When a user loads a page, your app might fire off 5,
10, maybe 20 database queries. Product
info, user session, cart contents, inventory checks, pricing rules. Now,
multiply that by a thousands of concurrent users. You're looking at tens
concurrent users. You're looking at tens of thousands of queries trying to squeeze through a few hundred connection slots. When you hit the connection
slots. When you hit the connection limit, new queries start queuing.
Queries that normally take 50 milliseconds now take 5 seconds because they are waiting in line. And if the queue grows faster than it drains,
timeouts, errors, your app can't function. But it gets worse. One slow
function. But it gets worse. One slow
query can cascade into a systemwide failure. Maybe it's a complex search
failure. Maybe it's a complex search without proper indexing. Maybe it's a reporting query that accidentally got triggered. That query holds its
triggered. That query holds its connection open longer, which means fewer connections for everything else.
And this is why database connection pooling matters. This is why query
pooling matters. This is why query optimization matters. And this is why
optimization matters. And this is why read replicas exist to spread the read load across multiple database instances.
Next layer up your application servers.
Whether you are running node, python or anything else, your servers have finite capacity. They can only handle so many
capacity. They can only handle so many concurrent requests before they start choking. In node, you have got the event
choking. In node, you have got the event loop. It's great at handling IO bound
loop. It's great at handling IO bound operations. But if you are doing CPU
operations. But if you are doing CPU intensive work, image processing, complex calculations, JSON passing on large payloads, you can block that event loop and grind everything to a halt. In
a threaded model like Java or Go, each request typically gets its own thread.
But thread consumes memory. Spin up too many and you exhaust your RAM. And
here's the cascading effect again. If
your database is slow, your app servers are stuck waiting on database responses.
Requests pile up. Memory usage climbs, response times balloon. Eventually, your
load balancers start seeing app servers as unhealthy. It routes traffic to the
as unhealthy. It routes traffic to the remaining healthy servers which now have even more load which makes them unhealthy. You see where this is going?
unhealthy. You see where this is going?
This is where caching is supposed to save you. The idea is simple. Instead of
save you. The idea is simple. Instead of
hitting the database for every request, you store frequently accessed data in memory. Predis mem cache even
memory. Predis mem cache even application level caching. A well-cashed
system can handle orders of magnitude more traffic than one hitting the database for everything. But caching has failure modes too. Cache stampede also called thundering herd. Your cash
expires and suddenly a thousand requests all try to regenerate the same data simultaneously. They all hit the
simultaneously. They all hit the database at once and database gets overwhelmed. Then there is cash
overwhelmed. Then there is cash invalidation bugs. You update a product
invalidation bugs. You update a product price but the old price is still cached.
Now you're selling things in the wrong price on Black Friday. That's a bad day or the cache itself runs out of memory.
It starts evicting entries. Hit rate
drops. More requests fail through the database. Database gets crushed. Caching
database. Database gets crushed. Caching
isn't a magic bullet. It's another
system that needs to be sized correctly and monitored. Let's move to the edge.
and monitored. Let's move to the edge.
Your CDN. A CDN caches static content, images, JavaScript, CSS on servers distributed across the world. user in
Tokyo gets served from a Tokyo Edge server, not your origin server in Virginia. This takes enormous pressure
Virginia. This takes enormous pressure off your infrastructure. If 80% of your page weight is static assets and they are all served from the CDN, your origin
only handles 20% of the bandwidth. But
CDNs aren't invincible. In 2023, Fastly had a 52-minute global outage that took down thousands of websites during peak shopping. Your infrastructure can be
shopping. Your infrastructure can be perfect and you still go down because your CDN failed. And then there is cache configuration. If your CDN isn't caching
configuration. If your CDN isn't caching what it should or if cache headers are misconfigured, requests that should be served from the edge are hitting your origin. You are paying for a CDN but not
origin. You are paying for a CDN but not getting the protection. Let's talk about the things completely outside your control. Payment processes Stripe,
control. Payment processes Stripe, PayPal, whatever you're using, they are handling Black Friday traffic from thousands of merchants simultaneously.
If they slow down, your checking slows down. If they go down, you can't process
down. If they go down, you can't process orders. Period. O providers. If you're
orders. Period. O providers. If you're
using an external identity service and it's having latency issues, your users can't login. Analytics, tag managers,
can't login. Analytics, tag managers, third party scripts. If any of these are slow to load, they can block your page rendering. Target's Black Friday crash a
rendering. Target's Black Friday crash a few years back was partly caused by payment processor integration failures combined with unexpected mobile traffic patterns. Their infrastructure was
patterns. Their infrastructure was optimized for 60% mobile traffic. They
got 78%. The point is your system is only as reliable as its weakest dependency. Now, here is what makes
dependency. Now, here is what makes Black Friday crash so brutal. These
layers don't fall in isolation. One
bottleneck creates pressure on the next.
Database slows down. App servers wait on the database. Load balancer sees app
the database. Load balancer sees app servers as slow. Request cues up. Memory
usage spikes. Server starts failing health checks. Traffic gets routed to
health checks. Traffic gets routed to fewer servers. And those servers get
fewer servers. And those servers get even more overloaded. And the whole time users are refreshing. Every refresh is another request. More request mean more
another request. More request mean more load. More load means longer weights.
load. More load means longer weights.
And longer weight means more refreshes.
It's a feedback loop that can take your system from a little slow to completely down in minutes. So how do companies like Amazon, Shopify, and Walmart actually survive this? It starts with
load testing. Before Black Friday, they
load testing. Before Black Friday, they are simulating traffic. Not just normal traffic, but 2x, 5x, 10x expected load.
They want to know exactly where the system breaks. There's a difference
system breaks. There's a difference between load testing and stress testing.
By the way, load testing validates that our system handles expected traffic.
Stress testing finds your breaking point, and you need both. Some companies
go further with chaos engineering.
Netflix famously runs Chaos Monkey, a tool that randomly kills production servers to make sure the system can handle failures. If your system can
handle failures. If your system can survive random servers dying, it can survive a traffic spike. Then there is capacity planning. They are analyzing
capacity planning. They are analyzing last year's traffic patterns, projecting growth, and provisioning infrastructure ahead of time. Not reactive, proactive.
Autoscaling helps, but it's not instant.
Spinning up new servers take time. If
traffic spikes faster than you can scale, you're still in trouble. And
that's why pre-warming infrastructure matters. And caching strategy. Big
matters. And caching strategy. Big
retailers catch aggressively. Not just
static assets, but API responses, database queries, even rendered HTML fragments. The less work your origin
fragments. The less work your origin servers do, the more traffic they can handle. Let's get practical. What can
handle. Let's get practical. What can
you actually do? Starting with the database layer, read replicas are your friend. Your primary database handles
friend. Your primary database handles all the rights. New orders, inventory updates, user signups. What reads? Those
can go to replica databases that sync from the primary. Most applications are read heavy. So this alone can
read heavy. So this alone can dramatically reduce load on your main database. Connection pooling is
database. Connection pooling is essential. Instead of opening a database
essential. Instead of opening a database connection for every request, which is expensive, you maintain a pool of connections that get reused. Tools like
PG bouncer for post SQL can handle thousands of application request with just a few dozen actual database connections. Index your queries. A query
connections. Index your queries. A query
that scans a million rows to find one record is going to be slow. The same
query with proper index returns instantly. Run your slow query log, find
instantly. Run your slow query log, find the offenders and fix them before Black Friday, not during. And query
optimization in general. Are you
fetching more data than you need? Are
you making multiple round trips when one would do? These things don't matter much
would do? These things don't matter much at low traffic. At high traffic, they are the difference between staying up and going down. At the application layer, horizontal scaling behind a load balancer. Make sure your app is
balancer. Make sure your app is stateless so any server can handle any request. Use async processing for
request. Use async processing for anything that doesn't need an immediate response.
Cache aggressively at every layer. CDN
for static assets. Radius for session data and frequent queries. Implement
cache warming so you are not starting cold on Black Friday morning. A
misconfigured CDN is almost worse than no CDN at all because you're paying for it but not getting the protection. Make
sure your cache headers are correct. Use
a multi-dian strategy if you can afford it. If one goes down, traffic fails over
it. If one goes down, traffic fails over to another.
And finally, third party services. You
can't control whether your integrated third party such as Stripe is having a bad day, but you can control how your system responds to it. So have fallbacks where possible. Can your checkout flow
where possible. Can your checkout flow gracefully degrade if analytics fail to load? Can you cue your orders if payment
load? Can you cue your orders if payment processing is slow? Implement circuit
breakers. If a service is failing, stop calling it temporarily instead of letting failures cascade. After a
timeout, try again. And finally,
testing. Load test your system regularly, not just before Black Friday.
You should know your breaking point all the times. Stress test to find your
the times. Stress test to find your ceiling. How much traffic can you handle
ceiling. How much traffic can you handle before response times degrade? Set up
monitoring and alerting. Dashboards are
great, but dashboards don't wake you up at 3:00 a.m. when something is wrong.
You need alerts on error rates, response times, CPU usage, memory usage, database connections. You want to know about
connections. You want to know about problems before your customers start tweeting about them. Now, if you want to go deeper, I teach all of these concepts in detail in my system design master course. Link is in the description if
course. Link is in the description if you want to check that out. For a lot of teams, especially those running WordPress, that's a big ask. If your
WordPress site can't handle a traffic spike on Black Friday, and it's probably not a code problem, it's a hosting problem. That's why I want to tell you
problem. That's why I want to tell you about Kinsta. They are sponsoring this
about Kinsta. They are sponsoring this video and offer manage hosting for WordPress. And here is why that matters
WordPress. And here is why that matters if you're building anything serious.
First, speed. Switching to Kinsta could make your site run up to 200% faster. We
are talking about 37 data centers worldwide, over 300 CDN locations, and edge caching built right in. Your users
notice, and so does Google. Second,
security that's actually enterprisegrade. SOC2 compliance, DDoS
enterprisegrade. SOC2 compliance, DDoS protection, isolated container technology, and up to 99.99% SLA backed up time. And when something
does go wrong, you get real WordPress experts available 24/7. Not chatbots,
actual humans who can solve problems in minutes. Their Black Friday office runs
minutes. Their Black Friday office runs until December 2nd. New customer gets 6 months free on annual plans or 50% off first 6 months on monthly. Plus, there's
a 30-day money back guarantee. And if
you're worried about the migration, they handle the entire thing for you. No
technical expertise required. Check the
link in the description to get started.
Loading video analysis...