System Design Deep Dives
Building Scalable Infrastructure from First Principles

In the world of distributed systems, the gap between "it works on my laptop" and "it works at scale" is filled with hard-earned lessons about concurrency, consistency and clever architectural patterns. This collection of system design deep dives explores how real-world platforms, from YouTube's view counter processing billions of events to Cricbuzz serving millions during live matches, solve fundamental challenges that every senior engineer eventually faces. Whether you're designing a message broker on SQL, building a distributed task scheduler, or implementing flash sales that handle 100,000 concurrent buyers fighting over 1,000 items, the patterns here reveal a surprising truth: the most robust solutions often aren't the most exotic. They're the ones that deeply understand database transactions, leverage partitioning intelligently and embrace eventual consistency where appropriate. Each section breaks down a specific system, recent search history, text commentary, task scheduling, search engines, view counting, message queues and flash sales, showing not just what to build, but why each architectural decision matters when milliseconds and correctness both count.
Designing Recent Searches
Understanding the Problem Through Metrics
As engineers, we don't always receive explicit requirements. Instead, we get behavioral data that tells the real story:
50% of users tap on Search Bar within the first 5 seconds – This signals an urgent need for pre-loaded, instantly available data
30% of all searches happens through recent searches – This represents massive write volume that must be captured reliably
These metrics reveal a dual challenge: we need a system that's both extremely read-optimized for instant access and robust enough to handle high-velocity writes from one of the most frequently hit APIs in any application.
Key Architectural Considerations
Before diving into implementation, we need to answer fundamental questions that will shape our entire design:
Bounded vs. Unbounded Data: Do we store search history indefinitely, or is there a natural limit? Recent searches are inherently bounded – users care about their last 10-20 queries, not their entire history.
Latency Requirements: With 50% of users accessing search within 5 seconds, we're operating in the sub-100ms response time territory. Any database round-trip becomes a bottleneck.
Stale Data Tolerance: If a user searches on their phone, how quickly must it appear on their laptop? Cross-device consistency matters, but perfect real-time sync may not be worth the architectural complexity.
Data Freshness: The 30% conversion rate through recent searches tells us that stale data directly impacts user behavior. A search made seconds ago must appear immediately.
Storage Layer: Choosing the Right Database
For storing search queries, we need to evaluate our access patterns:
No relational requirements (just user ID and query text)
Massive write volume (search is one of the most hit APIs)
Always accessed at the user level (never cross-user queries)
High cardinality data (millions of users, each with unique search patterns)
These characteristics point us toward partitioned NoSQL databases like MongoDB or Elasticsearch. Partitioning by user ID ensures that all queries for a user live in the same shard, making retrieval efficient when needed.
The Core Design Challenge: Sync vs. Async Writes
This decision fundamentally shapes our architecture. Let's examine both paths:
Asynchronous Writes (via SQS/Kafka):
Pros: API response times remain fast, system is decoupled and extensible
Cons: Introduces delay between search action and persistence, eventual consistency across devices
Synchronous Writes:
Pros: Zero delay, immediate consistency
Cons: Database becomes a bottleneck if it can't handle the load, increased API latency
Given that 30% of searches use recent searches, any delay in persistence directly impacts user experience. A user who searches for "flights to Paris" and immediately taps the search bar again expects to see that query. This drives us toward synchronous writes, but with a critical optimization.
The Solution: Write-Through Caching with Redis
The breakthrough insight comes from recognizing that our NoSQL database has a fundamental limitation: records for the same user are scattered across the partition. Listing a user's recent searches requires scanning and sorting, which is expensive at scale.
Instead, we implement a dual-write strategy:
Synchronous write to Redis (in-memory store)
Synchronous write to MongoDB (persistent store)
Redis Data Structure
Key: user:{user_id}:recent_searches
Value: ["flights to paris", "best restaurants NYC", "weather tomorrow", ...]
Type: List (capped at 10 most recent)
This pre-computed structure delivers:
Sub-millisecond read latency (pure memory lookup, no computation)
Automatic cross-device consistency (all devices read from same Redis key)
Natural data expiry (LTRIM keeps only recent 10 queries)
The Write Path
When a user submits a search query:
POST /search?q=flights+to+paris
1. Write to Redis: LPUSH user:12345:recent_searches "flights to paris"
LTRIM user:12345:recent_searches 0 9
2. Write to MongoDB: INSERT {user_id: 12345, query: "flights to paris", timestamp: now()}
3. Return search results
Both writes happen synchronously, but Redis is so fast (sub-millisecond) that the added latency is negligible compared to the search operation itself.
Handling the Cold Start Problem
Here's where our initial metric becomes critical: 50% of users tap search within 5 seconds. If the Redis cache is empty (server restart, cache eviction, new device), fetching from MongoDB ruins the experience.
Solution: Pre-warming Strategy
When a user logs in or opens the app:
Check if
user:{user_id}:recent_searchesexists in RedisIf missing, asynchronously fetch last 10 queries from MongoDB
Populate Redis in background
Set TTL (e.g., 7 days) to auto-expire inactive users
This ensures that by the time most users tap search, their data is already loaded.
Transaction Outbox Pattern for Reliability
Even with synchronous writes, we need durability guarantees. What if Redis accepts the write but MongoDB fails?
We implement a transaction outbox pattern:
Write to MongoDB includes an
outboxcollection entryBackground worker reads outbox and ensures Redis sync
Outbox entry deleted only after confirming Redis update
This provides eventual consistency guarantees without sacrificing performance.
Redis Persistence: AOF Configuration
Redis is in-memory, but we need crash recovery. Append-Only File (AOF) persistence gives us the right balance:
appendonly yes
appendfsync everysec
This configuration:
Writes every operation to disk log
Syncs to disk every second (not every write)
Recovers data with at most 1 second of loss
Maintains performance (doesn't block on disk I/O)
Cost Optimization Strategies
Running Redis at scale gets expensive. We optimize through data lifecycle management:
1. Bounded Cache Size
Redis stores only 10 most recent queries per user
MongoDB stores complete history
90% of users never scroll past first 5 suggestions
2. Data Archival
Queries older than 6 months move to cold storage (S3)
Saves 70-80% on database costs
Historical queries rarely accessed (product decision: show with higher latency if needed)
3. TTL-Based Eviction
Inactive users (no search in 30 days) auto-expire from Redis
Re-warm on next login
Keeps Redis memory footprint manageable
Async Write Alternative (Extensibility Path)
While we chose synchronous writes for immediate consistency, an async architecture offers advantages for future features:
POST /search → Kafka Topic: "search_events"
↓
Consumer Group
↓ ↓
MongoDB Redis
↓
Analytics Pipeline (ML for personalization)
This pattern shines when you need to:
Feed search data to recommendation engines
Run real-time analytics on search trends
A/B test different ranking algorithms
The tradeoff is accepting a few hundred milliseconds of delay before searches appear in "recent" lists.

The Final Architecture
Our design delivers on all key requirements:
Sub-100ms read latency: Pure Redis lookups, no computation
Zero staleness: Synchronous writes ensure immediate consistency
Cross-device sync: All devices read from same Redis key
High write throughput: Redis handles 100K+ writes/sec per instance
Cost efficient: Bounded cache + archival keeps expenses predictable
Reliable: AOF persistence + outbox pattern prevent data loss
The result is a system where tapping the search bar feels instant, recent searches are always fresh and the architecture scales to millions of users searching billions of times per day.
Designing Cricbuzz's Text Commentary
The Scale of the Challenge
Cricbuzz is India's 8th most visited website. During a high-profile cricket match (IPL finals, India vs Pakistan), millions of fans simultaneously refresh their screens, desperately waiting for the next ball's commentary. A single match can generate:
10-15 million concurrent users during peak moments
Commentary updates every 20-30 seconds (each ball bowled)
Sustained traffic for 3-4 hours per match
Multiple matches running simultaneously during tournaments
Designing text commentary isn't just about storing messages, it's about delivering real-time updates to millions while keeping infrastructure costs reasonable.
Core Requirements
1. Real-time Updates: When a commentator types "SIX! Kohli sends it sailing over long-on," millions of users must see it within seconds.
2. Cost Efficiency: Serving millions of requests per second can bankrupt your cloud bill if not architected carefully.
3. Excellent User Experience: Commentary should feel live, load instantly and work seamlessly even during traffic spikes.
4. Realistic Durability: Commentary is typed by humans in a control room. If something fails, they can retype. We don't need bank-grade transaction guarantees.
Storage Architecture: Hot and Cold Data
Commentary follows a natural lifecycle that we can exploit for cost optimization:
Hot Data (Last 20-30 comments): Users actively refresh to see the latest updates. This needs to be lightning-fast.
Cold Data (Everything older): Users might scroll through match history, but with much lower frequency. Speed is less critical.
This insight drives our storage strategy:
Primary Storage: Partitioned NoSQL Database
We use a partitioned database (MongoDB, Cassandra) with schema:
{
match_id: "IND_vs_PAK_2025_final",
commentary_id: "10.3", // Over.Ball
timestamp: 1704067200,
text: "SIX! Kohli smashes it over long-on!",
commentator: "Harsha Bhogle"
}
Partitioning by match_id ensures all commentary for a match lives in the same shard, making retrieval efficient. The database serves as our source of truth and handles:
Long-term storage (entire match history)
Pagination for users scrolling through old commentary
Analytics and post-match analysis
However, hitting the database for every refresh from millions of users would be catastrophic for both latency and cost.
Caching Layer: Redis for Hot Commentary
The breakthrough realization: only the latest 20-30 comments matter for 95% of users. We aggressively cache these in Redis:
Key: match:{match_id}:live_commentary
Type: Sorted Set (scored by timestamp)
Value: [
{score: 1704067200, text: "10.6: SIX! Kohli finishes the over in style"},
{score: 1704067180, text: "10.5: DOT ball, good length from Shaheen"},
{score: 1704067160, text: "10.4: FOUR! Edge flies to third man"},
...
]
Why Sorted Set?
Automatic ordering by timestamp
ZRANGE operations to fetch latest N comments
O(log N) insertion performance
Built-in score-based querying
This architecture means:
Read path: 99% of requests hit Redis (sub-millisecond response)
Database load: Only pagination requests for historical commentary
Cost: Massive savings on database read IOPS
The Write Path: Commentator Control Room
Here's the crucial insight about durability: commentary is manually typed by humans sitting in a control room. If a write fails, they can simply hit "Save" again. This changes everything about our consistency requirements.
Direct Redis Updates
When a commentator submits new commentary:
POST /commentary
{
"match_id": "IND_vs_PAK_2025_final",
"text": "SIX! Kohli sends it sailing!",
"ball": "10.3"
}
1. Write directly to Redis:
ZADD match:IND_vs_PAK_2025_final:live_commentary 1704067200
'{"ball":"10.3", "text":"SIX! Kohli sends it sailing!"}'
2. Trim to keep only latest 30:
ZREMRANGEBYRANK match:IND_vs_PAK_2025_final:live_commentary 0 -31
3. Async write to MongoDB (fire and forget)
INSERT into commentary_collection {...}
4. Return 200 OK to commentator
Why this works:
Redis write completes in < 5ms
Commentator gets instant feedback
If MongoDB write fails, it's okay, commentator will notice commentary missing and retry
No complex distributed transactions needed
Manual Retry as Durability Strategy
Instead of complex retry logic, we embrace human oversight:
Commentator UI shows the last 10 comments posted
If a comment doesn't appear after clicking "Save," they click again
Simple, effective and matches actual operational workflow
Saves massive engineering complexity on automatic retry mechanisms
The Read Path: Short Polling for Live Updates
Real-time updates face a classic tradeoff: WebSockets vs. Polling.
WebSockets: Persistent connections, true push-based updates
Pros: Instant delivery, efficient after connection established
Cons: Millions of concurrent connections, complex infrastructure, expensive at scale
Short Polling: Clients request updates every N seconds
Pros: Stateless, cacheable, simpler infrastructure
Cons: Slight delay, more requests
For cricket commentary, short polling wins because:
Commentary updates every 20-30 seconds (each ball), not every millisecond
Polling every 5-10 seconds feels real-time for this use case
Redis handles millions of reads/second easily with replication
Much cheaper than maintaining millions of WebSocket connections
Optimized Polling Implementation
GET /api/match/{match_id}/commentary/latest?since=1704067180
Response:
{
"commentary": [
{"ball": "10.6", "text": "SIX! Kohli finishes the over", "ts": 1704067200},
{"ball": "10.5", "text": "DOT ball from Shaheen", "ts": 1704067190}
],
"next_poll": 5000 // milliseconds
}
Key optimizations:
sinceparameter: Only return new commentary, reducing payload sizenext_pollfield: Server controls polling frequency (can slow down during breaks)HTTP caching:
ETagheaders let CDN cache when no new commentary exists
CDN Edge Caching
During overs where no balls are bowled (drinks break, DRS review), millions of users still poll. We cache at the edge:
Cache-Control: public, max-age=3
ETag: "match_123_commentary_10.5"
When commentary doesn't change:
CDN returns 304 Not Modified
No request hits our origin servers
Massive cost savings during idle periods
Cost Optimization: Data Lifecycle Management
A single IPL season generates millions of commentary entries. Storing all of it in hot storage would be expensive. We implement aggressive archival:
Tiered Storage Strategy
Tier 1 - Redis (Live Match + 1 hour post-match)
Latest 30 comments per live match
Sub-millisecond access
Auto-expire 1 hour after match ends
Cost: ~$500/month for 50 concurrent matches
Tier 2 - MongoDB (Last 30 days)
Full commentary for recent matches
Used for "recent matches" section
SSD-backed, fast reads
Cost: ~$2000/month
Tier 3 - S3 Archive (Historical)
Matches older than 30 days
Compressed JSON files
Accessed only for "match archives" feature
Cost: ~$50/month for entire season
Archive Process (runs daily):
1. Query MongoDB for matches older than 30 days
2. Export commentary to JSON, gzip compress
3. Upload to S3: s3://cricbuzz-archives/2025/IPL/match_123.json.gz
4. Delete from MongoDB
5. Update match metadata: commentary_location = "s3://..."
This tiered approach reduces storage costs by 90% while maintaining excellent UX for active matches.
Handling Traffic Spikes: Read Scaling
When Kohli walks out to bat, traffic can spike 10x in seconds. Our architecture handles this through:
Redis Read Replicas
Redis Cluster:
- 1 Primary (writes)
- 5 Read Replicas (reads)
- Load balancer distributes GET requests
Each replica handles 100K requests/second, giving us 500K reads/sec capacity. Replication lag is < 10ms, acceptable for commentary that updates every 30 seconds.
Application-Level Load Balancing
API Servers (Auto-scaling group):
- Baseline: 20 instances
- Scale trigger: CPU > 60% for 30 seconds
- Scale up: Add 10 instances (takes ~60 seconds)
- Scale down: Remove instances when CPU < 30% for 5 minutes
Short polling gives us time to scale, unlike WebSockets where connections are already established, polling lets new instances immediately share load.

Putting It All Together: User Journey
User opens Cricbuzz app during IND vs PAK match:
App requests:
GET /api/match/123/commentary/latestLoad balancer routes to API server
API server queries Redis:
ZRANGE match:123:live_commentary -30 -1Redis returns latest 30 comments in 2ms
API server returns JSON to app in 15ms total
App displays commentary, sets 5-second polling timer
Commentator types "SIX! Kohli smashes it!":
Commentator hits "Save" in control room UI
Request hits:
POST /commentaryAPI writes to Redis in 3ms
Async worker writes to MongoDB (no waiting)
Return 200 OK to commentator in 5ms
Commentator sees confirmation, types next comment
Within 5 seconds:
Millions of users' apps poll for updates
Requests hit CDN edge (many cached if no new commentary)
New commentary detected: CDN forwards to origin
Origin serves from Redis (cache hit)
Users see "SIX! Kohli smashes it!" appear
After match ends:
Redis entries expire after 1 hour (auto-cleanup)
Full commentary remains in MongoDB for 30 days
Daily job archives to S3 after 30 days
Historical commentary still accessible via "Match Archive" at slower speed
The Architecture's Elegant Properties
This design delivers on every requirement:
Real-Time Feel: 5-second polling + Redis caching = commentary appears almost instantly
Cost Efficient:
Short polling cheaper than WebSockets at scale
Tiered storage saves 90% on historical data
CDN caching reduces origin traffic by 70%
Excellent UX:
Sub-20ms API response times
No loading spinners during polls
Graceful degradation if Redis slow (fallback to MongoDB)
Realistic Durability:
Manual retry by commentators (humans in the loop)
No over-engineered distributed transactions
Simpler codebase, fewer failure modes
The result: millions of cricket fans experience the thrill of live commentary, ball by ball, without Cricbuzz's infrastructure team experiencing the thrill of a crashing production system.
Designing a Super Simple Text-Based Search Engine
The Fundamental Problem
At its core, a search engine solves one problem: given an information need, find the most relevant documents from a corpus. Whether you're searching through JSON files, web pages, or text documents, the challenge remains the same, how do you quickly surface what matters?
The naive approach seems obvious: when someone searches for "pizza," scan through every document one by one and check if it contains the word "pizza." Return all matches. Done.
Except this is O(N) complexity and when your corpus contains millions of documents, this becomes impossibly slow. A search that takes 30 seconds isn't a search, it's a coffee break.
Making Search Efficient: The Inverted Index
The breakthrough that makes modern search possible is indexing. Instead of checking documents at search time, we pre-compute which documents contain which words.
Building the Inverted Index
Think of it like the index at the back of a textbook. Instead of reading the entire book to find mentions of "photosynthesis," you flip to the index and see: "photosynthesis: pages 47, 89, 132."
For our search engine, we build an inverted index:
Corpus:
doc1: "The best pizza in New York"
doc2: "New York style bagels"
doc3: "Chicago deep dish pizza"
Inverted Index:
"pizza" → [doc1, doc3]
"new" → [doc1, doc2]
"york" → [doc1, doc2]
"chicago" → [doc3]
"bagels" → [doc2]
"best" → [doc1]
...
Now when someone searches "pizza," we don't scan documents, we simply look up the word in our index and instantly retrieve [doc1, doc3]. This transforms our O(N) problem into an O(1) lookup.
The Indexing Process
For each document in corpus:
1. Tokenize: "The best pizza" → ["the", "best", "pizza"]
2. Normalize: ["the", "best", "pizza"] → ["best", "pizza"] (remove stopwords)
3. For each token:
- If token exists in index: append document ID
- If token doesn't exist: create new entry [document ID]
This preprocessing step is expensive (you process every document once), but it makes search instantaneous. The tradeoff is obvious: spend time once during indexing to save time on every search forever.
Beyond Simple Matching: Relevance Ranking
But wait, simply returning all documents that contain "pizza" isn't enough. When a user searches, they want the most relevant documents first, not just any document that happens to mention the word.
Weighted Scoring: Not All Fields Are Equal
A document where "pizza" appears in the title is probably more relevant than one where it's buried in a footnote. We assign different weights to different fields:
Document structure:
{
title: "Best Pizza Places in NYC",
description: "A comprehensive guide...",
tags: ["food", "pizza", "restaurants"],
body: "When it comes to pizza, New York..."
}
Scoring weights:
- title match: 10x
- tags match: 5x
- description match: 3x
- body match: 1x
When calculating relevance:
Query: "pizza"
doc1: "pizza" in title (10 points) + "pizza" in body (1 point) = 11 points
doc2: "pizza" in body twice (2 points) = 2 points
Result: Return doc1 first
This simple weighting dramatically improves result quality. Users want authoritative documents where the search term is central, not tangentially mentioned.
Handling Real-World Messiness: Fuzzy Search
Users make mistakes. Constantly. They type "pizzza" instead of "pizza," "resturant" instead of "restaurant." A search engine that only matches exact strings is frustratingly brittle.
Edit Distance and BK-Trees
Edit distance (Levenshtein distance) measures how many single-character edits (insertions, deletions, substitutions) separate two words:
"pizza" → "pizzza": 1 edit (insert 'z')
"pizza" → "piza": 1 edit (delete 'z')
"pizza" → "pitza": 1 edit (substitute 'z' → 't')
Naive fuzzy search would calculate edit distance against every word in our vocabulary, once again hitting O(N) complexity. Instead, we use a BK-tree (Burkhard-Keller tree), a data structure that organizes words by edit distance, allowing us to prune the search space.
When a user searches "pizzza," we:
Look up in BK-tree with tolerance of 1-2 edits
Find candidate words: ["pizza", "piazza"]
Search our inverted index for both candidates
Return combined results
This handles typos gracefully without sacrificing speed.
The Pronunciation Problem: Phonetic Search
Here's a fascinating insight from real user behavior: people remember how words sound, not how they're spelled. This leads to a specific pattern of "typos" that fuzzy search can't solve.
Consider someone searching for Arnold Schwarzenegger:
"Arnold Swarzeneger"
"Arnold Swazeneger"
"Arnold Schwarzeneger"
These aren't random typos, they're phonetic approximations. The user knows how to pronounce the name but can't remember the exact spelling. Standard fuzzy search fails because the edit distance is too large.
Phonetic Algorithms: Metaphone and Soundex
Instead of indexing the literal spelling, we index by phonetic encoding, how the word sounds.
Soundex (simpler, older):
"Schwarzenegger" → S-623
"Swarzeneger" → S-623
"Swazeneger" → S-623
Metaphone (more accurate):
"Schwarzenegger" → SHWRTSNKR
"Swarzeneger" → SWRTSNKR
"Swazeneger" → SWSNKR
When building our index, we create a parallel phonetic index:
Inverted Index:
"schwarzenegger" → [doc_arnold_bio]
Phonetic Index (Metaphone):
"SHWRTSNKR" → [doc_arnold_bio]
Now when someone searches "Swarzeneger," we:
Try exact match: fails
Convert to phonetic: "SWRTSNKR"
Search phonetic index: matches "SHWRTSNKR" (close enough)
Return Arnold Schwarzenegger results
This handles a whole class of spelling mistakes that edit-distance approaches miss entirely.
Synonym Expansion: Understanding Intent
Users express the same concept in different ways. Someone searching "home" might want documents about "houses." Someone searching "car" might want "automobile" results.
Building a Synonym Map
Synonym groups:
["home", "house", "residence", "dwelling"]
["car", "automobile", "vehicle"]
["happy", "joyful", "cheerful"]
At index time, we expand documents:
Original document: "Beautiful house for sale"
Indexed under:
- "house" → [doc1]
- "home" → [doc1] (synonym)
- "residence" → [doc1] (synonym)
- "dwelling" → [doc1] (synonym)
At search time, we expand queries:
User searches: "home"
Expanded query: "home OR house OR residence OR dwelling"
Results: All documents mentioning any synonym
This dramatically improves recall, the percentage of relevant documents we actually find.
Spell Correction: Did You Mean...?
Typos that are too far from any real word need correction before search. When someone searches "HOUPE," we want to suggest "HOUSE."
Building a Vocabulary-Based Corrector
1. Build vocabulary from corpus (all unique words)
2. For misspelled query:
- Calculate edit distance to all vocabulary words
- Find closest matches (edit distance ≤ 2)
- Rank by word frequency in corpus
3. Suggest top match: "Did you mean: house?"
Example:
Query: "HOUPE"
Candidates (within 2 edits):
- "house" (edit distance: 2, frequency: 15,000)
- "coupe" (edit distance: 2, frequency: 500)
- "hopped" (edit distance: 2, frequency: 200)
Suggestion: "Did you mean: house?"
Frequency matters because common words are more likely to be intended than rare ones.
Query Segmentation: The Missing Space Problem
Users forget spaces. They search "mcdonalds" when they mean "mc donalds," or "newyork" instead of "new york." This breaks our indexing because we indexed "new" and "york" separately.
N-gram Based Segmentation
We pre-compute common multi-word phrases in our corpus:
Bigrams (2-word phrases):
"new york" → appears 50,000 times
"san francisco" → appears 30,000 times
"mc donalds" → appears 5,000 times
When we receive "newyork," we:
Try exact match: fails
Attempt segmentation:
"n ewyork" (unlikely)
"ne wyork" (unlikely)
"new york" (found in bigram index, high frequency!)
Search for "new york" instead
This requires maintaining a dictionary of common phrases, but it handles a surprisingly common user error pattern.
Putting It All Together: The Search Pipeline
When a user types a query and hits enter, here's what happens:
Query: "pizzza in newyork"
1. Tokenization: ["pizzza", "in", "newyork"]
2. Stopword removal: ["pizzza", "newyork"]
3. Spell correction:
- "pizzza" → "pizza" (edit distance: 1)
- "newyork" → "new york" (segmentation)
4. Synonym expansion:
- "pizza" → ["pizza", "pie"] (if synonyms exist)
5. Phonetic fallback (if needed):
- Convert to Metaphone codes
6. Index lookup:
- Fetch document IDs for each term
- "pizza" → [doc1, doc3, doc7, doc12]
- "new" → [doc1, doc2, doc5]
- "york" → [doc1, doc2, doc5]
7. Intersection & Scoring:
- Documents containing all terms: [doc1]
- Apply field weights (title, body, etc.)
- Calculate relevance scores
8. Ranking & Return:
- Sort by relevance score
- Return top N results
The entire pipeline completes in milliseconds because the expensive work (indexing) happened ahead of time.
Why This Scales
The beauty of this architecture is that search time is independent of corpus size:
Index lookup: O(1) hash table access
BK-tree fuzzy search: O(log N) with pruning
Phonetic lookup: O(1) after encoding
Synonym expansion: O(k) where k is number of synonyms (small constant)
Whether you have 1,000 documents or 1,000,000,000 documents, individual searches remain fast because you're not scanning documents, you're looking up pre-computed indexes.
The tradeoff is storage: your inverted index can be 30-50% the size of your corpus. But disk is cheap and user patience is not.
From Simple to Sophisticated
This "super simple" search engine actually incorporates sophisticated techniques:
Inverted indexing transforms O(N) scans into O(1) lookups
Weighted scoring surfaces the most relevant results first
Fuzzy search handles typos gracefully
Phonetic search solves the "I know how it sounds" problem
Synonym expansion understands semantic similarity
Spell correction prevents dead-end searches
Query segmentation handles missing spaces
Each technique addresses a real pattern of user behavior, making search feel effortless even though complex machinery runs behind the scenes. The result: users type what they mean (however they mean it) and they get what they want. That's the magic of a well-designed search engine.
Designing a Distributed Task Scheduler
The Problem We're Solving
Task scheduling is everywhere in modern systems. You need to send a reminder email in 2 hours. Run a data pipeline every night at 3 AM. Process payment reconciliation every Monday morning. Archive old records on the first of every month.
Building a distributed task scheduler means solving a deceptively simple problem: execute tasks at specific times, reliably, at scale. Systems like Dkron and AWS CloudWatch Events exist because this problem is harder than it looks.
Core Requirements
Our scheduler must support two fundamental scheduling patterns:
1. Fixed-Time Tasks: Execute once at a specific moment
"Send password reset email at 2:45 PM today"
"Start the sale at midnight on Black Friday"
"Generate monthly report at 12:01 AM on the 1st"
2. Recurring Tasks (CRON): Execute on a schedule
"Back up database every day at 2 AM"
"Send weekly newsletter every Monday at 9 AM"
"Check payment status every 15 minutes"
Critical Constraint: 30-Second SLA
If a task is scheduled for 10:01:00 AM, it must start executing by 10:01:30 AM. This isn't about how long the task takes to complete, it's about how quickly we pick it up and start processing. This SLA shapes every architectural decision we make.
Granularity: Minute-Level
We don't support second-level precision. Tasks scheduled for "10:01:15 AM" and "10:01:45 AM" will both be treated as "10:01:00 AM." This simplification drastically reduces complexity without sacrificing most real-world use cases.
Architectural Components: Store, Pick, Execute
Every task scheduler breaks down into three fundamental operations:
Store: Persist task definitions and scheduling information
Pick: Identify which tasks are ready to execute
Execute: Actually run the task logic
The naive approach puts all three in a single process. But this doesn't scale and here's why.
Why Separate Pickers from Executors?
Imagine you're running a pizza restaurant. You could have one person who takes orders, makes pizzas and delivers them. But you quickly realize this doesn't work:
Taking orders is fast (seconds)
Making pizzas is slower (minutes)
Delivering is slowest (20+ minutes)
If the same person does everything, they're blocked on deliveries while orders pile up. You separate roles: order-taker, cook, delivery driver. Each scales independently.
Task scheduling has the same dynamics:
Picking Tasks (Fast)
Query database: "which tasks are due now?"
Mark task as "picked" so others don't grab it
Put task on a queue for execution
Time: < 100ms per task
Executing Tasks (Slow)
Send HTTP request to external API (might take 5-10 seconds)
Process large dataset (might take minutes)
Wait for third-party service response
Time: seconds to minutes per task
If the same machine does both, picking stops while execution runs. Tasks accumulate, missing their SLA. By separating concerns:
Pickers: Lightweight, high-throughput machines that continuously scan for due tasks
Executors: Bulky machines that do the heavy work, scaled proportionally to load
You might run 3 picker instances but 50 executor instances, matching the different performance characteristics.
Storage: The Tasks Table
We use a relational database (MySQL, PostgreSQL) to store task definitions. The schema captures everything we need:
Table: tasks
+---------------+------------------+
| id | INT PRIMARY KEY |
| task | TEXT | ← Task payload/definition
| scheduled_at | TIMESTAMP | ← When to execute (for fixed tasks)
| picked_at | TIMESTAMP | ← When a picker claimed it
| started_at | TIMESTAMP | ← When execution began
| ended_at | TIMESTAMP | ← When execution completed
+---------------+------------------+
Why Relational?
ACID transactions for atomic picking (critical for preventing duplicate execution)
Rich querying with WHERE clauses and indexes
No sharding needed for typical workloads (millions of tasks)
Simple operational model
Let's trace through the lifecycle of a task:
Task Registration (user schedules a task):
INSERT INTO tasks (task, scheduled_at)
VALUES ('send_email:user_123', '2026-01-05 10:01:00');
At this point: scheduled_at is set, all other timestamps are NULL.
The Picking Process: Finding Ready Tasks
Pickers run on a continuous loop, typically every few seconds, looking for tasks that are due. The query is deceptively simple but critical:
SELECT * FROM tasks
WHERE scheduled_at <= (NOW() + INTERVAL 2 MINUTE)
AND picked_at IS NULL
ORDER BY scheduled_at ASC
LIMIT 10
FOR UPDATE SKIP LOCKED;
Let's break down each clause:
scheduled_at <= (NOW() + INTERVAL 2 MINUTE)
This looks 2 minutes into the future, not just at tasks due right now. Why? It gives us a buffer to handle the entire pick-and-queue workflow. If it's 10:00:00 AM, we fetch tasks scheduled up to 10:02:00 AM.
This headroom is crucial for meeting our 30-second SLA. By the time we pick, queue and an executor pulls from the queue, those "future" tasks will be due.
AND picked_at IS NULL
Only fetch tasks that haven't been claimed yet. Once a picker sets picked_at, that task is off-limits to other pickers. This prevents duplicate execution.
ORDER BY scheduled_at ASC
Process tasks in chronological order. Tasks scheduled for 10:01:00 AM should execute before tasks scheduled for 10:05:00 AM. FIFO semantics match user expectations.
LIMIT 10
Batch size for processing. Too small (1) means excessive database round trips. Too large (1000) means one slow picker holds a huge batch while others sit idle. 10-50 is a sweet spot.
FOR UPDATE SKIP LOCKED
This is the magic that makes distributed picking safe:
FOR UPDATE: Lock the selected rows for the duration of the transactionSKIP LOCKED: If another picker already locked some rows, skip them instead of waiting
This means multiple pickers can run simultaneously without blocking each other. Picker 1 might grab tasks 1-10, Picker 2 grabs 11-20, all without conflicts.
Atomic Pick-and-Mark
After selecting tasks, we must atomically mark them as picked:
UPDATE tasks
SET picked_at = NOW()
WHERE id IN (101, 102, 103, 104, 105, 106, 107, 108, 109, 110);
This happens in the same transaction as the SELECT. The transaction flow:
BEGIN TRANSACTION;
1. SELECT ... FOR UPDATE SKIP LOCKED
2. UPDATE tasks SET picked_at = NOW() WHERE id IN (...)
COMMIT;
If the picker crashes between SELECT and UPDATE, the transaction rolls back and the tasks remain unpicked. No orphaned state.
Once committed, picked_at is set and those tasks won't be selected by other pickers.
Pushing to the Execution Queue
After marking tasks as picked, we push them to a message queue (SQS, RabbitMQ):
For each task in batch:
queue.push({
task_id: 101,
payload: "send_email:user_123",
scheduled_at: "2026-01-05 10:01:00"
})
Why a Message Queue?
The queue decouples pickers from executors:
Buffering: If executors are momentarily overwhelmed, tasks wait in the queue rather than backing up in the database
Load balancing: Multiple executors pull from the same queue, automatically distributing work
Retry semantics: If an executor crashes mid-task, the message becomes visible again for retry (though we're designing without retries, the queue still provides crash recovery)
At this point, the picker's job is done. It moves on to the next batch.
The Execution Phase
Executors are worker processes that pull tasks from the queue and actually run them:
Executor process (continuous loop):
1. msg = queue.pull() // Blocks until task available
2. task = fetch_from_db(msg.task_id)
3. UPDATE tasks SET started_at = NOW() WHERE id = task.id
4. execute_task_logic(task.payload)
5. UPDATE tasks SET ended_at = NOW() WHERE id = task.id
6. queue.acknowledge(msg) // Remove from queue
Step-by-Step Breakdown:
1. Pull from Queue The executor blocks on queue.pull(), waiting for work. When a message arrives, it gets exclusive access (queue guarantees no other executor sees this message).
2. Fetch Full Task Details The queue message is minimal (task ID, payload). We fetch the full record from the database for any additional context.
3. Mark Execution Start Set started_at = NOW(). This is important for monitoring, we can track how long tasks take and identify stuck jobs.
4. Execute the Task This is where the actual work happens. For a "send email" task, this means calling your email service API. For a "process data" task, this means running your computation.
5. Mark Completion Set ended_at = NOW(). The task lifecycle is complete.
6. Acknowledge the Message Tell the queue "I'm done with this message, remove it." If we crash before acknowledgment, the message becomes visible again for another executor.
Meeting the 30-Second SLA
Our SLA states: tasks scheduled for 10:01:00 AM must start executing by 10:01:30 AM. Let's trace the timing:
10:00:00 AM: Task scheduled for 10:01:00 AM
10:00:04 AM: Picker runs, sees task (it's within 2-minute window)
10:00:04 AM: Picker marks picked_at, pushes to queue (< 100ms)
10:00:05 AM: Task sitting in queue
10:01:00 AM: Task is now officially "due"
10:01:01 AM: Executor pulls from queue, marks started_at
10:01:01 AM: Execution begins
Total time from scheduled to started: 1 second, well under our 30-second budget.
What if executors are overwhelmed?
10:00:04 AM: Task pushed to queue
10:00:05-10:01:20 AM: Task waits in queue (executors busy)
10:01:21 AM: Executor finally pulls task, marks started_at
Total time: 21 seconds, still meeting SLA.
The 2-minute look-ahead in picking gives us breathing room. Even if there's a 1-minute queue delay, we're fine.
Handling Recurring Tasks: The CRON Challenge
Fixed-time tasks are straightforward: schedule once, execute once, done. Recurring tasks are trickier.
Consider a task with CRON schedule 0 2 * * * (daily at 2 AM). After execution, we need to schedule the next occurrence.
Data Model for Recurring Tasks
Table: tasks
+---------------+------------------+
| id | INT |
| task | TEXT |
| schedule | TEXT | ← CRON expression: "0 2 * * *"
| scheduled_at | TIMESTAMP | ← Next execution time
| picked_at | TIMESTAMP |
| started_at | TIMESTAMP |
| ended_at | TIMESTAMP |
+---------------+------------------+
The schedule field contains the CRON expression. When a recurring task completes:
Executor (after task completes):
1. Parse CRON: "0 2 * * *"
2. Calculate next run: next_2am()
3. UPDATE tasks
SET scheduled_at = '2026-01-06 02:00:00',
picked_at = NULL,
started_at = NULL,
ended_at = NULL
WHERE id = 42
Critical: We reset picked_at, started_at and ended_at to NULL. This makes the task "look new" to pickers. On the next cycle, pickers will see scheduled_at = 2026-01-06 02:00:00 and picked_at IS NULL, treating it as a fresh task to schedule.
This "reset and reschedule" pattern means recurring tasks never get deleted, they just keep updating their scheduled_at timestamp.
When to Calculate Next Execution
Two approaches:
Option 1: After Picking When a picker grabs a recurring task, immediately schedule the next occurrence:
-- Picker marks current as picked
UPDATE tasks SET picked_at = NOW() WHERE id = 42;
-- Picker inserts next occurrence
INSERT INTO tasks (task, schedule, scheduled_at)
VALUES ('backup_db', '0 2 * * *', '2026-01-06 02:00:00');
Pros: Next run is scheduled immediately, no gap Cons: Creates duplicate rows, table grows over time
Option 2: After Execution Wait until the task completes, then update the same row:
-- Executor completes task
UPDATE tasks
SET scheduled_at = '2026-01-06 02:00:00',
picked_at = NULL, started_at = NULL, ended_at = NULL
WHERE id = 42;
Pros: Single row per recurring task, cleaner schema Cons: Slight gap between completion and next scheduling
For our design, Option 2 is better. It keeps the table compact and works naturally with our minute-level granularity.
Predictive Scaling: Handling Burst Traffic
One of the hardest operational challenges is traffic bursts. At 9:00 AM, 10,000 tasks might be scheduled simultaneously (daily reports, morning emails, etc.).
If we reactively scale executors based on queue depth, we're already behind. By the time new instances spin up (60-90 seconds), tasks are already missing their SLA.
Solution: Predictive Scaling
The orchestrator is a separate component that monitors the database and proactively scales:
Orchestrator (runs every minute):
1. Query DB:
SELECT COUNT(*) FROM tasks
WHERE scheduled_at BETWEEN NOW() AND NOW() + INTERVAL 10 MINUTE
AND picked_at IS NULL
2. If count > threshold:
- Scale up executors *before* the burst hits
- Add instances that will be ready when tasks start flowing
3. Monitor key metrics:
- Tasks pulled per minute
- Average task wait time
- Average time to completion
4. If metrics degrade:
- Scale up further
5. If queue is empty for 5+ minutes:
- Scale down to baseline
Example Scenario:
8:55 AM: Orchestrator queries DB
Sees 5,000 tasks scheduled for 9:00-9:10 AM
Current executors: 10 (can handle ~200 tasks/min)
Needed capacity: ~500 tasks/min
8:56 AM: Orchestrator triggers scale-up to 30 executors
8:57-8:59 AM: New executors provisioning
9:00 AM: Burst hits, 30 executors ready to pull tasks
9:00-9:10 AM: All 5,000 tasks executed within SLA
9:15 AM: Orchestrator sees queue empty, scales back to 15
9:20 AM: Scales back to baseline 10
By looking 10 minutes ahead, we're always prepared for what's coming.
Key Metrics for Monitoring
The orchestrator and operations team track these SLA-critical metrics:
1. Tasks Pulled Per Minute How many tasks are pickers successfully claiming? This should match expected throughput.
2. Task Wait Time Time from scheduled_at to started_at. Our SLA requires this to be < 30 seconds.
3. Average Time to Completion Time from started_at to ended_at. Helps identify slow tasks that need optimization.
4. Picker Lag Are pickers keeping up with the schedule? If pickers are slow, tasks accumulate.
5. Queue Depth How many tasks are waiting for execution? Spikes indicate need for more executors.
These metrics feed into alerting and auto-scaling decisions.
Why No Retries?
Our design explicitly excludes automatic retries. This is a product decision, not a technical limitation. Here's why it makes sense:
1. Idempotency is Hard Retrying means potentially executing the same task twice. For many workloads (sending emails, charging payments, posting to social media), duplicate execution is worse than no execution.
2. Failure Signals Issues If a task fails, it often indicates a bug in task logic or a downstream service outage. Retrying just papers over the problem. Better to fail loudly, page an engineer and fix the root cause.
3. Simpler State Machine Without retries, task states are simple: pending → picked → executing → complete. Retries introduce exponential complexity: retry counts, backoff timers, max retry limits, dead letter queues.
Alternative: Manual Retry Failed tasks stay in the database with started_at set but ended_at NULL. A monitoring dashboard shows these failures. Engineers can:
Investigate the failure
Fix the bug
Reset the task:
UPDATE tasks SET picked_at = NULL, started_at = NULL WHERE id = 42Let the system pick it up again naturally
This puts humans in the loop for exception handling, which is often appropriate for scheduled tasks.
Putting It All Together
Let's trace a complete task lifecycle:
Monday, 9:00 AM: User schedules a task
INSERT INTO tasks (task, scheduled_at)
VALUES ('send_newsletter', '2026-01-05 10:00:00');
Monday, 9:58 AM: Picker scans database
SELECT * FROM tasks
WHERE scheduled_at <= '2026-01-05 10:00:00'
AND picked_at IS NULL
ORDER BY scheduled_at ASC
LIMIT 10
FOR UPDATE SKIP LOCKED;
Finds our newsletter task (and 9 others).
Monday, 9:58 AM: Picker marks and queues
UPDATE tasks SET picked_at = '2026-01-05 09:58:00'
WHERE id = 42;
queue.push({task_id: 42, payload: 'send_newsletter'})
Monday, 9:59 AM: Task waits in queue
Monday, 10:00:01 AM: Executor pulls task
msg = queue.pull()
Monday, 10:00:01 AM: Executor starts execution
UPDATE tasks SET started_at = '2026-01-05 10:00:01'
WHERE id = 42;
Monday, 10:00:01-10:00:15 AM: Executor sends newsletter Calls email service API, waits for response...
Monday, 10:00:15 AM: Task completes
UPDATE tasks SET ended_at = '2026-01-05 10:00:15'
WHERE id = 42;
queue.acknowledge(msg)
Total time from scheduled to started: 1 second, crushing our 30-second SLA.

The Architecture's Elegant Properties
This design delivers on every requirement:
30-Second SLA: Look-ahead picking + queue buffering + predictive scaling ensures tasks start promptly
Minute-Level Granularity: Simplified model without second-level complexity
Fixed & Recurring Tasks: Single schema handles both with simple reset logic
Distributed Execution: Pickers and executors scale independently
High Throughput: Multiple pickers work concurrently without conflicts via SKIP LOCKED
Operational Simplicity: Relational database, standard message queue, no exotic components
The result: a production-grade task scheduler that handles millions of tasks per day while maintaining strict timing guarantees. Whether you're scheduling a single password reset email or orchestrating thousands of nightly data pipelines, the architecture scales gracefully from day one to day ten thousand.
Designing a Message Broker on Top of a Relational Database
The Surprising Choice: Why Build on SQL?
When you think "message broker," you probably think RabbitMQ, Kafka, or SQS, specialized systems built from the ground up for messaging. So why would anyone build a message broker on top of a relational database?
The answer lies in operational simplicity. Many organizations already run MySQL or PostgreSQL with robust backup, monitoring and operational expertise. Adding RabbitMQ means learning new tools, managing new infrastructure and dealing with new failure modes. Sometimes, especially at smaller scale, leveraging what you already have is the smart engineering decision.
The question becomes: can a relational database provide the guarantees a message broker needs? Let's find out.
Core Requirements
Our broker must satisfy four fundamental properties:
1. FIFO Ordering: Messages must be consumed in the order they were produced. If Producer sends Message A then Message B, consumers must process A before B.
2. Consume-Once Semantics: When a consumer reads a message, it should be deleted (or marked as consumed). No other consumer should see it.
3. High Throughput: The system must handle thousands of messages per second without becoming a bottleneck.
4. Exclusive Consumption: Once Consumer 1 picks a message, Consumer 2 cannot pick the same message. No duplicate processing.
These requirements mirror exactly what we built for the task scheduler's picker and that's no accident. Message consumption is fundamentally the same problem as task picking.
The Schema: Simple Yet Powerful
Table: messages
+------------------+------------------+
| id | INT AUTO_INC PK |
| msg | TEXT |
| created_at | TIMESTAMP |
| picked_at | TIMESTAMP |
| deleted_at | TIMESTAMP |
| receipt_handle | VARCHAR(UUID) |
+------------------+------------------+
Let's understand each field's purpose:
id: Auto-incrementing primary key. Provides natural FIFO ordering, lower IDs were inserted first.
msg: The actual message payload. Could be JSON, serialized protobuf, or plain text. The broker doesn't care about content.
created_at: When the producer inserted the message. Useful for monitoring and debugging.
picked_at: When a consumer claimed the message. NULL means available for consumption.
deleted_at: When the message was deleted after successful processing. NULL means not yet deleted.
receipt_handle: A unique UUID given to the consumer who picked the message. Used to ensure only that consumer can delete it, prevents one consumer from accidentally deleting another's message.
The Producer Path: Publishing Messages
Publishing is delightfully simple:
INSERT INTO messages (msg, created_at)
VALUES ('{"user_id": 123, "action": "send_email"}', NOW());
That's it. No complex protocol, no broker-specific client libraries. Just a standard SQL INSERT.
Throughput Considerations:
For high-throughput scenarios, batch inserts dramatically improve performance:
INSERT INTO messages (msg, created_at) VALUES
('message_1', NOW()),
('message_2', NOW()),
('message_3', NOW()),
...
('message_100', NOW());
A single batch INSERT of 100 messages is 50x faster than 100 individual INSERTs. Most database drivers support this natively.
The Consumer Path: Reading Messages
This is where the magic happens. Consumers need to atomically claim messages without conflicts. The query looks familiar if you read the task scheduler section:
SELECT * FROM messages
WHERE picked_at IS NULL
AND deleted_at IS NULL
ORDER BY created_at ASC
LIMIT n
FOR UPDATE SKIP LOCKED;
Let's break down each clause:
WHERE picked_at IS NULL
Only fetch messages that haven't been claimed yet. Once a consumer sets picked_at, the message becomes invisible to this query.
AND deleted_at IS NULL
Skip messages that have already been processed and deleted. This is our tombstone mechanism.
ORDER BY created_at ASC
FIFO guarantee. Process messages in the order they were created. We could also ORDER BY id ASC since IDs are auto-incrementing, which is slightly faster (no need to consult the timestamp).
LIMIT n
Batch size. Instead of fetching one message at a time (inefficient), grab 10-50 at once. Reduces database round trips.
FOR UPDATE SKIP LOCKED
The critical piece that makes concurrent consumption safe:
FOR UPDATE: Locks the selected rows for the duration of the transactionSKIP LOCKED: If another consumer already locked some rows, skip them instead of waiting
This means Consumer 1 might grab messages 1-10, Consumer 2 simultaneously grabs 11-20 and Consumer 3 grabs 21-30. All without blocking each other.
Claiming Messages: The Atomic Pick
After selecting messages, we must mark them as picked in the same transaction:
BEGIN TRANSACTION;
-- Select messages
SELECT * FROM messages
WHERE picked_at IS NULL AND deleted_at IS NULL
ORDER BY created_at ASC
LIMIT 10
FOR UPDATE SKIP LOCKED;
-- Mark as picked with unique receipt handle
UPDATE messages
SET picked_at = NOW(),
receipt_handle = UUID()
WHERE id IN (101, 102, 103, 104, 105, 106, 107, 108, 109, 110);
COMMIT;
After the COMMIT, those 10 messages have:
picked_atset to current timestampreceipt_handlecontaining a unique UUID like"550e8400-e29b-41d4-a716-446655440000"
The consumer receives both the messages and their receipt handles. This is crucial for the delete operation.
The Broker API: HTTP Endpoints
Our broker exposes a simple HTTP API:
GET /messages/q?n=10
Request:
GET /messages/q?n=10
Response:
{
"messages": [
{
"id": 101,
"msg": "{\"user_id\": 123, \"action\": \"send_email\"}",
"receipt_handle": "550e8400-e29b-41d4-a716-446655440000"
},
{
"id": 102,
"msg": "{\"user_id\": 456, \"action\": \"process_payment\"}",
"receipt_handle": "6ba7b810-9dad-11d1-80b4-00c04fd430c8"
}
]
}
The consumer gets up to 10 messages (whatever was available) along with unique receipt handles.
DELETE /message/q/uuid1
After successfully processing a message, the consumer deletes it:
Request:
DELETE /message/q/550e8400-e29b-41d4-a716-446655440000
Implementation:
UPDATE messages
SET deleted_at = NOW()
WHERE receipt_handle = '550e8400-e29b-41d4-a716-446655440000';
Notice we don't actually DELETE the row, we set deleted_at. This is a soft delete pattern.
Why Soft Delete?
Hard deletes (actual DELETE statements) cause problems:
Fragmentation: Constantly deleting and inserting creates holes in the table
Replication lag: Deletes must replicate to all database replicas
Audit trail: With soft deletes, you can see message history
The tradeoff is eventual cleanup (discussed below).
PUT /message/q
Producer publishes a new message:
Request:
PUT /message/q
Content-Type: application/json
{
"msg": "{\"user_id\": 789, \"action\": \"generate_report\"}"
}
Implementation:
INSERT INTO messages (msg, created_at)
VALUES ('{\"user_id\": 789, \"action\": \"generate_report\"}', NOW());
Simple, synchronous publish. For higher throughput, batch multiple messages.
Ensuring Exclusive Consumption
The receipt handle is the key to preventing one consumer from deleting another's messages. Consider this scenario:
10:00:00 - Consumer 1 picks message 101, receipt_handle = "uuid-A"
10:00:01 - Consumer 2 picks message 102, receipt_handle = "uuid-B"
10:00:02 - Consumer 1 tries to delete message 102 (by mistake)
Consumer 1's delete request:
UPDATE messages
SET deleted_at = NOW()
WHERE receipt_handle = 'uuid-A' AND id = 102;
This UPDATE affects 0 rows because message 102 has receipt_handle = 'uuid-B', not 'uuid-A'. The delete fails silently (or we return an error), protecting Consumer 2's work.
This pattern prevents accidentally:
Deleting messages you didn't pick
Deleting messages another consumer is processing
Race conditions where multiple consumers claim the same message
Visibility Timeout: Handling Crashed Consumers
What happens if a consumer crashes after picking messages but before deleting them?
10:00:00 - Consumer 1 picks messages 101-110
10:00:05 - Consumer 1 crashes (server dies, network partition, etc.)
Those 10 messages are now stuck: picked_at is set, so no other consumer will claim them, but they're not deleted because Consumer 1 never finished.
Solution: Visibility Timeout
After a message is picked, it should become visible again if not deleted within a timeout period (typically 10 minutes):
CRON job (runs every minute):
UPDATE messages
SET picked_at = NULL
WHERE picked_at < (NOW() - INTERVAL 10 MINUTE)
AND deleted_at IS NULL;
This finds messages that were picked more than 10 minutes ago but not deleted and resets them. They become available for other consumers to pick.
The CRON job essentially says: "If you picked a message but didn't delete it in 10 minutes, we assume you crashed and make it available again."
Important: This means messages might be delivered twice if a consumer is slow but not crashed. This is acceptable in most systems, we trade perfect once-delivery for resilience.
Scaling Throughput: Multiple Consumers
The beauty of FOR UPDATE SKIP LOCKED is that it makes horizontal scaling trivial:
Consumer 1 (polls every 4 seconds):
10:00:00 - Picks messages 1-10
10:00:04 - Picks messages 41-50
10:00:08 - Picks messages 81-90
Consumer 2 (polls every 4 seconds):
10:00:00 - Picks messages 11-20
10:00:04 - Picks messages 51-60
10:00:08 - Picks messages 91-100
Consumer 3 (polls every 4 seconds):
10:00:00 - Picks messages 21-30
10:00:04 - Picks messages 61-70
10:00:08 - Picks messages 101-110
Each consumer independently queries the database. The locking mechanism ensures they never pick the same messages. No coordination required, no leader election, no distributed consensus.
Throughput Calculation:
Each consumer processes 10 messages per batch
Processing takes ~1 second per message
Total batch time: ~10 seconds
Consumer polls every 4 seconds, processes for ~10 seconds, then polls again
Each consumer: ~60 messages/minute
With 10 consumers: ~600 messages/minute = 10 messages/second
To increase throughput, simply add more consumers. The database becomes the bottleneck only at very high scale (10,000+ messages/sec), at which point you'd migrate to a specialized broker.
The Broker API Layer
The broker sits between consumers and the database, providing a clean abstraction:
┌───────────┐
│Consumer 1 │───┐
└───────────┘ │
├──→ ┌────────────┐ ┌────────┐
┌───────────┐ │ │ Broker API │─────→│ MySQL │
│Consumer 2 │───┤ └────────────┘ └────────┘
└───────────┘ │
│
┌───────────┐ │
│Consumer 3 │───┘
└───────────┘
The Broker API exposes:
GET /messages/q?n=10→ Consumers pull messagesDEL /message/q/uuid1→ Consumers delete after processingPUT /message/q→ Producers publish messages
Why not let consumers query directly?
The broker layer provides:
Authentication: Only authorized clients can publish/consume
Rate limiting: Prevent abusive consumers from overwhelming the database
Metrics: Track messages published, consumed, latency
Abstraction: Swap MySQL for PostgreSQL without changing consumer code
Garbage Collection: Cleaning Up Old Messages
Soft deletes mean rows accumulate forever. A table with 1 billion messages (even deleted ones) becomes slow to query and wastes disk space.
Solution: Background Cleanup
-- Runs daily via CRON
DELETE FROM messages
WHERE deleted_at < (NOW() - INTERVAL 7 DAYS)
LIMIT 10000;
This hard-deletes messages that were soft-deleted more than 7 days ago. We keep a week of history for debugging, then purge.
Why LIMIT 10000?
Deleting millions of rows in one transaction locks the table and causes replication lag. By limiting to 10,000 rows, we delete in small batches:
Each batch takes <1 second
Doesn't block ongoing message consumption
Runs repeatedly until all old messages are purged
You can also use partitioning strategies where older data lives in separate partitions that can be dropped entirely (instant cleanup).
Handling FIFO Strictly
Our ORDER BY created_at ASC provides approximate FIFO, but what if two messages have the same timestamp (down to the second)?
For strict ordering, use ORDER BY id ASC instead:
SELECT * FROM messages
WHERE picked_at IS NULL AND deleted_at IS NULL
ORDER BY id ASC -- Guaranteed order since IDs are sequential
LIMIT 10
FOR UPDATE SKIP LOCKED;
Since id is auto-incrementing, lower IDs were inserted first. This gives perfect FIFO ordering.
Trade-off: Ordering by id (integer) is faster than created_at (timestamp) anyway, so this is strictly better for our use case.
Monitoring and Observability
Key metrics to track:
1. Messages Published Per Minute
SELECT COUNT(*) FROM messages
WHERE created_at >= (NOW() - INTERVAL 1 MINUTE);
2. Messages Pending (Not Picked)
SELECT COUNT(*) FROM messages
WHERE picked_at IS NULL AND deleted_at IS NULL;
If this number grows unbounded, consumers can't keep up. Scale up.
3. Messages Stuck (Picked But Not Deleted)
SELECT COUNT(*) FROM messages
WHERE picked_at IS NOT NULL
AND deleted_at IS NULL
AND picked_at < (NOW() - INTERVAL 5 MINUTE);
High numbers indicate consumer crashes or slow processing.
4. Average Time to Deletion
SELECT AVG(TIMESTAMPDIFF(SECOND, picked_at, deleted_at)) AS avg_seconds
FROM messages
WHERE deleted_at >= (NOW() - INTERVAL 1 HOUR);
Shows how long consumers take to process messages. Spikes indicate performance issues.
When This Design Works (and When It Doesn't)
This design excels when:
You already run MySQL/PostgreSQL and want to avoid new infrastructure
Throughput is moderate (< 10,000 messages/sec)
You need strong consistency guarantees (ACID transactions)
Operational simplicity is a priority
You want built-in durability (database backups cover messages too)
You should use a specialized broker when:
You need 100,000+ messages/sec throughput
You need advanced routing (pub/sub, topic exchanges, fanout)
You want message replays (Kafka-style log retention)
You need multi-datacenter replication with low latency
Database becomes a bottleneck for your other workloads
Real-World Usage: AWS SQS Comparison
Interestingly, this design closely mimics how AWS SQS works internally (though they don't publicly confirm implementation details):
| Feature | Our SQL Broker | AWS SQS |
| Visibility timeout | Cron resets picked_at | Built-in |
| Receipt handle | UUID column | Opaque token |
| Soft deletes | deleted_at column | Internal tombstones |
| FIFO | ORDER BY id ASC | FIFO queues |
| Throughput | 1K-10K msg/sec | 3K-10K msg/sec (standard) |
SQS is more polished and scalable, but the core concepts are identical. If you understand this SQL implementation, you understand how SQS works under the hood.
Complete Example: End-to-End Flow
Producer publishes 3 messages:
INSERT INTO messages (msg, created_at) VALUES
('Message A', '2026-01-05 10:00:00'),
('Message B', '2026-01-05 10:00:01'),
('Message C', '2026-01-05 10:00:02');
Consumer 1 pulls batch:
BEGIN;
SELECT * FROM messages
WHERE picked_at IS NULL AND deleted_at IS NULL
ORDER BY id ASC LIMIT 2
FOR UPDATE SKIP LOCKED;
-- Returns: Message A (id=1), Message B (id=2)
UPDATE messages
SET picked_at = NOW(), receipt_handle = UUID()
WHERE id IN (1, 2);
COMMIT;
-- Consumer receives: [{id:1, msg:"A", handle:"uuid-1"}, {id:2, msg:"B", handle:"uuid-2"}]
Consumer 2 pulls batch (simultaneously):
BEGIN;
SELECT * FROM messages
WHERE picked_at IS NULL AND deleted_at IS NULL
ORDER BY id ASC LIMIT 2
FOR UPDATE SKIP LOCKED;
-- Returns: Message C (id=3) [Messages 1-2 are locked by Consumer 1, skipped]
UPDATE messages
SET picked_at = NOW(), receipt_handle = UUID()
WHERE id = 3;
COMMIT;
-- Consumer receives: [{id:3, msg:"C", handle:"uuid-3"}]
Consumer 1 finishes processing, deletes:
UPDATE messages SET deleted_at = NOW()
WHERE receipt_handle = 'uuid-1';
UPDATE messages SET deleted_at = NOW()
WHERE receipt_handle = 'uuid-2';
Consumer 2 crashes before deleting Message C:
(Consumer 2 dies, message 3 stuck with picked_at set)
10 minutes later, visibility timeout CRON runs:
UPDATE messages
SET picked_at = NULL
WHERE picked_at < (NOW() - INTERVAL 10 MINUTE)
AND deleted_at IS NULL;
-- Message C becomes available again
Consumer 3 picks up Message C:
-- Same SELECT/UPDATE flow as before
-- Message C gets new receipt handle "uuid-4"
-- Consumer 3 processes and deletes successfully
The Elegant Simplicity
Building a message broker on SQL demonstrates a powerful principle: you don't always need specialized infrastructure. With the right patterns, atomic transactions, row-level locking, smart indexing, a relational database can provide message broker semantics.
The result is a system that's:
Simple to understand (just SQL queries)
Easy to operate (leverage existing database skills)
Reliable (ACID transactions prevent data loss)
Scalable (to thousands of messages per second)
For many applications, that's more than enough. And when you outgrow it, the concepts you learned, visibility timeouts, receipt handles, FIFO ordering, idempotent deletes, transfer directly to systems like RabbitMQ, SQS, or Kafka.
Sometimes the best solution isn't the most sophisticated technology, it's the one that solves your problem with the tools you already have.
Designing YouTube's Views Counter: Counting at Planetary Scale
The Deceptive Simplicity of "Just Count"
Counting views on YouTube sounds trivial. Every time someone watches a video, increment a counter. Done, right?
Except YouTube serves 1 billion hours of video every single day. That's roughly 100,000+ video views starting every second. Counting at this scale while filtering out bots, detecting fraud and maintaining accuracy across the globe is one of the most challenging distributed systems problems in existence.
This isn't just an engineering exercise, view counts directly impact creator revenue, recommendation algorithms and trending calculations. Getting it wrong means creators lose money and fraudulent videos go viral.
Core Requirements
1. Count Every Legitimate View When a user watches a video, that view should be reflected in the counter, eventually. We need high accuracy without losing data.
2. Filter Out Invalid Views Not every play counts as a view:
Bot traffic (automated scraping)
Repeated refreshes within seconds
Views from the same IP in suspicious patterns
Views shorter than some threshold (e.g., < 10 seconds)
Self-views from the video creator
These filtering rules live in a Rules Engine and evolve constantly as new fraud patterns emerge.
3. Near Real-Time Updates While we can tolerate some delay (eventual consistency), view counts shouldn't lag hours behind reality. Users expect to see counts update within minutes.
4. Scale to Billions of Events 100,000 views/second sustained, with spikes during viral moments (a popular livestream might generate 1 million concurrent views).
The Naive Approach (and Why It Fails)
The simplest design: every video play sends a request directly to a database to increment a counter.
UPDATE videos
SET view_count = view_count + 1
WHERE video_id = 'dQw4w9WgXcQ';
Why this doesn't work:
1. Database Hotspots Popular videos (viral hits, live events) would receive millions of concurrent writes to the same row. This creates contention, all writes wait in line for the lock on that row.
Databases can handle ~10,000 writes/sec to different rows. But 1 million writes/sec to the same row? The database grinds to a halt.
2. No Filtering Logic We're counting every view without applying fraud detection rules. Bots would inflate counts instantly.
3. Cross-Region Latency Users are globally distributed. Forcing every view to synchronously update a central database adds 100-300ms latency to video playback. Unacceptable for user experience.
We need a fundamentally different architecture.
The Architecture: Stream Processing at Scale
The breakthrough insight: counting is an aggregation problem, not a transactional problem. We don't need to know the exact count at every millisecond, we need to eventually arrive at the correct count across billions of events.
This leads us to a streaming architecture:
┌─────────┐
│ User 1 │──┐
└─────────┘ │
├──→ ┌────────────┐ ┌─────────┐
┌─────────┐ │ │ Watchtime │─────→│ Kafka │
│ User 2 │──┤ │ Service │ │(Partitioned)
└─────────┘ │ └────────────┘ └─────────┘
│ │
┌─────────┐ │ │
│ User 3 │──┘ ↓
└─────────┘ ┌────────────────┐
│ Event Filters │
│ (Consumers) │
└────────────────┘
│
↓
┌────────────────┐
│ Counters │
│(Batch & Count) │
└────────────────┘
│
↓
┌────────────────┐
│ Count DB │
│ (Aggregated) │
└────────────────┘
Let's trace through this step by step.
Step 1: Capturing Watch Events
When a user watches a video, the client (web browser, mobile app) sends a "heartbeat" to the Watchtime Service:
POST /watchtime
{
"video_id": "dQw4w9WgXcQ",
"user_id": "user_12345",
"session_id": "sess_abc",
"timestamp": "2026-01-05T10:23:45Z",
"watch_duration": 15, // seconds watched so far
"device": "mobile",
"ip": "203.0.113.42"
}
These heartbeats happen every 10-30 seconds during playback. If a user watches a 3-minute video, they might send 6-10 heartbeat events.
Why not wait until the video ends?
Users close tabs, their connections drop, they navigate away. If we only count "complete views," we'd miss 70%+ of actual viewing. Heartbeats give us fine-grained data on actual watch time.
Step 2: Publishing to Kafka
The Watchtime Service immediately publishes each event to Kafka:
Kafka Topic: "watch_events"
Partitions: 100 (to parallelize processing)
Event stream:
A, A, A, B, C, A, D, B, A, B, A, A, D, C, B, A, ...
Each letter represents a different video. Notice "A" appears frequently, it's a viral video getting massive views.
Why Kafka? Why not process directly?
1. Decoupling The Watchtime Service's job is to accept events quickly (low latency for users). Processing and filtering can happen asynchronously. Kafka acts as a buffer.
2. Durability Kafka persists events to disk. If a downstream consumer crashes, events aren't lost, they're replayed.
3. Parallelism Kafka's partitioning lets us run 100 parallel consumers, each processing a subset of events. This is crucial for handling 100,000 events/second.
4. Replayability If we discover a bug in our counting logic, we can reprocess historical events from Kafka to fix counts.
Kafka Partitioning Strategy
Kafka uses partitioning to distribute events across multiple consumers. The partition key determines which partition an event goes to:
Partition = hash(video_id) % num_partitions
Example with 4 partitions:
- Video A: hash("A") % 4 = 0 → Partition 0
- Video B: hash("B") % 4 = 1 → Partition 1
- Video C: hash("C") % 4 = 2 → Partition 2
- Video D: hash("D") % 4 = 3 → Partition 3
Critical insight: All events for the same video go to the same partition.
This means:
Partition 0 gets ALL events for video A
Partition 1 gets ALL events for video B
And so on...
Why partition by video_id?
Because counting requires aggregation. To count views for video A, we need all video A events processed by the same consumer. Partitioning by video_id ensures this locality.
If we partitioned randomly, video A events would scatter across all partitions and we'd need complex coordination to aggregate counts (expensive and error-prone).
Step 3: Event Filtering (Rules Engine)
Each Kafka partition is consumed by a Event Filter service. These consumers apply fraud detection rules:
Event Filter (Consumer Group, 100 instances):
For each event from Kafka:
1. Check Rules DB: Does this event pass validation?
- Is user_id legitimate? (not a bot)
- Has this user_id watched this video recently? (prevent spam refreshes)
- Did they watch for minimum duration? (e.g., >= 10 seconds)
- Is IP address flagged? (known bot farm)
2. If PASS: Forward to "valid_events" Kafka topic
3. If FAIL: Drop event (log for analysis)
Example Rules:
Rule 1: Same user watching same video within 5 minutes → Only count first view
Rule 2: Watch duration < 10 seconds → Don't count
Rule 3: More than 10 views from same IP in 1 minute → Flag as bot
Rule 4: Creator watching their own video → Don't count
The Rules DB is a separate database containing these rules, which can be updated without redeploying code. This is crucial because fraud patterns evolve daily.
Why re-ingest into Kafka after filtering?
Because filtering is just one stage. We want to decouple filtering from counting. If counting logic changes, we don't need to re-filter, we just reprocess from the "valid_events" topic.
This pattern is called staged stream processing: each stage reads from one topic, transforms data and writes to the next topic.
Step 4: Counting in Parallel (The Counter Service)
Now we have a stream of valid events in Kafka. The final stage is counting:
Counter Service (Consumer Group, 100 instances):
For each partition:
1. Batch events by video_id over a time window (e.g., 10 seconds)
2. Aggregate: video_id → count
Example batch:
{
"video_A": 1523 events,
"video_B": 847 events,
"video_C": 3291 events
}
3. Write aggregated counts to Count DB
Why batch instead of individual writes?
Writing every single event to the database means 100,000 writes/second. Batching reduces this to ~1,000 writes/second (100 partitions × 10 batches/second each).
Each batch aggregates potentially thousands of events for the same video:
Before batching:
- video_A viewed (write to DB)
- video_A viewed (write to DB)
- video_A viewed (write to DB)
... 1,523 database writes
After batching:
- video_A: +1523 views (single write to DB)
This reduces database load by 1000x while maintaining the same accuracy.
Parallelism: Why Count on Many Machines?
With 100 Kafka partitions and 100 Counter Service instances, we achieve massive parallelism:
Partition 0 → Counter Instance 0 → Processes videos {A, E, I, M, ...}
Partition 1 → Counter Instance 1 → Processes videos {B, F, J, N, ...}
Partition 2 → Counter Instance 2 → Processes videos {C, G, K, O, ...}
...
Partition 99 → Counter Instance 99 → Processes videos {D, H, L, P, ...}
Each instance independently:
Reads events from its assigned partition
Batches events per video_id
Writes aggregated counts to the database
No coordination needed. Because we partitioned by video_id, each video's events are processed by exactly one instance. No conflicts, no distributed locks, no race conditions.
Parallelism breakdown:
100,000 events/sec ÷ 100 partitions = 1,000 events/sec per partition
Each Counter instance handles 1,000 events/sec (easily manageable)
Processing time per event: ~1ms (filtering + batching)
This architecture scales horizontally: need more throughput? Add more partitions and instances.
The Count Database: Aggregation Storage
The Count DB stores aggregated view counts per video:
Table: video_counts
video_id | view_count
------------------|-----------
dQw4w9WgXcQ | 1,234,567,890
jNQXAC9IVRw | 543,210,987
...
Counter services issue simple UPDATE statements:
UPDATE video_counts
SET view_count = view_count + 1523
WHERE video_id = 'dQw4w9WgXcQ';
Because we batched 1,523 events into a single update, database writes are manageable even for viral videos.
Avoiding huge writes:
A live event with 1 million concurrent viewers would generate massive count updates if we batched for too long. Solution: adaptive batching:
If batch size for video_id > 10,000 events:
- Flush batch immediately (don't wait for time window)
- Write to database
- Start new batch
This prevents any single batch from accumulating huge counts, keeping write sizes reasonable.
Why Re-ingest into Kafka? (Deep Dive)
The architecture has two Kafka stages: raw events → Kafka → filtered events → Kafka → counting. Why the middle Kafka topic?
1. Fault Tolerance
If the Counter service crashes, filtered events are safely stored in Kafka. When it recovers, it resumes from the last committed offset. No data loss.
2. Reprocessing
Imagine we discover a bug: we undercounted views for videos watched on mobile devices. To fix:
Deploy corrected Counter code
Reset Kafka offset to 24 hours ago
Reprocess filtered events
Corrected counts populate the database
Without Kafka persistence, we'd lose historical events and couldn't fix the mistake.
3. Multiple Consumers
Other systems might need filtered events:
Analytics pipeline: Tracks watch patterns for recommendations
Revenue calculator: Determines ad revenue based on valid views
Trending algorithm: Identifies videos gaining views rapidly
All these consumers read from the same "valid_events" topic. Kafka naturally supports multiple consumer groups without additional load on the upstream filter.
4. Backpressure Handling
If the Count DB is slow or temporarily unavailable, filtered events accumulate in Kafka rather than backing up into the Watchtime Service. This isolates failures, users aren't impacted by downstream problems.
Handling Hot Videos: Partition Skew
Partitioning by video_id creates a problem: viral videos cause partition skew.
Normal day:
Partition 0: 1,000 events/sec (various videos)
Partition 1: 1,000 events/sec
...
Viral video "A" goes live:
Partition 0 (contains video A): 50,000 events/sec
Partition 1: 1,000 events/sec
...
Partition 0's consumer is overwhelmed while others sit idle. The system's overall capacity is 100,000 events/sec, but one partition is the bottleneck.
Solution: Sub-partitioning by session_id
Instead of partitioning purely by video_id, we use a composite key:
Partition = hash(video_id + session_id) % num_partitions
For video A with 50,000 concurrent viewers:
50,000 different session_ids
Events distributed across all 100 partitions
Each partition handles 500 events/sec for video A
Then during counting, we aggregate across partitions for the same video_id. This adds complexity (need coordination) but prevents hotspots.
Trade-off: Simple partitioning (video_id only) is easier to implement but suffers from skew. Composite partitioning (video_id + session_id) distributes load evenly but requires multi-partition aggregation.
YouTube likely uses the composite approach for resilience during viral events.
The Rules Database: Dynamic Fraud Detection
The Rules DB stores filter logic that evolves without code deploys:
Table: filter_rules
rule_id | rule_type | parameters
--------|--------------------|-----------
1 | min_watch_duration | {"seconds": 10}
2 | repeat_view_window | {"minutes": 5}
3 | ip_rate_limit | {"views_per_minute": 10}
4 | bot_ip_blocklist | {"ips": ["1.2.3.4", ...]}
Event Filters query this database periodically (cached locally to avoid database load):
Every 60 seconds:
rules = fetch_rules_from_db()
cache_locally(rules)
For each event:
for rule in cached_rules:
if not rule.apply(event):
drop_event()
break
Why not hardcode rules?
Fraud patterns change daily. New bot networks appear, old ones get shut down. Hardcoding means deploying new code every time rules change, slow and risky.
With a Rules DB, analysts can add/modify rules via a UI and changes propagate to all filters within 60 seconds. No deployment required.
Putting It All Together: Complete Flow
User watches "Never Gonna Give You Up" for 30 seconds:
1. 10:00:00 - Client sends heartbeat #1
POST /watchtime {"video_id": "dQw4w9WgXcQ", "duration": 10s, ...}
2. 10:00:00 - Watchtime Service publishes to Kafka
Topic: "watch_events", Partition: 42 (hash of video_id)
3. 10:00:01 - Event Filter (Consumer 42) reads event
- Checks Rules DB: min_watch_duration = 10s ✓
- Not a repeat view ✓
- IP not rate-limited ✓
- Passes all filters
- Publishes to "valid_events" topic
4. 10:00:10 - Client sends heartbeat #2 (20s watched)
5. 10:00:20 - Client sends heartbeat #3 (30s watched)
6. 10:00:10 - Counter Service (Consumer 42) batches events
- Batch window closes (10 seconds elapsed)
- Aggregates: {"dQw4w9WgXcQ": +3 views} (3 heartbeats)
7. 10:00:10 - Counter writes to Count DB
UPDATE video_counts SET view_count = view_count + 3
WHERE video_id = 'dQw4w9WgXcQ'
8. 10:00:11 - View count incremented: 1,234,567,890 → 1,234,567,893
From the user's watch action to the database update: 10 seconds. This near-real-time pipeline processes billions of events daily with high accuracy.

Monitoring and Observability
Key metrics for this system:
1. Event Throughput
Events published to Kafka per second
Events processed by Filters per second
Events written to Count DB per second
2. Partition Lag How far behind is each consumer from the latest events in its partition? High lag indicates processing can't keep up.
3. Filter Drop Rate Percentage of events dropped by Rules Engine. Sudden spikes might indicate a bot attack or a bug in filtering logic.
4. Database Write Latency Time to execute COUNT updates. Spikes indicate database contention or need for scaling.
5. Count Accuracy Periodically compare counts from the streaming pipeline against a batch recount job. Discrepancies indicate bugs.
The Elegance of Stream Processing
This architecture demonstrates core principles of distributed stream processing:
Decoupling: Each stage (ingestion, filtering, counting) operates independently. Failures are isolated.
Parallelism: Kafka partitioning enables massive horizontal scaling without coordination overhead.
Batching: Aggregating events before writing to the database reduces load by 1000x.
Idempotency: Reprocessing events produces the same counts. Enables fault tolerance and bug fixes.
Eventual Consistency: View counts update within seconds, not instantly. Users accept slight delays for a system that scales.
The result: YouTube counts billions of views per day, filters sophisticated fraud and keeps creators' revenue accurate, all while serving those 100,000 views per second that never stop flowing.
Designing Flash Sales: The High-Stakes Race for Limited Inventory
The Chaos of Scarcity
It's 12:00 PM. Your company drops 1,000 iPhone 15 units at 50% off. Within seconds, 100,000 eager buyers hit "Add to Cart" simultaneously. Only 1,000 will succeed. The rest will see the dreaded "Out of Stock" message.
This is a flash sale, a short time window where a fixed set of items is sold to whoever can claim them first. It's e-commerce at its most brutal: high traffic, limited inventory, zero tolerance for overselling. Get it wrong and you either disappoint customers or sell items you don't have (leading to refunds, angry emails and chargebacks).
The technical challenge isn't just handling traffic, it's ensuring exactly N items sell, no more, no less, even when 100,000 people try to buy simultaneously.
Real-World Examples
Flash sales power some of the internet's biggest traffic spikes:
Ticket booking sites (BookMyShow, Ticketmaster): Concert tickets for a popular artist sell out in minutes
IRCTC (Indian Railways): Train tickets during festival season face millions of concurrent bookings
Hotel booking: Limited discounted rooms during promotional periods
Shopify flash sales: Fashion brands drop limited edition items
YouTube live streams: Limited merchandise tied to live events
The pattern is universal: fixed inventory + time pressure + massive concurrent demand = need for bulletproof concurrency control.
Core Requirements
1. Fixed Inventory You have exactly 1,000 iPhones. No more magically appear. No less should be reserved.
2. Short Time Window Sale starts at 12:00 PM, ends at 12:15 PM (or when inventory depletes). High-intensity traffic concentrated in minutes.
3. High Throughput 100,000 users hitting your servers simultaneously. System must handle the load without crashing.
4. No Overselling The cardinal sin. If 1,000 items exist, exactly 1,000 orders should succeed. Selling 1,001 means you promised something you can't deliver.
5. Fair Distribution Users who click first (within milliseconds) should have first claim. No backdoor reservations, no favoritism.
6. Handle Payment Failures A user might add an item to cart, but payment fails. That item should become available again for others.
The Mental Model: The Real-World Store
Imagine a physical store with 1,000 iPhones on shelves:
Phase 0: Prepare the Stock
- Store owner stocks shelves with 1,000 iPhones
- Each phone is a physical unit on the shelf
Phase 1: Let Them In
- Doors open at 12:00 PM
- Customers rush in
- Each person physically grabs a phone from the shelf
- Once grabbed, no one else can take that specific unit
Phase 2: Payment
- Customer takes phone to checkout counter
- Payment succeeds → phone is theirs, they leave
- Payment fails → phone goes back on shelf for someone else
Our database design mimics this physical reality. The units table represents physical inventory units. Picking an item is like grabbing it from the shelf. Payment confirms ownership.
The Data Model: Units Table
Instead of tracking "quantity available," we model individual units:
Table: units
id | item_id | picked_at | picked_by | purchased_by
----|---------|------------|-----------|-------------
1 | 720 | NULL | NULL | NULL
2 | 720 | NULL | NULL | NULL
3 | 720 | NULL | NULL | NULL
...
1000| 720 | NULL | NULL | NULL
If you're selling 1,000 iPhones (item_id = 720), you have 1,000 rows in this table. Each row is a specific, sellable unit.
Why one row per unit instead of a counter?
Consider the alternative:
-- BAD: Using a counter
UPDATE items
SET quantity_available = quantity_available - 1
WHERE item_id = 720 AND quantity_available > 0;
This seems simpler, but it has fatal problems:
Problem 1: Race Conditions
Time | User A | User B
--------|-------------------------------------|-------------------------------------
10:00:00| SELECT quantity (reads: 1) |
10:00:01| | SELECT quantity (reads: 1)
10:00:02| UPDATE quantity = 0 (success) |
10:00:03| | UPDATE quantity = -1 (OVERSOLD!)
Even with WHERE quantity_available > 0, concurrent UPDATEs can pass the check simultaneously if not properly locked.
Problem 2: Contention
All 100,000 concurrent requests try to UPDATE the same row. The database serializes these updates (they wait in line for the row lock), creating massive contention. Throughput collapses to a few hundred requests per second.
The Units Model Solves Both:
With individual unit rows, locking happens at the unit level, not the item level. User A locks unit #1, User B locks unit #2, User C locks unit #3, all simultaneously, no waiting. This is the key to high throughput.
Phase 0: Preparing the Stock
Before the sale starts, the store owner populates the units table:
-- Selling 10,000 iPhones (item_id = 720)
INSERT INTO units (item_id, picked_at, picked_by, purchased_by)
SELECT 720, NULL, NULL, NULL
FROM generate_series(1, 10000);
This creates 10,000 rows, each representing a physical iPhone unit. All start with NULL values, unpicked, unclaimed, available.
At 11:59 AM (one minute before sale):
Database state:
- 10,000 rows for item_id = 720
- All: picked_at = NULL, picked_by = NULL, purchased_by = NULL
- Ready for battle
Phase 1: Let Them In (The Pick Operation)
At 12:00:00 PM, doors open. Users flood in, clicking "Add to Cart."
User 1023 clicks "Add to Cart":
BEGIN TRANSACTION;
-- Find one available unit
SELECT * FROM units
WHERE item_id = 720
AND picked_at IS NULL
ORDER BY id
LIMIT 1
FOR UPDATE SKIP LOCKED;
-- Returns: id = 1, item_id = 720, picked_at = NULL, ...
-- Claim it atomically
UPDATE units
SET picked_at = NOW(),
picked_by = 1023
WHERE id = 1;
COMMIT;
Breaking Down the Query:
WHERE picked_at IS NULL
Only fetch units that haven't been picked yet. Once picked_at is set, the unit is "in someone's cart" and unavailable.
ORDER BY id
Process units in a predictable order. Without this, you might leave gaps (units 1, 500, 900 picked while 2-499 sit idle). Ordering ensures sequential consumption.
LIMIT 1
Grab only one unit at a time. If a user wants 5 iPhones, run this query 5 times (or use LIMIT 5 but still process one transaction).
FOR UPDATE SKIP LOCKED
The magic that enables concurrency:
FOR UPDATE: Locks the selected row(s) for the duration of the transaction
SKIP LOCKED: If a row is already locked by another transaction, skip it and try the next row
What SKIP LOCKED does:
12:00:00.000 - User A's query: Tries to lock unit #1 → SUCCESS, locks unit #1
12:00:00.001 - User B's query: Tries to lock unit #1 → LOCKED, skips to unit #2 → SUCCESS
12:00:00.002 - User C's query: Tries unit #1 → LOCKED, tries unit #2 → LOCKED, tries unit #3 → SUCCESS
12:00:00.003 - User D's query: Tries units #1, #2, #3 → ALL LOCKED, tries unit #4 → SUCCESS
Each user gets a different unit without waiting. This non-blocking behavior is crucial for high throughput.
Without SKIP LOCKED:
12:00:00.000 - User A locks unit #1
12:00:00.001 - User B tries unit #1 → BLOCKS, waits for User A
12:00:00.002 - User C tries unit #1 → BLOCKS, waits for User B (who's waiting for User A)
12:00:00.003 - User D tries unit #1 → BLOCKS, waits for User C...
All users serialize on the first unit. Throughput drops to single-digit transactions per second. Flash sale becomes flash crash.
After the UPDATE:
Database state (unit #1):
id: 1
item_id: 720
picked_at: 2026-01-05 12:00:00
picked_by: 1023
purchased_by: NULL
Unit #1 is now "in User 1023's cart." No other user can pick it because picked_at IS NOT NULL.
Critical Insight: One Transaction for Pick + Update
Notice the BEGIN TRANSACTION and COMMIT wrapping both the SELECT and UPDATE. This is essential for correctness:
-- WRONG: Two separate transactions
SELECT * FROM units WHERE ... FOR UPDATE SKIP LOCKED;
-- (Transaction commits here)
UPDATE units SET picked_at = NOW() WHERE id = 1;
-- (Another transaction)
If the SELECT and UPDATE are separate transactions, another user could swoop in between them:
User A: SELECT → gets unit #1 (transaction ends)
User B: SELECT → gets unit #1 (before User A updates!)
User A: UPDATE → claims unit #1
User B: UPDATE → also claims unit #1 (DUPLICATE CLAIM!)
The correct pattern:
BEGIN TRANSACTION;
SELECT ... FOR UPDATE SKIP LOCKED; -- Locks the row
UPDATE ...; -- Updates while still locked
COMMIT; -- Releases lock
The lock acquired by FOR UPDATE persists until COMMIT. No one can interfere during that window.
Phase 2: Payment Processing
User 1023 has unit #1 in their cart. Now they proceed to checkout.
Payment Gateway Interaction:
1. User clicks "Pay Now"
2. Frontend calls: POST /api/payment
{
"user_id": 1023,
"unit_id": 1,
"amount": 599.99
}
3. Backend initiates payment with Stripe/PayPal/Razorpay:
POST https://api.stripe.com/v1/charges
{
"amount": 59999, // cents
"currency": "usd",
"source": "tok_visa",
"metadata": {"unit_id": 1, "user_id": 1023}
}
4. Payment gateway processes (takes 2-5 seconds)
5. Payment gateway sends webhook to your server:
POST https://yoursite.com/webhook/payment
{
"event": "payment_intent.succeeded",
"unit_id": 1,
"user_id": 1023
}
On Successful Payment (Webhook Handler):
UPDATE units
SET purchased_by = 1023,
purchased_at = NOW()
WHERE id = 1;
-- Also create order record, send confirmation email, etc.
Database state (unit #1) after success:
id: 1
item_id: 720
picked_at: 2026-01-05 12:00:00
picked_by: 1023
purchased_by: 1023
purchased_at: 2026-01-05 12:00:15
Unit #1 is now sold. The user owns it. The transaction is complete.
Handling Payment Failures
But what if payment fails? Card declined, insufficient funds, payment timeout?
On Unsuccessful Payment (Webhook Handler):
UPDATE units
SET picked_at = NULL,
picked_by = NULL
WHERE id = 1;
Database state (unit #1) after failure:
id: 1
item_id: 720
picked_at: NULL ← Back to available
picked_by: NULL ← No longer claimed
purchased_by: NULL
Unit #1 is now back on the shelf. The next user's SELECT query will find it and can pick it.
This is how inventory recirculates. If 100 people add to cart but only 50 complete payment, those 50 failed units become available for the next wave of buyers.
Critical Constraint: No Distributed Transaction
Notice a subtle but important detail:
You CANNOT wrap "add to cart" and "payment" in a single database transaction.
Why not? Because payment involves an external service (Stripe, PayPal) and takes seconds. Holding a database transaction open for seconds creates massive problems:
-- WRONG: Never do this
BEGIN TRANSACTION;
UPDATE units SET picked_at = NOW() WHERE id = 1; -- Locks row
-- Call payment API (takes 3-5 seconds)
-- ...waiting...
-- Still locked, other users can't access this row
UPDATE units SET purchased_by = 1023 WHERE id = 1;
COMMIT;
During those 3-5 seconds, the row is locked. If 10,000 users are paying simultaneously, you have 10,000 long-running transactions, each holding locks. The database chokes.
The correct pattern:
-- Transaction 1: Add to cart (fast, <100ms)
BEGIN;
SELECT ... FOR UPDATE SKIP LOCKED;
UPDATE units SET picked_at = NOW();
COMMIT;
-- External API call: Payment (slow, 2-5 seconds)
POST /stripe/charge
-- Transaction 2: Confirm purchase (fast, <100ms)
-- Triggered by webhook after payment completes
BEGIN;
UPDATE units SET purchased_by = ? WHERE id = ?;
COMMIT;
Transactions are short-lived (milliseconds). Long-running operations (payment) happen outside transactions. This keeps the database responsive.
The Expiration Problem: Abandoned Carts
User 1023 adds to cart but never completes payment. Maybe they got distracted, changed their mind, or their app crashed. Unit #1 is stuck in limbo: picked_at is set, but purchased_by is NULL.
Without intervention, this unit would be unavailable forever. We need a cart expiration mechanism:
-- CRON job runs every minute
UPDATE units
SET picked_at = NULL,
picked_by = NULL
WHERE picked_at < (NOW() - INTERVAL '12 minutes')
AND purchased_by IS NULL;
Logic:
If a unit was picked more than 12 minutes ago and hasn't been purchased, reset it. This gives users 12 minutes to complete payment, reasonable for a checkout flow.
Why 12 minutes for a 15-minute flash sale?
You want most of the sale window available for picking, not tied up in expired carts. 12 minutes gives 3 minutes for payment, then inventory recirculates.
Trade-off:
Too short (e.g., 2 minutes): Users don't have enough time, abandoned carts recirculate quickly but frustrate legitimate buyers
Too long (e.g., 30 minutes): Inventory gets locked up by abandoned carts, reducing availability
12 minutes is a sweet spot for most flash sales.
Preventing Disappointment vs. Underselling
This expiration mechanism creates a philosophical tension:
Option 1: Disappoint Some Customers
You could expire carts aggressively (say, 2 minutes). This means:
More inventory recirculates during the sale
More people get a chance to buy
But some legitimate buyers get timed out while filling in payment details
Option 2: Undersell
You could expire carts slowly (say, 30 minutes). This means:
Fewer people get timed out (better experience for those who succeed)
But many units get "locked" by abandoned carts
If 50% of users abandon carts, you only sell 500 of your 1,000 units during the sale window
You have to deal with the business impact of underselling (lost revenue, unfulfilled demand)
Most companies choose Option 1 (shorter expiration) because underselling is worse than temporary disappointment. You can always extend the sale or restock, but underselling means leaving money on the table.
Why This Approach Scales
Non-Blocking Concurrency
SKIP LOCKED means 10,000 users can pick units simultaneously. Each gets a different unit, no waiting, no contention.
Row-Level Locking
Locks are on individual units, not the whole table or item. User A locking unit #1 doesn't block User B from locking unit #2. This is fundamentally different from locking at the item level (where all users would block each other).
Short Transactions
Transactions last milliseconds (SELECT + UPDATE). Fast transactions mean locks are held briefly, maximizing throughput.
Database Handles Concurrency
We're not writing custom locking logic in application code (prone to bugs). The database's transaction engine handles all concurrency control, battle-tested over decades.
Scalability Math:
- Each transaction: ~10ms (SELECT + UPDATE)
- Database can handle: 100 concurrent transactions
- Throughput: 100 transactions / 10ms = 10,000 transactions/second
- For 1,000 units: Depletes in ~0.1 seconds
Real-world (with payment overhead):
- Pick operation: 10ms
- Payment initiation: 50ms
- Total per user: 60ms
- With 100 concurrent connections: 1,666 picks/second
- For 1,000 units: Depletes in ~0.6 seconds
Even under heavy load, inventory depletes in under a second if demand is high enough.
Similar Systems Using This Pattern
Ticket Booking (BookMyShow, Ticketmaster)
Table: seats
seat_id | show_id | picked_at | picked_by | purchased_by
A1 | 2501 | NULL | NULL | NULL
A2 | 2501 | NULL | NULL | NULL
...
Each seat is a unit. When you select seat A1, it gets picked_at set. If you don't pay within 10 minutes, it's released.
Hotel Booking
Table: room_nights
id | room_id | date | picked_at | picked_by | purchased_by
1 | 305 | 2026-01-10 | NULL | NULL | NULL
2 | 305 | 2026-01-11 | NULL | NULL | NULL
...
Each (room, date) combination is a unit. Booking room 305 for Jan 10-11 picks two units.
IRCTC Train Booking
Table: berths
berth_id | train_id | date | picked_at | picked_by | purchased_by
1A-UB | 12345 | 2026-01-15 | NULL | NULL | NULL
1A-LB | 12345 | 2026-01-15 | NULL | NULL | NULL
...
Each berth is a unit. High contention during festival bookings handled gracefully by SKIP LOCKED.
Shopify Flash Sales
Shopify's inventory system uses a variant of this pattern for limited edition drops. When Supreme releases 500 hoodies, each hoodie is a unit in their inventory system.
The Complete Flow: A Successful Purchase
12:00:00.000 - User 1023 clicks "Add to Cart":
BEGIN;
SELECT * FROM units WHERE item_id = 720 AND picked_at IS NULL
ORDER BY id LIMIT 1 FOR UPDATE SKIP LOCKED;
-- Returns: unit #1
UPDATE units SET picked_at = NOW(), picked_by = 1023 WHERE id = 1;
COMMIT;
12:00:00.050 - User shown "Item Added, Proceed to Checkout":
Frontend: "You have 12 minutes to complete purchase"
12:00:15.000 - User clicks "Pay Now":
POST /api/payment
→ Stripe API call initiated
12:00:18.000 - Stripe processes payment:
Stripe → Webhook → POST /yoursite/webhook/payment
{"event": "payment_intent.succeeded", "unit_id": 1, "user_id": 1023}
12:00:18.100 - Webhook handler updates database:
UPDATE units
SET purchased_by = 1023, purchased_at = NOW()
WHERE id = 1;
12:00:18.200 - Order confirmation email sent:
"Congratulations! Your iPhone 15 order is confirmed."
Total time from click to confirmation: 18 seconds. The unit was locked for the entire period, preventing double-selling.
The Elegance of Simplicity
This flash sale design is beautifully simple:
No distributed locks: Database transactions handle everything
No Redis reservations: Everything in the primary database
No complex state machines: Three states (available, picked, purchased)
No coordination between servers: Each app server independently queries the database
Yet it handles the hardest problem in e-commerce: selling exactly N items to the first N buyers when 100,000 people compete simultaneously.
The secret isn't exotic technology, it's understanding how database locking works and leveraging FOR UPDATE SKIP LOCKED to turn concurrent chaos into orderly, fair distribution. Sometimes the best solutions aren't the most complex; they're the ones that use simple tools exactly right.
Additional Reading Materials
Read these books:
Introduction to information retrieval
Book by Christopher D. Manning
That's all for now folks. See you in the next blog!



