Skip to main content

Command Palette

Search for a command to run...

System Design Deep Dives

Building Scalable Infrastructure from First Principles

Published
69 min read
System Design Deep Dives

In the world of distributed systems, the gap between "it works on my laptop" and "it works at scale" is filled with hard-earned lessons about concurrency, consistency and clever architectural patterns. This collection of system design deep dives explores how real-world platforms, from YouTube's view counter processing billions of events to Cricbuzz serving millions during live matches, solve fundamental challenges that every senior engineer eventually faces. Whether you're designing a message broker on SQL, building a distributed task scheduler, or implementing flash sales that handle 100,000 concurrent buyers fighting over 1,000 items, the patterns here reveal a surprising truth: the most robust solutions often aren't the most exotic. They're the ones that deeply understand database transactions, leverage partitioning intelligently and embrace eventual consistency where appropriate. Each section breaks down a specific system, recent search history, text commentary, task scheduling, search engines, view counting, message queues and flash sales, showing not just what to build, but why each architectural decision matters when milliseconds and correctness both count.

Designing Recent Searches

Understanding the Problem Through Metrics

As engineers, we don't always receive explicit requirements. Instead, we get behavioral data that tells the real story:

  • 50% of users tap on Search Bar within the first 5 seconds – This signals an urgent need for pre-loaded, instantly available data

  • 30% of all searches happens through recent searches – This represents massive write volume that must be captured reliably

These metrics reveal a dual challenge: we need a system that's both extremely read-optimized for instant access and robust enough to handle high-velocity writes from one of the most frequently hit APIs in any application.

Key Architectural Considerations

Before diving into implementation, we need to answer fundamental questions that will shape our entire design:

Bounded vs. Unbounded Data: Do we store search history indefinitely, or is there a natural limit? Recent searches are inherently bounded – users care about their last 10-20 queries, not their entire history.

Latency Requirements: With 50% of users accessing search within 5 seconds, we're operating in the sub-100ms response time territory. Any database round-trip becomes a bottleneck.

Stale Data Tolerance: If a user searches on their phone, how quickly must it appear on their laptop? Cross-device consistency matters, but perfect real-time sync may not be worth the architectural complexity.

Data Freshness: The 30% conversion rate through recent searches tells us that stale data directly impacts user behavior. A search made seconds ago must appear immediately.

Storage Layer: Choosing the Right Database

For storing search queries, we need to evaluate our access patterns:

  • No relational requirements (just user ID and query text)

  • Massive write volume (search is one of the most hit APIs)

  • Always accessed at the user level (never cross-user queries)

  • High cardinality data (millions of users, each with unique search patterns)

These characteristics point us toward partitioned NoSQL databases like MongoDB or Elasticsearch. Partitioning by user ID ensures that all queries for a user live in the same shard, making retrieval efficient when needed.

The Core Design Challenge: Sync vs. Async Writes

This decision fundamentally shapes our architecture. Let's examine both paths:

Asynchronous Writes (via SQS/Kafka):

  • Pros: API response times remain fast, system is decoupled and extensible

  • Cons: Introduces delay between search action and persistence, eventual consistency across devices

Synchronous Writes:

  • Pros: Zero delay, immediate consistency

  • Cons: Database becomes a bottleneck if it can't handle the load, increased API latency

Given that 30% of searches use recent searches, any delay in persistence directly impacts user experience. A user who searches for "flights to Paris" and immediately taps the search bar again expects to see that query. This drives us toward synchronous writes, but with a critical optimization.

The Solution: Write-Through Caching with Redis

The breakthrough insight comes from recognizing that our NoSQL database has a fundamental limitation: records for the same user are scattered across the partition. Listing a user's recent searches requires scanning and sorting, which is expensive at scale.

Instead, we implement a dual-write strategy:

  1. Synchronous write to Redis (in-memory store)

  2. Synchronous write to MongoDB (persistent store)

Redis Data Structure

Key: user:{user_id}:recent_searches
Value: ["flights to paris", "best restaurants NYC", "weather tomorrow", ...]
Type: List (capped at 10 most recent)

This pre-computed structure delivers:

  • Sub-millisecond read latency (pure memory lookup, no computation)

  • Automatic cross-device consistency (all devices read from same Redis key)

  • Natural data expiry (LTRIM keeps only recent 10 queries)

The Write Path

When a user submits a search query:

POST /search?q=flights+to+paris

1. Write to Redis: LPUSH user:12345:recent_searches "flights to paris"
                   LTRIM user:12345:recent_searches 0 9
2. Write to MongoDB: INSERT {user_id: 12345, query: "flights to paris", timestamp: now()}
3. Return search results

Both writes happen synchronously, but Redis is so fast (sub-millisecond) that the added latency is negligible compared to the search operation itself.

Handling the Cold Start Problem

Here's where our initial metric becomes critical: 50% of users tap search within 5 seconds. If the Redis cache is empty (server restart, cache eviction, new device), fetching from MongoDB ruins the experience.

Solution: Pre-warming Strategy

When a user logs in or opens the app:

  1. Check if user:{user_id}:recent_searches exists in Redis

  2. If missing, asynchronously fetch last 10 queries from MongoDB

  3. Populate Redis in background

  4. Set TTL (e.g., 7 days) to auto-expire inactive users

This ensures that by the time most users tap search, their data is already loaded.

Transaction Outbox Pattern for Reliability

Even with synchronous writes, we need durability guarantees. What if Redis accepts the write but MongoDB fails?

We implement a transaction outbox pattern:

  1. Write to MongoDB includes an outbox collection entry

  2. Background worker reads outbox and ensures Redis sync

  3. Outbox entry deleted only after confirming Redis update

This provides eventual consistency guarantees without sacrificing performance.

Redis Persistence: AOF Configuration

Redis is in-memory, but we need crash recovery. Append-Only File (AOF) persistence gives us the right balance:

appendonly yes
appendfsync everysec

This configuration:

  • Writes every operation to disk log

  • Syncs to disk every second (not every write)

  • Recovers data with at most 1 second of loss

  • Maintains performance (doesn't block on disk I/O)

Cost Optimization Strategies

Running Redis at scale gets expensive. We optimize through data lifecycle management:

1. Bounded Cache Size

  • Redis stores only 10 most recent queries per user

  • MongoDB stores complete history

  • 90% of users never scroll past first 5 suggestions

2. Data Archival

  • Queries older than 6 months move to cold storage (S3)

  • Saves 70-80% on database costs

  • Historical queries rarely accessed (product decision: show with higher latency if needed)

3. TTL-Based Eviction

  • Inactive users (no search in 30 days) auto-expire from Redis

  • Re-warm on next login

  • Keeps Redis memory footprint manageable

Async Write Alternative (Extensibility Path)

While we chose synchronous writes for immediate consistency, an async architecture offers advantages for future features:

POST /search → Kafka Topic: "search_events"
              ↓
         Consumer Group
         ↓           ↓
      MongoDB    Redis
                  ↓
         Analytics Pipeline (ML for personalization)

This pattern shines when you need to:

  • Feed search data to recommendation engines

  • Run real-time analytics on search trends

  • A/B test different ranking algorithms

The tradeoff is accepting a few hundred milliseconds of delay before searches appear in "recent" lists.

The Final Architecture

Our design delivers on all key requirements:

  • Sub-100ms read latency: Pure Redis lookups, no computation

  • Zero staleness: Synchronous writes ensure immediate consistency

  • Cross-device sync: All devices read from same Redis key

  • High write throughput: Redis handles 100K+ writes/sec per instance

  • Cost efficient: Bounded cache + archival keeps expenses predictable

  • Reliable: AOF persistence + outbox pattern prevent data loss

The result is a system where tapping the search bar feels instant, recent searches are always fresh and the architecture scales to millions of users searching billions of times per day.

Designing Cricbuzz's Text Commentary

The Scale of the Challenge

Cricbuzz is India's 8th most visited website. During a high-profile cricket match (IPL finals, India vs Pakistan), millions of fans simultaneously refresh their screens, desperately waiting for the next ball's commentary. A single match can generate:

  • 10-15 million concurrent users during peak moments

  • Commentary updates every 20-30 seconds (each ball bowled)

  • Sustained traffic for 3-4 hours per match

  • Multiple matches running simultaneously during tournaments

Designing text commentary isn't just about storing messages, it's about delivering real-time updates to millions while keeping infrastructure costs reasonable.

Core Requirements

1. Real-time Updates: When a commentator types "SIX! Kohli sends it sailing over long-on," millions of users must see it within seconds.

2. Cost Efficiency: Serving millions of requests per second can bankrupt your cloud bill if not architected carefully.

3. Excellent User Experience: Commentary should feel live, load instantly and work seamlessly even during traffic spikes.

4. Realistic Durability: Commentary is typed by humans in a control room. If something fails, they can retype. We don't need bank-grade transaction guarantees.

Storage Architecture: Hot and Cold Data

Commentary follows a natural lifecycle that we can exploit for cost optimization:

Hot Data (Last 20-30 comments): Users actively refresh to see the latest updates. This needs to be lightning-fast.

Cold Data (Everything older): Users might scroll through match history, but with much lower frequency. Speed is less critical.

This insight drives our storage strategy:

Primary Storage: Partitioned NoSQL Database

We use a partitioned database (MongoDB, Cassandra) with schema:

{
  match_id: "IND_vs_PAK_2025_final",
  commentary_id: "10.3",  // Over.Ball
  timestamp: 1704067200,
  text: "SIX! Kohli smashes it over long-on!",
  commentator: "Harsha Bhogle"
}

Partitioning by match_id ensures all commentary for a match lives in the same shard, making retrieval efficient. The database serves as our source of truth and handles:

  • Long-term storage (entire match history)

  • Pagination for users scrolling through old commentary

  • Analytics and post-match analysis

However, hitting the database for every refresh from millions of users would be catastrophic for both latency and cost.

Caching Layer: Redis for Hot Commentary

The breakthrough realization: only the latest 20-30 comments matter for 95% of users. We aggressively cache these in Redis:

Key: match:{match_id}:live_commentary
Type: Sorted Set (scored by timestamp)
Value: [
  {score: 1704067200, text: "10.6: SIX! Kohli finishes the over in style"},
  {score: 1704067180, text: "10.5: DOT ball, good length from Shaheen"},
  {score: 1704067160, text: "10.4: FOUR! Edge flies to third man"},
  ...
]

Why Sorted Set?

  • Automatic ordering by timestamp

  • ZRANGE operations to fetch latest N comments

  • O(log N) insertion performance

  • Built-in score-based querying

This architecture means:

  • Read path: 99% of requests hit Redis (sub-millisecond response)

  • Database load: Only pagination requests for historical commentary

  • Cost: Massive savings on database read IOPS

The Write Path: Commentator Control Room

Here's the crucial insight about durability: commentary is manually typed by humans sitting in a control room. If a write fails, they can simply hit "Save" again. This changes everything about our consistency requirements.

Direct Redis Updates

When a commentator submits new commentary:

POST /commentary
{
  "match_id": "IND_vs_PAK_2025_final",
  "text": "SIX! Kohli sends it sailing!",
  "ball": "10.3"
}

1. Write directly to Redis:
   ZADD match:IND_vs_PAK_2025_final:live_commentary 1704067200 
   '{"ball":"10.3", "text":"SIX! Kohli sends it sailing!"}'

2. Trim to keep only latest 30:
   ZREMRANGEBYRANK match:IND_vs_PAK_2025_final:live_commentary 0 -31

3. Async write to MongoDB (fire and forget)
   INSERT into commentary_collection {...}

4. Return 200 OK to commentator

Why this works:

  • Redis write completes in < 5ms

  • Commentator gets instant feedback

  • If MongoDB write fails, it's okay, commentator will notice commentary missing and retry

  • No complex distributed transactions needed

Manual Retry as Durability Strategy

Instead of complex retry logic, we embrace human oversight:

  • Commentator UI shows the last 10 comments posted

  • If a comment doesn't appear after clicking "Save," they click again

  • Simple, effective and matches actual operational workflow

  • Saves massive engineering complexity on automatic retry mechanisms

The Read Path: Short Polling for Live Updates

Real-time updates face a classic tradeoff: WebSockets vs. Polling.

WebSockets: Persistent connections, true push-based updates

  • Pros: Instant delivery, efficient after connection established

  • Cons: Millions of concurrent connections, complex infrastructure, expensive at scale

Short Polling: Clients request updates every N seconds

  • Pros: Stateless, cacheable, simpler infrastructure

  • Cons: Slight delay, more requests

For cricket commentary, short polling wins because:

  1. Commentary updates every 20-30 seconds (each ball), not every millisecond

  2. Polling every 5-10 seconds feels real-time for this use case

  3. Redis handles millions of reads/second easily with replication

  4. Much cheaper than maintaining millions of WebSocket connections

Optimized Polling Implementation

GET /api/match/{match_id}/commentary/latest?since=1704067180

Response:
{
  "commentary": [
    {"ball": "10.6", "text": "SIX! Kohli finishes the over", "ts": 1704067200},
    {"ball": "10.5", "text": "DOT ball from Shaheen", "ts": 1704067190}
  ],
  "next_poll": 5000  // milliseconds
}

Key optimizations:

  • since parameter: Only return new commentary, reducing payload size

  • next_poll field: Server controls polling frequency (can slow down during breaks)

  • HTTP caching: ETag headers let CDN cache when no new commentary exists

CDN Edge Caching

During overs where no balls are bowled (drinks break, DRS review), millions of users still poll. We cache at the edge:

Cache-Control: public, max-age=3
ETag: "match_123_commentary_10.5"

When commentary doesn't change:

  • CDN returns 304 Not Modified

  • No request hits our origin servers

  • Massive cost savings during idle periods

Cost Optimization: Data Lifecycle Management

A single IPL season generates millions of commentary entries. Storing all of it in hot storage would be expensive. We implement aggressive archival:

Tiered Storage Strategy

Tier 1 - Redis (Live Match + 1 hour post-match)

  • Latest 30 comments per live match

  • Sub-millisecond access

  • Auto-expire 1 hour after match ends

  • Cost: ~$500/month for 50 concurrent matches

Tier 2 - MongoDB (Last 30 days)

  • Full commentary for recent matches

  • Used for "recent matches" section

  • SSD-backed, fast reads

  • Cost: ~$2000/month

Tier 3 - S3 Archive (Historical)

  • Matches older than 30 days

  • Compressed JSON files

  • Accessed only for "match archives" feature

  • Cost: ~$50/month for entire season

Archive Process (runs daily):

1. Query MongoDB for matches older than 30 days
2. Export commentary to JSON, gzip compress
3. Upload to S3: s3://cricbuzz-archives/2025/IPL/match_123.json.gz
4. Delete from MongoDB
5. Update match metadata: commentary_location = "s3://..."

This tiered approach reduces storage costs by 90% while maintaining excellent UX for active matches.

Handling Traffic Spikes: Read Scaling

When Kohli walks out to bat, traffic can spike 10x in seconds. Our architecture handles this through:

Redis Read Replicas

Redis Cluster:
- 1 Primary (writes)
- 5 Read Replicas (reads)
- Load balancer distributes GET requests

Each replica handles 100K requests/second, giving us 500K reads/sec capacity. Replication lag is < 10ms, acceptable for commentary that updates every 30 seconds.

Application-Level Load Balancing

API Servers (Auto-scaling group):
- Baseline: 20 instances
- Scale trigger: CPU > 60% for 30 seconds
- Scale up: Add 10 instances (takes ~60 seconds)
- Scale down: Remove instances when CPU < 30% for 5 minutes

Short polling gives us time to scale, unlike WebSockets where connections are already established, polling lets new instances immediately share load.

Putting It All Together: User Journey

User opens Cricbuzz app during IND vs PAK match:

  1. App requests: GET /api/match/123/commentary/latest

  2. Load balancer routes to API server

  3. API server queries Redis: ZRANGE match:123:live_commentary -30 -1

  4. Redis returns latest 30 comments in 2ms

  5. API server returns JSON to app in 15ms total

  6. App displays commentary, sets 5-second polling timer

Commentator types "SIX! Kohli smashes it!":

  1. Commentator hits "Save" in control room UI

  2. Request hits: POST /commentary

  3. API writes to Redis in 3ms

  4. Async worker writes to MongoDB (no waiting)

  5. Return 200 OK to commentator in 5ms

  6. Commentator sees confirmation, types next comment

Within 5 seconds:

  1. Millions of users' apps poll for updates

  2. Requests hit CDN edge (many cached if no new commentary)

  3. New commentary detected: CDN forwards to origin

  4. Origin serves from Redis (cache hit)

  5. Users see "SIX! Kohli smashes it!" appear

After match ends:

  1. Redis entries expire after 1 hour (auto-cleanup)

  2. Full commentary remains in MongoDB for 30 days

  3. Daily job archives to S3 after 30 days

  4. Historical commentary still accessible via "Match Archive" at slower speed

The Architecture's Elegant Properties

This design delivers on every requirement:

Real-Time Feel: 5-second polling + Redis caching = commentary appears almost instantly

Cost Efficient:

  • Short polling cheaper than WebSockets at scale

  • Tiered storage saves 90% on historical data

  • CDN caching reduces origin traffic by 70%

Excellent UX:

  • Sub-20ms API response times

  • No loading spinners during polls

  • Graceful degradation if Redis slow (fallback to MongoDB)

Realistic Durability:

  • Manual retry by commentators (humans in the loop)

  • No over-engineered distributed transactions

  • Simpler codebase, fewer failure modes

The result: millions of cricket fans experience the thrill of live commentary, ball by ball, without Cricbuzz's infrastructure team experiencing the thrill of a crashing production system.

Designing a Super Simple Text-Based Search Engine

The Fundamental Problem

At its core, a search engine solves one problem: given an information need, find the most relevant documents from a corpus. Whether you're searching through JSON files, web pages, or text documents, the challenge remains the same, how do you quickly surface what matters?

The naive approach seems obvious: when someone searches for "pizza," scan through every document one by one and check if it contains the word "pizza." Return all matches. Done.

Except this is O(N) complexity and when your corpus contains millions of documents, this becomes impossibly slow. A search that takes 30 seconds isn't a search, it's a coffee break.

Making Search Efficient: The Inverted Index

The breakthrough that makes modern search possible is indexing. Instead of checking documents at search time, we pre-compute which documents contain which words.

Building the Inverted Index

Think of it like the index at the back of a textbook. Instead of reading the entire book to find mentions of "photosynthesis," you flip to the index and see: "photosynthesis: pages 47, 89, 132."

For our search engine, we build an inverted index:

Corpus:
doc1: "The best pizza in New York"
doc2: "New York style bagels"  
doc3: "Chicago deep dish pizza"

Inverted Index:
"pizza"   → [doc1, doc3]
"new"     → [doc1, doc2]
"york"    → [doc1, doc2]
"chicago" → [doc3]
"bagels"  → [doc2]
"best"    → [doc1]
...

Now when someone searches "pizza," we don't scan documents, we simply look up the word in our index and instantly retrieve [doc1, doc3]. This transforms our O(N) problem into an O(1) lookup.

The Indexing Process

For each document in corpus:
  1. Tokenize: "The best pizza" → ["the", "best", "pizza"]
  2. Normalize: ["the", "best", "pizza"] → ["best", "pizza"] (remove stopwords)
  3. For each token:
     - If token exists in index: append document ID
     - If token doesn't exist: create new entry [document ID]

This preprocessing step is expensive (you process every document once), but it makes search instantaneous. The tradeoff is obvious: spend time once during indexing to save time on every search forever.

Beyond Simple Matching: Relevance Ranking

But wait, simply returning all documents that contain "pizza" isn't enough. When a user searches, they want the most relevant documents first, not just any document that happens to mention the word.

Weighted Scoring: Not All Fields Are Equal

A document where "pizza" appears in the title is probably more relevant than one where it's buried in a footnote. We assign different weights to different fields:

Document structure:
{
  title: "Best Pizza Places in NYC",
  description: "A comprehensive guide...",
  tags: ["food", "pizza", "restaurants"],
  body: "When it comes to pizza, New York..."
}

Scoring weights:
- title match: 10x
- tags match: 5x  
- description match: 3x
- body match: 1x

When calculating relevance:

Query: "pizza"

doc1: "pizza" in title (10 points) + "pizza" in body (1 point) = 11 points
doc2: "pizza" in body twice (2 points) = 2 points

Result: Return doc1 first

This simple weighting dramatically improves result quality. Users want authoritative documents where the search term is central, not tangentially mentioned.

Users make mistakes. Constantly. They type "pizzza" instead of "pizza," "resturant" instead of "restaurant." A search engine that only matches exact strings is frustratingly brittle.

Edit Distance and BK-Trees

Edit distance (Levenshtein distance) measures how many single-character edits (insertions, deletions, substitutions) separate two words:

"pizza" → "pizzza": 1 edit (insert 'z')
"pizza" → "piza": 1 edit (delete 'z')  
"pizza" → "pitza": 1 edit (substitute 'z' → 't')

Naive fuzzy search would calculate edit distance against every word in our vocabulary, once again hitting O(N) complexity. Instead, we use a BK-tree (Burkhard-Keller tree), a data structure that organizes words by edit distance, allowing us to prune the search space.

When a user searches "pizzza," we:

  1. Look up in BK-tree with tolerance of 1-2 edits

  2. Find candidate words: ["pizza", "piazza"]

  3. Search our inverted index for both candidates

  4. Return combined results

This handles typos gracefully without sacrificing speed.

Here's a fascinating insight from real user behavior: people remember how words sound, not how they're spelled. This leads to a specific pattern of "typos" that fuzzy search can't solve.

Consider someone searching for Arnold Schwarzenegger:

  • "Arnold Swarzeneger"

  • "Arnold Swazeneger"

  • "Arnold Schwarzeneger"

These aren't random typos, they're phonetic approximations. The user knows how to pronounce the name but can't remember the exact spelling. Standard fuzzy search fails because the edit distance is too large.

Phonetic Algorithms: Metaphone and Soundex

Instead of indexing the literal spelling, we index by phonetic encoding, how the word sounds.

Soundex (simpler, older):

"Schwarzenegger" → S-623
"Swarzeneger"    → S-623  
"Swazeneger"     → S-623

Metaphone (more accurate):

"Schwarzenegger" → SHWRTSNKR
"Swarzeneger"    → SWRTSNKR
"Swazeneger"     → SWSNKR

When building our index, we create a parallel phonetic index:

Inverted Index:
"schwarzenegger" → [doc_arnold_bio]

Phonetic Index (Metaphone):
"SHWRTSNKR" → [doc_arnold_bio]

Now when someone searches "Swarzeneger," we:

  1. Try exact match: fails

  2. Convert to phonetic: "SWRTSNKR"

  3. Search phonetic index: matches "SHWRTSNKR" (close enough)

  4. Return Arnold Schwarzenegger results

This handles a whole class of spelling mistakes that edit-distance approaches miss entirely.

Synonym Expansion: Understanding Intent

Users express the same concept in different ways. Someone searching "home" might want documents about "houses." Someone searching "car" might want "automobile" results.

Building a Synonym Map

Synonym groups:
["home", "house", "residence", "dwelling"]
["car", "automobile", "vehicle"]
["happy", "joyful", "cheerful"]

At index time, we expand documents:

Original document: "Beautiful house for sale"

Indexed under:
- "house" → [doc1]
- "home" → [doc1]      (synonym)
- "residence" → [doc1]  (synonym)
- "dwelling" → [doc1]   (synonym)

At search time, we expand queries:

User searches: "home"

Expanded query: "home OR house OR residence OR dwelling"

Results: All documents mentioning any synonym

This dramatically improves recall, the percentage of relevant documents we actually find.

Spell Correction: Did You Mean...?

Typos that are too far from any real word need correction before search. When someone searches "HOUPE," we want to suggest "HOUSE."

Building a Vocabulary-Based Corrector

1. Build vocabulary from corpus (all unique words)
2. For misspelled query:
   - Calculate edit distance to all vocabulary words
   - Find closest matches (edit distance ≤ 2)
   - Rank by word frequency in corpus
3. Suggest top match: "Did you mean: house?"

Example:

Query: "HOUPE"

Candidates (within 2 edits):
- "house" (edit distance: 2, frequency: 15,000)
- "coupe" (edit distance: 2, frequency: 500)  
- "hopped" (edit distance: 2, frequency: 200)

Suggestion: "Did you mean: house?"

Frequency matters because common words are more likely to be intended than rare ones.

Query Segmentation: The Missing Space Problem

Users forget spaces. They search "mcdonalds" when they mean "mc donalds," or "newyork" instead of "new york." This breaks our indexing because we indexed "new" and "york" separately.

N-gram Based Segmentation

We pre-compute common multi-word phrases in our corpus:

Bigrams (2-word phrases):
"new york" → appears 50,000 times
"san francisco" → appears 30,000 times
"mc donalds" → appears 5,000 times

When we receive "newyork," we:

  1. Try exact match: fails

  2. Attempt segmentation:

    • "n ewyork" (unlikely)

    • "ne wyork" (unlikely)

    • "new york" (found in bigram index, high frequency!)

  3. Search for "new york" instead

This requires maintaining a dictionary of common phrases, but it handles a surprisingly common user error pattern.

Putting It All Together: The Search Pipeline

When a user types a query and hits enter, here's what happens:

Query: "pizzza in newyork"

1. Tokenization: ["pizzza", "in", "newyork"]

2. Stopword removal: ["pizzza", "newyork"]

3. Spell correction:
   - "pizzza" → "pizza" (edit distance: 1)
   - "newyork" → "new york" (segmentation)

4. Synonym expansion:
   - "pizza" → ["pizza", "pie"] (if synonyms exist)

5. Phonetic fallback (if needed):
   - Convert to Metaphone codes

6. Index lookup:
   - Fetch document IDs for each term
   - "pizza" → [doc1, doc3, doc7, doc12]
   - "new" → [doc1, doc2, doc5]
   - "york" → [doc1, doc2, doc5]

7. Intersection & Scoring:
   - Documents containing all terms: [doc1]
   - Apply field weights (title, body, etc.)
   - Calculate relevance scores

8. Ranking & Return:
   - Sort by relevance score
   - Return top N results

The entire pipeline completes in milliseconds because the expensive work (indexing) happened ahead of time.

Why This Scales

The beauty of this architecture is that search time is independent of corpus size:

  • Index lookup: O(1) hash table access

  • BK-tree fuzzy search: O(log N) with pruning

  • Phonetic lookup: O(1) after encoding

  • Synonym expansion: O(k) where k is number of synonyms (small constant)

Whether you have 1,000 documents or 1,000,000,000 documents, individual searches remain fast because you're not scanning documents, you're looking up pre-computed indexes.

The tradeoff is storage: your inverted index can be 30-50% the size of your corpus. But disk is cheap and user patience is not.

From Simple to Sophisticated

This "super simple" search engine actually incorporates sophisticated techniques:

  • Inverted indexing transforms O(N) scans into O(1) lookups

  • Weighted scoring surfaces the most relevant results first

  • Fuzzy search handles typos gracefully

  • Phonetic search solves the "I know how it sounds" problem

  • Synonym expansion understands semantic similarity

  • Spell correction prevents dead-end searches

  • Query segmentation handles missing spaces

Each technique addresses a real pattern of user behavior, making search feel effortless even though complex machinery runs behind the scenes. The result: users type what they mean (however they mean it) and they get what they want. That's the magic of a well-designed search engine.

Designing a Distributed Task Scheduler

The Problem We're Solving

Task scheduling is everywhere in modern systems. You need to send a reminder email in 2 hours. Run a data pipeline every night at 3 AM. Process payment reconciliation every Monday morning. Archive old records on the first of every month.

Building a distributed task scheduler means solving a deceptively simple problem: execute tasks at specific times, reliably, at scale. Systems like Dkron and AWS CloudWatch Events exist because this problem is harder than it looks.

Core Requirements

Our scheduler must support two fundamental scheduling patterns:

1. Fixed-Time Tasks: Execute once at a specific moment

  • "Send password reset email at 2:45 PM today"

  • "Start the sale at midnight on Black Friday"

  • "Generate monthly report at 12:01 AM on the 1st"

2. Recurring Tasks (CRON): Execute on a schedule

  • "Back up database every day at 2 AM"

  • "Send weekly newsletter every Monday at 9 AM"

  • "Check payment status every 15 minutes"

Critical Constraint: 30-Second SLA

If a task is scheduled for 10:01:00 AM, it must start executing by 10:01:30 AM. This isn't about how long the task takes to complete, it's about how quickly we pick it up and start processing. This SLA shapes every architectural decision we make.

Granularity: Minute-Level

We don't support second-level precision. Tasks scheduled for "10:01:15 AM" and "10:01:45 AM" will both be treated as "10:01:00 AM." This simplification drastically reduces complexity without sacrificing most real-world use cases.

Architectural Components: Store, Pick, Execute

Every task scheduler breaks down into three fundamental operations:

  1. Store: Persist task definitions and scheduling information

  2. Pick: Identify which tasks are ready to execute

  3. Execute: Actually run the task logic

The naive approach puts all three in a single process. But this doesn't scale and here's why.

Why Separate Pickers from Executors?

Imagine you're running a pizza restaurant. You could have one person who takes orders, makes pizzas and delivers them. But you quickly realize this doesn't work:

  • Taking orders is fast (seconds)

  • Making pizzas is slower (minutes)

  • Delivering is slowest (20+ minutes)

If the same person does everything, they're blocked on deliveries while orders pile up. You separate roles: order-taker, cook, delivery driver. Each scales independently.

Task scheduling has the same dynamics:

Picking Tasks (Fast)

  • Query database: "which tasks are due now?"

  • Mark task as "picked" so others don't grab it

  • Put task on a queue for execution

  • Time: < 100ms per task

Executing Tasks (Slow)

  • Send HTTP request to external API (might take 5-10 seconds)

  • Process large dataset (might take minutes)

  • Wait for third-party service response

  • Time: seconds to minutes per task

If the same machine does both, picking stops while execution runs. Tasks accumulate, missing their SLA. By separating concerns:

  • Pickers: Lightweight, high-throughput machines that continuously scan for due tasks

  • Executors: Bulky machines that do the heavy work, scaled proportionally to load

You might run 3 picker instances but 50 executor instances, matching the different performance characteristics.

Storage: The Tasks Table

We use a relational database (MySQL, PostgreSQL) to store task definitions. The schema captures everything we need:

Table: tasks
+---------------+------------------+
| id            | INT PRIMARY KEY  |
| task          | TEXT             | ← Task payload/definition
| scheduled_at  | TIMESTAMP        | ← When to execute (for fixed tasks)
| picked_at     | TIMESTAMP        | ← When a picker claimed it
| started_at    | TIMESTAMP        | ← When execution began
| ended_at      | TIMESTAMP        | ← When execution completed
+---------------+------------------+

Why Relational?

  • ACID transactions for atomic picking (critical for preventing duplicate execution)

  • Rich querying with WHERE clauses and indexes

  • No sharding needed for typical workloads (millions of tasks)

  • Simple operational model

Let's trace through the lifecycle of a task:

Task Registration (user schedules a task):

INSERT INTO tasks (task, scheduled_at) 
VALUES ('send_email:user_123', '2026-01-05 10:01:00');

At this point: scheduled_at is set, all other timestamps are NULL.

The Picking Process: Finding Ready Tasks

Pickers run on a continuous loop, typically every few seconds, looking for tasks that are due. The query is deceptively simple but critical:

SELECT * FROM tasks
WHERE scheduled_at <= (NOW() + INTERVAL 2 MINUTE)
  AND picked_at IS NULL
ORDER BY scheduled_at ASC
LIMIT 10
FOR UPDATE SKIP LOCKED;

Let's break down each clause:

scheduled_at <= (NOW() + INTERVAL 2 MINUTE)

This looks 2 minutes into the future, not just at tasks due right now. Why? It gives us a buffer to handle the entire pick-and-queue workflow. If it's 10:00:00 AM, we fetch tasks scheduled up to 10:02:00 AM.

This headroom is crucial for meeting our 30-second SLA. By the time we pick, queue and an executor pulls from the queue, those "future" tasks will be due.

AND picked_at IS NULL

Only fetch tasks that haven't been claimed yet. Once a picker sets picked_at, that task is off-limits to other pickers. This prevents duplicate execution.

ORDER BY scheduled_at ASC

Process tasks in chronological order. Tasks scheduled for 10:01:00 AM should execute before tasks scheduled for 10:05:00 AM. FIFO semantics match user expectations.

LIMIT 10

Batch size for processing. Too small (1) means excessive database round trips. Too large (1000) means one slow picker holds a huge batch while others sit idle. 10-50 is a sweet spot.

FOR UPDATE SKIP LOCKED

This is the magic that makes distributed picking safe:

  • FOR UPDATE: Lock the selected rows for the duration of the transaction

  • SKIP LOCKED: If another picker already locked some rows, skip them instead of waiting

This means multiple pickers can run simultaneously without blocking each other. Picker 1 might grab tasks 1-10, Picker 2 grabs 11-20, all without conflicts.

Atomic Pick-and-Mark

After selecting tasks, we must atomically mark them as picked:

UPDATE tasks 
SET picked_at = NOW()
WHERE id IN (101, 102, 103, 104, 105, 106, 107, 108, 109, 110);

This happens in the same transaction as the SELECT. The transaction flow:

BEGIN TRANSACTION;
  1. SELECT ... FOR UPDATE SKIP LOCKED
  2. UPDATE tasks SET picked_at = NOW() WHERE id IN (...)
COMMIT;

If the picker crashes between SELECT and UPDATE, the transaction rolls back and the tasks remain unpicked. No orphaned state.

Once committed, picked_at is set and those tasks won't be selected by other pickers.

Pushing to the Execution Queue

After marking tasks as picked, we push them to a message queue (SQS, RabbitMQ):

For each task in batch:
  queue.push({
    task_id: 101,
    payload: "send_email:user_123",
    scheduled_at: "2026-01-05 10:01:00"
  })

Why a Message Queue?

The queue decouples pickers from executors:

  • Buffering: If executors are momentarily overwhelmed, tasks wait in the queue rather than backing up in the database

  • Load balancing: Multiple executors pull from the same queue, automatically distributing work

  • Retry semantics: If an executor crashes mid-task, the message becomes visible again for retry (though we're designing without retries, the queue still provides crash recovery)

At this point, the picker's job is done. It moves on to the next batch.

The Execution Phase

Executors are worker processes that pull tasks from the queue and actually run them:

Executor process (continuous loop):
  1. msg = queue.pull()  // Blocks until task available
  2. task = fetch_from_db(msg.task_id)
  3. UPDATE tasks SET started_at = NOW() WHERE id = task.id
  4. execute_task_logic(task.payload)
  5. UPDATE tasks SET ended_at = NOW() WHERE id = task.id
  6. queue.acknowledge(msg)  // Remove from queue

Step-by-Step Breakdown:

1. Pull from Queue The executor blocks on queue.pull(), waiting for work. When a message arrives, it gets exclusive access (queue guarantees no other executor sees this message).

2. Fetch Full Task Details The queue message is minimal (task ID, payload). We fetch the full record from the database for any additional context.

3. Mark Execution Start Set started_at = NOW(). This is important for monitoring, we can track how long tasks take and identify stuck jobs.

4. Execute the Task This is where the actual work happens. For a "send email" task, this means calling your email service API. For a "process data" task, this means running your computation.

5. Mark Completion Set ended_at = NOW(). The task lifecycle is complete.

6. Acknowledge the Message Tell the queue "I'm done with this message, remove it." If we crash before acknowledgment, the message becomes visible again for another executor.

Meeting the 30-Second SLA

Our SLA states: tasks scheduled for 10:01:00 AM must start executing by 10:01:30 AM. Let's trace the timing:

10:00:00 AM: Task scheduled for 10:01:00 AM
10:00:04 AM: Picker runs, sees task (it's within 2-minute window)
10:00:04 AM: Picker marks picked_at, pushes to queue (< 100ms)
10:00:05 AM: Task sitting in queue
10:01:00 AM: Task is now officially "due"
10:01:01 AM: Executor pulls from queue, marks started_at
10:01:01 AM: Execution begins

Total time from scheduled to started: 1 second, well under our 30-second budget.

What if executors are overwhelmed?

10:00:04 AM: Task pushed to queue
10:00:05-10:01:20 AM: Task waits in queue (executors busy)
10:01:21 AM: Executor finally pulls task, marks started_at

Total time: 21 seconds, still meeting SLA.

The 2-minute look-ahead in picking gives us breathing room. Even if there's a 1-minute queue delay, we're fine.

Handling Recurring Tasks: The CRON Challenge

Fixed-time tasks are straightforward: schedule once, execute once, done. Recurring tasks are trickier.

Consider a task with CRON schedule 0 2 * * * (daily at 2 AM). After execution, we need to schedule the next occurrence.

Data Model for Recurring Tasks

Table: tasks
+---------------+------------------+
| id            | INT              |
| task          | TEXT             |
| schedule      | TEXT             | ← CRON expression: "0 2 * * *"
| scheduled_at  | TIMESTAMP        | ← Next execution time
| picked_at     | TIMESTAMP        |
| started_at    | TIMESTAMP        |
| ended_at      | TIMESTAMP        |
+---------------+------------------+

The schedule field contains the CRON expression. When a recurring task completes:

Executor (after task completes):
  1. Parse CRON: "0 2 * * *"
  2. Calculate next run: next_2am()
  3. UPDATE tasks 
     SET scheduled_at = '2026-01-06 02:00:00',
         picked_at = NULL,
         started_at = NULL,
         ended_at = NULL
     WHERE id = 42

Critical: We reset picked_at, started_at and ended_at to NULL. This makes the task "look new" to pickers. On the next cycle, pickers will see scheduled_at = 2026-01-06 02:00:00 and picked_at IS NULL, treating it as a fresh task to schedule.

This "reset and reschedule" pattern means recurring tasks never get deleted, they just keep updating their scheduled_at timestamp.

When to Calculate Next Execution

Two approaches:

Option 1: After Picking When a picker grabs a recurring task, immediately schedule the next occurrence:

-- Picker marks current as picked
UPDATE tasks SET picked_at = NOW() WHERE id = 42;

-- Picker inserts next occurrence
INSERT INTO tasks (task, schedule, scheduled_at)
VALUES ('backup_db', '0 2 * * *', '2026-01-06 02:00:00');

Pros: Next run is scheduled immediately, no gap Cons: Creates duplicate rows, table grows over time

Option 2: After Execution Wait until the task completes, then update the same row:

-- Executor completes task
UPDATE tasks 
SET scheduled_at = '2026-01-06 02:00:00',
    picked_at = NULL, started_at = NULL, ended_at = NULL
WHERE id = 42;

Pros: Single row per recurring task, cleaner schema Cons: Slight gap between completion and next scheduling

For our design, Option 2 is better. It keeps the table compact and works naturally with our minute-level granularity.

Predictive Scaling: Handling Burst Traffic

One of the hardest operational challenges is traffic bursts. At 9:00 AM, 10,000 tasks might be scheduled simultaneously (daily reports, morning emails, etc.).

If we reactively scale executors based on queue depth, we're already behind. By the time new instances spin up (60-90 seconds), tasks are already missing their SLA.

Solution: Predictive Scaling

The orchestrator is a separate component that monitors the database and proactively scales:

Orchestrator (runs every minute):
  1. Query DB: 
     SELECT COUNT(*) FROM tasks 
     WHERE scheduled_at BETWEEN NOW() AND NOW() + INTERVAL 10 MINUTE
       AND picked_at IS NULL

  2. If count > threshold:
     - Scale up executors *before* the burst hits
     - Add instances that will be ready when tasks start flowing

  3. Monitor key metrics:
     - Tasks pulled per minute
     - Average task wait time
     - Average time to completion

  4. If metrics degrade:
     - Scale up further

  5. If queue is empty for 5+ minutes:
     - Scale down to baseline

Example Scenario:

8:55 AM: Orchestrator queries DB
        Sees 5,000 tasks scheduled for 9:00-9:10 AM
        Current executors: 10 (can handle ~200 tasks/min)
        Needed capacity: ~500 tasks/min

8:56 AM: Orchestrator triggers scale-up to 30 executors
8:57-8:59 AM: New executors provisioning
9:00 AM: Burst hits, 30 executors ready to pull tasks
9:00-9:10 AM: All 5,000 tasks executed within SLA
9:15 AM: Orchestrator sees queue empty, scales back to 15
9:20 AM: Scales back to baseline 10

By looking 10 minutes ahead, we're always prepared for what's coming.

Key Metrics for Monitoring

The orchestrator and operations team track these SLA-critical metrics:

1. Tasks Pulled Per Minute How many tasks are pickers successfully claiming? This should match expected throughput.

2. Task Wait Time Time from scheduled_at to started_at. Our SLA requires this to be < 30 seconds.

3. Average Time to Completion Time from started_at to ended_at. Helps identify slow tasks that need optimization.

4. Picker Lag Are pickers keeping up with the schedule? If pickers are slow, tasks accumulate.

5. Queue Depth How many tasks are waiting for execution? Spikes indicate need for more executors.

These metrics feed into alerting and auto-scaling decisions.

Why No Retries?

Our design explicitly excludes automatic retries. This is a product decision, not a technical limitation. Here's why it makes sense:

1. Idempotency is Hard Retrying means potentially executing the same task twice. For many workloads (sending emails, charging payments, posting to social media), duplicate execution is worse than no execution.

2. Failure Signals Issues If a task fails, it often indicates a bug in task logic or a downstream service outage. Retrying just papers over the problem. Better to fail loudly, page an engineer and fix the root cause.

3. Simpler State Machine Without retries, task states are simple: pending → picked → executing → complete. Retries introduce exponential complexity: retry counts, backoff timers, max retry limits, dead letter queues.

Alternative: Manual Retry Failed tasks stay in the database with started_at set but ended_at NULL. A monitoring dashboard shows these failures. Engineers can:

  • Investigate the failure

  • Fix the bug

  • Reset the task: UPDATE tasks SET picked_at = NULL, started_at = NULL WHERE id = 42

  • Let the system pick it up again naturally

This puts humans in the loop for exception handling, which is often appropriate for scheduled tasks.

Putting It All Together

Let's trace a complete task lifecycle:

Monday, 9:00 AM: User schedules a task

INSERT INTO tasks (task, scheduled_at)
VALUES ('send_newsletter', '2026-01-05 10:00:00');

Monday, 9:58 AM: Picker scans database

SELECT * FROM tasks
WHERE scheduled_at <= '2026-01-05 10:00:00'
  AND picked_at IS NULL
ORDER BY scheduled_at ASC
LIMIT 10
FOR UPDATE SKIP LOCKED;

Finds our newsletter task (and 9 others).

Monday, 9:58 AM: Picker marks and queues

UPDATE tasks SET picked_at = '2026-01-05 09:58:00' 
WHERE id = 42;
queue.push({task_id: 42, payload: 'send_newsletter'})

Monday, 9:59 AM: Task waits in queue

Monday, 10:00:01 AM: Executor pulls task

msg = queue.pull()

Monday, 10:00:01 AM: Executor starts execution

UPDATE tasks SET started_at = '2026-01-05 10:00:01' 
WHERE id = 42;

Monday, 10:00:01-10:00:15 AM: Executor sends newsletter Calls email service API, waits for response...

Monday, 10:00:15 AM: Task completes

UPDATE tasks SET ended_at = '2026-01-05 10:00:15' 
WHERE id = 42;
queue.acknowledge(msg)

Total time from scheduled to started: 1 second, crushing our 30-second SLA.

The Architecture's Elegant Properties

This design delivers on every requirement:

30-Second SLA: Look-ahead picking + queue buffering + predictive scaling ensures tasks start promptly

Minute-Level Granularity: Simplified model without second-level complexity

Fixed & Recurring Tasks: Single schema handles both with simple reset logic

Distributed Execution: Pickers and executors scale independently

High Throughput: Multiple pickers work concurrently without conflicts via SKIP LOCKED

Operational Simplicity: Relational database, standard message queue, no exotic components

The result: a production-grade task scheduler that handles millions of tasks per day while maintaining strict timing guarantees. Whether you're scheduling a single password reset email or orchestrating thousands of nightly data pipelines, the architecture scales gracefully from day one to day ten thousand.

Designing a Message Broker on Top of a Relational Database

The Surprising Choice: Why Build on SQL?

When you think "message broker," you probably think RabbitMQ, Kafka, or SQS, specialized systems built from the ground up for messaging. So why would anyone build a message broker on top of a relational database?

The answer lies in operational simplicity. Many organizations already run MySQL or PostgreSQL with robust backup, monitoring and operational expertise. Adding RabbitMQ means learning new tools, managing new infrastructure and dealing with new failure modes. Sometimes, especially at smaller scale, leveraging what you already have is the smart engineering decision.

The question becomes: can a relational database provide the guarantees a message broker needs? Let's find out.

Core Requirements

Our broker must satisfy four fundamental properties:

1. FIFO Ordering: Messages must be consumed in the order they were produced. If Producer sends Message A then Message B, consumers must process A before B.

2. Consume-Once Semantics: When a consumer reads a message, it should be deleted (or marked as consumed). No other consumer should see it.

3. High Throughput: The system must handle thousands of messages per second without becoming a bottleneck.

4. Exclusive Consumption: Once Consumer 1 picks a message, Consumer 2 cannot pick the same message. No duplicate processing.

These requirements mirror exactly what we built for the task scheduler's picker and that's no accident. Message consumption is fundamentally the same problem as task picking.

The Schema: Simple Yet Powerful

Table: messages
+------------------+------------------+
| id               | INT AUTO_INC PK  |
| msg              | TEXT             |
| created_at       | TIMESTAMP        |
| picked_at        | TIMESTAMP        |
| deleted_at       | TIMESTAMP        |
| receipt_handle   | VARCHAR(UUID)    |
+------------------+------------------+

Let's understand each field's purpose:

id: Auto-incrementing primary key. Provides natural FIFO ordering, lower IDs were inserted first.

msg: The actual message payload. Could be JSON, serialized protobuf, or plain text. The broker doesn't care about content.

created_at: When the producer inserted the message. Useful for monitoring and debugging.

picked_at: When a consumer claimed the message. NULL means available for consumption.

deleted_at: When the message was deleted after successful processing. NULL means not yet deleted.

receipt_handle: A unique UUID given to the consumer who picked the message. Used to ensure only that consumer can delete it, prevents one consumer from accidentally deleting another's message.

The Producer Path: Publishing Messages

Publishing is delightfully simple:

INSERT INTO messages (msg, created_at) 
VALUES ('{"user_id": 123, "action": "send_email"}', NOW());

That's it. No complex protocol, no broker-specific client libraries. Just a standard SQL INSERT.

Throughput Considerations:

For high-throughput scenarios, batch inserts dramatically improve performance:

INSERT INTO messages (msg, created_at) VALUES
  ('message_1', NOW()),
  ('message_2', NOW()),
  ('message_3', NOW()),
  ...
  ('message_100', NOW());

A single batch INSERT of 100 messages is 50x faster than 100 individual INSERTs. Most database drivers support this natively.

The Consumer Path: Reading Messages

This is where the magic happens. Consumers need to atomically claim messages without conflicts. The query looks familiar if you read the task scheduler section:

SELECT * FROM messages
WHERE picked_at IS NULL
  AND deleted_at IS NULL
ORDER BY created_at ASC
LIMIT n
FOR UPDATE SKIP LOCKED;

Let's break down each clause:

WHERE picked_at IS NULL

Only fetch messages that haven't been claimed yet. Once a consumer sets picked_at, the message becomes invisible to this query.

AND deleted_at IS NULL

Skip messages that have already been processed and deleted. This is our tombstone mechanism.

ORDER BY created_at ASC

FIFO guarantee. Process messages in the order they were created. We could also ORDER BY id ASC since IDs are auto-incrementing, which is slightly faster (no need to consult the timestamp).

LIMIT n

Batch size. Instead of fetching one message at a time (inefficient), grab 10-50 at once. Reduces database round trips.

FOR UPDATE SKIP LOCKED

The critical piece that makes concurrent consumption safe:

  • FOR UPDATE: Locks the selected rows for the duration of the transaction

  • SKIP LOCKED: If another consumer already locked some rows, skip them instead of waiting

This means Consumer 1 might grab messages 1-10, Consumer 2 simultaneously grabs 11-20 and Consumer 3 grabs 21-30. All without blocking each other.

Claiming Messages: The Atomic Pick

After selecting messages, we must mark them as picked in the same transaction:

BEGIN TRANSACTION;

-- Select messages
SELECT * FROM messages
WHERE picked_at IS NULL AND deleted_at IS NULL
ORDER BY created_at ASC
LIMIT 10
FOR UPDATE SKIP LOCKED;

-- Mark as picked with unique receipt handle
UPDATE messages
SET picked_at = NOW(),
    receipt_handle = UUID()
WHERE id IN (101, 102, 103, 104, 105, 106, 107, 108, 109, 110);

COMMIT;

After the COMMIT, those 10 messages have:

  • picked_at set to current timestamp

  • receipt_handle containing a unique UUID like "550e8400-e29b-41d4-a716-446655440000"

The consumer receives both the messages and their receipt handles. This is crucial for the delete operation.

The Broker API: HTTP Endpoints

Our broker exposes a simple HTTP API:

GET /messages/q?n=10

Request:

GET /messages/q?n=10

Response:

{
  "messages": [
    {
      "id": 101,
      "msg": "{\"user_id\": 123, \"action\": \"send_email\"}",
      "receipt_handle": "550e8400-e29b-41d4-a716-446655440000"
    },
    {
      "id": 102,
      "msg": "{\"user_id\": 456, \"action\": \"process_payment\"}",
      "receipt_handle": "6ba7b810-9dad-11d1-80b4-00c04fd430c8"
    }
  ]
}

The consumer gets up to 10 messages (whatever was available) along with unique receipt handles.

DELETE /message/q/uuid1

After successfully processing a message, the consumer deletes it:

Request:

DELETE /message/q/550e8400-e29b-41d4-a716-446655440000

Implementation:

UPDATE messages
SET deleted_at = NOW()
WHERE receipt_handle = '550e8400-e29b-41d4-a716-446655440000';

Notice we don't actually DELETE the row, we set deleted_at. This is a soft delete pattern.

Why Soft Delete?

Hard deletes (actual DELETE statements) cause problems:

  • Fragmentation: Constantly deleting and inserting creates holes in the table

  • Replication lag: Deletes must replicate to all database replicas

  • Audit trail: With soft deletes, you can see message history

The tradeoff is eventual cleanup (discussed below).

PUT /message/q

Producer publishes a new message:

Request:

PUT /message/q
Content-Type: application/json

{
  "msg": "{\"user_id\": 789, \"action\": \"generate_report\"}"
}

Implementation:

INSERT INTO messages (msg, created_at)
VALUES ('{\"user_id\": 789, \"action\": \"generate_report\"}', NOW());

Simple, synchronous publish. For higher throughput, batch multiple messages.

Ensuring Exclusive Consumption

The receipt handle is the key to preventing one consumer from deleting another's messages. Consider this scenario:

10:00:00 - Consumer 1 picks message 101, receipt_handle = "uuid-A"
10:00:01 - Consumer 2 picks message 102, receipt_handle = "uuid-B"
10:00:02 - Consumer 1 tries to delete message 102 (by mistake)

Consumer 1's delete request:

UPDATE messages
SET deleted_at = NOW()
WHERE receipt_handle = 'uuid-A' AND id = 102;

This UPDATE affects 0 rows because message 102 has receipt_handle = 'uuid-B', not 'uuid-A'. The delete fails silently (or we return an error), protecting Consumer 2's work.

This pattern prevents accidentally:

  • Deleting messages you didn't pick

  • Deleting messages another consumer is processing

  • Race conditions where multiple consumers claim the same message

Visibility Timeout: Handling Crashed Consumers

What happens if a consumer crashes after picking messages but before deleting them?

10:00:00 - Consumer 1 picks messages 101-110
10:00:05 - Consumer 1 crashes (server dies, network partition, etc.)

Those 10 messages are now stuck: picked_at is set, so no other consumer will claim them, but they're not deleted because Consumer 1 never finished.

Solution: Visibility Timeout

After a message is picked, it should become visible again if not deleted within a timeout period (typically 10 minutes):

CRON job (runs every minute):

UPDATE messages
SET picked_at = NULL
WHERE picked_at < (NOW() - INTERVAL 10 MINUTE)
  AND deleted_at IS NULL;

This finds messages that were picked more than 10 minutes ago but not deleted and resets them. They become available for other consumers to pick.

The CRON job essentially says: "If you picked a message but didn't delete it in 10 minutes, we assume you crashed and make it available again."

Important: This means messages might be delivered twice if a consumer is slow but not crashed. This is acceptable in most systems, we trade perfect once-delivery for resilience.

Scaling Throughput: Multiple Consumers

The beauty of FOR UPDATE SKIP LOCKED is that it makes horizontal scaling trivial:

Consumer 1 (polls every 4 seconds):
  10:00:00 - Picks messages 1-10
  10:00:04 - Picks messages 41-50
  10:00:08 - Picks messages 81-90

Consumer 2 (polls every 4 seconds):
  10:00:00 - Picks messages 11-20
  10:00:04 - Picks messages 51-60
  10:00:08 - Picks messages 91-100

Consumer 3 (polls every 4 seconds):
  10:00:00 - Picks messages 21-30
  10:00:04 - Picks messages 61-70
  10:00:08 - Picks messages 101-110

Each consumer independently queries the database. The locking mechanism ensures they never pick the same messages. No coordination required, no leader election, no distributed consensus.

Throughput Calculation:

  • Each consumer processes 10 messages per batch

  • Processing takes ~1 second per message

  • Total batch time: ~10 seconds

  • Consumer polls every 4 seconds, processes for ~10 seconds, then polls again

  • Each consumer: ~60 messages/minute

  • With 10 consumers: ~600 messages/minute = 10 messages/second

To increase throughput, simply add more consumers. The database becomes the bottleneck only at very high scale (10,000+ messages/sec), at which point you'd migrate to a specialized broker.

The Broker API Layer

The broker sits between consumers and the database, providing a clean abstraction:

┌───────────┐
│Consumer 1 │───┐
└───────────┘   │
                ├──→ ┌────────────┐      ┌────────┐
┌───────────┐   │    │ Broker API │─────→│ MySQL  │
│Consumer 2 │───┤    └────────────┘      └────────┘
└───────────┘   │
                │
┌───────────┐   │
│Consumer 3 │───┘
└───────────┘

The Broker API exposes:

  • GET /messages/q?n=10 → Consumers pull messages

  • DEL /message/q/uuid1 → Consumers delete after processing

  • PUT /message/q → Producers publish messages

Why not let consumers query directly?

The broker layer provides:

  • Authentication: Only authorized clients can publish/consume

  • Rate limiting: Prevent abusive consumers from overwhelming the database

  • Metrics: Track messages published, consumed, latency

  • Abstraction: Swap MySQL for PostgreSQL without changing consumer code

Garbage Collection: Cleaning Up Old Messages

Soft deletes mean rows accumulate forever. A table with 1 billion messages (even deleted ones) becomes slow to query and wastes disk space.

Solution: Background Cleanup

-- Runs daily via CRON
DELETE FROM messages
WHERE deleted_at < (NOW() - INTERVAL 7 DAYS)
LIMIT 10000;

This hard-deletes messages that were soft-deleted more than 7 days ago. We keep a week of history for debugging, then purge.

Why LIMIT 10000?

Deleting millions of rows in one transaction locks the table and causes replication lag. By limiting to 10,000 rows, we delete in small batches:

  • Each batch takes <1 second

  • Doesn't block ongoing message consumption

  • Runs repeatedly until all old messages are purged

You can also use partitioning strategies where older data lives in separate partitions that can be dropped entirely (instant cleanup).

Handling FIFO Strictly

Our ORDER BY created_at ASC provides approximate FIFO, but what if two messages have the same timestamp (down to the second)?

For strict ordering, use ORDER BY id ASC instead:

SELECT * FROM messages
WHERE picked_at IS NULL AND deleted_at IS NULL
ORDER BY id ASC  -- Guaranteed order since IDs are sequential
LIMIT 10
FOR UPDATE SKIP LOCKED;

Since id is auto-incrementing, lower IDs were inserted first. This gives perfect FIFO ordering.

Trade-off: Ordering by id (integer) is faster than created_at (timestamp) anyway, so this is strictly better for our use case.

Monitoring and Observability

Key metrics to track:

1. Messages Published Per Minute

SELECT COUNT(*) FROM messages
WHERE created_at >= (NOW() - INTERVAL 1 MINUTE);

2. Messages Pending (Not Picked)

SELECT COUNT(*) FROM messages
WHERE picked_at IS NULL AND deleted_at IS NULL;

If this number grows unbounded, consumers can't keep up. Scale up.

3. Messages Stuck (Picked But Not Deleted)

SELECT COUNT(*) FROM messages
WHERE picked_at IS NOT NULL
  AND deleted_at IS NULL
  AND picked_at < (NOW() - INTERVAL 5 MINUTE);

High numbers indicate consumer crashes or slow processing.

4. Average Time to Deletion

SELECT AVG(TIMESTAMPDIFF(SECOND, picked_at, deleted_at)) AS avg_seconds
FROM messages
WHERE deleted_at >= (NOW() - INTERVAL 1 HOUR);

Shows how long consumers take to process messages. Spikes indicate performance issues.

When This Design Works (and When It Doesn't)

This design excels when:

  • You already run MySQL/PostgreSQL and want to avoid new infrastructure

  • Throughput is moderate (< 10,000 messages/sec)

  • You need strong consistency guarantees (ACID transactions)

  • Operational simplicity is a priority

  • You want built-in durability (database backups cover messages too)

You should use a specialized broker when:

  • You need 100,000+ messages/sec throughput

  • You need advanced routing (pub/sub, topic exchanges, fanout)

  • You want message replays (Kafka-style log retention)

  • You need multi-datacenter replication with low latency

  • Database becomes a bottleneck for your other workloads

Real-World Usage: AWS SQS Comparison

Interestingly, this design closely mimics how AWS SQS works internally (though they don't publicly confirm implementation details):

FeatureOur SQL BrokerAWS SQS
Visibility timeoutCron resets picked_atBuilt-in
Receipt handleUUID columnOpaque token
Soft deletesdeleted_at columnInternal tombstones
FIFOORDER BY id ASCFIFO queues
Throughput1K-10K msg/sec3K-10K msg/sec (standard)

SQS is more polished and scalable, but the core concepts are identical. If you understand this SQL implementation, you understand how SQS works under the hood.

Complete Example: End-to-End Flow

Producer publishes 3 messages:

INSERT INTO messages (msg, created_at) VALUES
  ('Message A', '2026-01-05 10:00:00'),
  ('Message B', '2026-01-05 10:00:01'),
  ('Message C', '2026-01-05 10:00:02');

Consumer 1 pulls batch:

BEGIN;
SELECT * FROM messages
WHERE picked_at IS NULL AND deleted_at IS NULL
ORDER BY id ASC LIMIT 2
FOR UPDATE SKIP LOCKED;
-- Returns: Message A (id=1), Message B (id=2)

UPDATE messages
SET picked_at = NOW(), receipt_handle = UUID()
WHERE id IN (1, 2);
COMMIT;
-- Consumer receives: [{id:1, msg:"A", handle:"uuid-1"}, {id:2, msg:"B", handle:"uuid-2"}]

Consumer 2 pulls batch (simultaneously):

BEGIN;
SELECT * FROM messages
WHERE picked_at IS NULL AND deleted_at IS NULL
ORDER BY id ASC LIMIT 2
FOR UPDATE SKIP LOCKED;
-- Returns: Message C (id=3) [Messages 1-2 are locked by Consumer 1, skipped]

UPDATE messages
SET picked_at = NOW(), receipt_handle = UUID()
WHERE id = 3;
COMMIT;
-- Consumer receives: [{id:3, msg:"C", handle:"uuid-3"}]

Consumer 1 finishes processing, deletes:

UPDATE messages SET deleted_at = NOW() 
WHERE receipt_handle = 'uuid-1';

UPDATE messages SET deleted_at = NOW() 
WHERE receipt_handle = 'uuid-2';

Consumer 2 crashes before deleting Message C:

(Consumer 2 dies, message 3 stuck with picked_at set)

10 minutes later, visibility timeout CRON runs:

UPDATE messages
SET picked_at = NULL
WHERE picked_at < (NOW() - INTERVAL 10 MINUTE)
  AND deleted_at IS NULL;
-- Message C becomes available again

Consumer 3 picks up Message C:

-- Same SELECT/UPDATE flow as before
-- Message C gets new receipt handle "uuid-4"
-- Consumer 3 processes and deletes successfully

The Elegant Simplicity

Building a message broker on SQL demonstrates a powerful principle: you don't always need specialized infrastructure. With the right patterns, atomic transactions, row-level locking, smart indexing, a relational database can provide message broker semantics.

The result is a system that's:

  • Simple to understand (just SQL queries)

  • Easy to operate (leverage existing database skills)

  • Reliable (ACID transactions prevent data loss)

  • Scalable (to thousands of messages per second)

For many applications, that's more than enough. And when you outgrow it, the concepts you learned, visibility timeouts, receipt handles, FIFO ordering, idempotent deletes, transfer directly to systems like RabbitMQ, SQS, or Kafka.

Sometimes the best solution isn't the most sophisticated technology, it's the one that solves your problem with the tools you already have.

Designing YouTube's Views Counter: Counting at Planetary Scale

The Deceptive Simplicity of "Just Count"

Counting views on YouTube sounds trivial. Every time someone watches a video, increment a counter. Done, right?

Except YouTube serves 1 billion hours of video every single day. That's roughly 100,000+ video views starting every second. Counting at this scale while filtering out bots, detecting fraud and maintaining accuracy across the globe is one of the most challenging distributed systems problems in existence.

This isn't just an engineering exercise, view counts directly impact creator revenue, recommendation algorithms and trending calculations. Getting it wrong means creators lose money and fraudulent videos go viral.

Core Requirements

1. Count Every Legitimate View When a user watches a video, that view should be reflected in the counter, eventually. We need high accuracy without losing data.

2. Filter Out Invalid Views Not every play counts as a view:

  • Bot traffic (automated scraping)

  • Repeated refreshes within seconds

  • Views from the same IP in suspicious patterns

  • Views shorter than some threshold (e.g., < 10 seconds)

  • Self-views from the video creator

These filtering rules live in a Rules Engine and evolve constantly as new fraud patterns emerge.

3. Near Real-Time Updates While we can tolerate some delay (eventual consistency), view counts shouldn't lag hours behind reality. Users expect to see counts update within minutes.

4. Scale to Billions of Events 100,000 views/second sustained, with spikes during viral moments (a popular livestream might generate 1 million concurrent views).

The Naive Approach (and Why It Fails)

The simplest design: every video play sends a request directly to a database to increment a counter.

UPDATE videos
SET view_count = view_count + 1
WHERE video_id = 'dQw4w9WgXcQ';

Why this doesn't work:

1. Database Hotspots Popular videos (viral hits, live events) would receive millions of concurrent writes to the same row. This creates contention, all writes wait in line for the lock on that row.

Databases can handle ~10,000 writes/sec to different rows. But 1 million writes/sec to the same row? The database grinds to a halt.

2. No Filtering Logic We're counting every view without applying fraud detection rules. Bots would inflate counts instantly.

3. Cross-Region Latency Users are globally distributed. Forcing every view to synchronously update a central database adds 100-300ms latency to video playback. Unacceptable for user experience.

We need a fundamentally different architecture.

The Architecture: Stream Processing at Scale

The breakthrough insight: counting is an aggregation problem, not a transactional problem. We don't need to know the exact count at every millisecond, we need to eventually arrive at the correct count across billions of events.

This leads us to a streaming architecture:

┌─────────┐
│ User 1  │──┐
└─────────┘  │
             ├──→ ┌────────────┐      ┌─────────┐
┌─────────┐  │    │ Watchtime  │─────→│  Kafka  │
│ User 2  │──┤    │   Service  │      │(Partitioned)
└─────────┘  │    └────────────┘      └─────────┘
             │                             │
┌─────────┐  │                             │
│ User 3  │──┘                             ↓
└─────────┘                        ┌────────────────┐
                                   │ Event Filters  │
                                   │  (Consumers)   │
                                   └────────────────┘
                                           │
                                           ↓
                                   ┌────────────────┐
                                   │   Counters     │
                                   │(Batch & Count) │
                                   └────────────────┘
                                           │
                                           ↓
                                   ┌────────────────┐
                                   │   Count DB     │
                                   │ (Aggregated)   │
                                   └────────────────┘

Let's trace through this step by step.

Step 1: Capturing Watch Events

When a user watches a video, the client (web browser, mobile app) sends a "heartbeat" to the Watchtime Service:

POST /watchtime
{
  "video_id": "dQw4w9WgXcQ",
  "user_id": "user_12345",
  "session_id": "sess_abc",
  "timestamp": "2026-01-05T10:23:45Z",
  "watch_duration": 15,  // seconds watched so far
  "device": "mobile",
  "ip": "203.0.113.42"
}

These heartbeats happen every 10-30 seconds during playback. If a user watches a 3-minute video, they might send 6-10 heartbeat events.

Why not wait until the video ends?

Users close tabs, their connections drop, they navigate away. If we only count "complete views," we'd miss 70%+ of actual viewing. Heartbeats give us fine-grained data on actual watch time.

Step 2: Publishing to Kafka

The Watchtime Service immediately publishes each event to Kafka:

Kafka Topic: "watch_events"
Partitions: 100 (to parallelize processing)

Event stream:
A, A, A, B, C, A, D, B, A, B, A, A, D, C, B, A, ...

Each letter represents a different video. Notice "A" appears frequently, it's a viral video getting massive views.

Why Kafka? Why not process directly?

1. Decoupling The Watchtime Service's job is to accept events quickly (low latency for users). Processing and filtering can happen asynchronously. Kafka acts as a buffer.

2. Durability Kafka persists events to disk. If a downstream consumer crashes, events aren't lost, they're replayed.

3. Parallelism Kafka's partitioning lets us run 100 parallel consumers, each processing a subset of events. This is crucial for handling 100,000 events/second.

4. Replayability If we discover a bug in our counting logic, we can reprocess historical events from Kafka to fix counts.

Kafka Partitioning Strategy

Kafka uses partitioning to distribute events across multiple consumers. The partition key determines which partition an event goes to:

Partition = hash(video_id) % num_partitions

Example with 4 partitions:
- Video A: hash("A") % 4 = 0 → Partition 0
- Video B: hash("B") % 4 = 1 → Partition 1
- Video C: hash("C") % 4 = 2 → Partition 2
- Video D: hash("D") % 4 = 3 → Partition 3

Critical insight: All events for the same video go to the same partition.

This means:

  • Partition 0 gets ALL events for video A

  • Partition 1 gets ALL events for video B

  • And so on...

Why partition by video_id?

Because counting requires aggregation. To count views for video A, we need all video A events processed by the same consumer. Partitioning by video_id ensures this locality.

If we partitioned randomly, video A events would scatter across all partitions and we'd need complex coordination to aggregate counts (expensive and error-prone).

Step 3: Event Filtering (Rules Engine)

Each Kafka partition is consumed by a Event Filter service. These consumers apply fraud detection rules:

Event Filter (Consumer Group, 100 instances):

For each event from Kafka:
  1. Check Rules DB: Does this event pass validation?
     - Is user_id legitimate? (not a bot)
     - Has this user_id watched this video recently? (prevent spam refreshes)
     - Did they watch for minimum duration? (e.g., >= 10 seconds)
     - Is IP address flagged? (known bot farm)

  2. If PASS: Forward to "valid_events" Kafka topic
  3. If FAIL: Drop event (log for analysis)

Example Rules:

Rule 1: Same user watching same video within 5 minutes → Only count first view
Rule 2: Watch duration < 10 seconds → Don't count
Rule 3: More than 10 views from same IP in 1 minute → Flag as bot
Rule 4: Creator watching their own video → Don't count

The Rules DB is a separate database containing these rules, which can be updated without redeploying code. This is crucial because fraud patterns evolve daily.

Why re-ingest into Kafka after filtering?

Because filtering is just one stage. We want to decouple filtering from counting. If counting logic changes, we don't need to re-filter, we just reprocess from the "valid_events" topic.

This pattern is called staged stream processing: each stage reads from one topic, transforms data and writes to the next topic.

Step 4: Counting in Parallel (The Counter Service)

Now we have a stream of valid events in Kafka. The final stage is counting:

Counter Service (Consumer Group, 100 instances):

For each partition:
  1. Batch events by video_id over a time window (e.g., 10 seconds)
  2. Aggregate: video_id → count
     Example batch:
     {
       "video_A": 1523 events,
       "video_B": 847 events,
       "video_C": 3291 events
     }
  3. Write aggregated counts to Count DB

Why batch instead of individual writes?

Writing every single event to the database means 100,000 writes/second. Batching reduces this to ~1,000 writes/second (100 partitions × 10 batches/second each).

Each batch aggregates potentially thousands of events for the same video:

Before batching:
- video_A viewed (write to DB)
- video_A viewed (write to DB)
- video_A viewed (write to DB)
... 1,523 database writes

After batching:
- video_A: +1523 views (single write to DB)

This reduces database load by 1000x while maintaining the same accuracy.

Parallelism: Why Count on Many Machines?

With 100 Kafka partitions and 100 Counter Service instances, we achieve massive parallelism:

Partition 0 → Counter Instance 0 → Processes videos {A, E, I, M, ...}
Partition 1 → Counter Instance 1 → Processes videos {B, F, J, N, ...}
Partition 2 → Counter Instance 2 → Processes videos {C, G, K, O, ...}
...
Partition 99 → Counter Instance 99 → Processes videos {D, H, L, P, ...}

Each instance independently:

  • Reads events from its assigned partition

  • Batches events per video_id

  • Writes aggregated counts to the database

No coordination needed. Because we partitioned by video_id, each video's events are processed by exactly one instance. No conflicts, no distributed locks, no race conditions.

Parallelism breakdown:

  • 100,000 events/sec ÷ 100 partitions = 1,000 events/sec per partition

  • Each Counter instance handles 1,000 events/sec (easily manageable)

  • Processing time per event: ~1ms (filtering + batching)

This architecture scales horizontally: need more throughput? Add more partitions and instances.

The Count Database: Aggregation Storage

The Count DB stores aggregated view counts per video:

Table: video_counts

video_id          | view_count
------------------|-----------
dQw4w9WgXcQ       | 1,234,567,890
jNQXAC9IVRw      | 543,210,987
...

Counter services issue simple UPDATE statements:

UPDATE video_counts
SET view_count = view_count + 1523
WHERE video_id = 'dQw4w9WgXcQ';

Because we batched 1,523 events into a single update, database writes are manageable even for viral videos.

Avoiding huge writes:

A live event with 1 million concurrent viewers would generate massive count updates if we batched for too long. Solution: adaptive batching:

If batch size for video_id > 10,000 events:
  - Flush batch immediately (don't wait for time window)
  - Write to database
  - Start new batch

This prevents any single batch from accumulating huge counts, keeping write sizes reasonable.

Why Re-ingest into Kafka? (Deep Dive)

The architecture has two Kafka stages: raw events → Kafka → filtered events → Kafka → counting. Why the middle Kafka topic?

1. Fault Tolerance

If the Counter service crashes, filtered events are safely stored in Kafka. When it recovers, it resumes from the last committed offset. No data loss.

2. Reprocessing

Imagine we discover a bug: we undercounted views for videos watched on mobile devices. To fix:

  • Deploy corrected Counter code

  • Reset Kafka offset to 24 hours ago

  • Reprocess filtered events

  • Corrected counts populate the database

Without Kafka persistence, we'd lose historical events and couldn't fix the mistake.

3. Multiple Consumers

Other systems might need filtered events:

  • Analytics pipeline: Tracks watch patterns for recommendations

  • Revenue calculator: Determines ad revenue based on valid views

  • Trending algorithm: Identifies videos gaining views rapidly

All these consumers read from the same "valid_events" topic. Kafka naturally supports multiple consumer groups without additional load on the upstream filter.

4. Backpressure Handling

If the Count DB is slow or temporarily unavailable, filtered events accumulate in Kafka rather than backing up into the Watchtime Service. This isolates failures, users aren't impacted by downstream problems.

Handling Hot Videos: Partition Skew

Partitioning by video_id creates a problem: viral videos cause partition skew.

Normal day:
Partition 0: 1,000 events/sec (various videos)
Partition 1: 1,000 events/sec
...

Viral video "A" goes live:
Partition 0 (contains video A): 50,000 events/sec
Partition 1: 1,000 events/sec
...

Partition 0's consumer is overwhelmed while others sit idle. The system's overall capacity is 100,000 events/sec, but one partition is the bottleneck.

Solution: Sub-partitioning by session_id

Instead of partitioning purely by video_id, we use a composite key:

Partition = hash(video_id + session_id) % num_partitions

For video A with 50,000 concurrent viewers:

  • 50,000 different session_ids

  • Events distributed across all 100 partitions

  • Each partition handles 500 events/sec for video A

Then during counting, we aggregate across partitions for the same video_id. This adds complexity (need coordination) but prevents hotspots.

Trade-off: Simple partitioning (video_id only) is easier to implement but suffers from skew. Composite partitioning (video_id + session_id) distributes load evenly but requires multi-partition aggregation.

YouTube likely uses the composite approach for resilience during viral events.

The Rules Database: Dynamic Fraud Detection

The Rules DB stores filter logic that evolves without code deploys:

Table: filter_rules

rule_id | rule_type          | parameters
--------|--------------------|-----------
1       | min_watch_duration | {"seconds": 10}
2       | repeat_view_window | {"minutes": 5}
3       | ip_rate_limit      | {"views_per_minute": 10}
4       | bot_ip_blocklist   | {"ips": ["1.2.3.4", ...]}

Event Filters query this database periodically (cached locally to avoid database load):

Every 60 seconds:
  rules = fetch_rules_from_db()
  cache_locally(rules)

For each event:
  for rule in cached_rules:
    if not rule.apply(event):
      drop_event()
      break

Why not hardcode rules?

Fraud patterns change daily. New bot networks appear, old ones get shut down. Hardcoding means deploying new code every time rules change, slow and risky.

With a Rules DB, analysts can add/modify rules via a UI and changes propagate to all filters within 60 seconds. No deployment required.

Putting It All Together: Complete Flow

User watches "Never Gonna Give You Up" for 30 seconds:

1. 10:00:00 - Client sends heartbeat #1
   POST /watchtime {"video_id": "dQw4w9WgXcQ", "duration": 10s, ...}

2. 10:00:00 - Watchtime Service publishes to Kafka
   Topic: "watch_events", Partition: 42 (hash of video_id)

3. 10:00:01 - Event Filter (Consumer 42) reads event
   - Checks Rules DB: min_watch_duration = 10s ✓
   - Not a repeat view ✓
   - IP not rate-limited ✓
   - Passes all filters
   - Publishes to "valid_events" topic

4. 10:00:10 - Client sends heartbeat #2 (20s watched)
5. 10:00:20 - Client sends heartbeat #3 (30s watched)

6. 10:00:10 - Counter Service (Consumer 42) batches events
   - Batch window closes (10 seconds elapsed)
   - Aggregates: {"dQw4w9WgXcQ": +3 views}  (3 heartbeats)

7. 10:00:10 - Counter writes to Count DB
   UPDATE video_counts SET view_count = view_count + 3
   WHERE video_id = 'dQw4w9WgXcQ'

8. 10:00:11 - View count incremented: 1,234,567,890 → 1,234,567,893

From the user's watch action to the database update: 10 seconds. This near-real-time pipeline processes billions of events daily with high accuracy.

Monitoring and Observability

Key metrics for this system:

1. Event Throughput

  • Events published to Kafka per second

  • Events processed by Filters per second

  • Events written to Count DB per second

2. Partition Lag How far behind is each consumer from the latest events in its partition? High lag indicates processing can't keep up.

3. Filter Drop Rate Percentage of events dropped by Rules Engine. Sudden spikes might indicate a bot attack or a bug in filtering logic.

4. Database Write Latency Time to execute COUNT updates. Spikes indicate database contention or need for scaling.

5. Count Accuracy Periodically compare counts from the streaming pipeline against a batch recount job. Discrepancies indicate bugs.

The Elegance of Stream Processing

This architecture demonstrates core principles of distributed stream processing:

Decoupling: Each stage (ingestion, filtering, counting) operates independently. Failures are isolated.

Parallelism: Kafka partitioning enables massive horizontal scaling without coordination overhead.

Batching: Aggregating events before writing to the database reduces load by 1000x.

Idempotency: Reprocessing events produces the same counts. Enables fault tolerance and bug fixes.

Eventual Consistency: View counts update within seconds, not instantly. Users accept slight delays for a system that scales.

The result: YouTube counts billions of views per day, filters sophisticated fraud and keeps creators' revenue accurate, all while serving those 100,000 views per second that never stop flowing.

Designing Flash Sales: The High-Stakes Race for Limited Inventory

The Chaos of Scarcity

It's 12:00 PM. Your company drops 1,000 iPhone 15 units at 50% off. Within seconds, 100,000 eager buyers hit "Add to Cart" simultaneously. Only 1,000 will succeed. The rest will see the dreaded "Out of Stock" message.

This is a flash sale, a short time window where a fixed set of items is sold to whoever can claim them first. It's e-commerce at its most brutal: high traffic, limited inventory, zero tolerance for overselling. Get it wrong and you either disappoint customers or sell items you don't have (leading to refunds, angry emails and chargebacks).

The technical challenge isn't just handling traffic, it's ensuring exactly N items sell, no more, no less, even when 100,000 people try to buy simultaneously.

Real-World Examples

Flash sales power some of the internet's biggest traffic spikes:

  • Ticket booking sites (BookMyShow, Ticketmaster): Concert tickets for a popular artist sell out in minutes

  • IRCTC (Indian Railways): Train tickets during festival season face millions of concurrent bookings

  • Hotel booking: Limited discounted rooms during promotional periods

  • Shopify flash sales: Fashion brands drop limited edition items

  • YouTube live streams: Limited merchandise tied to live events

The pattern is universal: fixed inventory + time pressure + massive concurrent demand = need for bulletproof concurrency control.

Core Requirements

1. Fixed Inventory You have exactly 1,000 iPhones. No more magically appear. No less should be reserved.

2. Short Time Window Sale starts at 12:00 PM, ends at 12:15 PM (or when inventory depletes). High-intensity traffic concentrated in minutes.

3. High Throughput 100,000 users hitting your servers simultaneously. System must handle the load without crashing.

4. No Overselling The cardinal sin. If 1,000 items exist, exactly 1,000 orders should succeed. Selling 1,001 means you promised something you can't deliver.

5. Fair Distribution Users who click first (within milliseconds) should have first claim. No backdoor reservations, no favoritism.

6. Handle Payment Failures A user might add an item to cart, but payment fails. That item should become available again for others.

The Mental Model: The Real-World Store

Imagine a physical store with 1,000 iPhones on shelves:

Phase 0: Prepare the Stock
- Store owner stocks shelves with 1,000 iPhones
- Each phone is a physical unit on the shelf

Phase 1: Let Them In
- Doors open at 12:00 PM
- Customers rush in
- Each person physically grabs a phone from the shelf
- Once grabbed, no one else can take that specific unit

Phase 2: Payment
- Customer takes phone to checkout counter
- Payment succeeds → phone is theirs, they leave
- Payment fails → phone goes back on shelf for someone else

Our database design mimics this physical reality. The units table represents physical inventory units. Picking an item is like grabbing it from the shelf. Payment confirms ownership.

The Data Model: Units Table

Instead of tracking "quantity available," we model individual units:

Table: units

id  | item_id | picked_at  | picked_by | purchased_by
----|---------|------------|-----------|-------------
1   | 720     | NULL       | NULL      | NULL
2   | 720     | NULL       | NULL      | NULL
3   | 720     | NULL       | NULL      | NULL
...
1000| 720     | NULL       | NULL      | NULL

If you're selling 1,000 iPhones (item_id = 720), you have 1,000 rows in this table. Each row is a specific, sellable unit.

Why one row per unit instead of a counter?

Consider the alternative:

-- BAD: Using a counter
UPDATE items
SET quantity_available = quantity_available - 1
WHERE item_id = 720 AND quantity_available > 0;

This seems simpler, but it has fatal problems:

Problem 1: Race Conditions

Time    | User A                              | User B
--------|-------------------------------------|-------------------------------------
10:00:00| SELECT quantity (reads: 1)          |
10:00:01|                                     | SELECT quantity (reads: 1)
10:00:02| UPDATE quantity = 0 (success)       |
10:00:03|                                     | UPDATE quantity = -1 (OVERSOLD!)

Even with WHERE quantity_available > 0, concurrent UPDATEs can pass the check simultaneously if not properly locked.

Problem 2: Contention

All 100,000 concurrent requests try to UPDATE the same row. The database serializes these updates (they wait in line for the row lock), creating massive contention. Throughput collapses to a few hundred requests per second.

The Units Model Solves Both:

With individual unit rows, locking happens at the unit level, not the item level. User A locks unit #1, User B locks unit #2, User C locks unit #3, all simultaneously, no waiting. This is the key to high throughput.

Phase 0: Preparing the Stock

Before the sale starts, the store owner populates the units table:

-- Selling 10,000 iPhones (item_id = 720)
INSERT INTO units (item_id, picked_at, picked_by, purchased_by)
SELECT 720, NULL, NULL, NULL
FROM generate_series(1, 10000);

This creates 10,000 rows, each representing a physical iPhone unit. All start with NULL values, unpicked, unclaimed, available.

At 11:59 AM (one minute before sale):

Database state:
- 10,000 rows for item_id = 720
- All: picked_at = NULL, picked_by = NULL, purchased_by = NULL
- Ready for battle

Phase 1: Let Them In (The Pick Operation)

At 12:00:00 PM, doors open. Users flood in, clicking "Add to Cart."

User 1023 clicks "Add to Cart":

BEGIN TRANSACTION;

-- Find one available unit
SELECT * FROM units
WHERE item_id = 720
  AND picked_at IS NULL
ORDER BY id
LIMIT 1
FOR UPDATE SKIP LOCKED;

-- Returns: id = 1, item_id = 720, picked_at = NULL, ...

-- Claim it atomically
UPDATE units
SET picked_at = NOW(),
    picked_by = 1023
WHERE id = 1;

COMMIT;

Breaking Down the Query:

WHERE picked_at IS NULL

Only fetch units that haven't been picked yet. Once picked_at is set, the unit is "in someone's cart" and unavailable.

ORDER BY id

Process units in a predictable order. Without this, you might leave gaps (units 1, 500, 900 picked while 2-499 sit idle). Ordering ensures sequential consumption.

LIMIT 1

Grab only one unit at a time. If a user wants 5 iPhones, run this query 5 times (or use LIMIT 5 but still process one transaction).

FOR UPDATE SKIP LOCKED

The magic that enables concurrency:

  • FOR UPDATE: Locks the selected row(s) for the duration of the transaction

  • SKIP LOCKED: If a row is already locked by another transaction, skip it and try the next row

What SKIP LOCKED does:

12:00:00.000 - User A's query: Tries to lock unit #1 → SUCCESS, locks unit #1
12:00:00.001 - User B's query: Tries to lock unit #1 → LOCKED, skips to unit #2 → SUCCESS
12:00:00.002 - User C's query: Tries unit #1 → LOCKED, tries unit #2 → LOCKED, tries unit #3 → SUCCESS
12:00:00.003 - User D's query: Tries units #1, #2, #3 → ALL LOCKED, tries unit #4 → SUCCESS

Each user gets a different unit without waiting. This non-blocking behavior is crucial for high throughput.

Without SKIP LOCKED:

12:00:00.000 - User A locks unit #1
12:00:00.001 - User B tries unit #1 → BLOCKS, waits for User A
12:00:00.002 - User C tries unit #1 → BLOCKS, waits for User B (who's waiting for User A)
12:00:00.003 - User D tries unit #1 → BLOCKS, waits for User C...

All users serialize on the first unit. Throughput drops to single-digit transactions per second. Flash sale becomes flash crash.

After the UPDATE:

Database state (unit #1):
id: 1
item_id: 720
picked_at: 2026-01-05 12:00:00
picked_by: 1023
purchased_by: NULL

Unit #1 is now "in User 1023's cart." No other user can pick it because picked_at IS NOT NULL.

Critical Insight: One Transaction for Pick + Update

Notice the BEGIN TRANSACTION and COMMIT wrapping both the SELECT and UPDATE. This is essential for correctness:

-- WRONG: Two separate transactions
SELECT * FROM units WHERE ... FOR UPDATE SKIP LOCKED;
-- (Transaction commits here)
UPDATE units SET picked_at = NOW() WHERE id = 1;
-- (Another transaction)

If the SELECT and UPDATE are separate transactions, another user could swoop in between them:

User A: SELECT → gets unit #1 (transaction ends)
User B: SELECT → gets unit #1 (before User A updates!) 
User A: UPDATE → claims unit #1
User B: UPDATE → also claims unit #1 (DUPLICATE CLAIM!)

The correct pattern:

BEGIN TRANSACTION;
  SELECT ... FOR UPDATE SKIP LOCKED;  -- Locks the row
  UPDATE ...;                         -- Updates while still locked
COMMIT;                              -- Releases lock

The lock acquired by FOR UPDATE persists until COMMIT. No one can interfere during that window.

Phase 2: Payment Processing

User 1023 has unit #1 in their cart. Now they proceed to checkout.

Payment Gateway Interaction:

1. User clicks "Pay Now"
2. Frontend calls: POST /api/payment
   {
     "user_id": 1023,
     "unit_id": 1,
     "amount": 599.99
   }

3. Backend initiates payment with Stripe/PayPal/Razorpay:
   POST https://api.stripe.com/v1/charges
   {
     "amount": 59999,  // cents
     "currency": "usd",
     "source": "tok_visa",
     "metadata": {"unit_id": 1, "user_id": 1023}
   }

4. Payment gateway processes (takes 2-5 seconds)

5. Payment gateway sends webhook to your server:
   POST https://yoursite.com/webhook/payment
   {
     "event": "payment_intent.succeeded",
     "unit_id": 1,
     "user_id": 1023
   }

On Successful Payment (Webhook Handler):

UPDATE units
SET purchased_by = 1023,
    purchased_at = NOW()
WHERE id = 1;

-- Also create order record, send confirmation email, etc.

Database state (unit #1) after success:

id: 1
item_id: 720
picked_at: 2026-01-05 12:00:00
picked_by: 1023
purchased_by: 1023
purchased_at: 2026-01-05 12:00:15

Unit #1 is now sold. The user owns it. The transaction is complete.

Handling Payment Failures

But what if payment fails? Card declined, insufficient funds, payment timeout?

On Unsuccessful Payment (Webhook Handler):

UPDATE units
SET picked_at = NULL,
    picked_by = NULL
WHERE id = 1;

Database state (unit #1) after failure:

id: 1
item_id: 720
picked_at: NULL      ← Back to available
picked_by: NULL      ← No longer claimed
purchased_by: NULL

Unit #1 is now back on the shelf. The next user's SELECT query will find it and can pick it.

This is how inventory recirculates. If 100 people add to cart but only 50 complete payment, those 50 failed units become available for the next wave of buyers.

Critical Constraint: No Distributed Transaction

Notice a subtle but important detail:

You CANNOT wrap "add to cart" and "payment" in a single database transaction.

Why not? Because payment involves an external service (Stripe, PayPal) and takes seconds. Holding a database transaction open for seconds creates massive problems:

-- WRONG: Never do this
BEGIN TRANSACTION;
  UPDATE units SET picked_at = NOW() WHERE id = 1;  -- Locks row
  -- Call payment API (takes 3-5 seconds)
  -- ...waiting...
  -- Still locked, other users can't access this row
  UPDATE units SET purchased_by = 1023 WHERE id = 1;
COMMIT;

During those 3-5 seconds, the row is locked. If 10,000 users are paying simultaneously, you have 10,000 long-running transactions, each holding locks. The database chokes.

The correct pattern:

-- Transaction 1: Add to cart (fast, <100ms)
BEGIN;
  SELECT ... FOR UPDATE SKIP LOCKED;
  UPDATE units SET picked_at = NOW();
COMMIT;

-- External API call: Payment (slow, 2-5 seconds)
POST /stripe/charge

-- Transaction 2: Confirm purchase (fast, <100ms)
-- Triggered by webhook after payment completes
BEGIN;
  UPDATE units SET purchased_by = ? WHERE id = ?;
COMMIT;

Transactions are short-lived (milliseconds). Long-running operations (payment) happen outside transactions. This keeps the database responsive.

The Expiration Problem: Abandoned Carts

User 1023 adds to cart but never completes payment. Maybe they got distracted, changed their mind, or their app crashed. Unit #1 is stuck in limbo: picked_at is set, but purchased_by is NULL.

Without intervention, this unit would be unavailable forever. We need a cart expiration mechanism:

-- CRON job runs every minute
UPDATE units
SET picked_at = NULL,
    picked_by = NULL
WHERE picked_at < (NOW() - INTERVAL '12 minutes')
  AND purchased_by IS NULL;

Logic:

If a unit was picked more than 12 minutes ago and hasn't been purchased, reset it. This gives users 12 minutes to complete payment, reasonable for a checkout flow.

Why 12 minutes for a 15-minute flash sale?

You want most of the sale window available for picking, not tied up in expired carts. 12 minutes gives 3 minutes for payment, then inventory recirculates.

Trade-off:

  • Too short (e.g., 2 minutes): Users don't have enough time, abandoned carts recirculate quickly but frustrate legitimate buyers

  • Too long (e.g., 30 minutes): Inventory gets locked up by abandoned carts, reducing availability

12 minutes is a sweet spot for most flash sales.

Preventing Disappointment vs. Underselling

This expiration mechanism creates a philosophical tension:

Option 1: Disappoint Some Customers

You could expire carts aggressively (say, 2 minutes). This means:

  • More inventory recirculates during the sale

  • More people get a chance to buy

  • But some legitimate buyers get timed out while filling in payment details

Option 2: Undersell

You could expire carts slowly (say, 30 minutes). This means:

  • Fewer people get timed out (better experience for those who succeed)

  • But many units get "locked" by abandoned carts

  • If 50% of users abandon carts, you only sell 500 of your 1,000 units during the sale window

  • You have to deal with the business impact of underselling (lost revenue, unfulfilled demand)

Most companies choose Option 1 (shorter expiration) because underselling is worse than temporary disappointment. You can always extend the sale or restock, but underselling means leaving money on the table.

Why This Approach Scales

Non-Blocking Concurrency

SKIP LOCKED means 10,000 users can pick units simultaneously. Each gets a different unit, no waiting, no contention.

Row-Level Locking

Locks are on individual units, not the whole table or item. User A locking unit #1 doesn't block User B from locking unit #2. This is fundamentally different from locking at the item level (where all users would block each other).

Short Transactions

Transactions last milliseconds (SELECT + UPDATE). Fast transactions mean locks are held briefly, maximizing throughput.

Database Handles Concurrency

We're not writing custom locking logic in application code (prone to bugs). The database's transaction engine handles all concurrency control, battle-tested over decades.

Scalability Math:

- Each transaction: ~10ms (SELECT + UPDATE)
- Database can handle: 100 concurrent transactions
- Throughput: 100 transactions / 10ms = 10,000 transactions/second
- For 1,000 units: Depletes in ~0.1 seconds

Real-world (with payment overhead):
- Pick operation: 10ms
- Payment initiation: 50ms
- Total per user: 60ms
- With 100 concurrent connections: 1,666 picks/second
- For 1,000 units: Depletes in ~0.6 seconds

Even under heavy load, inventory depletes in under a second if demand is high enough.

Similar Systems Using This Pattern

Ticket Booking (BookMyShow, Ticketmaster)

Table: seats
seat_id | show_id | picked_at | picked_by | purchased_by
A1      | 2501    | NULL      | NULL      | NULL
A2      | 2501    | NULL      | NULL      | NULL
...

Each seat is a unit. When you select seat A1, it gets picked_at set. If you don't pay within 10 minutes, it's released.

Hotel Booking

Table: room_nights
id | room_id | date       | picked_at | picked_by | purchased_by
1  | 305     | 2026-01-10 | NULL      | NULL      | NULL
2  | 305     | 2026-01-11 | NULL      | NULL      | NULL
...

Each (room, date) combination is a unit. Booking room 305 for Jan 10-11 picks two units.

IRCTC Train Booking

Table: berths
berth_id | train_id | date       | picked_at | picked_by | purchased_by
1A-UB    | 12345    | 2026-01-15 | NULL      | NULL      | NULL
1A-LB    | 12345    | 2026-01-15 | NULL      | NULL      | NULL
...

Each berth is a unit. High contention during festival bookings handled gracefully by SKIP LOCKED.

Shopify Flash Sales

Shopify's inventory system uses a variant of this pattern for limited edition drops. When Supreme releases 500 hoodies, each hoodie is a unit in their inventory system.

The Complete Flow: A Successful Purchase

12:00:00.000 - User 1023 clicks "Add to Cart":

BEGIN;
SELECT * FROM units WHERE item_id = 720 AND picked_at IS NULL
ORDER BY id LIMIT 1 FOR UPDATE SKIP LOCKED;
-- Returns: unit #1

UPDATE units SET picked_at = NOW(), picked_by = 1023 WHERE id = 1;
COMMIT;

12:00:00.050 - User shown "Item Added, Proceed to Checkout":

Frontend: "You have 12 minutes to complete purchase"

12:00:15.000 - User clicks "Pay Now":

POST /api/payment
→ Stripe API call initiated

12:00:18.000 - Stripe processes payment:

Stripe → Webhook → POST /yoursite/webhook/payment
{"event": "payment_intent.succeeded", "unit_id": 1, "user_id": 1023}

12:00:18.100 - Webhook handler updates database:

UPDATE units
SET purchased_by = 1023, purchased_at = NOW()
WHERE id = 1;

12:00:18.200 - Order confirmation email sent:

"Congratulations! Your iPhone 15 order is confirmed."

Total time from click to confirmation: 18 seconds. The unit was locked for the entire period, preventing double-selling.

The Elegance of Simplicity

This flash sale design is beautifully simple:

  • No distributed locks: Database transactions handle everything

  • No Redis reservations: Everything in the primary database

  • No complex state machines: Three states (available, picked, purchased)

  • No coordination between servers: Each app server independently queries the database

Yet it handles the hardest problem in e-commerce: selling exactly N items to the first N buyers when 100,000 people compete simultaneously.

The secret isn't exotic technology, it's understanding how database locking works and leveraging FOR UPDATE SKIP LOCKED to turn concurrent chaos into orderly, fair distribution. Sometimes the best solutions aren't the most complex; they're the ones that use simple tools exactly right.

Additional Reading Materials

Read these books:

  1. Introduction to information retrieval

    Book by Christopher D. Manning

That's all for now folks. See you in the next blog!

System Design

Part 8 of 9

In this series, we'll start learning the system design fundamental concepts that will help us become better software engineers. We'll understand real life case studies where certain decisions are more optimal than others, given the specific context.

Up next

Building Algorithmic Systems

From predicting driver locations in real-time to syncing files across devices, from counting ad impressions to mapping social connections, modern applications demand solutions to computational problems that seem deceptively simple at first glance. Ho...