Building Storage Engines
Modern Speed & Scalability. Battle-Tested Patterns.

Modern applications demand speed and scalability, but traditional databases don't always deliver. In this deep dive, we'll explore three unconventional approaches: using Change Data Capture to build analytics pipelines without application events, designing a 1TB word dictionary with only object storage and clever indexing, and building Bitcask, the log-structured key-value store that powers Uber's infrastructure with O(1) operations. These aren't theoretical concepts, they're battle-tested patterns from companies operating at massive scale.
ETL and Tiered Storage
The Problem: When Your Data Store Doesn't Match Your Analytical Needs
Imagine you're running a multi-user blogging platform with MongoDB as your transactional database. Your application is thriving, but now the CEO wants detailed reports, statistics, and dashboards. There's just one catch: your insights team only knows SQL, not MongoDB's query language.

Your first instinct might be to instrument your API layer to emit events to Kafka, process them through workers, and populate a SQL database. But this approach comes with significant challenges:
Events are application-centric, not data-centric. They carry the context of what happened in your application, but they might not contain all the data needed for analytics. Not every database change triggers an event, and worse, what happens when a database change succeeds but the event emission fails? You've just introduced a consistency problem that could quietly corrupt your analytics over time.
Enter Change Data Capture (CDC)
Instead of relying on application events, CDC takes a fundamentally different approach: it treats your database as the source of truth. Tools like Airbyte and Debezium tap directly into your database's commit log or binary log, capturing every change as it happens at the storage layer.
Here's what makes CDC powerful:
It's comprehensive. Every insert, update, or delete on your 'Blogs' table generates a CDC event. You're not dependent on developers remembering to emit events or application logic executing correctly.
It's transformable. You can optionally transform data as it flows through the CDC pipeline, adapting it to your analytics requirements.
It's flexible. You can route changes to various sinks like Kafka, SQS, or directly to your target database. For complex transformations, you might choose Kafka as an intermediate sink and build custom processing pipelines.
CDC has become the backbone of multi-tiered storage architectures, where hot transactional data lives in one database while historical or analytical data migrates to specialized stores optimized for different query patterns.

If you've never implemented CDC, I highly recommend experimenting with Airbyte. It's become an essential tool in modern data architecture discussions.
Designing a Distributed Cache
Requirements
We need to build a caching system that delivers high throughput and low latency at scale. The basic operations are straightforward (GET, PUT, DELETE) but we also need to support time-to-live (TTL) for automatic key expiration. The real challenge isn't implementing these operations on a single node; it's making them work seamlessly across a distributed system.
Single-Node Cache: Building the Foundation
Data Structure and Storage
At its core, a cache maps keys to objects. Each object has a type (list, set, string, etc.) that determines which operations it supports:
object struct {
type list
data
}
These objects live in a hash table, giving us O(1) access times:
{
key1 -> Object { type: list }
key2 -> Object { type: set }
}
Communication Protocols
Your cache server needs to communicate with clients. You have three main options:
Raw TCP with a custom protocol (like Redis's RESP protocol) offers the best performance. You maintain connection pools, avoiding the overhead of TCP's three-way handshake on every request.
HTTP is familiar and easy to debug, but comes with more overhead.
gRPC strikes a balance, offering efficient serialization and connection pooling with good tooling support.
The smart approach? Abstract your implementation so you can support all three. Different clients may have different needs.
Eviction: What Happens When the Cache is Full?
Your cache will eventually fill up. You need an eviction policy to make room for new entries. The choice depends on your access patterns:
LRU (Least Recently Used) works well for CDNs or news sites where recent content matters most. LFU (Least Frequently Used) suits scenarios like Wikipedia where some content remains perpetually popular. Random eviction is simpler but less intelligent.
Both LRU and LFU can be implemented in constant time using clever data structures; look into the details if you're interested in the implementation.
How do you know the cache is full? You can either cap the number of keys in your hash table or track the cumulative size of all values. When you hit that limit, trigger eviction.
Handling TTLs
Each key gets an absolute expiration timestamp. A background cleanup process periodically scans for expired keys and frees them.
But how do you efficiently find keys to clean up? You have two approaches:
Priority Queue: Maintain a min-heap ordered by expiration time. This is simple and consistent, but requires an additional data structure.
Sampling (Redis's approach): Combine lazy deletion (clean up on access) with periodic sampling. Pick 20 random keys, free the expired ones, and repeat until fewer than 25% of sampled keys are expired. This is probabilistically effective without the overhead of a priority queue.
Concurrency
When multiple clients try to update or delete the same key simultaneously, you need a strategy:
Pessimistic locking uses mutexes or semaphores to ensure only one operation proceeds at a time. It's safe but can create bottlenecks.
Optimistic locking uses compare-and-swap operations. Updates are conditional: "update value to 20 where current value is 10." If another operation changed the value first, yours fails and you can retry.
Single-threaded execution (like Redis's core) sidesteps concurrency issues entirely by processing one command at a time, though you lose parallelism.
Vertical Scaling
Before going distributed, you can scale vertically by adding more RAM, CPU, and network bandwidth to a single node. This is often the simplest first step.
Going Distributed
A single node eventually hits its limits. A distributed cache is multiple cache servers working together as a cohesive system. This introduces new challenges: routing, availability, reliability, and scaling.
Routing Strategies
Hash-based routing uses modulo arithmetic:
i = hash(key) % n. Key k1 goes to nodehash(k1) % 3.Range-based routing partitions the keyspace:
cache 1 handles [a-j]
cache 2 handles [k-p]
cache 3 handles [q-z]
The problem with simple hash-based routing? When you add or remove a node, the number changes, and suddenly 50% or more of your keys hash to different nodes. All that data needs to move.
Consistent Hashing: Minimizing Data Movement
Consistent hashing solves the data movement problem by arranging nodes in a ring. Each node "owns" a range of the hash space. When you add cache node 5, you might split cache node 1's range, but you only need to move data from node 1 to node 5. The rest of the cluster is unaffected.

Adding a node: Minimal data transfer from the node whose range is being split.
Removing a node: Graceful shutdown redistributes its data to neighbors, or in an abrupt failure, the data is simply lost until it's requested again and cached from the source.
Note that virtual nodes (placing each physical node at multiple points on the ring) work well for stateless systems but complicate stateful systems like caches because they make data movement less predictable.
High Availability and Reliability
Replicas provide redundancy but introduce the possibility of stale reads.
Standby nodes can take over if a primary fails.
Write-ahead logging helps recover state after crashes, though it adds overhead.
The trade-offs here depend on how critical cache availability is versus the cost of a cache miss (fetching from the slower underlying data store).
Distributed Hash Tables (DHTs)
Under the hood, a distributed cache is essentially a Distributed Hash Table that maps keys to values across nodes, optimized for minimal data movement during membership changes.
DHTs power more than just caches. They're used in distributed file systems, DNS servers, instant messaging platforms, and famously, BitTorrent's peer discovery mechanism.
Fun Exercise: How would you use MySQL as a cache server with the same throughput as Redis? What would you need to optimize? Think about data structures, eviction policies, and whether you'd need to bypass MySQL's query optimizer entirely.
Designing a Word Dictionary without a Database
Requirements and Constraints
We're building a word dictionary service with some interesting constraints. The dictionary contains 170,000 words with a total size of 1TB. Words and their meanings are updated weekly through a changelog. Our lookups are always for singular words, with no repetitive entries. The system needs to be scalable, both in storage (portable) and API servers, though response time isn't our primary concern. Here's the catch: we cannot use a traditional database.
The Storage Challenge
Without MySQL, PostgreSQL, or MongoDB to lean on, we need to get creative. Let's explore using object storage like S3 or a network-attached file system as our data store.
Approach 1: One File Per Word
The simplest idea: store each word in its own file with a logical directory structure.
s3://word-dictionary/a/apple.txt
s3://word-dictionary/a/america.txt
s3://word-dictionary/a/automatic.txt
...
s3://word-dictionary/z/zoo.txt
Lookup becomes straightforward: construct the S3 path from the word, fetch the file, return the contents.
But this breaks a critical requirement: portability. With 170,000 separate files scattered across S3, you can't easily move your dictionary around. Downloading or backing up means 170,000 individual operations. That's a deal-breaker.
Approach 2: Single File Dictionary
Let's consolidate everything into one file: data.dat.
a, first letter in english alphabet
abandon, to leave and never return to
ability, power or skill to do something
able, having the power that is needed to do something
...
zone, an area that is different from other areas
zoo, a place where many kinds of animals are kept
We'll use CSV format, simple and universal. The file will be 1TB, but it's now portable: one file to download, one file to backup.
The lookup problem: How do you find "zebra" without reading through the entire 1TB file? Linear scanning is impossibly slow.
Making Lookups Fast: Indexing
The universal strategy for faster lookups is indexing. Let's create a separate index file that maps words to their byte offsets in the data file.
index.dat:
a: 0:127
abandon: 127:130
ability: 257:100
able: 357:150
...
zone: 1023795:100
zoo: 1023895:196
Each entry contains the word, its starting byte offset, and the length of the entry.
How big is this index? With 171,476 words and an average word length of 4.7 characters, each index entry is roughly 15.7 bytes (word + separators + offset + length). That's about 2.6 MB total, small enough to fit comfortably in memory.
The Lookup Flow
On boot, the API server downloads and loads
index.datinto memoryWhen a request arrives for a word, the server looks it up in the in-memory index
The index returns the byte offset and length
The server makes a range request to S3 to fetch just that portion of
data.datThe meaning is parsed and returned to the user
This is fast. The index lookup is O(log n) with binary search or O(1) with a hash map, and we only fetch the exact bytes we need from S3.
Handling Weekly Updates
Updates arrive as changelogs. Here's the procedure:
Spin up a temporary server
Download the current dictionary locally
Merge the changelog with the existing data in O(n) time
Create new
data.datandindex.datfiles locallyUpload them to S3
Why is merge O(n)? Both the dictionary and changelog are sorted alphabetically. Merging two sorted lists is linear time, a classic algorithm.
The Transition Problem
If we simply overwrite the existing files at the same S3 paths, we create a race condition:
s3://word-dictionary/data.dat ← updated
s3://word-dictionary/index.dat ← updated
But your API servers have already loaded the old index into memory. When index.dat points to offset 1024 for "apple," but you're reading from the new data.dat, you'll get garbage data. Users will see corrupted responses during the transition window.
How do we smoothly transition without downtime or corrupted reads?
Solution: Versioned Paths with Metadata
Upload new versions to separate paths and use a metadata file to coordinate the transition:
s3://word-dictionary/001/data.dat
s3://word-dictionary/001/index.dat
s3://word-dictionary/002/data.dat
s3://word-dictionary/002/index.dat
s3://word-dictionary/meta.json
meta.json:
{
"index": "002/index.dat",
"data": "002/data.dat",
"version": "002"
}
The flow:
Old servers continue serving from version 001
You upload the new data and index to version 002
You update
meta.jsonto point to version 002New servers (or restarted servers) fetch
meta.jsonand load version 002Gradually roll over your server fleet
Once all servers are on version 002, you can delete version 001
This gives you zero-downtime deployments with no data corruption.
Alternative Transition Strategies
Periodic refresh: Servers periodically re-check S3 and reload the index if it's changed. Simple, but introduces eventual consistency.
Reactive (Pub/Sub): When you upload a new version, publish a message to SNS or a similar service. Servers subscribe and reload on notification. More complex but lower latency.
The versioned path approach is the most robust because it doesn't require servers to reload mid-operation, they simply start fresh with the correct version.
Improving Portability: The Single-File Format
Two files (index.dat and data.dat) are better than 170,000, but we can do even better. Let's merge them into a single file.
The challenge: How do you know where the index ends and the data begins?
You could use a separator, but parsing for separators in a 1TB file is expensive and fragile. The better solution is a fixed-length header.
File Format with Header
[HEADER - fixed 256 bytes]
[INDEX - variable size]
[DATA - variable size]
The header stores:
Offset where index starts (byte 256)
Offset where data starts
Total number of words
Version number
Checksum for integrity
The new flow:
API server downloads the first 256 bytes (the header)
Header tells the server where the index starts and how long it is
Server downloads just the index portion (still only ~2.6 MB)
Server starts serving requests using range reads into the data section
Now you have true portability: one file containing everything, with efficient random access.
Real-World Applications
This pattern, storing structured data in flat files with indexing, appears in many systems beyond dictionaries:
Multi-tiered storage: Companies often keep recent orders in MySQL for fast queries, but archive historical orders to S3. By using indexed flat files on S3, you can still query old data without keeping terabytes in an expensive database.
Log aggregation: Systems like Elasticsearch use similar techniques, storing indexed segments as immutable files.
Embedded databases: SQLite's file format uses these principles, a single portable file with internal indexing structures.
The key insight is that databases aren't magic. They're implementations of indexing, caching, and storage strategies that you can replicate when your requirements call for it. In our case, the constraint against traditional databases forced us to reinvent these patterns, and in doing so, we gained portability and cost efficiency that might have been harder to achieve with a conventional setup.
Designing a Log-Structured Key-Value Store
Requirements: Speed and Persistence
We need a key-value store that's superfast for reads, writes, and deletes while guaranteeing persistence. The challenge is achieving high throughput without sacrificing durability or query performance.
The Core Idea: Log-Structured Storage
Traditional databases perform random writes, seeking across the disk to update records in place. This is slow. Log-structured storage takes a different approach: all data is stored in append-only files with sequential writes. There are no random updates, everything is written sequentially to the end of a log.
The benefit? Sequential writes are orders of magnitude faster than random writes on spinning disks. We're talking about potential gains of 5,000x on HDDs. Even on SSDs, where the gap is smaller, sequential writes still offer significant performance advantages.
The Simplest Design: A Single Log File
Let's start with the most basic implementation, a single file containing key-value pairs:
K1: V1
K2: V2
K3: V3
...
K1: V1' ← updated value for K1
K2: V2' ← updated value for K2
PUT(k, v): Append the key-value pair to the end of the file. This is a lightning-fast sequential write.
DELETE(k): Here's a clever trick, a delete is just a special PUT operation. We write PUT(k, -1) or another tombstone marker. This keeps all operations as simple appends.
Entry Format: Handling Variable-Length Data
Keys and values have variable lengths, so we can't just "read until newline." We need to know exactly how many bytes to read. The solution is to prefix each entry with metadata:
[KS2] [VS2] [K] [V]
KS2 (4 bytes): Key size
VS2 (4 bytes): Value size
K (KS2 bytes): The key
V (VS2 bytes): The value
When reading an entry, we first read 8 bytes to get the key and value sizes, then read exactly KS2 + VS2 bytes for the data.
Ensuring Integrity: CRC and Timestamps
What happens if the machine crashes while writing a value? We need to detect corrupt entries. The solution is a CRC (Cyclic Redundancy Check), a checksum that's calculated and written first:
[CRC] [TS] [KS2] [VS2] [K] [V]
CRC: Checksum for integrity verification
TS: Timestamp to resolve conflicts (useful for distributed scenarios)
When reading, we verify the CRC. If it doesn't match, we know the entry is corrupt and can ignore it.
Making Reads Fast: In-Memory Indexing
Appending to a log makes writes fast, but what about reads? Scanning through gigabytes of data to find one key is unacceptable.
The solution: Maintain an in-memory hash table that maps keys to their file offsets.
In-memory index:
K1 → {file_id: 3, offset: 1024, size: 256, timestamp: ...}
K2 → {file_id: 3, offset: 1280, size: 128, timestamp: ...}
GET(k) flow:
Look up the key in the in-memory hash table (O(1))
Seek to the offset in the file
Read the entry
Return the value
This gives us O(1) reads, a hash table lookup followed by a single disk seek.
The trade-off: All keys must fit in memory. This is the fundamental limitation of this design.
Handling File Growth: Rotation and Immutability
As you keep appending, your log file grows unbounded. The solution is file rotation, when the active file reaches a certain size (say, 1GB), we close it and make it immutable. All new writes go to a fresh active file.
Now we have multiple files, but only one is "active" for writes:
[File 1] [File 2] [File 3] [File 4 - ACTIVE]
└─────────────── Immutable ──────────┘
The in-memory index now tracks which file contains each key.
Optimization: Merge and Compaction
With multiple immutable files, we accumulate duplicates and deleted entries. A key might be updated multiple times, leaving stale versions scattered across files. Tombstones from deletions take up space.
Merge and compaction solves this:
Select several immutable files
Read through them sequentially
For each key, keep only the latest version
Skip entries marked as deleted (tombstones)
Write the compacted data to new files
Atomically update the in-memory index to point to the new offsets
Delete the old files
This reclaims space and reduces the number of files, improving read efficiency.
Why Atomic Index Updates Matter
During compaction, file offsets change. If you update the index incrementally while reads are happening, a reader might look up an old offset in a file that's been deleted or modified. By updating the index atomically (all at once, perhaps by swapping pointers), you ensure reads always see a consistent view.
What We Just Built: Bitcask
This design is Bitcask, an embedded database originally created by Basho Technologies. It's one of the most efficient key-value stores ever designed for write-heavy workloads.
Bitcask's Strengths
O(1) reads, writes, and deletes: Hash table lookups and sequential I/O
High throughput, low latency: Sequential writes saturate disk I/O bandwidth
Simple and reliable: Append-only logs are easy to reason about and recover from crashes
Easy backups: Just copy the immutable files
Bitcask's Limitation
All keys must fit in memory. If you have a billion keys, even with small metadata (say, 50 bytes per key), that's 50GB of RAM just for the index.
Real-World Usage: Riak
Bitcask isn't a toy. Uber used it in production as the storage engine for Riak, a distributed database. Each node in a Riak cluster runs an instance of Bitcask.
Bitcask's predictable performance and simplicity make it ideal for scenarios where:
Write throughput is critical
Keys are manageable in memory
You need single-digit millisecond latencies
Durability is non-negotiable

The Proxy Pattern in Riak
As shown in the diagram, clients connect to Riak through a proxy. The proxy routes requests to the appropriate Riak node (using consistent hashing or another partitioning scheme). Each Riak node stores a subset of keys in its local Bitcask instance.
This architecture combines Bitcask's local performance with Riak's distributed coordination, giving horizontal scalability while retaining Bitcask's speed at each node.
The lesson: Sometimes the fastest database design isn't the most complex. By embracing simplicity, append-only logs, in-memory indexes, and straightforward compaction, Bitcask achieves performance that more sophisticated systems struggle to match. It's a reminder that understanding your workload and constraints can lead to elegantly simple solutions.
That's all for now folks. See you in the next blog!



