A distributed key-value store sits at the core of nearly every large-scale system - from DynamoDB to Cassandra to etcd. The design forces you to make hard choices between consistency and availability, pick a partitioning scheme, and build a replication strategy that actually works during network partitions. This walkthrough covers the full design with real tradeoffs, not textbook idealism.
Practice this design with AI
Get coached through each section in a mock interview setting
A distributed key-value store is the most fundamental building block in distributed systems. Every large tech company has built or heavily customized one: Amazon has DynamoDB, Meta has RocksDB/ZippyDB, Apple has FoundationDB. The reason is simple - when you need sub-10ms lookups at millions of QPS, you need something purpose-built.
The critical decision upfront: favor availability over strong consistency (AP in CAP terms). Most key-value store use cases (caching, session storage, user profiles) tolerate brief staleness but cannot tolerate downtime. Strong consistency (CP) is the right choice for coordination services like etcd/ZooKeeper, but not here.
Start from the throughput requirements and work backward to infrastructure.
Bottom line: ~50 storage nodes with 64GB RAM and 4TB SSD each. This is a medium-sized cluster - very manageable.
Keep the API dead simple. A key-value store has three operations. Don't over-design this.
/kv/{key}``/kv/{key}?consistency=quorum``/kv/{key}?consistency=quorum
High-Level Architecture
The architecture is a ring of storage nodes with a coordinator layer in front. Every request touches exactly 3 nodes (the replication factor).
1. Client sends PUT to any node. That node becomes the coordinator for this request. 2. Coordinator hashes the key using consistent hashing to find the primary node responsible for this key. 3. Coordinator forwards the write to the primary node and its 2 clockwise neighbors on the ring (replicas). 4. If consistency=quorum, coordinator waits for 2 of 3 nodes to acknowledge. If consistency=one, it returns after the first ack. 5. The remaining replica(s) receive the write asynchronously.
1. Client sends GET to any node (coordinator). 2. Coordinator hashes the key, identifies the 3 responsible nodes. 3. If consistency=quorum, coordinator reads from 2 of 3 nodes and returns the value with the highest version/timestamp. 4. If the 2 reads disagree, coordinator triggers a read-repair: sends the newer value to the stale node.

Detailed Component Design
Three components are worth going deep on: the consistent hashing ring, the storage engine, and the failure detection system.
Each physical node gets 150 virtual nodes (vnodes) placed at random positions on a 2^128 hash ring (using MD5 or MurmurHash3). When a key arrives, hash it and walk clockwise to find the first vnode - that's the primary. The next 2 distinct physical nodes clockwise are the replicas.
Why 150 vnodes? Fewer vnodes cause uneven load distribution. More vnodes increase memory overhead and slow down ring recalculation. 150 is the sweet spot - Cassandra uses 256 by default, DynamoDB uses ~30 with more sophisticated placement. At 50 nodes à - 150 vnodes = 7,500 ring entries - trivially fits in memory.
When a node joins, it takes ownership of its vnodes' ranges. Existing nodes stream data for those ranges to the new node. When a node leaves, its ranges are redistributed to clockwise neighbors. No data movement for unaffected ranges.
1. Append to write-ahead log (WAL) on disk - sequential write, fast 2. Insert into in-memory MemTable (a skip list) 3. When MemTable fills (64MB default), flush to an immutable SSTable on disk 4. Background compaction merges SSTables to reclaim space and maintain read performance
This is the LSM-tree pattern. It's write-optimized because all writes are sequential (WAL append + MemTable insert). Reads check MemTable first, then SSTables with a Bloom filter to skip irrelevant files.
Why RocksDB over B-tree databases (PostgreSQL, InnoDB)? B-trees require random I/O for writes, which limits write throughput on SSDs to ~10K ops/sec. LSM trees sustain 50K+ writes/sec on the same hardware because every write is sequential.
Every node gossips with 3 random peers every second, exchanging heartbeat counters and membership info. Use the Phi Accrual failure detector (same as Cassandra) instead of a fixed timeout. It calculates a "suspicion level" (phi) based on heartbeat arrival patterns. When phi > 8 (configurable), the node is marked as suspected-down.
Why not a centralized health checker (like ZooKeeper)? It becomes a single point of failure and doesn't scale past ~1,000 nodes. Gossip scales to tens of thousands of nodes with O(log N) convergence time.
When a node is marked down, its vnodes' ranges are temporarily handled by the next nodes clockwise (hinted handoff). Writes destined for the dead node are stored as "hints" on the neighbor. When the node recovers, hints are replayed. If it doesn't recover within 3 hours, a full replacement is triggered.

Data Model & Database Design
This is a key-value store, so the "data model" is deceptively simple. The complexity lives in how data is organized on disk and replicated across nodes.
RocksDB stores these as sorted key-value pairs. The sort key is (key_bytes, version_desc) so the newest version of a key is always first. Old versions are garbage-collected during compaction.
Consistent hashing on the key determines which 3 nodes store the data. No range partitioning - pure hash partitioning. This means:
Each key lives on 3 nodes (replication factor = 3). Writes go to all 3; reads go to 1, 2, or 3 depending on consistency level.
When replicas disagree, use Last-Write-Wins (LWW) with the timestamp field. This is simple and works for 99% of use cases. The tradeoff: if two clients update the same key simultaneously on different nodes, one write silently wins.
For use cases where silent data loss is unacceptable, support vector clocks as an opt-in. Vector clocks detect conflicts and return both versions to the client, who resolves it (this is what DynamoDB does with its "shopping cart" example). But vector clocks add metadata overhead (~20 bytes per node in the cluster per key) and push complexity to the client.
Store ttl_expiry in the record. A background process on each node scans for expired keys every 10 seconds and writes tombstones. Tombstones are garbage-collected after 72 hours (enough time for all replicas to see the delete via anti-entropy).

Deep Dives
Deep Dive 1: Handling Network Partitions
Problem: Your 50-node cluster splits into two groups of 25 due to a network partition. Both sides are running, both sides receive client traffic. What happens?
Approach: This is where the CAP theorem stops being theoretical. Since we chose AP (availability + partition tolerance), both sides continue serving reads and writes. Keys that are replicated across both sides of the partition will diverge.
When the partition heals, replicas sync up via anti-entropy (Merkle tree comparison). Conflicting writes are resolved by LWW - the write with the higher timestamp wins. This means some writes during the partition are silently discarded.
Tradeoff: If you need zero data loss during partitions, you'd need to switch to CP mode - reject writes to keys that can't reach a quorum. This is what etcd/ZooKeeper do. The cost is availability: during a partition, some keys become unwritable. For a general-purpose KV store, the AP choice is almost always correct.
Deep Dive 2: Hot Keys
Problem: A single key (e.g., a viral tweet's like count) gets 50K reads/sec. It lives on 3 nodes, but even with quorum reads hitting 2 of 3, each node handles ~33K reads/sec for this one key. That saturates the node.
Approach: Client-side caching with short TTL (1-5 seconds). The client library caches hot keys locally and serves reads from cache. This eliminates 95%+ of hot-key traffic before it hits the cluster.
For writes, use a different strategy: buffer writes in a per-node counter and flush periodically. Instead of 50K increment operations, do 50 batch increments of 1,000 each. This is the approach Facebook uses for like counts.
Tradeoff: Both approaches sacrifice freshness. A 5-second client cache means reads can be 5 seconds stale. Batched writes mean the count is eventually consistent. For social media counters, nobody notices. For bank balances, you'd need a different approach (strong consistency + single-leader replication for that key).
Deep Dive 3: Data Repair and Anti-Entropy
Problem: Over time, replicas drift. A node was down for 30 minutes, missed 100K writes, came back up. Hinted handoff covers writes that arrived during downtime, but what about writes that were dropped? How do you detect and fix divergence?
Approach: Merkle trees. Each node maintains a Merkle tree over its key ranges. Every 10 minutes, replicas exchange Merkle tree roots. If roots match, data is consistent. If not, they walk down the tree to find exactly which key ranges diverged, then stream only the differing keys.
This is extremely efficient: comparing two Merkle trees for 1 billion keys requires exchanging only ~20 tree levels à - a few KB each. The actual data transfer is proportional to the number of divergent keys, not the total dataset.
Tradeoff: Merkle trees use ~1% of total storage for tree nodes. Rebuilding the tree after a compaction takes a few seconds. The 10-minute sync interval means divergence can persist for up to 10 minutes. For tighter consistency, reduce the interval - but this increases network overhead.
Key Trade-offs:
Our AI interviewer will test your understanding with follow-up questions
Start Mock Interview