A chat system is fundamentally a real-time message routing problem. The hard parts are maintaining millions of persistent WebSocket connections, guaranteeing message delivery even when recipients are offline, ordering messages correctly in group chats, and keeping presence status accurate across a distributed fleet of servers.
Practice this design with AI
Get coached through each section in a mock interview setting
A chat system routes messages between users in real time. The core tension is between delivery speed (users expect messages in under 200ms) and reliability (losing a message is unacceptable).
500M DAU, each sending 40 messages/day on average.
This is serious throughput. A single database cannot handle this. You need a distributed message store from day one.
This is why WhatsApp uses a write-optimized store, not PostgreSQL. You need something like Cassandra or a custom LSM-tree based system.
The chat system uses two protocols: WebSocket for real-time messaging and REST for everything else. This split is intentional - WebSocket gives you bidirectional streaming for messages, while REST handles stateless operations like creating groups or fetching history.
The client_msg_id is critical - it is a client-generated UUID for idempotency. If the client retries a send (e.g., after a network blip), the server deduplicates on this ID.
/api/v1/conversations/{conversation_id}/messages?before={cursor}&limit=50/api/v1/groups/api/v1/users/{user_id}/presence
High-Level Architecture
Follow a message from sender to recipient to understand the architecture.
1. Sender's phone sends a message over its WebSocket connection to a Gateway Server. 2. The Gateway Server authenticates the sender, assigns a server-side message_id and timestamp, and publishes the message to a Message Queue (Kafka). 3. A Message Router service consumes from Kafka. It looks up which Gateway Server the recipient is connected to (via the Session Service backed by Redis). 4. If the recipient is online: the Router pushes the message to the recipient's Gateway Server, which delivers it over the recipient's WebSocket. 5. If the recipient is offline: the Router writes the message to an Offline Message Store and triggers a push notification via APNs/FCM. 6. In parallel, a Storage Consumer writes the message to the persistent Message Store (Cassandra) for history.
Group messages fan out: the Router reads the group membership, then delivers to each online member individually. For large groups, this fan-out is the primary scaling bottleneck.

Detailed Component Design
Three components deserve a close look: the Gateway Server, the Message Router, and the Presence Service.

Data Model & Database Design
Use Cassandra for the message store, not PostgreSQL. At 20 TB/day, you need a database built for high write throughput and horizontal scaling. Cassandra delivers both.
This partition design means all messages in a conversation live on the same partition, sorted by time. Fetching the last 50 messages is a single-partition range scan - extremely fast in Cassandra.
This powers the inbox view: "show me my conversations sorted by most recent activity." One partition scan per user.
Why Cassandra for messages but PostgreSQL for groups/users?

Deep Dives
Deep Dive 1: Message Ordering in Group Chats
Problem: Two users send messages to a group at the same time. Different members may see them in different orders, breaking conversation coherence.
Approach: Assign each message a globally ordered ID at the point of ingestion. Use a Snowflake-style ID: 41 bits for timestamp (ms precision) + 10 bits for machine ID + 12 bits for sequence number. This gives you ~4096 messages/ms per machine with guaranteed uniqueness and rough time ordering.
For strict ordering within a group, route all messages for a given group_id to the same Kafka partition (use group_id as the partition key). Kafka guarantees ordering within a partition. The tradeoff: a very active group becomes a hot partition. If one group produces 10K messages/sec, that single partition becomes a bottleneck.
Mitigation: most groups are small and low-traffic. For the rare mega-group (thousands of members, high activity), consider a dedicated Kafka topic with multiple partitions and a sub-ordering scheme within the group.
Deep Dive 2: Offline Message Delivery
Problem: User B is offline when user A sends a message. When B comes back online, they need to receive all missed messages, in order, without duplicates.
Approach: When the Message Router cannot find B in the Session Service, it writes the message to an Offline Message Store (a Cassandra table partitioned by recipient_id, clustered by message_id). It also triggers a push notification.
When B reconnects, the Gateway Server queries the Offline Store for all messages with message_id > B's last_seen_message_id. These are delivered over the WebSocket, and the Offline Store entries are tombstoned.
The subtle part: B might reconnect to a different Gateway Server than before. The new Gateway must know B's last_seen_message_id. Store this in the Session Service (Redis) alongside the gateway mapping. On disconnect, persist the last acknowledged message_id.
Tradeoff: Cassandra tombstones accumulate and slow down reads over time. Run compaction regularly, and consider TTLing offline messages (e.g., 30 days) with a separate cold-storage archive for older undelivered messages.
Deep Dive 3: Scaling the WebSocket Layer
Problem: 100M concurrent connections across 200 servers. How do you handle server failures, deployments, and rebalancing without dropping messages?
Approach: Graceful shutdown is essential. When a Gateway Server is being drained (for deployment or scaling down), it sends a "reconnect" signal to all connected clients. Clients reconnect to a different server via the load balancer. The new server registers the updated mapping in the Session Service.
During the reconnection window (a few seconds), messages for these users go to the Offline Store. Once the client reconnects and pulls from the Offline Store, there is no message loss.
For server crashes (ungraceful), the Session Service entries have a TTL (e.g., 90 seconds). When the TTL expires, the Router treats the user as offline. The client's built-in reconnection logic (exponential backoff) establishes a new connection to a healthy server.
Tradeoff: the TTL creates a window where messages might be routed to a dead server. Keep the TTL short (60-90 seconds) and have the Router fall back to the Offline Store on delivery failure.
Key Trade-offs:
Our AI interviewer will test your understanding with follow-up questions
Start Mock Interview