A web crawler is deceptively simple on the surface - fetch pages, extract links, repeat - but at scale it becomes one of the hardest distributed systems problems. The real challenges are politeness (not DDoSing websites), deduplication (the web is full of duplicate content), and managing a frontier of billions of URLs without losing progress or wasting resources.
Practice this design with AI
Get coached through each section in a mock interview setting
A web crawler fetches pages, extracts links, and repeats - but doing this at internet scale is where it gets interesting. The core tension is between crawl speed and politeness: you want to discover billions of pages quickly without hammering any single server.
Target: 100 million unique URLs crawled per day.
Key takeaway: the bottleneck is network bandwidth and politeness constraints, not compute. Most of the time your crawler is waiting on DNS or HTTP responses.
A web crawler is mostly an internal system, but it needs management APIs for operations. Keep them simple.
/api/v1/crawler/seeds``/api/v1/crawler/status`/api/v1/crawler/control``/api/v1/crawler/pages/{url_hash}`Note: the crawler itself doesn't serve external traffic. These APIs are for your ops team to monitor and control the crawl. Authentication via internal service mesh is sufficient - no need for API keys.

High-Level Architecture
The architecture follows a producer-consumer pattern with the URL Frontier as the central coordination point.
1. Seed URLs enter the URL Frontier, which is a priority queue backed by Redis sorted sets. Each URL gets a priority score based on domain authority, freshness, and crawl depth.
2. Crawler Workers (the consumers) pull URLs from the frontier in priority order. Before fetching, each worker checks the politeness enforcer - a per-domain rate limiter that ensures no domain gets hit more than once per second.
3. The worker resolves DNS (hitting a local DNS cache first - DNS resolution is surprisingly expensive at scale), then fetches the page via HTTP.
4. The raw HTML goes to the Content Extractor, which pulls out text, metadata, and outgoing links. Raw HTML is written to blob storage (S3). Structured metadata goes to the metadata database (PostgreSQL).
5. Extracted URLs pass through the URL Deduplicator (Bloom filter) before being added back to the frontier. This prevents re-crawling pages we've already seen.
6. A separate robots.txt service caches and serves robots.txt rules. Workers check it before every fetch. Cache TTL is 24 hours per domain.

Detailed Component Design
Three components deserve deep attention: the URL Frontier, the Downloader, and the Politeness Enforcer.
The frontier is the heart of the crawler. Use Redis sorted sets - the score is the URL's priority, and ZPOPMIN gives you the highest-priority URL in O(log N).
Why Redis over Kafka here? Kafka is great for append-only streams, but a crawler frontier needs priority ordering and deduplication. Redis sorted sets give you both. Shard the frontier across 10-20 Redis instances using consistent hashing on the URL's domain. This naturally groups same-domain URLs on the same shard, which simplifies politeness enforcement.
The Bloom filter for dedup sits in front of the frontier. At 1 billion URLs with a 0.01% false positive rate, it's ~2.4GB - fits in RAM. Use RedisBloom or a standalone in-memory filter. False positives mean occasionally skipping a valid URL, which is acceptable. False negatives (re-crawling duplicates) would waste resources.
Use async I/O (aiohttp in Python, or better yet, a Go worker pool with goroutines). A single worker machine with 1,000 concurrent connections can sustain ~500 fetches/sec, limited mainly by network latency. You need ~20 worker machines for 10K URLs/sec.
Every worker maintains a local DNS cache. Without it, you'd be doing thousands of DNS lookups per second, and public DNS resolvers will rate-limit you. Cache TTL of 5 minutes is a good balance between freshness and performance.
Timeout configuration matters: 10 seconds for connection, 30 seconds for full response. Anything slower than that is likely a broken server, and you're wasting a connection slot.
This is the component interviewers care about most. Implement it as a token bucket per domain: each domain gets 1 token per second. When a worker wants to fetch from example.com, it consumes a token. No token available? The URL goes back to the frontier with a delayed score.
Store token buckets in the same Redis shard as the domain's URLs. This avoids cross-shard coordination for politeness checks. The enforcer also caches robots.txt rules per domain (24h TTL) and rejects disallowed paths before they even reach the downloader.

Data Model & Database Design
Use two storage systems: PostgreSQL for structured metadata, S3 (or equivalent blob storage) for raw HTML. Do not try to put HTML content in a relational database - it'll destroy your query performance.
Why not MongoDB for HTML? S3 is cheaper by 10x for write-heavy blob workloads, handles concurrent writes without locking, and you never query HTML by anything other than its key.

Deep Dives
Deep Dive 1: URL Deduplication at Scale
Problem: The web is full of duplicate URLs. Same page behind http/https, www/no-www, trailing slashes, query parameter ordering, session IDs in URLs. Naive string comparison misses all of these.
Approach: Three-layer dedup.
Tradeoff: Layer 3 requires actually fetching the page, so it doesn't save bandwidth - but it prevents storing and indexing duplicate content. Skip it if storage cost isn't a concern.
Deep Dive 2: Politeness Without Killing Throughput
Problem: Politeness at 1 req/sec/domain means a single popular domain takes forever to crawl. But you have millions of domains, so overall throughput can still be high - if you interleave correctly.
Approach: Domain-sharded frontier. Group URLs by domain in the frontier. Each worker pulls from multiple domains in round-robin, never hitting the same domain twice in a row. The politeness delay becomes invisible because while you're waiting on domain A's cooldown, you're fetching from domains B through Z.
The failure mode to watch for: "domain concentration" - when your frontier has 10M URLs from one domain and 100 URLs each from 1,000 others. The single domain becomes a bottleneck. Solution: cap per-domain frontier size (e.g., 50K URLs max). Excess URLs spill to disk and get re-queued later.
Deep Dive 3: Recovering from Crawler Failures
Problem: A worker dies mid-crawl holding 500 URLs. Those URLs are marked 'in_progress' in the frontier. Nobody else will pick them up.
Approach: Lease-based ownership. When a worker pulls a URL, it gets a lease (timestamp + TTL, typically 5 minutes). If the worker doesn't report completion before the lease expires, a background sweeper resets the URL to 'queued'. Workers send heartbeats to extend leases for slow-but-alive fetches.
This is the same pattern DynamoDB uses for its lock client, and it's battle-tested. The key insight: treat "in_progress" as a temporary state with automatic expiry, not a permanent flag.
Key Trade-offs:
Our AI interviewer will test your understanding with follow-up questions
Start Mock Interview