~/articles/design-web-crawler
◆◆◆Advancedasked at Googleasked at Microsoftasked at Amazon

Design a Web Crawler (Googlebot)

Crawl billions of URLs at petabyte scale. URL frontier, politeness, deduplication, 4xx/dead-link handling, and the realities of indexing the web.

// DEPTH
the full breakdown — requirements, capacity, evolution, trade-offs

The problem

Googlebot has indexed over 100 billion pages. To do that, it continuously fetches pages across the entire public web, extracts every link it finds, and follows those links — recursively, at global scale. The service behind this is a web crawler: an automated system that starts from a seed list of URLs, downloads the HTML at each one, pulls out new links, and adds them to a queue to be fetched next. Repeat indefinitely. It is, at its core, a breadth-first graph traversal over the largest graph humans have ever built.

The naive version is straightforward enough to write in an afternoon. The production version — the one powering Googlebot, Bingbot, or the Internet Archive's Wayback Machine — operates at 1 billion fetches per day, across thousands of worker machines, against a frontier of roughly a trillion known URLs. That gap between "works on my laptop" and "works at web scale" is exactly what makes this a classic system-design problem.

Two tensions drive most of the design decisions. First, politeness vs. throughput: to hit 12,000 requests per second you need massive parallelism, but every single website on the internet expects you not to hammer it. Doing both simultaneously requires careful per-host scheduling that gets harder as you add workers. Second, deduplication at impossible scale: you cannot store a trillion URLs in a hash set — that's 100 TB of RAM — so you need a different data structure entirely, and the tradeoffs of that structure shape the whole system. These two constraints ripple through every component.

Functional requirements

  • Given a seed list of URLs, fetch them.
  • Parse HTML, extract URLs, repeat.
  • Store fetched pages.
  • Honor robots.txt and Crawl-Delay.
  • Skip duplicates (URL and content).
  • Detect and skip traps (calendar pages, infinite parameter spaces).

Non-functional

  • 1B URLs/day at scale.
  • Be polite — don't DoS individual sites.
  • Distributed across thousands of workers.
  • Re-crawl popular pages frequently, long-tail rarely.

Capacity

DimensionEstimateHow we got there
URL frontier (planning horizon)~1T URLs~4B pages publicly indexed per worldwidewebsize.com; ~1T accounts for the full discoverable frontier
Average page size~100 KBHTTP Archive 2024 median HTML doc is ~32–33 KB; 100 KB accounts for large pages and response overhead
Daily fresh fetches1B/dayTarget crawl rate
Fetch throughput~12k req/s average1B ÷ 86,400 s ≈ 12k/s
Storage~100 TB/day raw HTML100 KB × 1B pages
Inbound bandwidth~1.2 GB/s download12k req/s × 100 KB/page; needs fat pipes and many crawler hosts

Takeaway: 12k req/s sustained, 1.2 GB/s of inbound bandwidth, and 100 TB/day of raw HTML — this is a problem for hundreds of crawler hosts with fat network pipes, not a single beefy server.

Building up to the design

A crawler is a great "the naive version runs, but is rude or wrong" question. Each piece in the final diagram earns its place as the answer to a specific failure mode — walk the path and the architecture is obvious.

V1: BFS from a single seed

queue = [seed]
seen = set()
while queue:
    url = queue.pop(0)
    if url in seen: continue
    seen.add(url)
    html = fetch(url)
    save(url, html)
    queue.extend(extract_urls(html))

This works. You can point it at a small website tonight and it'll index the whole thing by morning. The problem isn't correctness — it's manners. A FIFO queue naturally concentrates on whichever host dominates the link graph. Within minutes you're firing thousands of requests at a single server, getting your IP banned, and more importantly being a bad citizen. We fix that next.

V2: Per-host queues with politeness delays

Instead of one queue, maintain one queue per host. A worker picks a host, fetches one URL, waits for the Crawl-Delay directive from robots.txt (commonly 1–10 s; there is no RFC-standard default), then moves on. No single site gets more than one request per delay window.

That fixes the rudeness problem but introduces a bottleneck: with one process, even if you manage 1k active hosts each with a 1s delay, you're capped at 1k fetches/sec. The web has billions of hosts, and you need 12k/sec. You need parallelism.

V3: Many workers, sharded by host

Spread the per-host queues across many worker processes. Hash the hostname to assign a queue to a specific worker — so all wikipedia.org URLs go to worker 42, every time. Politeness is now purely local to each worker; no cross-worker coordination required, and DNS lookups and robots.txt entries are warm in that worker's cache. Theoretical ceiling: 1,000 workers × 1,000 hosts each × 1 fetch/sec = 1M fetches/sec.

What breaks now is memory. The seen set of URLs can't stay in a Python set. At 1T URLs × 100 bytes each, that's 100 TB of RAM — which is why the next step is the most interesting data-structure decision in the whole system.

V4: Bloom filter for URL dedup

A Bloom filter answers "have I seen this URL?" without storing the URLs themselves. The answer is either "definitely not seen" or "probably seen." At 1T URLs and ~9.6 bits per entry (the optimal for 1% false-positive rate), the whole filter takes about 1.2 TB of RAM — sharded across nodes, that's manageable. Each check is a handful of hash evaluations, no disk I/O, effectively O(1).

The "probably seen" case (a false positive) means we skip a URL we've actually never fetched. That's acceptable — a small fraction of new URLs are never discovered. What can't happen with a standard Bloom filter is a false negative: if we've seen a URL, the filter will always say so. That's the property that makes it safe for dedup.

For the rare "probably seen" result that needs confirmation, we fall back to a sharded persistent store (Redis cluster or RocksDB).

V5: Content-hash dedup + canonicalization

Even with URL dedup working, the same page is often reachable from five different URLs: canonical, mobile, with tracking parameters, www vs. non-www. We'd fetch and index the same content multiple times. The fix is to hash the fetched page body. If we've stored this hash before, we just record the URL → content_hash mapping and skip re-indexing. We also canonicalize URLs before queuing: strip tracking params, normalize trailing slashes, follow <link rel=canonical>.

V6: Priority + freshness

The New York Times homepage changes hourly. A random forum post from 2009 never changes. A naive crawler treats both identically, burning crawl budget on stale pages while missing breaking news. Score each URL by importance (PageRank-like inbound-link signal) and observed change rate, and let a priority queue feed the per-host back-queues so the most important and most stale URLs always go first.

V7: Robots, traps, DNS, production hardening

The finishing pieces: fetch each host's robots.txt once per day and gate every URL against it. Detect trap pages (calendar generators, infinite parameter spaces) using per-host depth limits and URL-pattern quotas. Cache DNS aggressively so each unique hostname costs only one resolver query rather than one per fetched URL. Add backpressure so a slow indexer pauses the frontier rather than causing drops.

That's the full design — V3 through V7 combined.

flowchart LR
    V1[V1: BFS<br/>impolite, single host] --> V2[V2: per-host queues<br/>polite, slow]
    V2 --> V3[V3: + sharded workers<br/>1M/sec]
    V3 --> V4[V4: + Bloom filter<br/>1T URLs feasible]
    V4 --> V5[V5: + content dedup<br/>no double work]
    V5 --> V6[V6: + priority + freshness<br/>matters first]
    V6 --> V7[V7: + robots + DNS + traps<br/>production Googlebot]
    style V1 fill:#0e7490,color:#fff
    style V3 fill:#15803d,color:#fff
    style V4 fill:#ff6b1a,color:#0a0a0f
    style V7 fill:#a855f7,color:#fff

High-level architecture

flowchart TD
    SEEDS[Seed URLs] --> FRONT[URL Frontier]

    subgraph Crawl Workers
    FRONT --> FETCH[Fetcher]
    FETCH --> PARSE[Parser]
    PARSE --> EXTRACT[URL Extractor]
    end

    FETCH --> PSTORE[(Page Store / S3)]
    PARSE --> CSTORE[(Content Store)]
    EXTRACT --> DEDUP[URL Dedup]
    DEDUP --> FRONT

    PARSE --> INDEXER[Indexer]
    INDEXER --> SEARCH[(Search Index)]

    ROBOTS[Robots Cache] --> FETCH
    DNS[DNS Cache] --> FETCH

    style FRONT fill:#ff6b1a,color:#0a0a0f
    style FETCH fill:#15803d,color:#fff
    style DEDUP fill:#a855f7,color:#fff

The URL frontier

The URL frontier is the heart of the crawler — the data structure that decides what gets fetched next and when. Get it wrong and you'll either hammer individual hosts or starve your workers.

A naïve FIFO queue has an obvious failure mode: Stack Overflow posts thousands of self-referencing links, so a FIFO queue spends all its time there while every other host waits. The canonical solution is the Mercator-style two-tier design:

flowchart TD
    URL[New URL discovered] --> PRIORITY[Priority Queue<br/>by importance score]
    PRIORITY --> Q1["Queue 1<br/>'all URLs from host A'"]
    PRIORITY --> Q2["Queue 2<br/>'all URLs from host B'"]
    PRIORITY --> QN[Queue N]
    Q1 --> SCHED[Politeness Scheduler]
    Q2 --> SCHED
    QN --> SCHED
    SCHED -->|"one URL per host<br/>respecting Crawl-Delay"| WORKER[Worker]
    style PRIORITY fill:#ff6b1a,color:#0a0a0f
    style SCHED fill:#15803d,color:#fff

Front queues order URLs by importance score — a PageRank-like signal that puts the NYT homepage ahead of a rarely-linked corner page. Back queues are per-host, and only the scheduler touches them. The scheduler's job is simple: pick the next URL such that no host is contacted faster than its Crawl-Delay. It serializes access to each back-queue, so even if a thousand workers are running, no host ever sees parallel requests.

Politeness

Per host, a production crawler:

  • Respects robots.txt — fetched and cached per origin (TTL 24h).
  • Respects the Crawl-Delay directive when set (there is no RFC-standard default; common convention is 1–10 s; Google ignores this directive and manages its own crawl rate through other signals).
  • Avoids multiple parallel connections to the same host simultaneously.
  • Honors 429 Too Many Requests and the Retry-After value in the response.
  • Sends a recognizable User-Agent string (Googlebot, Bingbot) so site owners can identify and reach you.

The key insight is that politeness enforcement belongs at the scheduler layer, not at the worker level. Workers can't be trusted to know what other workers are doing; only the scheduler has the full per-host picture.

Deduplication: have we seen this URL before?

URL-level dedup

The numbers tell the story. For 1T URLs, an exact set requires ~100 TB of RAM — clearly impossible. A Bloom filter is the right structure:

  • 1T URLs × ~9.6 bits/URL (1% false positive rate, optimal formula: −ln(0.01)/ln(2)² ≈ 9.585 bits) = 9.6 trillion bits ÷ 8 ≈ 1.2 TB.
  • A membership check is O(k) hash evaluations, effectively O(1) — no disk I/O.
  • A false positive (says "seen" when it isn't) means we skip that URL. We miss a small fraction of new pages. Acceptable.
  • A false negative (says "new" when it was already seen) cannot happen with a standard Bloom filter. That's the property that makes it safe: we might under-crawl, but we never over-crawl.

Why not a hash set or Cassandra? A hash set at 1T entries runs to ~100 TB of RAM. A Cassandra point-lookup adds 1–5 ms of network latency; at 12k req/s that's fine as a confirmation step, but not as the fast-path check. The practical pattern combines both: Bloom filter as a fast pre-filter (answers in RAM), Cassandra or RocksDB only for the small fraction of "probably seen" cases that need confirmation.

In a distributed crawler, the Bloom filter is sharded by hash(url) % N; each worker owns its shard and checks locally with no network hop.

Content-level dedup

Many URLs serve the same content: mirrors, query-string variants, www vs. non-www, mobile vs. desktop. Computing an exact content hash (MD5, SHA-1) catches byte-identical copies, but most duplicate content differs slightly — a different ad, a different footer timestamp, a different header. For that you need SimHash, a locality-sensitive hash where similar documents produce fingerprints with small Hamming distance. A threshold of ≤3 bits different is a common cutoff. If a near-duplicate already exists in the fingerprint store, record the URL → existing content_hash mapping and skip re-indexing.

Fetching

A worker pulls a URL, connects via HTTP (with keep-alive and HTTP/2 multiplexing where the server supports it), follows redirects up to a configured limit, and respects timeouts at both connection and total-response granularity.

A few specifics matter at scale:

  • DNS caching — every unique hostname requires a DNS resolution. Without a local cache, the volume of unique hostnames per second can quickly exhaust or anger your resolvers. Cache DNS results for each host's published TTL so that fetching thousands of pages from the same host costs only one lookup.
  • Conditional fetches — send If-Modified-Since / If-None-Match on re-crawl requests so unchanged pages return a lightweight 304 Not Modified with no body; this is the mechanism the Re-crawl scheduling section exploits to keep high-frequency checks cheap.
  • Headless browser for JS-heavy sites — JavaScript-rendered content is invisible to a basic HTTP fetcher. Google does headless rendering for a subset of pages, but it's 10–50× more expensive than a plain HTTP fetch, so it's applied selectively.

Parsing & URL extraction

def parse(html, base_url):
    dom = parse_html(html)
    text = extract_text(dom)
    links = []
    for a in dom.select('a'):
        href = a.get('href')
        absolute = urljoin(base_url, href)
        if same_scheme_host(absolute) or follow_external():
            links.append(absolute)
    return text, links

Three things go wrong often enough to plan for:

  • Malformed HTML — the real web is not well-formed XML. Use a forgiving parser (libxml2, BeautifulSoup) that handles unclosed tags and attribute soup.
  • Infinite link generators?page=1, ?page=2, ... ?page=99999. Enforce per-host depth limits and URL-pattern budgets ("we've already seen 10k URLs matching ?page=* from this host; stop").
  • JavaScript-only links — without rendering, you miss most modern single-page apps. This is a conscious trade-off between coverage and cost.

Re-crawl scheduling

The web changes unevenly. Some pages change every minute (live sports scores), some change yearly (about pages). A crawler that re-fetches everything on a uniform schedule wastes crawl budget on static pages and misses breaking news. The solution is to treat re-crawl scheduling as a prediction problem.

freshness_score = f(content_change_rate, importance, last_crawled)

Pages with a high freshness_score go back in the priority queue sooner. Three signals feed the score, each pulling in a different direction.

Change rate is measured as a rolling window of "was content different on this visit vs. last?" A page that changed on 8 of the last 10 visits gets scheduled every hour; one that changed once in 20 visits waits a month. Importance — a PageRank-equivalent score derived from inbound links and traffic signals — provides a correction for pages that change slowly but matter a lot. A missed update on the NYT homepage is more costly than a missed update on an obscure blog, so important pages receive extra crawl budget regardless of their change rate. Finally, conditional fetching makes the whole scheme affordable: sending If-Modified-Since / If-None-Match headers means a 304 Not Modified response costs just a DNS lookup, TCP handshake, and a round-trip of headers — no page body transferred. This makes re-crawl cheap enough to check far more frequently than naive bandwidth math would suggest.

Crawl budget awareness: Google allocates a per-site crawl budget based on site authority and server health. A slow site gets fewer visits; a fast, authoritative site gets more. If a site returns 5xx errors, the budget shrinks automatically — the crawler backs off rather than hammering a broken server.

Storage

DataStore
Raw HTMLObject storage (compressed)
Parsed textCompressed columnar store (Parquet on S3, Bigtable)
URL frontier (active)Distributed queue (Kafka or custom)
URL bloom filterSharded (Redis cluster or RocksDB)
Robots.txt cachePer-host KV store
IndexInverted index for search (Lucene-derived)

A petabyte-scale crawler stores raw HTML for ~30 days so it can be reprocessed if a parser bug is discovered. Long-term, only parsed text and the index survive.

Distributed coordination

How do you split work across 10,000 worker hosts?

Option 1: shard by host hash. Each worker owns a slice of hostnames. All wikipedia.org URLs go to worker 42, every time. Politeness enforcement is purely local — no cross-worker coordination needed. DNS lookups and robots.txt cache hits are also local, so the common case involves zero network hops outside of the actual fetch. The trade-off is hot shards: if reddit.com is 5% of the frontier, the worker assigned to it runs 5× busier than average. Mitigate with consistent hashing and virtual nodes, or by capping per-host parallelism and overflowing to a shared pool.

Option 2: central task queue (Kafka or SQS-style). Workers pull from a shared queue. Load balancing is trivial — no hot-shard problem. But politeness state becomes shared: two workers could independently pick up reddit.com URLs and fire simultaneous requests. You'd need a distributed lock or a "host lease" table to ensure only one worker holds the token for a given host at a time.

Real systems use a hybrid: shard-by-host for the common case, with a centralized priority feed (a Kafka topic of high-priority URLs that any worker can steal) for breaking bottlenecks. Worker assignment is tracked by a coordinator (ZooKeeper-style or a custom Raft service), so if a worker dies its host shards are reassigned.

flowchart LR
    COORD[Coordinator<br/>Raft / ZooKeeper] -->|"assign host shards"| W1[Worker 1<br/>hosts A-M]
    COORD -->|"assign host shards"| W2[Worker 2<br/>hosts N-Z]
    HPFEED[High-priority<br/>Kafka feed] -->|"any worker steals"| W1
    HPFEED -->|"any worker steals"| W2
    W1 -->|"fetch + dedup"| STORE[(Object Store)]
    W2 -->|"fetch + dedup"| STORE
    style COORD fill:#ff6b1a,color:#0a0a0f
    style HPFEED fill:#ffaa00,color:#0a0a0f
    style STORE fill:#0e7490,color:#fff

Failure modes

Worker dies mid-fetch

The URL goes back to the frontier via a visibility timeout — the same pattern SQS uses. Another worker picks it up and retries. No URL is lost as long as the frontier itself is durable.

Site is on fire / 5xx

Exponential backoff per host. After N consecutive failures, mark the host as "paused for 1 hour" and stop dequeueing its URLs. This prevents the crawler from turning a struggling site's outage into a sustained DDoS.

Trap pages (bot honeypots)

Some sites generate infinite unique URLs to detect and ban bots: calendar pages that link to next-month-next-month-next-month forever, or pages with ?color=red&size=M&sort=price&page=1 in every permutation. Detect by:

  • Per-host depth limits.
  • URL pattern budgets ("we've already queued 1M URLs from this host today; pause it").
  • Content similarity — if 100k URLs return near-identical content, blacklist the URL pattern.

A 404 Not Found is a transient signal — the page may return. Log it, increment a failure counter, and retry after an exponential backoff. After N consecutive 404s, remove the URL from the re-crawl queue and send a deletion signal to the indexer so the page is dropped from search results. A 410 Gone is permanent: remove immediately without retry. Both are fundamentally different from 5xx errors, which indicate a server-side problem, not a missing resource — never apply the same backoff logic to client-error and server-error status codes.

Storage / index falls behind

Backpressure — slow down the frontier dequeue when the downstream indexer's queue is filling. It's better to pause the crawler for a few minutes than to drop fetched pages or overwhelm a parsing service.

Politeness, ethics, and law

A real crawler:

  • Respects robots.txt rigorously. Ignoring it is the fastest way to get blocked and potentially sued.
  • Identifies itself with a User-Agent string and a contact email address.
  • Backs off when it sees high response times, 5xx errors, or 429 Too Many Requests.
  • Avoids logged-in pages and paywalled content unless the site has explicitly authorized access.
  • Handles deletion requests — some jurisdictions (EU GDPR, CCPA) require honoring requests to remove personal data from crawled content.

Things to discuss in an interview

  • URL frontier: priority + politeness + per-host queues.
  • Bloom filter for URL dedup at trillion-URL scale.
  • Bandwidth & politeness budget — show you understand DDoS risk.
  • Re-crawl scheduling as a prediction problem.
  • Failure handling — retries, backoff, traps.

Things you should now be able to answer

  • Why does a naive FIFO frontier violate politeness?
  • Why is a Bloom filter the right structure for "URL seen?" at this scale?
  • How do you schedule re-crawls of changing pages?
  • What's the difference between URL dedup and content dedup?
  • How does a crawler avoid being trapped by infinite-URL generators?

Further reading

  • "Mercator: A Scalable, Extensible Web Crawler" (Heydon & Najork, 1999)
  • Google's web crawl & index architecture — various papers (e.g., "Web Search for a Planet")
  • Common Crawl — open dataset of billions of pages
// FAQ

Frequently asked questions

Why is a Bloom filter used for URL deduplication instead of a hash set or database?

At 1 trillion URLs, an exact hash set requires roughly 100 TB of RAM — clearly infeasible. A Bloom filter stores the same trillion URLs in about 1.2 TB of RAM using 9.6 bits per URL at a 1% false-positive rate, and each membership check is O(1) with no disk I/O. False positives (skipping a URL never seen) are acceptable; false negatives are impossible with a standard Bloom filter, making it safe for deduplication.

What is a two-tier URL frontier and why does a web crawler need it?

The Mercator-style two-tier frontier uses front queues ordered by importance score (a PageRank-like signal) feeding into per-host back queues. A politeness scheduler drains the back queues one URL per host per Crawl-Delay window, ensuring no single site receives parallel requests regardless of how many workers are running. A naive FIFO queue fails because link-heavy hosts like Stack Overflow crowd out every other site.

What is the difference between URL deduplication and content deduplication in a web crawler?

URL deduplication checks whether a specific URL has been fetched before, using a Bloom filter as a fast in-RAM pre-filter. Content deduplication checks whether the fetched page body is a duplicate of already-indexed content, using an exact content hash for byte-identical copies and SimHash (with a threshold of 3 or fewer differing bits) for near-duplicates like mobile variants, mirrors, or pages with different ad footers.

When should a web crawler use headless browser rendering instead of a plain HTTP fetch?

JavaScript rendering should be applied selectively to JS-heavy pages only, because it is 10 to 50 times more expensive than a plain HTTP fetch. Google uses this approach for a subset of pages; the default path is a plain HTTP fetch, and full rendering is reserved for cases where JavaScript-only content would otherwise be invisible to the crawler.

How does a production web crawler handle a site that is returning 5xx errors without DDoSing it further?

The crawler applies exponential backoff per host and, after N consecutive failures, marks the host as paused for one hour and stops dequeueing its URLs. This prevents a crawler from turning a struggling site's outage into a sustained DDoS attack. Worker crashes are handled separately via a visibility-timeout queue so that in-flight URLs are re-queued automatically without being lost.

// RELATED

You may also like