Design a Distributed File System (GFS / HDFS)
Store petabyte files across thousands of commodity machines for high-throughput batch reads. The single-master + chunkservers design, replication, and append-heavy workloads.
The problem
Google's 2003 paper on the Google File System describes a cluster of 1,000 commodity machines where hardware failures were not edge cases — they were the expected daily operating condition. The engineers needed to store petabytes of data in files tens of gigabytes to terabytes large, and stream through a 100 GB file at hundreds of MB/s from thousands of machines simultaneously. Nothing in the standard Unix or NFS toolkit came close. GFS, and its open-source sibling HDFS (Hadoop Distributed File System), were the answer — and they remain the foundational blueprint for every large-scale batch storage system built since.
At its core, a distributed file system (DFS) does two things: it splits enormous files into fixed-size chunks and spreads those chunks across a fleet of commodity storage nodes, and it maintains a central map of where every chunk lives so clients can find and read data without a broadcast search. The actual file bytes never pass through a central coordinator — once a client knows which node holds the chunk it needs, it streams data directly at full network speed.
The engineering tension that makes this interesting is the collision between two opposing forces. You want a single consistent view of the filesystem — every client sees the same files, the same directory structure, the same data. But you also want the metadata catalog to remain fast and available even as individual nodes crash, and you want writes to be cheap even when you have three replicas to update. GFS resolves this by relaxing consistency: it guarantees that concurrent appends land atomically at some offset, but does not guarantee that all replicas are byte-for-byte identical at every instant. That trade-off — defined but not necessarily consistent regions — is what lets the system sustain hundreds of gigabytes per second of aggregate throughput without distributed transactions.
The second tension is the single-master problem. A central metadata node is simple and fast, but it is also a potential single point of failure and a throughput bottleneck. GFS solves the throughput half by keeping the master entirely off the data path — clients consult it once per file open, then stream data directly between themselves and chunkservers. HDFS HA (with ZooKeeper-based failover) solves the availability half. Understanding how those two design choices reinforce each other is the heart of any DFS interview question.
Functional requirements
- Store files of arbitrary size (GBs to TBs) in a hierarchical namespace.
create,delete,open,read,write,append— POSIX-like but not POSIX-complete.- Support concurrent reads from many clients and concurrent appends from many producers (the multi-producer log pattern).
- Files are mostly immutable after the first write pass.
Non-functional requirements
- Throughput over latency — optimized for streaming reads, not random I/O.
- Fault tolerance — commodity hardware fails routinely; the system must self-heal without operator intervention.
- Durability — data must survive multiple simultaneous failures.
- Scalability — petabytes of data, thousands of nodes.
- Single-digit MB/s per disk; aggregate throughput across all disks/nodes in the thousands of MB/s.
Capacity estimation
| Dimension | Estimate | How we got there |
|---|---|---|
| Cluster nodes | 10,000 DataNodes | Baseline cluster size |
| Raw storage | 200 PB | 10,000 × 20 TB per node |
| User data (after replication) | ~67 PB | 200 PB ÷ 3 (replication factor 3) |
| Total chunks | ~1.05 billion | 67 PB ÷ 64 MB per chunk |
| Chunks per DataNode | ~105,000 | 1.05 B ÷ 10,000 nodes |
| Metadata per chunk | ~64 B | chunk handle (8 B) + file-to-chunk entry (≈ 16 B) + 3 replica locations (24 B) + version/flags (8 B) ≈ 56 B, rounded to 64 B |
| Master RAM (chunk map) | ~67 GB | 1.05 B × 64 B; fits in a single server with 128 GB RAM |
| Heartbeat rate | ~3,333 heartbeats/s | 10,000 nodes × (1 ÷ 3 s) — easily handled; each is a tiny UDP/TCP message |
| Aggregate read throughput | 500 GB/s | 10,000 concurrent MapReduce tasks × 50 MB/s each; 10,000 nodes × 50 MB/s = 500 GB/s ✓ |
The capacity math confirms two important properties: the master's entire metadata fits comfortably in RAM (fast lookups, no disk I/O on the hot path), and read throughput scales linearly with nodes because clients stream directly from chunkservers.
Building up to the design
V1: One NFS mount on a big machine
All clients mount the same NAS/NFS export. Works fine up to a few TB and a handful of clients. The appeal is real: zero custom code, POSIX semantics, and you can be up and running in an afternoon.
The problem is that a single storage server is the throughput ceiling. At 500 MB/s disk bandwidth, 1,000 MapReduce tasks will queue on that one machine. And the box is a single point of failure — its unavailability means the whole cluster stalls.
V2: Shard files across N servers (Lustre-style)
Stripe file blocks across multiple storage servers. A metadata server (MDS) holds the namespace; object storage servers (OSS) hold the data. Now clients read from multiple disks in parallel — throughput scales with how many storage servers you add.
But chunk sizes are small here (typical block sizes are 4–16 KB). For huge files, the metadata explosion is enormous: each chunk requires a metadata lookup, and the MDS becomes the new bottleneck. POSIX consistency under concurrent appends is also expensive to maintain at that granularity.
V3: Giant chunks + relax consistency (GFS)
Make chunks huge — 64 MB. This reduces metadata by roughly 16,000× compared to 4 KB blocks. For a 10,000-node cluster with 67 PB of usable data, that means about 1 billion chunk records in master RAM — roughly 67 GB — rather than petabytes of unaddressable metadata. Relax the consistency model: instead of strict POSIX, allow "defined but not consistent" regions and at-least-once record appends. Batch consumers can handle duplicates or padding. The master stays comfortably in RAM, read paths become simple streaming operations, and concurrent writers can append without distributed locks.
The remaining gap: the master is still a single process. If it crashes, the cluster is unavailable until it restarts. At exabyte scale, even 64 MB chunks eventually produce more metadata than one machine can hold.
V4: Master HA + federation (HDFS NameNode HA)
Add a standby master that replays the same operation log as the active master. ZooKeeper handles leader election — if the active master fails, the standby is fenced and promoted in seconds. For true scale-out, split the namespace across multiple independent masters (HDFS federation), each owning a subtree.
flowchart LR
V1["V1: Single NFS<br/>~10 GB, 1 client"] --> V2["V2: Striped MDS+OSS<br/>TB scale, slow metadata"]
V2 --> V3["V3: GFS — 64 MB chunks<br/>PB scale, relaxed consistency"]
V3 --> V4["V4: HA master + federation<br/>EB scale, no single SPOF"]
style V1 fill:#0e7490,color:#fff
style V3 fill:#ff6b1a,color:#0a0a0f
style V4 fill:#a855f7,color:#fff
Everything from here dives into V3–V4: the design you'd actually discuss in a senior interview.
High-level architecture
flowchart TD
subgraph Control Plane
MASTER[Metadata Master<br/>namespace tree<br/>chunk-to-location map<br/>chunk version table]
STANDBY[Standby Master<br/>replays op log]
JNS[(Journal / Edit Log<br/>shared NFS or QJM)]
ZK[ZooKeeper<br/>leader election]
MASTER <-->|"write-ahead log"| JNS
STANDBY <-->|"tail log"| JNS
ZK -.fencing.-> STANDBY
end
subgraph Data Plane
CS1[Chunkserver / DataNode 1<br/>Rack A]
CS2[Chunkserver / DataNode 2<br/>Rack A]
CS3[Chunkserver / DataNode 3<br/>Rack B]
end
CL[Client] -->|"1 metadata RPC"| MASTER
MASTER -->|"2 chunk locations"| CL
CL -->|"3 data stream"| CS1
CS1 -->|"4 pipeline"| CS2
CS2 -->|"5 pipeline"| CS3
MASTER -.heartbeat + blockreport.-> CS1
MASTER -.heartbeat + blockreport.-> CS2
MASTER -.heartbeat + blockreport.-> CS3
style MASTER fill:#ff6b1a,color:#0a0a0f
style STANDBY fill:#ffaa00,color:#0a0a0f
style JNS fill:#0e7490,color:#fff
style ZK fill:#a855f7,color:#fff
style CS1 fill:#15803d,color:#fff
style CS2 fill:#15803d,color:#fff
style CS3 fill:#0e7490,color:#fff
style CL fill:#ff2e88,color:#fff
The key insight: the master is on the control plane, not the data plane. After the client learns chunk locations from the master, all further communication is directly between the client and chunkservers. The master never touches file bytes.
The metadata master: what it holds in memory
The master maintains three critical data structures, all in RAM:
- Namespace tree — the file and directory hierarchy; a prefix-compressed in-memory tree, protected by reader-writer locks per namespace region.
- File → chunk handle list — mapping from each file to its ordered list of 64 MB chunks (chunk handles are 64-bit opaque IDs).
- Chunk handle → replica locations — mapping from each chunk handle to the set of chunkservers that currently hold a valid replica.
The replica locations are not persisted to disk by the master. On startup or after a crash, the master polls every chunkserver for its block report, which reconstructs this map in a few seconds. Only the namespace tree and file-to-chunk mapping need to survive restarts, and they do so via the operation log.
The operation log is the source of truth for all metadata mutations. The master flushes it to durable storage (local disk, replicated across a quorum of journal nodes in HDFS's Quorum Journal Manager) before acknowledging any client operation. Periodically, the master compacts the log into a checkpoint (a serialized in-memory snapshot) so log replay on restart is bounded.
What happens when the master restarts?
This is worth tracing once, because it reveals why not persisting replica locations is actually fine. The master loads its last checkpoint, replays the operation log tail since that checkpoint to recover the namespace and file-to-chunk map, then asks every chunkserver to report what it holds. Within a few seconds the chunk-location map is fully reconstructed — the chunkservers are the authoritative source.
sequenceDiagram
participant M as Master (restarting)
participant D as Disk (local)
participant CS as Chunkservers (all)
M->>D: Load FSImage checkpoint
D-->>M: namespace + file-to-chunk map (as of last checkpoint)
M->>D: Replay op log tail since checkpoint
D-->>M: recent mutations applied
Note over M: Namespace fully recovered.<br/>Chunk locations unknown.
M->>CS: RequestBlockReports (broadcast)
CS-->>M: block reports (chunk handle, version, size)
Note over M: Chunk-location map rebuilt.<br/>Master accepts client traffic.
Why 64 MB chunks?
| Chunk size | Metadata entries (67 PB usable) | Master RAM | Connect overhead per chunk |
|---|---|---|---|
| 4 KB (ext4 block) | 1.68 × 10^13 | ~1.07 PB | High (many RPCs) |
| 4 MB | 1.68 × 10^10 | ~1.07 TB | Medium |
| 64 MB | 1.05 × 10^9 | ~67 GB | Low (streaming inside chunk) |
| 512 MB | 1.31 × 10^8 | ~8.4 GB | Very low (but wastes space on small files) |
64–128 MB is the sweet spot. The metadata fits comfortably in one machine's RAM. Clients amortize TCP setup cost across tens of seconds of streaming. Chunkservers can dedicate a persistent TCP connection per client for the chunk's lifetime, eliminating per-block setup. The only real downside is internal fragmentation on small files, which GFS mitigates by storing multiple small files in the same chunk, and HDFS handles with a separate small-file path.
Read path
sequenceDiagram
participant C as Client
participant M as Master
participant CS as Chunkserver (closest replica)
C->>M: OpenFile("hdfs://path/to/file")
M-->>C: file length, chunk handles [0..N], version numbers
Note over C: Client caches this for ~60 s
C->>M: GetChunkLocations(handle=42)
M-->>C: [ChunkserverA, ChunkserverB, ChunkserverC]
Note over C: Client picks closest replica<br/>(rack-aware network topology)
C->>CS: ReadChunk(handle=42, offset=0, length=64MB)
CS-->>C: data stream (no master involvement)
Note over C: Client checks checksum<br/>per 64 KB sub-chunk<br/>on-the-fly
The master is involved only in the first step — thereafter the client streams data at full network speed. On a 10 Gbps NIC, a single chunkserver can sustain ~1 GB/s of reads. Clients cache chunk locations for roughly 60 seconds; stale locations are resolved by re-querying the master.
Checksum verification: each chunkserver stores a checksum (e.g., CRC-32C) per 64 KB sub-block. On read, the chunkserver verifies the checksum and, on mismatch, returns an error so the client can retry from another replica. The master is notified; if all replicas are corrupt, the data is lost (rare in practice with 3× replication).
Write / append path
There are two write types: random overwrites (rare) and record appends (the dominant pattern for log-structured workloads).
Lease and primary election
Before any mutation, the master grants a lease to one of the chunk's replicas, designating it the primary. The lease is typically 60 seconds and is renewable. Only one primary exists at a time, eliminating split-brain during writes.
sequenceDiagram
participant C as Client
participant M as Master
participant P as Primary Chunkserver
participant S1 as Secondary 1
participant S2 as Secondary 2
C->>M: GrantLease(chunk=42)
M-->>C: Primary=ChunkserverA, Secondaries=[B, C], version=7
Note over C,S2: Phase 1: Push data to all replicas (pipelined)
C->>P: PushData(chunk=42, data=<bytes>)
P->>S1: PipelineData(data=<bytes>)
S1->>S2: PipelineData(data=<bytes>)
S2-->>S1: ack
S1-->>P: ack
P-->>C: ack (data in all replica buffers)
Note over C,S2: Phase 2: Client sends write command to Primary
C->>P: WriteChunk(chunk=42, offset=X, len=Y)
P->>P: assign sequence number, apply mutation
P->>S1: ApplyMutation(seq=1, offset=X)
P->>S2: ApplyMutation(seq=1, offset=X)
S1-->>P: ack
S2-->>P: ack
P-->>C: success (write complete)
The two-phase design (push data first, then send write command) decouples network topology from mutation ordering. Data flows along a pipeline (client → primary → secondary1 → secondary2) to maximize bandwidth utilization — each chunkserver forwards data to the next while writing to disk. The write command then flows only to the primary, which serializes mutation order.
Record append
Record append is the "append to distributed log" primitive. The client specifies data; the primary picks an offset (atomically, at the current end of the chunk) and applies the mutation. If any secondary fails to acknowledge, the primary asks the client to retry — this means the same record may appear more than once at different offsets, or the failed secondary may have a gap filled with padding. Consumers must be designed to handle duplicates.
This is the at-least-once consistency model. It is intentional: it keeps the implementation simple and avoids distributed transactions, which are expensive. Applications that need exactly-once semantics do their own deduplication (e.g., checking a sequence ID).
Consistency model
GFS/HDFS do not guarantee POSIX consistency. The model is:
| Operation | Outcome |
|---|---|
| Successful serial write | Defined — all replicas identical, consistent with the write |
| Successful concurrent writes | Consistent but undefined — all replicas identical, but the bytes may be an interleaving of the writers' data |
| Successful record append | Defined (at the primary) — but replicas may have padding between appends |
| Failed write / partial replica failure | Inconsistent — replicas may differ; client retries lead to duplicates |
"Defined" means all clients that read that region will see the same data. "Consistent but undefined" means they agree on bytes but those bytes may not match any single writer's intent. Applications deal with this by using record append (which gives a defined offset) rather than overlapping random writes.
Replication and rack-aware placement
Default replication factor is 3. The placement policy:
- First replica: on the same node as the writer (or a lightly loaded node if the writer has no storage).
- Second replica: on a different rack from the first.
- Third replica: on a different node in the same rack as the second.
This gives: 2 copies on the remote rack (one of which survives even if the whole local rack fails) and 1 local copy for fast reads. A full rack switch failure — which takes out all nodes in the rack — still leaves two copies alive.
flowchart LR
subgraph Rack A
N1["Node 1<br/>(writer, replica 1)"]
N2["Node 2"]
end
subgraph Rack B
N3["Node 3<br/>(replica 2)"]
N4["Node 4<br/>(replica 3)"]
end
N1 -->|"cross-rack"| N3
N3 -->|"intra-rack"| N4
style N1 fill:#ff6b1a,color:#0a0a0f
style N2 fill:#15803d,color:#fff
style N3 fill:#15803d,color:#fff
style N4 fill:#0e7490,color:#fff
The cross-rack hop ensures that no single rack switch failure loses more than one copy. Both replica 2 and replica 3 live on Rack B, so a Rack A failure still leaves two copies alive.
Heartbeats and re-replication
Every chunkserver sends a heartbeat to the master every 3 seconds (HDFS default dfs.heartbeat.interval). If heartbeats stop, the NameNode declares the DataNode dead after approximately 10.5 minutes at default settings — the formula is 2 × dfs.namenode.heartbeat.recheck-interval + 10 × dfs.heartbeat.interval, using defaults of 300 s and 3 s respectively. Once declared dead, the master immediately identifies every chunk that was on that server. Chunks that now have fewer than 3 live replicas are queued for re-replication: the master instructs a chunkserver that holds a valid replica to copy it to a healthy chunkserver elsewhere, respecting the rack-awareness policy.
Re-replication is rate-limited to avoid consuming all network bandwidth during a large failure event (e.g., an entire rack going dark). Priority is given to chunks with the fewest remaining replicas — a chunk sitting at replication factor 1 jumps ahead of one at factor 2.
Chunkservers also send periodic block reports listing all chunks they hold. The master uses these to detect stale replicas: a chunkserver that was partitioned and missed writes may hold chunks at an old version number. On reconnection, its block report is compared to the master's version table; stale chunks are garbage-collected.
How re-replication actually flows
When a chunkserver disappears, here is what the master does, step by step:
flowchart TD
HB[Chunkserver heartbeats stop] --> MARK[Master marks node dead<br/>after N missed heartbeats]
MARK --> SCAN[Scan chunk map for affected chunks]
SCAN --> PRIO{Replicas remaining?}
PRIO -->|"1 replica — critical"| HIGH[Highest priority queue]
PRIO -->|"2 replicas — degraded"| MED[Normal priority queue]
HIGH --> REPL[Master instructs healthy chunkserver<br/>to copy from surviving replica]
MED --> REPL
REPL --> RACK[Respects rack-aware placement policy]
RACK --> DONE[Replication factor restored]
style MARK fill:#ff6b1a,color:#0a0a0f
style HIGH fill:#ff2e88,color:#fff
style DONE fill:#15803d,color:#fff
Master HA: operation log, checkpoints, and standby
A single master that crashes means the cluster is unavailable until it restarts. Three mechanisms protect against this:
Operation log: every metadata mutation (create file, allocate chunk, increment version) is written to the op log and flushed to durable storage before the operation is acknowledged. In HDFS's Quorum Journal Manager (QJM), this log is written to a quorum (typically 3) of journal nodes; the edit is accepted once a majority acknowledges.
Checkpoint (FSImage): periodically, the master serializes its entire in-memory namespace to a checkpoint file. On restart, it loads the checkpoint and replays only the op log tail since the checkpoint. This bounds restart time regardless of how long the cluster has been running.
Standby NameNode (HDFS HA): a hot standby continuously tails the journal and applies edits to its own in-memory state. DataNodes send block reports to both the active and standby. On failover, ZooKeeper detects the active NameNode failure (after the ZooKeeper session expires — ha.zookeeper.session-timeout.ms, which defaulted to 5 s in older Hadoop releases but was raised to 10 s in Hadoop 2.8.5 / 3.1.1+ via HADOOP-15449, and is often tuned higher still in production), fences the old active (via STONITH or SSH kill), and promotes the standby. Detection and promotion typically completes within several seconds at the default timeout, but total failover time depends heavily on fencing latency and configuration.
The original GFS paper describes a simpler "shadow master" that lags the primary slightly and provides read-only access during failover. HDFS HA's ZooKeeper-based scheme provides fully automated warm failover with configurable fencing.
Storage choices
| Component | What it stores | Technology choice | Why |
|---|---|---|---|
| Master namespace + chunk map | In-memory data structures | JVM heap (HDFS), C++ process (GFS) | Nanosecond access; full state fits in RAM |
| Operation log | Sequential append of metadata mutations | Local SSD + QJM quorum | Durability; low write latency; crash-safe |
| Checkpoint (FSImage) | Snapshot of master state | Local disk + HDFS itself | Fast restart; backed by the system it manages |
| Chunk data | Raw 64 MB binary files | Local disks on chunkservers | Direct I/O; no filesystem overhead |
| Chunk checksums | 32-bit CRC per 64 KB sub-block | Stored alongside chunk on chunkserver | Detect silent corruption without master involvement |
Failure modes
Chunkserver failure
A chunkserver crashes or becomes unreachable. With HDFS defaults (dfs.namenode.heartbeat.recheck-interval = 5 min, dfs.heartbeat.interval = 3 s), a DataNode is declared dead after about 10.5 minutes (2 × 300 s + 10 × 3 s). Once declared dead, the master re-replicates all affected chunks from surviving replicas to other healthy chunkservers. If only 1 replica remains, that chunk is in critical state and gets highest re-replication priority.
The risk to acknowledge: if three chunkservers holding the same chunk fail simultaneously before re-replication completes, data is permanently lost. At 3× replication, the probability is low but non-zero in large clusters — some operators run 4× or 5× for critical data.
Master failure
Response (HDFS HA): ZooKeeper session expires (the ha.zookeeper.session-timeout.ms parameter — 5 s in older releases, raised to 10 s in Hadoop 2.8.5/3.1.1+ and often tuned higher in production), ZooKeeper elects the standby, standby fences the old active, standby promotes itself. In-flight client operations fail with an exception; the client retries and finds the new active.
Writes that were acknowledged by the master but not yet replicated to chunkservers may need to be re-done if the chunk's version rolled back. The primary chunkserver holding the lease will reject stale writes when it contacts the new master.
Stale replica detection
Each chunk has a version number stored by the master. When a lease is granted, the master increments the version and tells all healthy replicas. A chunkserver that was partitioned will have the old version number. On reconnection, its block report will show the old version; the master marks those chunks stale and re-replicates from up-to-date replicas.
Hotspot on a chunk
A single very popular chunk (e.g., a shared configuration file read by every MapReduce task at startup) can overwhelm the chunkserver holding it. The mitigations layer nicely: the master can selectively increase the replication factor for the hot chunk, clients can read from a random replica rather than always the same one, and at the application layer you can cache the data in memory or route through a caching tier.
Scaling the metadata: HDFS federation
At exabyte scale (billions of files), the namespace alone may exceed a single NameNode's RAM. HDFS federation partitions the namespace into namespace volumes, each owned by an independent NameNode. DataNodes register with all NameNodes and serve blocks from any namespace.
flowchart TD
NN1[NameNode 1<br/>/user /tmp /logs]
NN2[NameNode 2<br/>/data /warehouse /archive]
NN3[NameNode 3<br/>/ml /checkpoints /scratch]
subgraph "DataNodes — each registers with ALL NameNodes"
DN1[DataNode A]
DN2[DataNode B]
DN3[DataNode C]
end
DN1 -.block report.-> NN1
DN1 -.block report.-> NN2
DN1 -.block report.-> NN3
DN2 -.block report.-> NN1
DN2 -.block report.-> NN2
DN2 -.block report.-> NN3
DN3 -.block report.-> NN1
DN3 -.block report.-> NN2
DN3 -.block report.-> NN3
style NN1 fill:#ff6b1a,color:#0a0a0f
style NN2 fill:#ff6b1a,color:#0a0a0f
style NN3 fill:#ff6b1a,color:#0a0a0f
style DN1 fill:#15803d,color:#fff
style DN2 fill:#15803d,color:#fff
style DN3 fill:#0e7490,color:#fff
Clients use a ViewFileSystem (a client-side mount table) to route paths to the correct NameNode, transparent to applications.
DFS vs object storage
A distributed file system is often compared to object storage (e.g., Amazon S3). They solve overlapping but distinct problems.
| Dimension | DFS (GFS / HDFS) | Object storage (S3-style) |
|---|---|---|
| Access model | Streaming append + sequential read | PUT / GET whole objects |
| Latency | 1–10 ms (co-located compute) | 10–100 ms (remote API) |
| Throughput | Very high (direct disk-to-disk) | Depends on parallelism |
| Consistency | Relaxed (defined / at-least-once) | Eventual → now often strong per-object |
| Namespace | Hierarchical (directory tree) | Flat (bucket + key prefix) |
| Append | First-class (record append) | Not supported; requires re-PUT |
| Cost | High capex (owned hardware) | Low opex (managed, pay-per-GB) |
| Best for | Co-located MapReduce / Spark batch | Data lake cold storage, archival |
Modern data stacks often use both: HDFS for hot working sets (Spark shuffle, training data cache) and S3 for archival and cross-team sharing. See Design an Object Storage System for the S3-style architecture.
Things to discuss in an interview
Chunk size and the metadata budget. Jumping from 4 KB blocks to 64 MB chunks cuts the number of metadata entries by roughly 16,000×, letting the entire chunk map fit in one server's RAM. Larger chunks also mean clients amortize TCP connection overhead across tens of seconds of streaming instead of paying it per block.
Master off the data path. The client contacts the master once per file open to retrieve chunk locations, then streams bytes directly from chunkservers. This is the reason a single master can serve a cluster of 10,000 nodes — it never touches file data.
- Consistency trade-off of at-least-once appends: Consumers must handle duplicates or padding gaps at failed-secondary offsets. The pay-off is avoiding distributed transactions, which are expensive and complex to implement correctly.
- Rack-aware placement and correlated failures: With the default 3-replica policy (1 replica local, 2 on the remote rack), losing the entire local rack leaves two copies; losing the remote rack leaves one. Either way the cluster has surviving replicas and can re-replicate.
- ZooKeeper's role in HDFS HA: Leader election and fencing — it prevents split-brain where two NameNodes simultaneously believe they are active and issue conflicting mutations.
- Chunk-location recovery after a master crash: The master never persists replica locations. On restart it broadcasts a block-report request; DataNodes respond within seconds and the map is rebuilt entirely from their replies.
- DFS vs object storage: Choose a DFS for co-located compute, append-heavy workloads, and 1–10 ms streaming latency. Choose object storage for managed archival, pay-per-GB cost, and cross-region access where you do not control the hardware.
Things you should now be able to answer
- Why does increasing chunk size from 4 KB to 64 MB reduce master memory by ~16,000×?
- Why is the master never in the data path for reads?
- What happens to re-replication priority when a chunk drops to 1 replica?
- What is the GFS consistency guarantee for concurrent record appends?
- How does HDFS NameNode HA prevent split-brain between active and standby?
- What is a chunk version number and how does it detect stale replicas?
- How does HDFS federation scale the namespace beyond a single NameNode's RAM?
Further reading
- Ghemawat, Gobioff, Leung. "The Google File System." SOSP 2003. — the canonical paper; dense, accurate, worth reading in full.
- Shvachko et al. "The Hadoop Distributed File System." MSST 2010. — HDFS design, differences from GFS.
- HDFS Architecture Guide — hadoop.apache.org — current HDFS HA and federation documentation.
- Design an Object Storage System — S3-style architecture; contrast with this article.
- Consistent Hashing — used in some DFS implementations for distributing blocks across nodes.
- Database Replication — replication concepts (sync vs async, durability) that apply equally to chunk replication.
Frequently asked questions
▸Why does GFS use 64 MB chunks instead of the 4 KB blocks common in local filesystems?
At 64 MB per chunk, a 67 PB usable cluster (10,000 nodes × 20 TB, 3× replication) produces roughly 1.05 billion metadata entries requiring about 67 GB of master RAM — something that fits in a single 128 GB server. At 4 KB blocks, the same dataset generates about 16.75 trillion entries needing over 1 PB of RAM, which is not addressable by any single machine. Larger chunks also let clients amortize TCP connection setup across tens of seconds of streaming rather than paying that cost per block.
▸What consistency guarantee does GFS provide for concurrent record appends?
GFS guarantees at-least-once, defined appends: each successful record append lands atomically at some offset chosen by the primary, and all clients reading that offset will see the same bytes. However, if a secondary fails to acknowledge, the primary instructs the client to retry, so the same record may appear at multiple offsets and failed secondaries may have padding gaps. Consumers must handle duplicates, typically by checking a sequence ID.
▸How does rack-aware replica placement protect against correlated hardware failures?
With the default 3-replica policy, the first copy goes on the writer's local rack, the second goes on a different rack, and the third goes on a different node within that second rack. This means a complete failure of the local rack still leaves two copies alive on the remote rack, and a complete failure of the remote rack still leaves one copy on the local rack — in either case the cluster has surviving replicas and can re-replicate to restore full fault tolerance.
▸How does the master recover its chunk-to-location map after a crash, if it never persists that map to disk?
It does not persist replica locations at all. On restart, the master loads its last FSImage checkpoint, replays the operation log tail since that checkpoint to recover the namespace and file-to-chunk mapping, then broadcasts a block report request to every chunkserver. Within seconds the chunkservers respond with the chunks they hold, and the master rebuilds the location map entirely from those reports — making chunkservers the authoritative source of replica location truth.
▸When should you choose a distributed file system like HDFS over object storage like S3?
Choose HDFS when compute runs co-located with storage, latency must be 1-10 ms rather than 10-100 ms, and the workload is append-heavy streaming — for example, Spark shuffle or MapReduce batch jobs reading 100 GB files at hundreds of MB/s. Choose object storage when you need managed, pay-per-GB archival, cross-region access, or you do not control the hardware; object storage does not support first-class record appends and requires a full re-PUT to update an object.
You may also like
Design an LLM Observability Platform
Build the distributed tracing backbone for non-deterministic, multi-step LLM applications — capturing every prompt, completion, token count, and dollar cost across chains, retrievals, and tool calls so you can debug a failed agent run and account for every cent.
Design an LLM Gateway (AI Gateway & Model Router)
A single proxy control plane in front of OpenAI, Anthropic, Google, and open models — routing ~65 trillion tokens a month with automatic failover, semantic caching, per-team budget enforcement, and streaming SSE passthrough, all under 50 ms of added latency.
Design an LLM Fine-Tuning Platform
Turn a base model and a dataset into a deployed fine-tuned adapter at scale — the end-to-end platform covering dataset ingestion, LoRA/QLoRA/DPO training, fault-tolerant distributed GPU scheduling, eval gating, and multi-LoRA serving for hundreds of concurrent fine-tunes.