Design an Object Storage Service (S3)
Store arbitrary blobs with HTTP GET/PUT at exabyte scale and 11 nines of durability. Metadata vs data separation, erasure coding, and self-healing.
The problem
Amazon S3 stores over 350 trillion objects as of 2024, and every week that number grows by tens of billions more. The basic pitch sounds simple: give it a file, get back a URL; call it again, get the file. But under that interface lies one of the most demanding durability problems in distributed systems — keep arbitrary blobs alive, byte-for-byte, through years of continuous disk failures, silent corruption, and hardware generation cycles, at a cost-per-gigabyte that is commercially viable.
Object storage is the substrate that the rest of cloud infrastructure sits on. Database backups, model checkpoints, log archives, video segments, static website assets — everything that needs to outlive the machine that created it ends up as a blob in a bucket. The API products it models most directly are Amazon S3, Google Cloud Storage, and Azure Blob Storage, each of which stores exabytes and serves millions of requests per second. Understanding how they work is understanding how cloud-native architectures store their most valuable data.
The design has two hard constraints pulling in opposite directions. First, durability: commodity hard drives fail at roughly 1–2% per year. With 200,000 drives at 2 EB, that's ~8 drive deaths per day, every day, indefinitely. The system must survive these failures without losing a single byte — and the SLO for that is eleven nines (99.999999999%), meaning fewer than one object lost per 100 billion stored per year. Second, scale economics: achieving that durability via three full copies of every object costs 3× in storage. At exabyte scale, the difference between 1.4× and 3× storage overhead is measured in hundreds of millions of dollars annually, which is why the choice between replication and erasure coding is not academic.
Layered on top of those two constraints is the metadata problem: tracking the physical location of trillions of objects, serving that lookup consistently at 500k GET/sec, and keeping the namespace strongly consistent so a client that just wrote a file can immediately read it back — all without a single metadata bottleneck. That is the system the interview question "design S3" is really asking you to build.
Functional requirements
PUT /bucket/key— upload an object (arbitrary bytes).GET /bucket/key— download an object; range GET (Range: bytes=0-1048575) for partial reads.DELETE /bucket/key— mark deleted (GC async).LIST /bucket?prefix=logs/2026/— list objects matching a prefix.- Multipart upload: initiate → PUT each part → complete. Required for objects > ~100 MB.
- Presigned URLs: generate a time-limited, HMAC-signed URL that authorizes a single GET or PUT without requiring API credentials.
- Bucket-level policies: IAM, ACLs, versioning toggle, lifecycle rules, encryption at rest.
Non-functional requirements
- Durability: 99.999999999% (11 nines) per-object per year.
- Availability: 99.99% for GET; slightly lower for PUT is acceptable.
- Consistency: strong read-after-write for new PUTs and for overwrites/deletes (this was not always the case — see the consistency section).
- Latency: time-to-first-byte for GET < 100 ms p50 at single-digit MB/s; throughput scales with parallelism.
- Scale: exabytes of data, trillions of objects. Single objects from 1 byte to ~53.7 TB (10,000 parts × 5 GiB/part = 48.8 TiB ≈ 53.7 TB decimal; S3 raised the practical limit beyond 5 TB via multipart upload in 2022).
- Security: TLS in transit, server-side encryption at rest, access logs, IAM integration.
Capacity estimation
| Dimension | Estimate | How we got there |
|---|---|---|
| Total objects | 10 trillion | Scale anchor |
| Average object size | 200 KB | Heavy bimodal — most objects tiny; a minority of large video/model files dominates byte count |
| Total data stored | ~2 EB | 10T × 200 KB |
| Metadata per object | ~200 bytes | Key, bucket, size, etag, checksum, location pointers, timestamps, version |
| Total metadata | ~2 PB | 10T × 200 B |
| Metadata nodes | ~20k nodes at ~100 GB/node | 2 PB ÷ 20k nodes — manageable per node |
| Write throughput | 50k PUT/sec peak | Scale anchor |
| Read throughput | 500k GET/sec peak | 10:1 read:write ratio — reads dominate |
| Drive count | ~200k drives | 2 EB ÷ 10 TB/drive |
| Drive failures (AFR ~1.5%) | ~3,000/year ≈ 8/day | 200k × 1.5%; Backblaze 2024 fleet AFR ~1.57% (individual models: 0.5%–5%+) |
| 3× replication overhead | 3× storage cost | Achieves 11 nines with fast enough recovery, but expensive at exabyte scale |
| EC storage overhead (RS 10+4) | 1.4× | 14 shards ÷ 10 data shards; tolerates any 4 simultaneous shard failures; reconstruct from any 10 survivors |
| EC shard placement | 5+5+4 across 3 AZs | Losing the AZ with 4 shards is within RS(10,4) tolerance; losing an AZ with 5 shards exceeds it — see erasure coding section |
Takeaway: At 2 EB raw data and 500k GET/sec, erasure coding's 1.4× overhead versus replication's 3× is worth hundreds of millions of dollars at scale — but adds reconstruction cost on degraded reads and CPU overhead on writes.
Object model and API
The object model is deliberately simple:
- Bucket: a flat namespace container with an owner, region, and policy.
- Key: a UTF-8 string up to ~1 KB. Keys are not a real filesystem hierarchy — slashes in keys are just characters. The "folder" experience in a UI is a client-side illusion produced by grouping objects with a common prefix and using delimiter-aware LIST.
- Object: an immutable blob identified by
(bucket, key). Overwriting means writing a new object that replaces the mapping in the metadata store; the old chunks are GC'd. - Version: when versioning is enabled, overwrite preserves the old object under a version ID; delete creates a delete marker.
# Upload
PUT /my-bucket/logs/2026/05/26/app.log.gz HTTP/1.1
Host: storage.example.com
Content-Length: 42943
Content-Type: application/gzip
x-content-sha256: 8a3c...
<binary body>
→ 200 OK
ETag: "d41d8cd98f00b204e9800998ecf8427e"
# Range download (e.g., seek to byte 1 MB)
GET /my-bucket/video/clip.mp4 HTTP/1.1
Range: bytes=1048576-2097151
→ 206 Partial Content
Content-Range: bytes 1048576-2097151/104857600
Building up to the design
V1: One server, one disk
Start with the simplest thing that could work: a single HTTP server writes incoming bytes to a local directory and tracks them in a SQLite table of (bucket, key, filepath, size, etag). PUT writes to disk, GET reads from disk, zero operational complexity.
The problem is obvious — one disk has an annual failure rate around 1%, so you will lose data. One server can't handle more than a few hundred MB/s. Nothing is redundant.
V2: Replicate to N disks across N machines
Store three copies of every object on three different machines, ideally different racks. The metadata table now stores three location rows per object. Now you survive single-node failures: with AFR around 1% and three independent copies, the chance all three fail within the same recovery window becomes negligible.
But 3× storage cost is brutal at exabyte scale. You still have a single metadata database — that's a single point of failure. And LIST queries over a flat namespace start slowing down as the table grows into the billions.
V3: Separate the metadata and data planes, shard both
This is the architectural leap that makes everything else possible. Split the system into two independent fleets:
- Metadata service: sharded by
hash(bucket, key). Each shard is a replicated KV store. Strongly consistent. Stores location pointers, not bytes. - Storage nodes: raw disk capacity. Store chunks (numbered opaque files). They don't know about buckets or keys — they just store and serve blobs by chunk ID.
flowchart LR
subgraph "Metadata Plane"
MS1[(Metadata Shard 1)]
MS2[(Metadata Shard 2)]
MS3[(Metadata Shard 3)]
end
subgraph "Data Plane"
SN1[Storage Node A]
SN2[Storage Node B]
SN3[Storage Node C]
SN4[Storage Node D]
end
API[API Layer] -->|"lookup: bucket+key → shard locations"| MS1
API -->|"fetch chunk by ID"| SN1
API -->|"fetch chunk by ID"| SN2
MS1 <-.->|"Raft replication"| MS2
MS2 <-.->|"Raft replication"| MS3
style API fill:#ff6b1a,color:#fff
style MS1 fill:#0e7490,color:#fff
style MS2 fill:#0e7490,color:#fff
style MS3 fill:#0e7490,color:#fff
style SN1 fill:#ffaa00,color:#0a0a0f
style SN2 fill:#ffaa00,color:#0a0a0f
style SN3 fill:#ffaa00,color:#0a0a0f
style SN4 fill:#ffaa00,color:#0a0a0f
Each plane now scales independently. Metadata can live on SSDs optimized for small random reads; storage nodes use cheap high-density HDDs. But 3× replication is still expensive — and you've introduced a new problem: a customer with a huge bucket can saturate a single metadata shard.
V4: Erasure coding
Replace 3× replication with Reed-Solomon erasure coding. The API layer chunks large objects, encodes each chunk into (k data shards + m parity shards), and writes each shard to a distinct storage node. For a 10+4 scheme: split into 10 data shards, compute 4 parity shards, write all 14 to 14 different nodes across at least 3 AZs.
Storage overhead drops from 3× to 14/10 = 1.4×. You can survive any 4 simultaneous shard failures. The wrinkle: for small objects (< 1 MB), erasure coding overhead outweighs the benefit — so keep 3× replication for tiny objects and use EC for large ones. And when a shard is unavailable, serving a GET now requires reading 10 surviving shards and decoding on the fly, which adds latency and CPU load.
V5: Production system
Layer on top: background self-healing, checksum-based scrubbing for bit-rot, presigned URLs for CDN offload, lifecycle tiering, multipart upload for large objects, and a garbage collection worker for deleted or overwritten chunks.
flowchart LR
V1["V1: 1 server, 1 disk\nno durability"] --> V2["V2: 3× replication\n3× cost, metadata SPOF"]
V2 --> V3["V3: Separated planes\nsharded metadata"]
V3 --> V4["V4: Erasure coding\n1.4× overhead, survive 4 shard failures"]
V4 --> V5["V5: Self-healing + GC\n+ tiering + multipart\n+ CDN offload"]
style V1 fill:#0e7490,color:#fff
style V2 fill:#15803d,color:#fff
style V4 fill:#ff6b1a,color:#0a0a0f
style V5 fill:#a855f7,color:#fff
Full architecture
flowchart TD
CLIENT[Client / SDK] --> GW[API Gateway\nAuth + TLS termination]
GW --> ROUTER[Request Router]
ROUTER -->|"PUT"| WPATH[Write Handler]
ROUTER -->|"GET"| RPATH[Read Handler]
ROUTER -->|"LIST"| LPATH[List Handler]
WPATH --> CHUNKER[Chunker +\nErasure Encoder]
CHUNKER --> PLACER[Placement Service\nchooses storage nodes]
PLACER --> SN1[Storage Node\nAZ-1]
PLACER --> SN2[Storage Node\nAZ-2]
PLACER --> SN3[Storage Node\nAZ-3]
CHUNKER --> MS[(Metadata Service\nsharded by hash bucket+key)]
RPATH --> MS
LPATH --> MS
MS -->|"shard locations"| FETCHER[Shard Fetcher\n+ Decoder]
FETCHER --> SN1 & SN2 & SN3
GC[GC Worker] --> MS
GC --> SN1 & SN2 & SN3
HEAL[Self-Healer] --> MS
HEAL --> SN1 & SN2 & SN3
SCRUB[Scrubber] --> SN1 & SN2 & SN3
style GW fill:#ff6b1a,color:#fff
style MS fill:#0e7490,color:#fff
style PLACER fill:#15803d,color:#fff
style CHUNKER fill:#ffaa00,color:#0a0a0f
style HEAL fill:#a855f7,color:#fff
style SCRUB fill:#ff2e88,color:#fff
Write path (PUT)
sequenceDiagram
participant C as Client
participant API as API Layer
participant MS as Metadata Service
participant PL as Placement Service
participant SN as Storage Nodes
C->>API: PUT /bucket/key with bytes
API->>API: Verify auth, compute checksum
API->>PL: Request N node placements across failure domains
PL-->>API: 14 placements across AZ1, AZ2, AZ3
API->>API: Split into chunks, erasure-code each chunk 10+4
API->>SN: Write 14 shards in parallel
SN-->>API: 14× ack with shard checksums
API->>MS: Write metadata record — key, size, etag, shard_locations
MS-->>API: Committed (durable)
API->>C: 200 OK, ETag
The metadata write is the commit point. If the API crashes between writing shards and committing metadata, those orphaned shards are cleaned up by the GC worker scanning for unreferenced chunk IDs. This is why metadata must be strongly consistent and durable before the success response is sent — a partial write where the API returned success but metadata wasn't committed would look like a lost object to the client.
Small objects (below a configurable threshold, often 1 MB) skip erasure coding and use 3× replication instead. EC overhead simply isn't worth it for tiny blobs.
Large objects (above ~100 MB) use multipart upload: the client splits into parts, PUTs each part independently with its own shard fan-out, then sends a CompleteMultipartUpload call that stitches part metadata into a single logical object record. No bytes are copied across storage nodes during assembly — only metadata is updated.
Read path (GET)
sequenceDiagram
participant C as Client
participant API as API Layer
participant MS as Metadata Service
participant SN as Storage Nodes
C->>API: GET /bucket/key
API->>MS: Lookup (bucket, key) → shard_locations[], size, checksum
MS-->>API: Metadata record
API->>SN: Fetch k shards in parallel (fan-out to 10 of 14 nodes)
SN-->>API: Shard data
API->>API: Decode shards → reconstruct original bytes
API->>API: Verify checksum
API->>C: Stream bytes (206 for Range GET)
For a range GET (Range: bytes=X-Y), the API maps the byte range to specific chunks and reads only the relevant shards rather than fetching all chunks. This is critical for video seeking and partial reads of large archives.
If fewer than all 14 nodes respond — some are slow, some have failed — the API waits for at least k=10 shards, then decodes. These degraded reads are slower but still succeed. Reconstruction happens transparently; the client sees the same bytes either way.
Erasure coding deep dive
Reed-Solomon is the standard algorithm here. Given an object or chunk, split it into k equally sized data fragments. The algorithm generates m additional parity fragments such that any k of the k+m fragments are sufficient to reconstruct the original.
Example: RS(10, 4)
Input: 100 MB object
k=10: 10 × 10 MB data shards
m=4: 4 × 10 MB parity shards (computed from data shards)
Total: 14 × 10 MB written = 140 MB stored
Overhead: 1.4× (vs 300 MB for 3× replication)
Failure tolerance: survive any 4 shard losses simultaneously.
Placement: spread 14 shards across ≥ 3 AZs (e.g., 5+5+4).
Caution: RS(10,4) tolerates exactly 4 simultaneous failures.
Losing the AZ with 4 shards is within tolerance; losing an AZ
with 5 shards exceeds it (5 > 4). To guarantee any-single-AZ
tolerance you need ≤ 4 shards per AZ — either use 4 AZs, or
accept that losing a majority AZ triggers urgent self-healing
before another failure. In practice systems also target
node-level distribution within each AZ.
EC is compute-intensive — writes require encoding and degraded reads require decoding. SIMD-accelerated implementations (e.g., Intel ISA-L, used by most production systems) achieve multiple GB/s per core, so throughput is rarely the bottleneck. The cost that actually matters is per-byte overhead for tiny objects: for a 1 KB blob, the EC framing and parity computation outweighs any storage savings, which is why small objects stay on 3× replication.
| Scheme | Storage overhead | Failure tolerance | Reconstruction cost | Best for |
|---|---|---|---|---|
| 3× replication | 3× | 2 failures | Zero (read any copy) | Small objects, metadata |
| RS(6, 3) | 1.5× | 3 failures | Medium (need 6 shards) | Mid-size objects |
| RS(10, 4) | 1.4× | 4 failures | Higher (need 10 shards) | Large objects, cost-critical |
| RS(14, 4) | 1.29× | 4 failures | Higher (need 14 shards) | Maximum storage efficiency |
Metadata service
The metadata service is the system's brain. It stores the mapping from logical name to physical location and is the source of truth for existence, authorization, and consistency.
Schema (per object record, simplified):
bucket_id: ULID
key: bytes (up to 1 KB)
version_id: ULID (null if versioning disabled)
size: uint64
etag: hex(MD5 of content) -- MD5 for single-part; complex formula for multipart
checksum_alg: sha256 | crc32c
created_at: timestamp
storage_class: STANDARD | IA | GLACIER
shard_map: [{node_id, shard_index, checksum}, ...] -- one entry per shard
delete_marker: bool
Sharding by hash(bucket_id || key) distributes objects evenly across shards, even within a single bucket. The downside shows up at LIST time: a LIST bucket?prefix=X must scatter across all shards that could contain keys with that prefix — which is why listing is fundamentally more expensive than getting.
A single customer with trillions of objects in one bucket can put pressure on multiple shards via prefix. The mitigation is virtual sharding within a bucket: add a random prefix salt to the physical shard key, and reconstruct the logical view at query time. It complicates LIST but keeps write load even.
Each metadata shard uses a consensus protocol (Raft or Paxos) for replication, guaranteeing strong read-after-write. Once PUT metadata commits, any subsequent GET from any node in the shard sees the new object. This was not always true — Amazon S3 operated on an eventually consistent model for overwrites and deletes until December 2020, when it switched to strong consistency. New systems should design for strong consistency from day one.
Durability: self-healing and scrubbing
Durability is not a property you achieve once at write time — it is a continuous background process.
Self-healing
stateDiagram-v2
[*] --> Healthy: object written with all N shards
Healthy --> Degraded: disk/node failure (shard lost)
Degraded --> Healing: self-healer detects < N shards
Healing --> Healthy: reconstruct missing shard from k survivors, write to new node
Degraded --> Critical: additional failures before healing completes
Critical --> Healing: escalate priority
Healing --> DataLoss: fewer than k shards survive (extremely rare)
When the health monitor detects a storage node failure, the self-healer immediately queries the metadata service for all objects with shards on that node. It reconstructs each missing shard from the surviving k shards and writes the new shard to a healthy node in the same failure domain. The metadata record is updated atomically.
The window between when a shard is lost and when a replacement is written is the period of elevated risk. A second simultaneous failure during that window could push a 10+4 object past its 4-failure tolerance. So healing speed directly determines your effective durability: the tighter that window, the shorter the exposure.
Scrubbing
Disk bit-rot — hardware corruption of stored bits with no filesystem error raised — is real and measurable in large fleets. Scrubbers periodically read every shard, verify its checksum, and compare against the metadata-stored checksum. Silent corruption is detected and the shard is rebuilt from its peers. Without scrubbing, objects that are never read can silently lose durability over years — the degradation is invisible until a client finally requests the object.
A large deployment scrubs every shard on a rolling schedule over weeks to months. Priority is raised for shards on older or higher-error-rate drives.
Garbage collection
When an object is deleted or overwritten, the metadata record is removed (or marked with a delete marker), but the shard data on storage nodes is not immediately freed. Deleting data is an expensive random-write operation that can interfere with the read path.
Instead, a GC worker runs in the background:
- Scans the metadata service for delete markers and completed multipart-upload tombstones older than a configurable delay.
- Cross-references shard map records against active object references.
- Issues delete commands to storage nodes for unreferenced shards.
- Removes the tombstoned metadata record.
The delay — typically minutes to hours — is intentional. It provides a safety window against bugs, enables soft-delete and undelete capability, and decouples delete latency from what the client experiences.
Consistency model (the one interviewers trip candidates on)
S3 today provides strong read-after-write consistency for all operations: a successful PUT is immediately visible to all subsequent GETs from any client within that bucket's region. Overwrites and DELETEs are also strongly consistent.
Before December 2020, S3 offered strong consistency for new object PUTs (first write to a key), but only eventual consistency for overwrites and DELETEs — a GET shortly after an overwrite could return the old version.
The architectural reason: S3 had a caching layer in its metadata subsystem optimized for high availability. Writes could flow through one part of that cache infrastructure while reads queried another, producing stale reads after overwrites. The migration to strong consistency didn't remove the cache — instead AWS added a "witness" component that tracks every object change and acts as a read barrier. Before serving a cached read, the server asks the witness whether the cache is fresh. If the witness confirms freshness, the cached value is returned; otherwise the persistence tier is queried directly. This cache coherence protocol, similar in spirit to CPU cache protocols, achieved strong consistency without sacrificing performance or availability — a non-trivial distributed systems engineering effort. (Architectural details are published in Werner Vogels' April 2021 blog post "Diving Deep on S3 Consistency.")
When designing a new system, always start with strong consistency. Relaxing it later is possible; fixing applications that silently depended on it is not.
Failure modes
| Failure | Impact | Mitigation |
|---|---|---|
| Single disk failure | 1 shard unavailable per object on that disk | Self-healer reconstructs within minutes; degraded reads still work |
| Single node failure | All objects with shards on that node degraded | Same as disk; priority healing |
| Silent bit-rot | Data silently wrong, not detected until read | Checksum on every shard; periodic scrubbing detects and repairs |
| AZ failure with 4 shards (RS 10+4, 5+5+4 placement) | Within tolerance — exactly at the m=4 boundary; degraded reads still work | Cross-AZ placement capped at m shards per AZ, or expedited self-healing |
| AZ failure with 5 shards (RS 10+4, majority AZ) | Exceeds RS(10,4) tolerance — reads require urgent reconstruction; data not lost if healing starts immediately | Either use ≤ 4 shards per AZ (needs 4 AZs) or accept brief degraded window + fast healer |
| Two simultaneous AZ failures | Below k shards remaining — data loss possible | Multi-region replication as second layer for critical data |
| Hot object (viral file) | Thundering herd on a small set of storage nodes | Presigned URLs + CDN; storage nodes are not in the hot read path for popular objects |
| Large multipart upload aborted | Orphaned part data on storage nodes | Lifecycle rules expire incomplete uploads after N days; GC reclaims parts |
| Metadata shard failure | All objects on that shard temporarily unavailable | Raft replication: automatic failover to a replica; recovery time < 30s |
| Correlated hardware failure (bad batch of drives) | Many simultaneous disk failures faster than healing | Physical diversity (drive vendors, firmware versions); monitoring AFR per drive model |
Multipart upload
Objects larger than ~100 MB should use multipart upload. The client splits the file, uploads parts in parallel, and the server assembles them in metadata — no bytes are physically stitched together on disk.
sequenceDiagram
participant C as Client
participant API as API Layer
participant SN as Storage Nodes
participant MS as Metadata Service
C->>API: POST /bucket/key?uploads
API-->>C: UploadId = "abc123"
par Upload parts in parallel
C->>API: PUT /bucket/key?partNumber=1&uploadId=abc123
API->>SN: Shard fan-out, part 1 → AZ1, AZ2, AZ3
SN-->>API: ack
API-->>C: ETag part 1
and
C->>API: PUT /bucket/key?partNumber=2&uploadId=abc123
API->>SN: Shard fan-out, part 2 → AZ1, AZ2, AZ3
SN-->>API: ack
API-->>C: ETag part 2
end
C->>API: "POST /bucket/key?uploadId=abc123 {parts: [1,2,...]}"
API->>MS: Write single object record linking all part metadata
MS-->>API: Committed
API-->>C: 200 OK, final ETag
The protocol:
- Initiate:
POST /bucket/key?uploads→ server returnsUploadId. - Upload parts:
PUT /bucket/key?partNumber=N&uploadId=X— each part is written and erasure-coded independently. Parts must be at least 5 MiB and at most 5 GiB (except the last part, which has no minimum). Up to 10,000 parts per upload, for a maximum object size of ~48.8 TiB (≈53.7 TB in decimal). - Complete:
POST /bucket/key?uploadId=Xwith an ordered list of(partNumber, ETag)pairs → server stitches the part metadata into a single object record. Client gets back the final object ETag. - Abort:
DELETE /bucket/key?uploadId=X→ server discards parts.
If the upload is interrupted, the client lists already-uploaded parts and resumes from the first missing or incomplete part — without re-uploading anything that already landed.
Presigned URLs and CDN offload
For popular objects, the storage service is not the right egress path. Routing a viral image through the API and storage nodes at 10M req/s would overwhelm even a large fleet.
Presigned URLs delegate authorization:
HMAC-sign(bucket, key, expiry, allowed_method, caller_identity)
→ URL valid for GET (or PUT) until expiry
The client hands this URL to a CDN (or directly to an end-user). The CDN caches the object at edge. The storage API only serves cache misses, dramatically reducing hot-object load.
For PUT, presigned URLs allow direct client-to-storage uploads without routing through the application server — critical for user-generated content pipelines.
Storage classes and lifecycle tiering
Not all data needs the same treatment. Storing log archives from three years ago on high-performance, redundant SSDs alongside today's active objects wastes money.
| Storage class | Typical use | Access pattern | Retrieval latency | Cost relative to Standard |
|---|---|---|---|---|
| Standard | Active, frequently accessed objects | Any time | Milliseconds | 1× |
| Infrequent Access (IA) | Backups, older data | Monthly | Milliseconds | ~0.5× storage, higher GET cost |
| Glacier Instant Retrieval | Archive data needing occasional immediate access | Quarterly | Milliseconds | ~0.2× storage |
| Glacier Flexible Retrieval | Disaster recovery archives | 1–2× per year | 1–5 min (expedited) / 3–5 hr (standard) / 5–12 hr (bulk) | ~0.1× storage |
| Glacier Deep Archive | Long-term compliance, rarely accessed | Less than yearly | 12–48 hours | ~0.05× storage |
Lifecycle rules let the client define policies ("transition objects in prefix logs/ older than 30 days to IA; delete after 365 days"). The lifecycle engine runs asynchronously, updating metadata and potentially moving data across storage tiers. Tiering to Glacier typically means the shard placement moves to cheaper, slower, potentially offline media, with the metadata record updated to reflect the new location and retrieval process.
Scaling the metadata service
At 10 trillion objects, the metadata service is itself a major distributed system.
Hash-based sharding of (bucket_id, key) distributes load evenly. Each shard is a Raft group of 3–5 replicas. A routing layer maps a request to the correct shard.
LIST operations (LIST bucket?prefix=X) must scatter across potentially all shards, since objects matching the prefix may be anywhere. For very large buckets, list operations are expensive and are typically rate-limited. Advanced designs maintain secondary indexes per-bucket for prefix queries, at the cost of extra write amplification.
A read cache in front of the metadata store can serve hot-key lookups with microsecond latency, but cache invalidation on overwrites must be synchronous to maintain strong consistency — which largely defeats the latency benefit for write-heavy keys. Use a short TTL (< 1s) for the cache, accept the occasional stale read window as a practical trade-off, or skip the cache layer entirely for consistency.
Things to discuss in an interview
- Why separate metadata and data planes? Different performance profiles, scaling needs, and consistency requirements. Metadata must be strongly consistent and highly available for small reads; data is bulk bytes on cheap disks.
- Erasure coding vs replication: storage cost vs computation cost vs recovery complexity. Know the arithmetic (1.4× vs 3×).
- The commit point in the write path: why metadata write is last, and why orphaned shards need GC.
- LIST vs GET: why listing is fundamentally more expensive (scatter-gather across metadata shards).
- Consistency history of real systems: S3's eventual-to-strong consistency transition is a real design lesson about cache invalidation at scale.
- Self-healing and scrubbing: durability is a continuous property, not a one-time write property.
- Presigned URLs: how CDN offload works; separation of authorization from serving.
Things you should now be able to answer
- Why doesn't object storage use a hierarchical filesystem internally, even though it shows folders in the UI?
- What happens to a GET request if one of the 14 EC shards is on a failed node?
- Why is the metadata write the last step in a PUT, not the first?
- What is bit-rot, and how does a scrubber catch it without relying on the client to re-read the object?
- Why is a LIST query more expensive than a GET query, at the metadata layer?
- What changed in December 2020 for S3 consistency, and why did it take years to implement?
Further reading
- "Amazon S3 Update – Strong Read-After-Write Consistency" — AWS News Blog, December 1, 2020 (announcement)
- "Diving Deep on S3 Consistency" — Werner Vogels / All Things Distributed, April 2021 (architectural deep-dive: witness protocol, cache coherence design)
- "Erasure Coding in Windows Azure Storage" — USENIX ATC 2012 (one of the clearest public explanations of EC in production blob storage)
- "Ceph: A Scalable, High-Performance Distributed File System" — OSDI 2006 (foundational for understanding object/block storage architecture)
- "Haystack: Finding a needle in Facebook's photo storage" — OSDI 2010 (a real production object store, deep on small-object efficiency)
- "FoundationDB: A Distributed Unbounded Ordered Key-Value Store" — SIGMOD 2021 (relevant for understanding how the metadata KV layer can be built)
Frequently asked questions
▸Why does object storage use erasure coding instead of 3x replication for durability?
Reed-Solomon erasure coding (RS 10+4) achieves 11 nines of durability at only 1.4x storage overhead versus 3x for full replication. At 2 EB raw data, that difference is hundreds of millions of dollars annually. The trade-off is CPU overhead on writes and reconstruction cost on degraded reads.
▸What is the commit point in an object storage write path, and why does it matter?
The metadata write is the commit point. The API layer writes all 14 erasure-coded shards to storage nodes first, then writes the metadata record. If the API crashes after shards land but before metadata commits, a GC worker cleans up those orphaned shards — the object is never visible to clients, so no partial write can appear as a successful PUT.
▸When did Amazon S3 switch to strong read-after-write consistency, and how did AWS achieve it?
S3 switched from eventual consistency for overwrites and deletes to strong read-after-write consistency in December 2020. AWS achieved this by adding a witness component that acts as a read barrier before cached metadata is served — if the witness cannot confirm the cache is fresh, the persistence tier is queried directly, similar in spirit to CPU cache coherence protocols.
▸What is the shard placement risk with RS(10,4) across 3 AZs?
RS(10,4) tolerates exactly 4 simultaneous shard failures. With a 5+5+4 placement across 3 AZs, losing the AZ holding 4 shards stays within tolerance, but losing an AZ holding 5 shards exceeds the 4-failure limit. To guarantee any-single-AZ tolerance you need either 4 AZs or a cap of at most 4 shards per AZ.
▸Why is a LIST operation more expensive than a GET in object storage?
Metadata is sharded by hash(bucket_id, key), which distributes objects evenly but means objects sharing a common prefix can sit on any shard. A LIST with a prefix filter must scatter across all shards that could hold matching keys and gather results, whereas a GET resolves to exactly one shard via the same hash.
You may also like
Design an LLM Observability Platform
Build the distributed tracing backbone for non-deterministic, multi-step LLM applications — capturing every prompt, completion, token count, and dollar cost across chains, retrievals, and tool calls so you can debug a failed agent run and account for every cent.
Design an LLM Gateway (AI Gateway & Model Router)
A single proxy control plane in front of OpenAI, Anthropic, Google, and open models — routing ~65 trillion tokens a month with automatic failover, semantic caching, per-team budget enforcement, and streaming SSE passthrough, all under 50 ms of added latency.
Design an LLM Fine-Tuning Platform
Turn a base model and a dataset into a deployed fine-tuned adapter at scale — the end-to-end platform covering dataset ingestion, LoRA/QLoRA/DPO training, fault-tolerant distributed GPU scheduling, eval gating, and multi-LoRA serving for hundreds of concurrent fine-tunes.