~/articles/design-object-storage

◆◆◆Advancedasked at Amazonasked at Googleasked at Microsoft

Design an Object Storage Service (S3)

Q: Why does object storage use erasure coding instead of 3x replication for durability?

Reed-Solomon erasure coding (RS 10+4) achieves 11 nines of durability at only 1.4x storage overhead versus 3x for full replication. At 2 EB raw data, that difference is hundreds of millions of dollars annually. The trade-off is CPU overhead on writes and reconstruction cost on degraded reads.

Q: What is the commit point in an object storage write path, and why does it matter?

The metadata write is the commit point. The API layer writes all 14 erasure-coded shards to storage nodes first, then writes the metadata record. If the API crashes after shards land but before metadata commits, a GC worker cleans up those orphaned shards — the object is never visible to clients, so no partial write can appear as a successful PUT.

Q: When did Amazon S3 switch to strong read-after-write consistency, and how did AWS achieve it?

S3 switched from eventual consistency for overwrites and deletes to strong read-after-write consistency in December 2020. AWS achieved this by adding a witness component that acts as a read barrier before cached metadata is served — if the witness cannot confirm the cache is fresh, the persistence tier is queried directly, similar in spirit to CPU cache coherence protocols.

Q: What is the shard placement risk with RS(10,4) across 3 AZs?

RS(10,4) tolerates exactly 4 simultaneous shard failures. With a 5+5+4 placement across 3 AZs, losing the AZ holding 4 shards stays within tolerance, but losing an AZ holding 5 shards exceeds the 4-failure limit. To guarantee any-single-AZ tolerance you need either 4 AZs or a cap of at most 4 shards per AZ.

Q: Why is a LIST operation more expensive than a GET in object storage?

Metadata is sharded by hash(bucket_id, key), which distributes objects evenly but means objects sharing a common prefix can sit on any shard. A LIST with a prefix filter must scatter across all shards that could hold matching keys and gather results, whereas a GET resolves to exactly one shard via the same hash.

Store arbitrary blobs with HTTP GET/PUT at exabyte scale and 11 nines of durability. Metadata vs data separation, erasure coding, and self-healing.

24 min read2026-05-26Ironclad Academy

#interview #storage #distributed-systems #durability

// DEPTH

the full breakdown — requirements, capacity, evolution, trade-offs

The problem

Amazon S3 stores over 350 trillion objects as of 2024, and every week that number grows by tens of billions more. The basic pitch sounds simple: give it a file, get back a URL; call it again, get the file. But under that interface lies one of the most demanding durability problems in distributed systems — keep arbitrary blobs alive, byte-for-byte, through years of continuous disk failures, silent corruption, and hardware generation cycles, at a cost-per-gigabyte that is commercially viable.

Object storage is the substrate that the rest of cloud infrastructure sits on. Database backups, model checkpoints, log archives, video segments, static website assets — everything that needs to outlive the machine that created it ends up as a blob in a bucket. The API products it models most directly are Amazon S3, Google Cloud Storage, and Azure Blob Storage, each of which stores exabytes and serves millions of requests per second. Understanding how they work is understanding how cloud-native architectures store their most valuable data.

The design has two hard constraints pulling in opposite directions. First, durability: commodity hard drives fail at roughly 1–2% per year. With 200,000 drives at 2 EB, that's ~8 drive deaths per day, every day, indefinitely. The system must survive these failures without losing a single byte — and the SLO for that is eleven nines (99.999999999%), meaning fewer than one object lost per 100 billion stored per year. Second, scale economics: achieving that durability via three full copies of every object costs 3× in storage. At exabyte scale, the difference between 1.4× and 3× storage overhead is measured in hundreds of millions of dollars annually, which is why the choice between replication and erasure coding is not academic.

Layered on top of those two constraints is the metadata problem: tracking the physical location of trillions of objects, serving that lookup consistently at 500k GET/sec, and keeping the namespace strongly consistent so a client that just wrote a file can immediately read it back — all without a single metadata bottleneck. That is the system the interview question "design S3" is really asking you to build.

Functional requirements

PUT /bucket/key — upload an object (arbitrary bytes).
GET /bucket/key — download an object; range GET (Range: bytes=0-1048575) for partial reads.
DELETE /bucket/key — mark deleted (GC async).
LIST /bucket?prefix=logs/2026/ — list objects matching a prefix.
Multipart upload: initiate → PUT each part → complete. Required for objects > ~100 MB.
Presigned URLs: generate a time-limited, HMAC-signed URL that authorizes a single GET or PUT without requiring API credentials.
Bucket-level policies: IAM, ACLs, versioning toggle, lifecycle rules, encryption at rest.

Non-functional requirements

Durability: 99.999999999% (11 nines) per-object per year.
Availability: 99.99% for GET; slightly lower for PUT is acceptable.
Consistency: strong read-after-write for new PUTs and for overwrites/deletes (this was not always the case — see the consistency section).
Latency: time-to-first-byte for GET < 100 ms p50 at single-digit MB/s; throughput scales with parallelism.
Scale: exabytes of data, trillions of objects. Single objects from 1 byte to ~53.7 TB (10,000 parts × 5 GiB/part = 48.8 TiB ≈ 53.7 TB decimal; S3 raised the practical limit beyond 5 TB via multipart upload in 2022).
Security: TLS in transit, server-side encryption at rest, access logs, IAM integration.

Capacity estimation

Dimension	Estimate	How we got there
Total objects	10 trillion	Scale anchor
Average object size	200 KB	Heavy bimodal — most objects tiny; a minority of large video/model files dominates byte count
Total data stored	~2 EB	`10T × 200 KB`
Metadata per object	~200 bytes	Key, bucket, size, etag, checksum, location pointers, timestamps, version
Total metadata	~2 PB	`10T × 200 B`
Metadata nodes	~20k nodes at ~100 GB/node	`2 PB ÷ 20k nodes` — manageable per node
Write throughput	50k PUT/sec peak	Scale anchor
Read throughput	500k GET/sec peak	10:1 read:write ratio — reads dominate
Drive count	~200k drives	`2 EB ÷ 10 TB/drive`
Drive failures (AFR ~1.5%)	~3,000/year ≈ 8/day	`200k × 1.5%`; Backblaze 2024 fleet AFR ~1.57% (individual models: 0.5%–5%+)
3× replication overhead	3× storage cost	Achieves 11 nines with fast enough recovery, but expensive at exabyte scale
EC storage overhead (RS 10+4)	1.4×	`14 shards ÷ 10 data shards`; tolerates any 4 simultaneous shard failures; reconstruct from any 10 survivors
EC shard placement	5+5+4 across 3 AZs	Losing the AZ with 4 shards is within RS(10,4) tolerance; losing an AZ with 5 shards exceeds it — see erasure coding section

Takeaway: At 2 EB raw data and 500k GET/sec, erasure coding's 1.4× overhead versus replication's 3× is worth hundreds of millions of dollars at scale — but adds reconstruction cost on degraded reads and CPU overhead on writes.

Object model and API

The object model is deliberately simple:

Bucket: a flat namespace container with an owner, region, and policy.
Key: a UTF-8 string up to ~1 KB. Keys are not a real filesystem hierarchy — slashes in keys are just characters. The "folder" experience in a UI is a client-side illusion produced by grouping objects with a common prefix and using delimiter-aware LIST.
Object: an immutable blob identified by (bucket, key). Overwriting means writing a new object that replaces the mapping in the metadata store; the old chunks are GC'd.
Version: when versioning is enabled, overwrite preserves the old object under a version ID; delete creates a delete marker.

# Upload
PUT /my-bucket/logs/2026/05/26/app.log.gz HTTP/1.1
Host: storage.example.com
Content-Length: 42943
Content-Type: application/gzip
x-content-sha256: 8a3c...

<binary body>

→ 200 OK
ETag: "d41d8cd98f00b204e9800998ecf8427e"

# Range download (e.g., seek to byte 1 MB)
GET /my-bucket/video/clip.mp4 HTTP/1.1
Range: bytes=1048576-2097151

→ 206 Partial Content
Content-Range: bytes 1048576-2097151/104857600

Building up to the design

V1: One server, one disk

Start with the simplest thing that could work: a single HTTP server writes incoming bytes to a local directory and tracks them in a SQLite table of (bucket, key, filepath, size, etag). PUT writes to disk, GET reads from disk, zero operational complexity.

The problem is obvious — one disk has an annual failure rate around 1%, so you will lose data. One server can't handle more than a few hundred MB/s. Nothing is redundant.

V2: Replicate to N disks across N machines

Store three copies of every object on three different machines, ideally different racks. The metadata table now stores three location rows per object. Now you survive single-node failures: with AFR around 1% and three independent copies, the chance all three fail within the same recovery window becomes negligible.

But 3× storage cost is brutal at exabyte scale. You still have a single metadata database — that's a single point of failure. And LIST queries over a flat namespace start slowing down as the table grows into the billions.

V3: Separate the metadata and data planes, shard both

This is the architectural leap that makes everything else possible. Split the system into two independent fleets:

Metadata service: sharded by hash(bucket, key). Each shard is a replicated KV store. Strongly consistent. Stores location pointers, not bytes.
Storage nodes: raw disk capacity. Store chunks (numbered opaque files). They don't know about buckets or keys — they just store and serve blobs by chunk ID.

flowchart LR
    subgraph "Metadata Plane"
        MS1[(Metadata Shard 1)]
        MS2[(Metadata Shard 2)]
        MS3[(Metadata Shard 3)]
    end
    subgraph "Data Plane"
        SN1[Storage Node A]
        SN2[Storage Node B]
        SN3[Storage Node C]
        SN4[Storage Node D]
    end
    API[API Layer] -->|"lookup: bucket+key → shard locations"| MS1
    API -->|"fetch chunk by ID"| SN1
    API -->|"fetch chunk by ID"| SN2
    MS1 <-.->|"Raft replication"| MS2
    MS2 <-.->|"Raft replication"| MS3
    style API fill:#ff6b1a,color:#fff
    style MS1 fill:#0e7490,color:#fff
    style MS2 fill:#0e7490,color:#fff
    style MS3 fill:#0e7490,color:#fff
    style SN1 fill:#ffaa00,color:#0a0a0f
    style SN2 fill:#ffaa00,color:#0a0a0f
    style SN3 fill:#ffaa00,color:#0a0a0f
    style SN4 fill:#ffaa00,color:#0a0a0f

Each plane now scales independently. Metadata can live on SSDs optimized for small random reads; storage nodes use cheap high-density HDDs. But 3× replication is still expensive — and you've introduced a new problem: a customer with a huge bucket can saturate a single metadata shard.

V4: Erasure coding

Replace 3× replication with Reed-Solomon erasure coding. The API layer chunks large objects, encodes each chunk into (k data shards + m parity shards), and writes each shard to a distinct storage node. For a 10+4 scheme: split into 10 data shards, compute 4 parity shards, write all 14 to 14 different nodes across at least 3 AZs.

Storage overhead drops from 3× to 14/10 = 1.4×. You can survive any 4 simultaneous shard failures. The wrinkle: for small objects (< 1 MB), erasure coding overhead outweighs the benefit — so keep 3× replication for tiny objects and use EC for large ones. And when a shard is unavailable, serving a GET now requires reading 10 surviving shards and decoding on the fly, which adds latency and CPU load.

V5: Production system

Layer on top: background self-healing, checksum-based scrubbing for bit-rot, presigned URLs for CDN offload, lifecycle tiering, multipart upload for large objects, and a garbage collection worker for deleted or overwritten chunks.

flowchart LR
    V1["V1: 1 server, 1 disk\nno durability"] --> V2["V2: 3× replication\n3× cost, metadata SPOF"]
    V2 --> V3["V3: Separated planes\nsharded metadata"]
    V3 --> V4["V4: Erasure coding\n1.4× overhead, survive 4 shard failures"]
    V4 --> V5["V5: Self-healing + GC\n+ tiering + multipart\n+ CDN offload"]
    style V1 fill:#0e7490,color:#fff
    style V2 fill:#15803d,color:#fff
    style V4 fill:#ff6b1a,color:#0a0a0f
    style V5 fill:#a855f7,color:#fff

Full architecture

flowchart TD
    CLIENT[Client / SDK] --> GW[API Gateway\nAuth + TLS termination]
    GW --> ROUTER[Request Router]
    ROUTER -->|"PUT"| WPATH[Write Handler]
    ROUTER -->|"GET"| RPATH[Read Handler]
    ROUTER -->|"LIST"| LPATH[List Handler]

    WPATH --> CHUNKER[Chunker +\nErasure Encoder]
    CHUNKER --> PLACER[Placement Service\nchooses storage nodes]
    PLACER --> SN1[Storage Node\nAZ-1]
    PLACER --> SN2[Storage Node\nAZ-2]
    PLACER --> SN3[Storage Node\nAZ-3]
    CHUNKER --> MS[(Metadata Service\nsharded by hash bucket+key)]

    RPATH --> MS
    LPATH --> MS
    MS -->|"shard locations"| FETCHER[Shard Fetcher\n+ Decoder]
    FETCHER --> SN1 & SN2 & SN3

    GC[GC Worker] --> MS
    GC --> SN1 & SN2 & SN3

    HEAL[Self-Healer] --> MS
    HEAL --> SN1 & SN2 & SN3
    SCRUB[Scrubber] --> SN1 & SN2 & SN3

    style GW fill:#ff6b1a,color:#fff
    style MS fill:#0e7490,color:#fff
    style PLACER fill:#15803d,color:#fff
    style CHUNKER fill:#ffaa00,color:#0a0a0f
    style HEAL fill:#a855f7,color:#fff
    style SCRUB fill:#ff2e88,color:#fff

Write path (PUT)

sequenceDiagram
    participant C as Client
    participant API as API Layer
    participant MS as Metadata Service
    participant PL as Placement Service
    participant SN as Storage Nodes

    C->>API: PUT /bucket/key with bytes
    API->>API: Verify auth, compute checksum
    API->>PL: Request N node placements across failure domains
    PL-->>API: 14 placements across AZ1, AZ2, AZ3
    API->>API: Split into chunks, erasure-code each chunk 10+4
    API->>SN: Write 14 shards in parallel
    SN-->>API: 14× ack with shard checksums
    API->>MS: Write metadata record — key, size, etag, shard_locations
    MS-->>API: Committed (durable)
    API->>C: 200 OK, ETag

The metadata write is the commit point. If the API crashes between writing shards and committing metadata, those orphaned shards are cleaned up by the GC worker scanning for unreferenced chunk IDs. This is why metadata must be strongly consistent and durable before the success response is sent — a partial write where the API returned success but metadata wasn't committed would look like a lost object to the client.

Small objects (below a configurable threshold, often 1 MB) skip erasure coding and use 3× replication instead. EC overhead simply isn't worth it for tiny blobs.

Large objects (above ~100 MB) use multipart upload: the client splits into parts, PUTs each part independently with its own shard fan-out, then sends a CompleteMultipartUpload call that stitches part metadata into a single logical object record. No bytes are copied across storage nodes during assembly — only metadata is updated.

Read path (GET)

sequenceDiagram
    participant C as Client
    participant API as API Layer
    participant MS as Metadata Service
    participant SN as Storage Nodes

    C->>API: GET /bucket/key
    API->>MS: Lookup (bucket, key) → shard_locations[], size, checksum
    MS-->>API: Metadata record
    API->>SN: Fetch k shards in parallel (fan-out to 10 of 14 nodes)
    SN-->>API: Shard data
    API->>API: Decode shards → reconstruct original bytes
    API->>API: Verify checksum
    API->>C: Stream bytes (206 for Range GET)

For a range GET (Range: bytes=X-Y), the API maps the byte range to specific chunks and reads only the relevant shards rather than fetching all chunks. This is critical for video seeking and partial reads of large archives.

If fewer than all 14 nodes respond — some are slow, some have failed — the API waits for at least k=10 shards, then decodes. These degraded reads are slower but still succeed. Reconstruction happens transparently; the client sees the same bytes either way.

Erasure coding deep dive

Reed-Solomon is the standard algorithm here. Given an object or chunk, split it into k equally sized data fragments. The algorithm generates m additional parity fragments such that any k of the k+m fragments are sufficient to reconstruct the original.

Example: RS(10, 4)
Input:  100 MB object
k=10:   10 × 10 MB data shards
m=4:    4 × 10 MB parity shards (computed from data shards)
Total:  14 × 10 MB written = 140 MB stored
Overhead: 1.4× (vs 300 MB for 3× replication)

Failure tolerance: survive any 4 shard losses simultaneously.
Placement:  spread 14 shards across ≥ 3 AZs (e.g., 5+5+4).
            Caution: RS(10,4) tolerates exactly 4 simultaneous failures.
            Losing the AZ with 4 shards is within tolerance; losing an AZ
            with 5 shards exceeds it (5 > 4). To guarantee any-single-AZ
            tolerance you need ≤ 4 shards per AZ — either use 4 AZs, or
            accept that losing a majority AZ triggers urgent self-healing
            before another failure. In practice systems also target
            node-level distribution within each AZ.

EC is compute-intensive — writes require encoding and degraded reads require decoding. SIMD-accelerated implementations (e.g., Intel ISA-L, used by most production systems) achieve multiple GB/s per core, so throughput is rarely the bottleneck. The cost that actually matters is per-byte overhead for tiny objects: for a 1 KB blob, the EC framing and parity computation outweighs any storage savings, which is why small objects stay on 3× replication.

Scheme	Storage overhead	Failure tolerance	Reconstruction cost	Best for
3× replication	3×	2 failures	Zero (read any copy)	Small objects, metadata
RS(6, 3)	1.5×	3 failures	Medium (need 6 shards)	Mid-size objects
RS(10, 4)	1.4×	4 failures	Higher (need 10 shards)	Large objects, cost-critical
RS(14, 4)	1.29×	4 failures	Higher (need 14 shards)	Maximum storage efficiency

Metadata service

The metadata service is the system's brain. It stores the mapping from logical name to physical location and is the source of truth for existence, authorization, and consistency.

Schema (per object record, simplified):

bucket_id:    ULID
key:          bytes (up to 1 KB)
version_id:   ULID (null if versioning disabled)
size:         uint64
etag:         hex(MD5 of content)   -- MD5 for single-part; complex formula for multipart
checksum_alg: sha256 | crc32c
created_at:   timestamp
storage_class: STANDARD | IA | GLACIER
shard_map:    [{node_id, shard_index, checksum}, ...]  -- one entry per shard
delete_marker: bool

Sharding by hash(bucket_id || key) distributes objects evenly across shards, even within a single bucket. The downside shows up at LIST time: a LIST bucket?prefix=X must scatter across all shards that could contain keys with that prefix — which is why listing is fundamentally more expensive than getting.

A single customer with trillions of objects in one bucket can put pressure on multiple shards via prefix. The mitigation is virtual sharding within a bucket: add a random prefix salt to the physical shard key, and reconstruct the logical view at query time. It complicates LIST but keeps write load even.

Each metadata shard uses a consensus protocol (Raft or Paxos) for replication, guaranteeing strong read-after-write. Once PUT metadata commits, any subsequent GET from any node in the shard sees the new object. This was not always true — Amazon S3 operated on an eventually consistent model for overwrites and deletes until December 2020, when it switched to strong consistency. New systems should design for strong consistency from day one.

Durability: self-healing and scrubbing

Durability is not a property you achieve once at write time — it is a continuous background process.

Self-healing

stateDiagram-v2
    [*] --> Healthy: object written with all N shards
    Healthy --> Degraded: disk/node failure (shard lost)
    Degraded --> Healing: self-healer detects < N shards
    Healing --> Healthy: reconstruct missing shard from k survivors, write to new node
    Degraded --> Critical: additional failures before healing completes
    Critical --> Healing: escalate priority
    Healing --> DataLoss: fewer than k shards survive (extremely rare)

When the health monitor detects a storage node failure, the self-healer immediately queries the metadata service for all objects with shards on that node. It reconstructs each missing shard from the surviving k shards and writes the new shard to a healthy node in the same failure domain. The metadata record is updated atomically.

The window between when a shard is lost and when a replacement is written is the period of elevated risk. A second simultaneous failure during that window could push a 10+4 object past its 4-failure tolerance. So healing speed directly determines your effective durability: the tighter that window, the shorter the exposure.

Scrubbing

Disk bit-rot — hardware corruption of stored bits with no filesystem error raised — is real and measurable in large fleets. Scrubbers periodically read every shard, verify its checksum, and compare against the metadata-stored checksum. Silent corruption is detected and the shard is rebuilt from its peers. Without scrubbing, objects that are never read can silently lose durability over years — the degradation is invisible until a client finally requests the object.

A large deployment scrubs every shard on a rolling schedule over weeks to months. Priority is raised for shards on older or higher-error-rate drives.

Garbage collection

When an object is deleted or overwritten, the metadata record is removed (or marked with a delete marker), but the shard data on storage nodes is not immediately freed. Deleting data is an expensive random-write operation that can interfere with the read path.

Instead, a GC worker runs in the background:

Scans the metadata service for delete markers and completed multipart-upload tombstones older than a configurable delay.
Cross-references shard map records against active object references.
Issues delete commands to storage nodes for unreferenced shards.
Removes the tombstoned metadata record.

The delay — typically minutes to hours — is intentional. It provides a safety window against bugs, enables soft-delete and undelete capability, and decouples delete latency from what the client experiences.

Consistency model (the one interviewers trip candidates on)

S3 today provides strong read-after-write consistency for all operations: a successful PUT is immediately visible to all subsequent GETs from any client within that bucket's region. Overwrites and DELETEs are also strongly consistent.

Before December 2020, S3 offered strong consistency for new object PUTs (first write to a key), but only eventual consistency for overwrites and DELETEs — a GET shortly after an overwrite could return the old version.

The architectural reason: S3 had a caching layer in its metadata subsystem optimized for high availability. Writes could flow through one part of that cache infrastructure while reads queried another, producing stale reads after overwrites. The migration to strong consistency didn't remove the cache — instead AWS added a "witness" component that tracks every object change and acts as a read barrier. Before serving a cached read, the server asks the witness whether the cache is fresh. If the witness confirms freshness, the cached value is returned; otherwise the persistence tier is queried directly. This cache coherence protocol, similar in spirit to CPU cache protocols, achieved strong consistency without sacrificing performance or availability — a non-trivial distributed systems engineering effort. (Architectural details are published in Werner Vogels' April 2021 blog post "Diving Deep on S3 Consistency.")

When designing a new system, always start with strong consistency. Relaxing it later is possible; fixing applications that silently depended on it is not.

Failure modes

Failure	Impact	Mitigation
Single disk failure	1 shard unavailable per object on that disk	Self-healer reconstructs within minutes; degraded reads still work
Single node failure	All objects with shards on that node degraded	Same as disk; priority healing
Silent bit-rot	Data silently wrong, not detected until read	Checksum on every shard; periodic scrubbing detects and repairs
AZ failure with 4 shards (RS 10+4, 5+5+4 placement)	Within tolerance — exactly at the m=4 boundary; degraded reads still work	Cross-AZ placement capped at m shards per AZ, or expedited self-healing
AZ failure with 5 shards (RS 10+4, majority AZ)	Exceeds RS(10,4) tolerance — reads require urgent reconstruction; data not lost if healing starts immediately	Either use ≤ 4 shards per AZ (needs 4 AZs) or accept brief degraded window + fast healer
Two simultaneous AZ failures	Below k shards remaining — data loss possible	Multi-region replication as second layer for critical data
Hot object (viral file)	Thundering herd on a small set of storage nodes	Presigned URLs + CDN; storage nodes are not in the hot read path for popular objects
Large multipart upload aborted	Orphaned part data on storage nodes	Lifecycle rules expire incomplete uploads after N days; GC reclaims parts
Metadata shard failure	All objects on that shard temporarily unavailable	Raft replication: automatic failover to a replica; recovery time < 30s
Correlated hardware failure (bad batch of drives)	Many simultaneous disk failures faster than healing	Physical diversity (drive vendors, firmware versions); monitoring AFR per drive model

Multipart upload

Objects larger than ~100 MB should use multipart upload. The client splits the file, uploads parts in parallel, and the server assembles them in metadata — no bytes are physically stitched together on disk.

sequenceDiagram
    participant C as Client
    participant API as API Layer
    participant SN as Storage Nodes
    participant MS as Metadata Service

    C->>API: POST /bucket/key?uploads
    API-->>C: UploadId = "abc123"
    par Upload parts in parallel
        C->>API: PUT /bucket/key?partNumber=1&uploadId=abc123
        API->>SN: Shard fan-out, part 1 → AZ1, AZ2, AZ3
        SN-->>API: ack
        API-->>C: ETag part 1
    and
        C->>API: PUT /bucket/key?partNumber=2&uploadId=abc123
        API->>SN: Shard fan-out, part 2 → AZ1, AZ2, AZ3
        SN-->>API: ack
        API-->>C: ETag part 2
    end
    C->>API: "POST /bucket/key?uploadId=abc123 {parts: [1,2,...]}"
    API->>MS: Write single object record linking all part metadata
    MS-->>API: Committed
    API-->>C: 200 OK, final ETag

The protocol:

Initiate: POST /bucket/key?uploads → server returns UploadId.
Upload parts: PUT /bucket/key?partNumber=N&uploadId=X — each part is written and erasure-coded independently. Parts must be at least 5 MiB and at most 5 GiB (except the last part, which has no minimum). Up to 10,000 parts per upload, for a maximum object size of ~48.8 TiB (≈53.7 TB in decimal).
Complete: POST /bucket/key?uploadId=X with an ordered list of (partNumber, ETag) pairs → server stitches the part metadata into a single object record. Client gets back the final object ETag.
Abort: DELETE /bucket/key?uploadId=X → server discards parts.

If the upload is interrupted, the client lists already-uploaded parts and resumes from the first missing or incomplete part — without re-uploading anything that already landed.

Presigned URLs and CDN offload

For popular objects, the storage service is not the right egress path. Routing a viral image through the API and storage nodes at 10M req/s would overwhelm even a large fleet.

Presigned URLs delegate authorization:

HMAC-sign(bucket, key, expiry, allowed_method, caller_identity)
→ URL valid for GET (or PUT) until expiry

The client hands this URL to a CDN (or directly to an end-user). The CDN caches the object at edge. The storage API only serves cache misses, dramatically reducing hot-object load.

For PUT, presigned URLs allow direct client-to-storage uploads without routing through the application server — critical for user-generated content pipelines.

Storage classes and lifecycle tiering

Not all data needs the same treatment. Storing log archives from three years ago on high-performance, redundant SSDs alongside today's active objects wastes money.

Storage class	Typical use	Access pattern	Retrieval latency	Cost relative to Standard
Standard	Active, frequently accessed objects	Any time	Milliseconds	1×
Infrequent Access (IA)	Backups, older data	Monthly	Milliseconds	~0.5× storage, higher GET cost
Glacier Instant Retrieval	Archive data needing occasional immediate access	Quarterly	Milliseconds	~0.2× storage
Glacier Flexible Retrieval	Disaster recovery archives	1–2× per year	1–5 min (expedited) / 3–5 hr (standard) / 5–12 hr (bulk)	~0.1× storage
Glacier Deep Archive	Long-term compliance, rarely accessed	Less than yearly	12–48 hours	~0.05× storage

Lifecycle rules let the client define policies ("transition objects in prefix logs/ older than 30 days to IA; delete after 365 days"). The lifecycle engine runs asynchronously, updating metadata and potentially moving data across storage tiers. Tiering to Glacier typically means the shard placement moves to cheaper, slower, potentially offline media, with the metadata record updated to reflect the new location and retrieval process.

Scaling the metadata service

At 10 trillion objects, the metadata service is itself a major distributed system.

Hash-based sharding of (bucket_id, key) distributes load evenly. Each shard is a Raft group of 3–5 replicas. A routing layer maps a request to the correct shard.

LIST operations (LIST bucket?prefix=X) must scatter across potentially all shards, since objects matching the prefix may be anywhere. For very large buckets, list operations are expensive and are typically rate-limited. Advanced designs maintain secondary indexes per-bucket for prefix queries, at the cost of extra write amplification.

A read cache in front of the metadata store can serve hot-key lookups with microsecond latency, but cache invalidation on overwrites must be synchronous to maintain strong consistency — which largely defeats the latency benefit for write-heavy keys. Use a short TTL (< 1s) for the cache, accept the occasional stale read window as a practical trade-off, or skip the cache layer entirely for consistency.

Things to discuss in an interview

Why separate metadata and data planes? Different performance profiles, scaling needs, and consistency requirements. Metadata must be strongly consistent and highly available for small reads; data is bulk bytes on cheap disks.
Erasure coding vs replication: storage cost vs computation cost vs recovery complexity. Know the arithmetic (1.4× vs 3×).
The commit point in the write path: why metadata write is last, and why orphaned shards need GC.
LIST vs GET: why listing is fundamentally more expensive (scatter-gather across metadata shards).
Consistency history of real systems: S3's eventual-to-strong consistency transition is a real design lesson about cache invalidation at scale.
Self-healing and scrubbing: durability is a continuous property, not a one-time write property.
Presigned URLs: how CDN offload works; separation of authorization from serving.

Things you should now be able to answer

Why doesn't object storage use a hierarchical filesystem internally, even though it shows folders in the UI?
What happens to a GET request if one of the 14 EC shards is on a failed node?
Why is the metadata write the last step in a PUT, not the first?
What is bit-rot, and how does a scrubber catch it without relying on the client to re-read the object?
Why is a LIST query more expensive than a GET query, at the metadata layer?
What changed in December 2020 for S3 consistency, and why did it take years to implement?

Frequently asked questions

▸Why does object storage use erasure coding instead of 3x replication for durability?