MODULE 06 / 12crash course

~/roadmap/06-storage-systems

◆Beginner

Storage Systems — Files, Blocks, and Objects

Block storage, file systems, object storage (S3), CDNs, and the cost/durability/throughput trade-offs that decide where your data lives.

15 min read2026-01-20Ironclad Academy

#storage #files #object-storage #s3 #fundamentals

The previous module covered databases — structured storage with indexes and queries. This module covers everything else you'll need to store: user uploads, video, backups, logs, ML datasets, and the bytes that don't fit neatly into rows. The decisions here are about durability, throughput, and cost, not query plans.

The three storage abstractions

Almost every storage product is one of three things:

flowchart TD
    S[Storage abstractions] --> BLK[Block<br/>raw byte ranges]
    S --> FS[File<br/>hierarchy + metadata]
    S --> OBJ[Object<br/>key → blob + metadata]
    BLK --> EBS[AWS EBS<br/>Local SSD<br/>iSCSI]
    FS --> EFS[AWS EFS / NFS<br/>Local FS<br/>Lustre, GPFS]
    OBJ --> S3[S3, GCS, Azure Blob<br/>R2, Backblaze B2]
    style BLK fill:#ff6b1a,color:#0a0a0f
    style FS fill:#0e7490,color:#fff
    style OBJ fill:#15803d,color:#fff

Each layer is built on the one below: file systems sit on block devices; object stores often sit on giant cluster file systems. Picking the right abstraction is the first storage decision you make.

Block storage

A "block" is a fixed-size chunk (typically 4 KB) addressed by an offset. Block storage is what your operating system mounts as a "disk." It has no semantic awareness of files — just numbered blocks you can read or write.

Property	Block storage
Granularity	4 KB blocks
Addressing	Offset on a device
Latency	0.1–1 ms (local SSD)
Throughput	500 MB/s – 10 GB/s
Sharing	Usually one host at a time
Examples	AWS EBS, Azure Managed Disks, GCP PD, on-prem SAN

Block storage is what you want when a database needs to control its own layout — Postgres, MySQL, and most storage engines are designed to bypass the file system and write directly to block devices for maximum throughput and predictable latency. VM root disks live here too. The trade-off is that the volume typically attaches to a single host, resizing is online but slow, and snapshots are point-in-time copies whose exact semantics vary by provider.

File systems

A file system organizes blocks into a tree of directories and files, with metadata (permissions, timestamps, size). On a single machine, every laptop has one. Across machines, the choice depends on what scale and access patterns you need:

Type	Examples	Use case
Local	ext4, XFS, APFS, NTFS	Default per-host
Network	NFS, SMB, AWS EFS, Azure Files	Shared across many hosts
Distributed/parallel	Lustre, GPFS, CephFS	HPC, ML training
In-cluster	HDFS	Hadoop ecosystem

Reach for a file system when legacy software expects "a directory of files" — most ML training frameworks, build artifact pipelines, and render farms need POSIX semantics (locking, partial overwrites). Two things will bite you, though. First, lots of small files are the file system's worst nightmare: every file has metadata overhead, and 10 million 1 KB files is dramatically worse than 1,000 10 MB files. Second, shared file systems cache aggressively on clients, which means NFS writes from one machine may not be visible to another for a surprising amount of time.

Object storage

A flat namespace of key → blob. There are no real directories — the slashes in photos/2024/cat.jpg are part of the key, not actual path components. Each object has bytes, metadata (Content-Type, custom headers), and a version.

flowchart LR
    C[Client] -->|"PUT photos/cat.jpg"| API[HTTPS API]
    API --> META[(Metadata<br/>service)]
    API --> SHARDS[(Object shards<br/>spread over thousands<br/>of disks)]
    SHARDS -.replicate.-> SHARDS
    style META fill:#ff6b1a,color:#0a0a0f
    style SHARDS fill:#0e7490,color:#fff

Property	Object storage
Granularity	Whole objects (no partial overwrites)
Addressing	URL: `s3://bucket/key`
Latency	30–200 ms first byte
Throughput	Effectively unlimited (parallel)
Durability	11 nines (99.999999999%) for S3 standard
Cost	$0.021–0.023 / GB-month for S3 Standard (volume tiers)
Examples	S3, GCS, Azure Blob, Cloudflare R2, Backblaze B2

Object storage is the right home for anything that's "an asset": user uploads, images, video, backups, ML datasets, static-site files, log archives, build artifacts. It is not the right tool for anything that needs random partial writes (you can't seek into an S3 object and change a few bytes), single-byte low-latency access, or POSIX semantics.

Choosing between the three

The decision usually comes down to three questions:

flowchart TD
    Q1{"Does the software<br/>need low-latency<br/>random I/O?"}
    Q1 -->|yes| BLK["Block storage<br/>(EBS, local SSD)"]
    Q1 -->|no| Q2{"Does it need<br/>POSIX directories<br/>or shared mounts?"}
    Q2 -->|yes| FS["File system<br/>(EFS, NFS, Lustre)"]
    Q2 -->|no| OBJ["Object storage<br/>(S3, GCS — default choice)"]
    style BLK fill:#ff6b1a,color:#0a0a0f
    style FS fill:#0e7490,color:#fff
    style OBJ fill:#15803d,color:#fff

When in doubt, object storage. It scales horizontally by default, costs the least per GB, and requires zero operational work to grow.

Why object storage won

For 90% of "where do I put this?" decisions in modern systems, the answer is object storage, and the reasons all trace back to one property: horizontal scaling without operational pain.

flowchart TD
    A[Old way: NAS / SAN] -->|"capacity = buy bigger box"| B[Vertical scaling]
    B --> C[Hits a ceiling]
    C --> D[Migration project]
    E[Object storage] -->|"capacity = automatic"| F[Pay-as-you-go]
    F --> G[Effectively infinite]
    style A fill:#ff2e88,color:#fff
    style G fill:#15803d,color:#fff

With a NAS, you buy a box, fill it, buy a bigger box, migrate. With S3-class storage, you upload a byte and S3 finds room for it — automatically distributed across multiple Availability Zones so that losing two AZs at once still leaves your data intact. That 11-nines durability figure (99.999999999%) isn't a marketing number; it comes from S3 replicating each object across at least three AZs, each in turn writing to multiple physical disks.

Beyond durability, you also get throughput that scales by adding parallel uploads, lifecycle policies that automatically move objects to cheaper tiers as they age, built-in versioning so accidental deletes are recoverable, and native CDN integration so a bucket can become a static website with a single config flag.

The S3 (object storage) data model

Five concepts show up in every cloud's object store:

Concept	What
Bucket	Top-level namespace, globally unique
Key	Object name within the bucket — looks like a path but isn't
Object	The bytes plus metadata
Version	An immutable snapshot of an object at a point in time
Storage class	Hot / warm / cold tier with different cost/latency

Storage classes

Class	Latency	$/GB-mo	Min duration	Use case
S3 Standard	~30 ms	~$0.023	none	Hot, frequent access
S3 IA (Infrequent Access)	~30 ms	~$0.0125	30 days	Backups, last-30-day logs
S3 Glacier Instant	~30 ms	~$0.004	90 days	Archival but quickly retrievable
S3 Glacier Flexible	minutes–hours	~$0.0036	90 days	Compliance archive
S3 Glacier Deep Archive	12–48 hours	~$0.00099	180 days	Long-term hold

The most important number here isn't the storage price — it's the retrieval cost. Glacier retrievals can cost more than the storage savings if you read more often than expected. Use IA and Glacier for data you'll truly almost never read; then the economics work in your favor.

Multipart upload (the way to upload anything > 100 MB)

A single S3 PUT works for objects up to 5 GB but becomes fragile over a flaky network. For anything larger, use multipart upload: split the object into parts, upload them in parallel, and tell S3 to assemble the final object. Multipart raises the practical object size cap to the AWS maximum of 5 TB per object.

sequenceDiagram
    participant C as Client
    participant S as S3
    C->>S: CreateMultipartUpload<br/>(bucket, key) → UploadId
    par parallel uploads
    C->>S: UploadPart 1, partNumber=1 → ETag
    C->>S: UploadPart 2, partNumber=2 → ETag
    C->>S: UploadPart N, partNumber=N → ETag
    end
    C->>S: CompleteMultipartUpload<br/>(UploadId, [parts]) → final ETag

A part that fails can be retried independently — you don't re-upload the other 50. Uploading many parts concurrently saturates your available bandwidth.

One operational detail that catches teams off guard: set a lifecycle rule to abort incomplete multipart uploads after N days. A client that starts an upload and crashes leaves orphaned parts in your bucket, and you pay for them silently.

Pre-signed URLs (don't proxy your bytes through your app)

A common architectural trap: every user upload flows through your API server on the way to S3. You've effectively doubled your bandwidth costs and added your own server as a bottleneck and failure point.

The fix is a pre-signed URL: your API generates a time-limited signature that authorizes the client to PUT directly to S3, then hands that URL to the client.

flowchart LR
    C[Client] -->|"1. request upload URL"| API[Your API]
    API -->|"2. generate signed URL"| API
    API -->|"3. return URL"| C
    C -->|"4. PUT bytes directly"| S3[(S3)]
    style API fill:#ff6b1a,color:#0a0a0f
    style S3 fill:#15803d,color:#fff

The same pattern works for downloads: generate a signed URL valid for 5 minutes, return it to the client, the client fetches from S3 directly. Your app servers never touch the bytes. This is how every modern file-upload feature is built — your API just coordinates the handoff, it doesn't carry the payload.

Consistency in object storage

Older S3 (pre-2020) was eventually consistent on overwrites and listings — a read immediately after a write could return the old version. That produced genuinely surprising bugs for teams that assumed stronger guarantees.

Modern S3, GCS, and Azure Blob are all strongly consistent for read-after-write and read-after-delete. List operations are strongly consistent too. You no longer need to build retry-and-compare logic for single-region operations.

What's still not strongly consistent: cross-region replication. Replicating a bucket to another region is asynchronous and lags by seconds to minutes. Don't read from a replica immediately after writing to the primary and expect to see the new version.

Throughput characteristics

Pattern	Approx. throughput
Single-thread sequential PUT (1 MB parts)	~50–100 MB/s
100 parallel PUTs to same bucket	~5–10 GB/s
Single-thread GET	~80–120 MB/s
1000 parallel GETs from same prefix	~100 GB/s
Cross-region replication	seconds–minutes lag

One subtlety worth knowing: behind the scenes, S3 partitions a bucket by key prefix. If all your keys start with 2026/01/15/..., they all hash to the same partition and you can saturate it. Modern S3 auto-partitions when it detects traffic, but at very high request rates (above ~3,500 PUTs per second per prefix) you still want diverse prefixes — prepend a hash, reverse a timestamp, anything to spread the writes across partitions.

Content delivery networks (CDNs)

Object storage gives you durable, cheap storage. A CDN gives you fast delivery by caching your content at dozens of edge locations around the world.

flowchart LR
    U1[User Tokyo] --> E1[CDN edge: Tokyo]
    U2[User Mumbai] --> E2[CDN edge: Mumbai]
    U3[User Sao Paulo] --> E3[CDN edge: Sao Paulo]
    E1 -.miss.-> O[(S3 origin: us-east-1)]
    E2 -.miss.-> O
    E3 -.miss.-> O
    style E1 fill:#ff6b1a,color:#0a0a0f
    style E2 fill:#ff6b1a,color:#0a0a0f
    style E3 fill:#ff6b1a,color:#0a0a0f
    style O fill:#0e7490,color:#fff

The math: serving a 200 KB image to a user 200 ms away from your origin costs 200 ms just for TLS handshake, and again for the transfer. From a CDN edge 5 ms away, the same image feels 50× faster. For any content that's cacheable and served globally, putting a CDN in front of S3 is almost always the right call.

Set your Cache-Control headers deliberately:

public, max-age=31536000, immutable for content-hashed assets like /static/main.a8c7.js — never changes, cache forever.
public, max-age=300, s-maxage=3600 for HTML you re-render hourly — short browser TTL, longer CDN TTL.
no-store for personalized HTML like account dashboards — don't let the CDN serve one user's data to another.

The more advanced trick is stale-while-revalidate: serve the cached version instantly, then fetch a fresh copy in the background and swap it in for the next request. Users always see a fast response; freshness comes for free.

Tiered storage in practice

Real systems don't pick one storage tier and live there — they chain tiers by access frequency:

flowchart LR
    HOT[Hot: in-memory] -->|"<1ms"| WARM[Warm: SSD / DB]
    WARM -->|"1-50ms"| COLD[Cold: object storage]
    COLD -->|"100ms+"| ARCH[Archive: Glacier]
    style HOT fill:#ff2e88,color:#fff
    style WARM fill:#ff6b1a,color:#0a0a0f
    style COLD fill:#0e7490,color:#fff
    style ARCH fill:#a855f7,color:#fff

Slack messages are a clean real-world example of this ladder. The last hour of messages lives in Redis (hot). The last 30 days lives in a database like Postgres or Manhattan (warm). Older messages move to object storage with just a metadata pointer in the DB (cold). Beyond the compliance window, data is deleted or dropped to Glacier (archive). Each tier is roughly 10× cheaper and 10× slower than the one above. Designing the lifecycle — when to move data, and at what trigger — is its own discipline worth thinking through for any long-lived data store.

File system specifics worth knowing

If you end up working with actual files (not blobs in S3), a few pitfalls to keep in mind:

inode exhaustion: a file system can run out of inodes before it runs out of disk bytes. df -i shows your inode usage. This is the failure mode when you store millions of tiny files — each one consumes an inode, and once they're gone the disk appears "full" even though there's plenty of space.

fsync semantics: until you call fsync(), "written" data sits in the OS page cache and can be lost in a crash. Databases call fsync() after each commit — that's the durability cost you pay, and it's why database writes feel slower than they "should."

Atomic rename: on POSIX, rename() is atomic on the same file system. This is the primitive behind nearly every "atomic write" you'll see: write to a temp file, fsync it, rename it over the target. The rename is atomic; the old file is never visible in a half-written state.

Hard vs symbolic links: hard links share an inode; symlinks are pointers to a path. Backups and tar archives handle them differently — a hard link looks like two independent copies to tar unless you use a deduplication flag.

Block storage in cloud

When provisioning EBS-class block storage, you're picking on three axes: IOPS, throughput, and cost.

Type	Use case	IOPS	Throughput	$/GB-mo
gp3 (general SSD)	Boot disks, most apps	3,000–16,000	125–1,000 MB/s	~$0.08
io2 (provisioned SSD)	High-performance DBs	up to 256,000	up to 4 GB/s	~$0.125 + IOPS
st1 (throughput HDD)	Big sequential workloads (logs)	low	up to 500 MB/s	~$0.045
sc1 (cold HDD)	Archival of files needing FS	very low	up to 250 MB/s	~$0.015

Latency is single-digit milliseconds across all SSD options (network-attached EBS typically 1–3 ms average). HDD is roughly 10× slower for random I/O — not a problem for sequential log writes, but disqualifying for a database.

Anti-patterns we see all the time

Storing user uploads on your app server's local disk. The disk fills up, and the moment you scale horizontally only one server has the file. Use object storage from day one — the migration later is painful.

Storing video bytes in Postgres bytea columns. This bloats the database, slows backups to a crawl, and destroys the buffer pool's effectiveness. Store video in S3; store the URL in Postgres.

Listing a million-key bucket on every request. S3 LIST is O(N) and rate-limited. Either index your keys in DynamoDB or Postgres so you can do point lookups, or design your key schema so prefix listing always returns a small result.

Caching S3 GETs in Redis. Your CDN already does that, with global distribution and hardware purpose-built for it. Cache the parsed result of an expensive computation, not raw bytes that the CDN will handle better.

Worked example: design photo upload

Spec: 100M users, average 5 photos uploaded per day, 500 KB per photo, 90-day hot retention, indefinite cold retention.

Writes:
  100M × 5 = 500M uploads/day
  500M / 86400 ≈ 5,800 uploads/sec average
  Peak ~3× = 17,000 uploads/sec

Bandwidth (writes):
  5,800 × 500 KB = 2.9 GB/sec average
  17,000 × 500 KB = 8.5 GB/sec peak

Storage:
  500M × 500 KB = 250 TB/day
  90 days hot   = 22 PB hot
  At $0.023/GB (first-50-TB tier) = ~$500K/month for hot
  Move to IA   = $0.0125/GB → savings ~$240K/month

Architecture:
  Client → pre-signed PUT → S3 (hot)
  Metadata → Postgres (user_id, photo_id, key, created_at, content_type)
  CDN in front of S3 for reads
  Lifecycle rule: after 90 days, transition to S3 IA
  Lifecycle rule: after 1 year, transition to Glacier IR

That sketch tells you the system has one storage class for live serving, two for older data, and the application layer (Postgres) holds only metadata. The bytes never pass through your app servers.

Things you should now be able to answer

Why is object storage almost always better than NFS for "user uploads"?
A user uploads a 6 GB video. What's wrong with a single S3 PUT? What do you do instead?
Your app needs to deliver a private file to a user. How do you avoid proxying it through your server?
A bucket is suddenly returning 503 SlowDown errors. What might be wrong with your key naming?
Why does S3 have 11 nines of durability but only 99.99% availability?
A startup stores all user-uploaded video in Postgres bytea. What three things will go wrong first?

→ Next: Caching

// KEY TAKEAWAYS

▸Object storage wins by default: it scales horizontally without ops work, costs $0.021–0.023/GB-month (S3 Standard, volume tiers), and delivers 11-nines durability by replicating across at least three AZs.
▸Never proxy upload bytes through your app server — generate a pre-signed URL and let the client PUT directly to S3.
▸S3 Glacier retrieval costs can exceed the storage savings; use cold tiers only for data you will almost never read.
▸Lots of small files are a file system's worst nightmare — millions of 1 KB files exhaust inodes and metadata performance long before disk space runs out.
▸A tiered storage ladder (Redis hot, Postgres warm, S3 cold, Glacier archive) is each tier roughly 10x cheaper and 10x slower than the one above — design the lifecycle trigger, not just the tiers.

// FAQ

Frequently asked questions

▸What is object storage and how does it differ from a file system?

Object storage is a flat namespace mapping a key to a blob plus metadata, with no real directory hierarchy — the slashes in a key like photos/2024/cat.jpg are just part of the key string. Unlike a file system, it has no partial-write support, no POSIX locking, and exposes data over an HTTPS API rather than a mount point. The trade-off is that you get 11-nines durability, effectively unlimited parallel throughput, and pay-as-you-go capacity with zero operational overhead.

▸When should I use block storage instead of object storage?

Use block storage when software needs low-latency random I/O — databases like Postgres and MySQL are designed to write directly to block devices for sub-millisecond latency and predictable throughput up to 10 GB/s. Block storage (e.g., AWS EBS gp3) attaches to a single host and gives the storage engine full control over layout. If the workload does not need random partial writes or single-digit-millisecond latency, object storage is the cheaper default.

▸What is S3 Glacier Deep Archive and when does it make economic sense?

S3 Glacier Deep Archive is the coldest storage class at roughly $0.00099 per GB-month, with a minimum storage duration of 180 days and retrieval times of 12 to 48 hours. The economics only work when you will almost never read the data, because retrieval costs can exceed storage savings if access is more frequent than expected. It is the right tier for compliance archives and long-term holds where the data must be retained but is rarely if ever accessed.

▸What is a pre-signed URL and why should I use it for file uploads?

A pre-signed URL is a time-limited, signed request your API generates that authorizes a client to PUT an object directly to S3 without routing bytes through your application server. Without it, every upload doubles your bandwidth cost and adds your own server as a bottleneck and single point of failure. The pattern — client requests a URL from your API, API returns the signed URL, client uploads directly to S3 — is how every modern file-upload feature is built.

▸How does S3 handle high request rates, and what key naming strategy avoids throttling?

S3 partitions a bucket by key prefix, so keys that all share the same prefix (such as 2026/01/15/...) hash to the same partition and can saturate it above roughly 3,500 PUTs per second per prefix. To spread load, prepend a hash to the key or reverse a timestamp so writes are distributed across many partitions. Modern S3 auto-partitions when it detects traffic concentration, but deliberately diverse prefixes remain important at very high request rates.

← previous module

Databases — SQL, NoSQL, NewSQL

next module →

Caching