Storage Systems — Files, Blocks, and Objects
Block storage, file systems, object storage (S3), CDNs, and the cost/durability/throughput trade-offs that decide where your data lives.
The previous module covered databases — structured storage with indexes and queries. This module covers everything else you'll need to store: user uploads, video, backups, logs, ML datasets, and the bytes that don't fit neatly into rows. The decisions here are about durability, throughput, and cost, not query plans.
The three storage abstractions
Almost every storage product is one of three things:
flowchart TD
S[Storage abstractions] --> BLK[Block<br/>raw byte ranges]
S --> FS[File<br/>hierarchy + metadata]
S --> OBJ[Object<br/>key → blob + metadata]
BLK --> EBS[AWS EBS<br/>Local SSD<br/>iSCSI]
FS --> EFS[AWS EFS / NFS<br/>Local FS<br/>Lustre, GPFS]
OBJ --> S3[S3, GCS, Azure Blob<br/>R2, Backblaze B2]
style BLK fill:#ff6b1a,color:#0a0a0f
style FS fill:#0e7490,color:#fff
style OBJ fill:#15803d,color:#fff
Each layer is built on the one below: file systems sit on block devices; object stores often sit on giant cluster file systems. Picking the right abstraction is the first storage decision you make.
Block storage
A "block" is a fixed-size chunk (typically 4 KB) addressed by an offset. Block storage is what your operating system mounts as a "disk." It has no semantic awareness of files — just numbered blocks you can read or write.
| Property | Block storage |
|---|---|
| Granularity | 4 KB blocks |
| Addressing | Offset on a device |
| Latency | 0.1–1 ms (local SSD) |
| Throughput | 500 MB/s – 10 GB/s |
| Sharing | Usually one host at a time |
| Examples | AWS EBS, Azure Managed Disks, GCP PD, on-prem SAN |
Block storage is what you want when a database needs to control its own layout — Postgres, MySQL, and most storage engines are designed to bypass the file system and write directly to block devices for maximum throughput and predictable latency. VM root disks live here too. The trade-off is that the volume typically attaches to a single host, resizing is online but slow, and snapshots are point-in-time copies whose exact semantics vary by provider.
File systems
A file system organizes blocks into a tree of directories and files, with metadata (permissions, timestamps, size). On a single machine, every laptop has one. Across machines, the choice depends on what scale and access patterns you need:
| Type | Examples | Use case |
|---|---|---|
| Local | ext4, XFS, APFS, NTFS | Default per-host |
| Network | NFS, SMB, AWS EFS, Azure Files | Shared across many hosts |
| Distributed/parallel | Lustre, GPFS, CephFS | HPC, ML training |
| In-cluster | HDFS | Hadoop ecosystem |
Reach for a file system when legacy software expects "a directory of files" — most ML training frameworks, build artifact pipelines, and render farms need POSIX semantics (locking, partial overwrites). Two things will bite you, though. First, lots of small files are the file system's worst nightmare: every file has metadata overhead, and 10 million 1 KB files is dramatically worse than 1,000 10 MB files. Second, shared file systems cache aggressively on clients, which means NFS writes from one machine may not be visible to another for a surprising amount of time.
Object storage
A flat namespace of key → blob. There are no real directories — the slashes in photos/2024/cat.jpg are part of the key, not actual path components. Each object has bytes, metadata (Content-Type, custom headers), and a version.
flowchart LR
C[Client] -->|"PUT photos/cat.jpg"| API[HTTPS API]
API --> META[(Metadata<br/>service)]
API --> SHARDS[(Object shards<br/>spread over thousands<br/>of disks)]
SHARDS -.replicate.-> SHARDS
style META fill:#ff6b1a,color:#0a0a0f
style SHARDS fill:#0e7490,color:#fff
| Property | Object storage |
|---|---|
| Granularity | Whole objects (no partial overwrites) |
| Addressing | URL: s3://bucket/key |
| Latency | 30–200 ms first byte |
| Throughput | Effectively unlimited (parallel) |
| Durability | 11 nines (99.999999999%) for S3 standard |
| Cost | $0.021–0.023 / GB-month for S3 Standard (volume tiers) |
| Examples | S3, GCS, Azure Blob, Cloudflare R2, Backblaze B2 |
Object storage is the right home for anything that's "an asset": user uploads, images, video, backups, ML datasets, static-site files, log archives, build artifacts. It is not the right tool for anything that needs random partial writes (you can't seek into an S3 object and change a few bytes), single-byte low-latency access, or POSIX semantics.
Choosing between the three
The decision usually comes down to three questions:
flowchart TD
Q1{"Does the software<br/>need low-latency<br/>random I/O?"}
Q1 -->|yes| BLK["Block storage<br/>(EBS, local SSD)"]
Q1 -->|no| Q2{"Does it need<br/>POSIX directories<br/>or shared mounts?"}
Q2 -->|yes| FS["File system<br/>(EFS, NFS, Lustre)"]
Q2 -->|no| OBJ["Object storage<br/>(S3, GCS — default choice)"]
style BLK fill:#ff6b1a,color:#0a0a0f
style FS fill:#0e7490,color:#fff
style OBJ fill:#15803d,color:#fff
When in doubt, object storage. It scales horizontally by default, costs the least per GB, and requires zero operational work to grow.
Why object storage won
For 90% of "where do I put this?" decisions in modern systems, the answer is object storage, and the reasons all trace back to one property: horizontal scaling without operational pain.
flowchart TD
A[Old way: NAS / SAN] -->|"capacity = buy bigger box"| B[Vertical scaling]
B --> C[Hits a ceiling]
C --> D[Migration project]
E[Object storage] -->|"capacity = automatic"| F[Pay-as-you-go]
F --> G[Effectively infinite]
style A fill:#ff2e88,color:#fff
style G fill:#15803d,color:#fff
With a NAS, you buy a box, fill it, buy a bigger box, migrate. With S3-class storage, you upload a byte and S3 finds room for it — automatically distributed across multiple Availability Zones so that losing two AZs at once still leaves your data intact. That 11-nines durability figure (99.999999999%) isn't a marketing number; it comes from S3 replicating each object across at least three AZs, each in turn writing to multiple physical disks.
Beyond durability, you also get throughput that scales by adding parallel uploads, lifecycle policies that automatically move objects to cheaper tiers as they age, built-in versioning so accidental deletes are recoverable, and native CDN integration so a bucket can become a static website with a single config flag.
The S3 (object storage) data model
Five concepts show up in every cloud's object store:
| Concept | What |
|---|---|
| Bucket | Top-level namespace, globally unique |
| Key | Object name within the bucket — looks like a path but isn't |
| Object | The bytes plus metadata |
| Version | An immutable snapshot of an object at a point in time |
| Storage class | Hot / warm / cold tier with different cost/latency |
Storage classes
| Class | Latency | $/GB-mo | Min duration | Use case |
|---|---|---|---|---|
| S3 Standard | ~30 ms | ~$0.023 | none | Hot, frequent access |
| S3 IA (Infrequent Access) | ~30 ms | ~$0.0125 | 30 days | Backups, last-30-day logs |
| S3 Glacier Instant | ~30 ms | ~$0.004 | 90 days | Archival but quickly retrievable |
| S3 Glacier Flexible | minutes–hours | ~$0.0036 | 90 days | Compliance archive |
| S3 Glacier Deep Archive | 12–48 hours | ~$0.00099 | 180 days | Long-term hold |
The most important number here isn't the storage price — it's the retrieval cost. Glacier retrievals can cost more than the storage savings if you read more often than expected. Use IA and Glacier for data you'll truly almost never read; then the economics work in your favor.
Multipart upload (the way to upload anything > 100 MB)
A single S3 PUT works for objects up to 5 GB but becomes fragile over a flaky network. For anything larger, use multipart upload: split the object into parts, upload them in parallel, and tell S3 to assemble the final object. Multipart raises the practical object size cap to the AWS maximum of 5 TB per object.
sequenceDiagram
participant C as Client
participant S as S3
C->>S: CreateMultipartUpload<br/>(bucket, key) → UploadId
par parallel uploads
C->>S: UploadPart 1, partNumber=1 → ETag
C->>S: UploadPart 2, partNumber=2 → ETag
C->>S: UploadPart N, partNumber=N → ETag
end
C->>S: CompleteMultipartUpload<br/>(UploadId, [parts]) → final ETag
A part that fails can be retried independently — you don't re-upload the other 50. Uploading many parts concurrently saturates your available bandwidth.
One operational detail that catches teams off guard: set a lifecycle rule to abort incomplete multipart uploads after N days. A client that starts an upload and crashes leaves orphaned parts in your bucket, and you pay for them silently.
Pre-signed URLs (don't proxy your bytes through your app)
A common architectural trap: every user upload flows through your API server on the way to S3. You've effectively doubled your bandwidth costs and added your own server as a bottleneck and failure point.
The fix is a pre-signed URL: your API generates a time-limited signature that authorizes the client to PUT directly to S3, then hands that URL to the client.
flowchart LR
C[Client] -->|"1. request upload URL"| API[Your API]
API -->|"2. generate signed URL"| API
API -->|"3. return URL"| C
C -->|"4. PUT bytes directly"| S3[(S3)]
style API fill:#ff6b1a,color:#0a0a0f
style S3 fill:#15803d,color:#fff
The same pattern works for downloads: generate a signed URL valid for 5 minutes, return it to the client, the client fetches from S3 directly. Your app servers never touch the bytes. This is how every modern file-upload feature is built — your API just coordinates the handoff, it doesn't carry the payload.
Consistency in object storage
Older S3 (pre-2020) was eventually consistent on overwrites and listings — a read immediately after a write could return the old version. That produced genuinely surprising bugs for teams that assumed stronger guarantees.
Modern S3, GCS, and Azure Blob are all strongly consistent for read-after-write and read-after-delete. List operations are strongly consistent too. You no longer need to build retry-and-compare logic for single-region operations.
What's still not strongly consistent: cross-region replication. Replicating a bucket to another region is asynchronous and lags by seconds to minutes. Don't read from a replica immediately after writing to the primary and expect to see the new version.
Throughput characteristics
| Pattern | Approx. throughput |
|---|---|
| Single-thread sequential PUT (1 MB parts) | ~50–100 MB/s |
| 100 parallel PUTs to same bucket | ~5–10 GB/s |
| Single-thread GET | ~80–120 MB/s |
| 1000 parallel GETs from same prefix | ~100 GB/s |
| Cross-region replication | seconds–minutes lag |
One subtlety worth knowing: behind the scenes, S3 partitions a bucket by key prefix. If all your keys start with 2026/01/15/..., they all hash to the same partition and you can saturate it. Modern S3 auto-partitions when it detects traffic, but at very high request rates (above ~3,500 PUTs per second per prefix) you still want diverse prefixes — prepend a hash, reverse a timestamp, anything to spread the writes across partitions.
Content delivery networks (CDNs)
Object storage gives you durable, cheap storage. A CDN gives you fast delivery by caching your content at dozens of edge locations around the world.
flowchart LR
U1[User Tokyo] --> E1[CDN edge: Tokyo]
U2[User Mumbai] --> E2[CDN edge: Mumbai]
U3[User Sao Paulo] --> E3[CDN edge: Sao Paulo]
E1 -.miss.-> O[(S3 origin: us-east-1)]
E2 -.miss.-> O
E3 -.miss.-> O
style E1 fill:#ff6b1a,color:#0a0a0f
style E2 fill:#ff6b1a,color:#0a0a0f
style E3 fill:#ff6b1a,color:#0a0a0f
style O fill:#0e7490,color:#fff
The math: serving a 200 KB image to a user 200 ms away from your origin costs 200 ms just for TLS handshake, and again for the transfer. From a CDN edge 5 ms away, the same image feels 50× faster. For any content that's cacheable and served globally, putting a CDN in front of S3 is almost always the right call.
Set your Cache-Control headers deliberately:
public, max-age=31536000, immutablefor content-hashed assets like/static/main.a8c7.js— never changes, cache forever.public, max-age=300, s-maxage=3600for HTML you re-render hourly — short browser TTL, longer CDN TTL.no-storefor personalized HTML like account dashboards — don't let the CDN serve one user's data to another.
The more advanced trick is stale-while-revalidate: serve the cached version instantly, then fetch a fresh copy in the background and swap it in for the next request. Users always see a fast response; freshness comes for free.
Tiered storage in practice
Real systems don't pick one storage tier and live there — they chain tiers by access frequency:
flowchart LR
HOT[Hot: in-memory] -->|"<1ms"| WARM[Warm: SSD / DB]
WARM -->|"1-50ms"| COLD[Cold: object storage]
COLD -->|"100ms+"| ARCH[Archive: Glacier]
style HOT fill:#ff2e88,color:#fff
style WARM fill:#ff6b1a,color:#0a0a0f
style COLD fill:#0e7490,color:#fff
style ARCH fill:#a855f7,color:#fff
Slack messages are a clean real-world example of this ladder. The last hour of messages lives in Redis (hot). The last 30 days lives in a database like Postgres or Manhattan (warm). Older messages move to object storage with just a metadata pointer in the DB (cold). Beyond the compliance window, data is deleted or dropped to Glacier (archive). Each tier is roughly 10× cheaper and 10× slower than the one above. Designing the lifecycle — when to move data, and at what trigger — is its own discipline worth thinking through for any long-lived data store.
File system specifics worth knowing
If you end up working with actual files (not blobs in S3), a few pitfalls to keep in mind:
inode exhaustion: a file system can run out of inodes before it runs out of disk bytes. df -i shows your inode usage. This is the failure mode when you store millions of tiny files — each one consumes an inode, and once they're gone the disk appears "full" even though there's plenty of space.
fsync semantics: until you call fsync(), "written" data sits in the OS page cache and can be lost in a crash. Databases call fsync() after each commit — that's the durability cost you pay, and it's why database writes feel slower than they "should."
Atomic rename: on POSIX, rename() is atomic on the same file system. This is the primitive behind nearly every "atomic write" you'll see: write to a temp file, fsync it, rename it over the target. The rename is atomic; the old file is never visible in a half-written state.
Hard vs symbolic links: hard links share an inode; symlinks are pointers to a path. Backups and tar archives handle them differently — a hard link looks like two independent copies to tar unless you use a deduplication flag.
Block storage in cloud
When provisioning EBS-class block storage, you're picking on three axes: IOPS, throughput, and cost.
| Type | Use case | IOPS | Throughput | $/GB-mo |
|---|---|---|---|---|
| gp3 (general SSD) | Boot disks, most apps | 3,000–16,000 | 125–1,000 MB/s | ~$0.08 |
| io2 (provisioned SSD) | High-performance DBs | up to 256,000 | up to 4 GB/s | ~$0.125 + IOPS |
| st1 (throughput HDD) | Big sequential workloads (logs) | low | up to 500 MB/s | ~$0.045 |
| sc1 (cold HDD) | Archival of files needing FS | very low | up to 250 MB/s | ~$0.015 |
Latency is single-digit milliseconds across all SSD options (network-attached EBS typically 1–3 ms average). HDD is roughly 10× slower for random I/O — not a problem for sequential log writes, but disqualifying for a database.
Anti-patterns we see all the time
Storing user uploads on your app server's local disk. The disk fills up, and the moment you scale horizontally only one server has the file. Use object storage from day one — the migration later is painful.
Storing video bytes in Postgres bytea columns. This bloats the database, slows backups to a crawl, and destroys the buffer pool's effectiveness. Store video in S3; store the URL in Postgres.
Listing a million-key bucket on every request. S3 LIST is O(N) and rate-limited. Either index your keys in DynamoDB or Postgres so you can do point lookups, or design your key schema so prefix listing always returns a small result.
Caching S3 GETs in Redis. Your CDN already does that, with global distribution and hardware purpose-built for it. Cache the parsed result of an expensive computation, not raw bytes that the CDN will handle better.
Worked example: design photo upload
Spec: 100M users, average 5 photos uploaded per day, 500 KB per photo, 90-day hot retention, indefinite cold retention.
Writes:
100M × 5 = 500M uploads/day
500M / 86400 ≈ 5,800 uploads/sec average
Peak ~3× = 17,000 uploads/sec
Bandwidth (writes):
5,800 × 500 KB = 2.9 GB/sec average
17,000 × 500 KB = 8.5 GB/sec peak
Storage:
500M × 500 KB = 250 TB/day
90 days hot = 22 PB hot
At $0.023/GB (first-50-TB tier) = ~$500K/month for hot
Move to IA = $0.0125/GB → savings ~$240K/month
Architecture:
Client → pre-signed PUT → S3 (hot)
Metadata → Postgres (user_id, photo_id, key, created_at, content_type)
CDN in front of S3 for reads
Lifecycle rule: after 90 days, transition to S3 IA
Lifecycle rule: after 1 year, transition to Glacier IR
That sketch tells you the system has one storage class for live serving, two for older data, and the application layer (Postgres) holds only metadata. The bytes never pass through your app servers.
Things you should now be able to answer
- Why is object storage almost always better than NFS for "user uploads"?
- A user uploads a 6 GB video. What's wrong with a single S3 PUT? What do you do instead?
- Your app needs to deliver a private file to a user. How do you avoid proxying it through your server?
- A bucket is suddenly returning 503 SlowDown errors. What might be wrong with your key naming?
- Why does S3 have 11 nines of durability but only 99.99% availability?
- A startup stores all user-uploaded video in Postgres
bytea. What three things will go wrong first?
→ Next: Caching
Frequently asked questions
▸What is object storage and how does it differ from a file system?
Object storage is a flat namespace mapping a key to a blob plus metadata, with no real directory hierarchy — the slashes in a key like photos/2024/cat.jpg are just part of the key string. Unlike a file system, it has no partial-write support, no POSIX locking, and exposes data over an HTTPS API rather than a mount point. The trade-off is that you get 11-nines durability, effectively unlimited parallel throughput, and pay-as-you-go capacity with zero operational overhead.
▸When should I use block storage instead of object storage?
Use block storage when software needs low-latency random I/O — databases like Postgres and MySQL are designed to write directly to block devices for sub-millisecond latency and predictable throughput up to 10 GB/s. Block storage (e.g., AWS EBS gp3) attaches to a single host and gives the storage engine full control over layout. If the workload does not need random partial writes or single-digit-millisecond latency, object storage is the cheaper default.
▸What is S3 Glacier Deep Archive and when does it make economic sense?
S3 Glacier Deep Archive is the coldest storage class at roughly $0.00099 per GB-month, with a minimum storage duration of 180 days and retrieval times of 12 to 48 hours. The economics only work when you will almost never read the data, because retrieval costs can exceed storage savings if access is more frequent than expected. It is the right tier for compliance archives and long-term holds where the data must be retained but is rarely if ever accessed.
▸What is a pre-signed URL and why should I use it for file uploads?
A pre-signed URL is a time-limited, signed request your API generates that authorizes a client to PUT an object directly to S3 without routing bytes through your application server. Without it, every upload doubles your bandwidth cost and adds your own server as a bottleneck and single point of failure. The pattern — client requests a URL from your API, API returns the signed URL, client uploads directly to S3 — is how every modern file-upload feature is built.
▸How does S3 handle high request rates, and what key naming strategy avoids throttling?
S3 partitions a bucket by key prefix, so keys that all share the same prefix (such as 2026/01/15/...) hash to the same partition and can saturate it above roughly 3,500 PUTs per second per prefix. To spread load, prepend a hash to the key or reverse a timestamp so writes are distributed across many partitions. Modern S3 auto-partitions when it detects traffic concentration, but deliberately diverse prefixes remain important at very high request rates.