~/articles/design-google-drive

◆◆◆Advancedasked at Googleasked at Dropboxasked at Microsoftasked at Amazon

Design Google Drive / Dropbox

Q: Why does Google Drive store files as chunks instead of single blobs?

Splitting a file into roughly 4 MB chunks and addressing each by its SHA-256 hash enables three things at once: cross-user deduplication (a million users sharing the same template store it once), within-user delta sync (only the changed chunks go over the wire on edits), and resumable uploads (a failure resumes from the last successful chunk, not from scratch). These benefits fall out of a single mechanism, which is why the chunk-based design is the load-bearing decision the rest of the architecture builds on.

Q: What is content-defined chunking and why does it matter?

Content-defined chunking (CDC) uses a rolling hash — typically Rabin fingerprinting — that slides over the byte stream and cuts a boundary whenever the hash matches a pattern, such as the low 13 bits being zero. Because boundaries depend on local content rather than fixed offsets, inserting bytes near the start of a file shifts only the one or two chunks around the insertion; all downstream chunks are unchanged and still hit the dedup cache. Fixed-size chunking, by contrast, shifts every chunk boundary downstream of any insertion, destroying dedup for the tail of the file.

Q: How does Google Drive detect and deliver a file change to other devices within 30 seconds?

When a file is saved, the Metadata Service emits a FileChanged event to Kafka, and Notification Workers expand that into per-subscriber pushes over WebSocket (for active sessions) or long-poll (for background reconnects). The receiving device wakes up, calls GET /sync with its cursor to get the changed chunk list, then fetches only the missing chunks from the CDN. The metadata write and Kafka emit are synchronous; everything after that is async, and the design targets p99 sync detection under 30 seconds.

Q: How does the system handle conflicts when two devices edit the same file offline?

Each upload includes a base_version — the version the client read before editing. If base_version differs from current_version on the server, a conflict is detected. The safest approach, used by Dropbox for opaque files, saves both as separate copies (e.g., report.pdf and report (Conflicted Copy 2026-06-02).pdf) so no data is ever lost. Google Docs handles real-time collaboration differently by running Operational Transform on every keystroke, but Drive still treats binary files like .zip and .pdf as opaque blobs and falls back to conflicted copies.

Q: How did Dropbox reduce storage costs at exabyte scale, and by how much?

Dropbox built Magic Pocket, its own exabyte-scale blob store, and migrated roughly 90% of its user data off S3 by October 2015. The company reported in its 2018 S-1 filing that the migration saved approximately $74.6 million in infrastructure costs over two years.

File sync that works on every device, blob storage, deduplication, conflict resolution, and how to do all of it efficiently.

17 min read2026-02-16Ironclad Academy

#interview #storage #sync #file-system #distributed-systems

// DEPTH

the full breakdown — requirements, capacity, evolution, trade-offs

The problem

Google Drive, Dropbox, and iCloud Drive all solve the same core problem: you save a file on one device and it appears — correctly, completely, quickly — on every other device you own. The product looks deceptively simple. The engineering underneath is not.

At the surface level, a file-sync service is just a place to put files and get them back. At scale — 1 billion users, 100 billion files, petabytes of data spread across phones, laptops, and a fleet of datacenters — three tensions make this hard. First, bandwidth efficiency: re-uploading a 1 GB video every time a user trims two seconds from it is neither fast nor cheap. The system has to detect and transmit only the bytes that actually changed. Second, sync correctness: a user edits the same file on their laptop and their phone while one is offline. When both devices reconnect, neither version can be silently discarded. Third, metadata scale: file hierarchy, version history, sharing permissions, and sync cursors for billions of files have to be durably stored, strongly consistent, and queryable in milliseconds.

These tensions are why the interview question exists. Getting from "upload a file to S3" to "Dropbox/Drive at scale" is a journey through chunk-level content addressing, delta sync protocols, conflict resolution strategies, and global fan-out — each step forced by a specific failure mode of the previous approach.

The central insight that unlocks most of the design: a file is not a single blob. It is a list of chunk hashes. Store each unique chunk once, globally, and the system gets cross-user deduplication, within-user delta sync, and cheap versioning almost for free. Everything else — the sync protocol, the conflict resolution, the garbage collection — is scaffolding around that one idea.

Functional requirements

Upload, download, share files.
Sync changes across multiple devices.
Versioning — recover deleted files, see file history.
Sharing — give other users read/write access.
Offline editing → sync when back online.

Non-functional

1B users, 100B files.
Files up to 5 GB (free tier).
p99 sync detection < 30 seconds when both devices online.
Durable forever (until user deletes).
Reasonable bandwidth — only re-upload what changed.

Capacity

Dimension	Estimate	How we got there
Total files	100B	Given
Avg file size	500 KB	Heavy long tail of small files; videos and big files exist
Total storage	~50 PB	`100B × 500 KB`
Upload rate	~12k/sec	`1B uploads/day ÷ 86,400 s`
Download rate	~120k/sec	`10B downloads/day ÷ 86,400 s`
Edit rate	~6k/sec	`500M edits/day ÷ 86,400 s`

Takeaway: At ~50 PB raw, the storage bill is manageable — the bottleneck is sync bandwidth; cross-user dedup (30–50%+) is the highest-leverage knob.

The challenge is not the storage cost (S3 makes that boring) — it's the sync protocol efficiency.

Building up to the design

Drive/Dropbox starts looking like "S3 with a UI" and gets harder every minute. Walking from "upload to a server" to the actual production design makes the trade-offs visible.

V1: Upload the whole file every time

Client → POST /files (multipart upload) → server saves to S3 → record in DB.

This works as a file backup. For small files and occasional saves it's fine. But the moment someone edits a 1 GB video — adding a title card, trimming two seconds — they re-upload the whole gigabyte. On a flaky home connection, that upload restarts from zero on every drop. Bandwidth becomes the dominant cost for both the user and you.

V2: Chunk the file, upload chunks

Split the file into 4 MB chunks and upload each independently. A failure mid-upload means resuming from the last successful chunk, not from scratch. Big files become uploadable on bad networks.

The problem that surfaces next: if a million users each have a copy of the same office_template.docx and all of them upload it in chunks, you still store the same bytes a million times.

V3: Content-addressed chunks (dedup)

Hash each chunk (SHA-256). Before uploading, ask the server: "do you already have hash X?" If yes, skip the upload entirely — just record that this file version uses chunk X. The file itself becomes metadata: report.pdf → [hash1, hash2, hash3, …].

That one change yields significant storage savings from dedup (commonly cited at 30–50%+ at scale). Rename a 50 MB video: no chunks re-upload, just a metadata pointer update. A million users sharing the same template doc: stored once.

The new problem is edits. Insert one byte at the start of a file, and fixed-size 4 MB chunking shifts every subsequent boundary — you lose all dedup downstream of the insertion. You need chunk boundaries that are stable under edits, which means content-defined chunking. More on that shortly.

V4: Sync protocol — push changes to other devices

When a file changes on one device, the user's other devices need to know about it. Polling every few seconds works but burns battery and hammers the server. A better path: each device holds an open connection (WebSocket or long-poll) to a notification channel. The server emits an event when a file the user owns changes; the client wakes up, pulls the new metadata, and fetches any missing chunks. The result is a "save on laptop, phone sees it in under 30 seconds" experience.

Offline edits are where it gets interesting. Two devices change the same file while one is offline. When the offline device reconnects, the server holds a newer version from the other device. Now what?

V5: Versioning + conflict resolution

Every change creates a new version. A version is just a list of chunk hashes — cheap pointer storage, not a copy of the file. When a client uploads a change against a stale base version, the server detects the conflict and saves both as a "conflicted copy" (Dropbox style) or merges if it recognizes the format (Google Docs style). The user gets to choose the winner, and neither version is ever silently discarded.

Full version history is a natural byproduct. "Restore deleted file" becomes "promote version N-1 back to current."

A file is owned by a user; shares grant (file_id, granted_to_user_id, perm). Change notifications fan out to anyone with read access. Permission checks live at the metadata layer — chunks themselves are content-addressed and global, guarded only by per-file ACLs when fetching metadata.

V7: Production Drive

V3 (chunks + dedup) + V4 (sync) + V5 (versions) + V6 (sharing) + chunk storage in S3/GCS, metadata in a globally-replicated SQL store (Spanner-like), per-chunk encryption, lifecycle policies to cold storage for old versions, and a notification fabric that scales to 1B+ devices.

flowchart LR
    V1[V1: whole file upload<br/>1 GB resends] --> V2[V2: + chunked upload<br/>resumable]
    V2 --> V3[V3: + content-hash dedup<br/>30–50%+ savings]
    V3 --> V4[V4: + sync protocol<br/>multi-device]
    V4 --> V5[V5: + versions + conflicts<br/>data safety]
    V5 --> V6[V6: + sharing/ACLs<br/>collaboration]
    V6 --> V7[V7: production Drive]
    style V1 fill:#0e7490,color:#fff
    style V3 fill:#15803d,color:#fff
    style V5 fill:#ff6b1a,color:#0a0a0f
    style V7 fill:#a855f7,color:#fff

Architecture

flowchart TD
    USER[User devices] -->|API| APIGW[API Gateway]
    APIGW --> META[Metadata Service]
    APIGW --> UP[Upload Service]
    APIGW --> NOTIF[Notification Service]

    META --> METADB[(Metadata DB<br/>Spanner / sharded SQL)]

    UP --> CHUNKER[Chunker]
    CHUNKER --> DEDUP[Dedup Service]
    DEDUP --> CHUNKDB[(Chunk Index<br/>hash → location)]
    DEDUP --> BLOB[(Blob Storage<br/>S3 / GCS)]

    META --> KAFKA[Kafka events]
    KAFKA --> NOTIF

    NOTIF --> PUSH[Push to other devices]

    style META fill:#ff6b1a,color:#0a0a0f
    style DEDUP fill:#15803d,color:#fff
    style BLOB fill:#0e7490,color:#fff

Key insight: chunk and deduplicate

A file is not stored as a single blob. It's split into ~4 MB chunks, each chunk stored once (deduplicated by hash), and the file metadata records which chunks make it up.

flowchart TD
    F[File: report.pdf<br/>20 MB] --> SPLIT[Split into 5 chunks]
    SPLIT --> H1[chunk1 hash=abc...<br/>~4MB]
    SPLIT --> H2[chunk2 hash=def...<br/>~4MB]
    SPLIT --> H3[chunk3 hash=ghi...<br/>~4MB]
    SPLIT --> H4[chunk4 hash=jkl...<br/>~4MB]
    SPLIT --> H5[chunk5 hash=mno...<br/>~4MB]
    H1 --> S[(Blob storage)]
    H2 --> S
    H3 --> S
    H4 --> S
    H5 --> S
    style SPLIT fill:#ff6b1a,color:#0a0a0f

Three things fall out of this for free:

Cross-user dedup: 1M users with the same office_template.docx store it once.
Within-user dedup: rename a 50 MB video and no bytes re-upload — just the metadata pointer changes.
Delta sync: edit the last page of a 100-page PDF and only the affected chunk(s) go over the wire.

Dedup at scale yields substantial savings — commonly cited in the 30–50%+ range depending on workload, with cross-user dedup of common file types (templates, system libraries, media files) being the dominant driver.

The chunking algorithm

For maximum dedup, use content-defined chunking (CDC). A rolling hash — typically Rabin fingerprinting — slides over the byte stream and marks a chunk boundary whenever the hash value matches a pattern (e.g., the low 13 bits are zero). Because boundaries are determined by local content rather than fixed offsets, inserting bytes near the start of a file only changes the chunk or two around the insertion point; all downstream chunks are untouched and still hit the dedup cache.

Rabin fingerprinting is computationally cheap: each new hash is derived from the previous one by shifting out the oldest byte and shifting in the newest — O(1) per byte. Average chunk size is controlled by the bitmask: 13 bits → ~8 KB average, 22 bits → ~4 MB average.

Fixed-size 4 MB chunks are the simpler alternative. Easier to implement, slightly less efficient on edits — a good starting point for an interview answer.

Upload flow

Here is the full upload sequence, from client to blob store. Notice the "which hashes do you need?" probe that gates how much actually goes over the wire.

sequenceDiagram
    participant Client
    participant Upload Svc
    participant Dedup
    participant Blob
    participant Meta
    Client->>Client: split into chunks
    Client->>Client: hash each chunk
    Client->>Upload Svc: list of (chunk_hash, size)
    Upload Svc->>Dedup: which hashes are new?
    Dedup-->>Upload Svc: chunks 2 and 4 are new
    Upload Svc-->>Client: signed URLs for chunks 2 & 4
    Client->>Blob: PUT chunk 2
    Client->>Blob: PUT chunk 4
    Client->>Upload Svc: complete
    Upload Svc->>Meta: register file -> [hash1, hash2, ...]
    Meta-->>Upload Svc: file_id, version
    Upload Svc-->>Client: 201 Created

For a small edit on a 1 GB file, only one or two chunks are "new" — the client uploads tens of megabytes instead of a gigabyte.

Download flow

sequenceDiagram
    participant Client
    participant Download API
    participant Meta
    participant CDN
    participant Blob
    Client->>Download API: GET /files/123
    Download API->>Meta: file_id 123
    Meta-->>Download API: chunks: [hash1, ...], permissions
    Download API-->>Client: list of (chunk_hash, signed_url)
    Client->>CDN: GET hash1
    CDN-->>Client: chunk
    Client->>Client: assemble file

CDN serves frequently-accessed chunks fast. For private files, signed URLs expire after a short window so the CDN can't serve them to unauthenticated requests.

Metadata schema

files (
  file_id     BIGINT PK,
  owner_id    BIGINT,
  parent_id   BIGINT,        -- folder hierarchy
  name        TEXT,
  size        BIGINT,
  current_ver BIGINT
)

versions (
  file_id     BIGINT,
  version     BIGINT,
  chunk_list  TEXT[],         -- ordered chunk hashes (SHA-256 hex strings)
  created_at  TIMESTAMPTZ,
  created_by  BIGINT
)

chunks (
  hash        TEXT PK,
  size        INTEGER,
  blob_url    TEXT,
  refcount    INTEGER          -- for GC
)

permissions (
  file_id     BIGINT,
  user_id     BIGINT,
  level       TEXT             -- view, comment, edit, owner
)

Sync protocol

Each client keeps a local index: (file_id, last_seen_version). When it wants to catch up, it asks:

GET /sync?cursor=123456
→ {
   changes: [
     {"file_id": 1, "op": "update", "version": 5, "chunks": [...]},
     {"file_id": 7, "op": "delete"},
   ],
   cursor: "v=123500"
}

Two ways to deliver the initial notification:

Polling: client asks every N seconds. Simple, but burns battery and server capacity.
Long-polling / WebSocket: server pushes when something changes. Lower latency, more complex to operate.

Most production systems use a hybrid — WebSocket for active sessions, long-poll for background reconnects — and fall back gracefully when either side drops.

Cursor design matters. The cursor token is not a wall-clock timestamp; clocks skew, and you'd get race conditions. Better options:

Cursor type	Properties	Trade-off
Monotonic DB sequence (e.g. Postgres `SEQUENCE`)	Gaps are possible; guaranteed order on a single shard	Simple; breaks across shards
Logical clock / Lamport timestamp	Causally ordered, no gap risk	Requires global coordination or shard-level timestamps merged at read
Hybrid logical clock (HLC)	Wall-clock monotonicity + logical ordering	Used in CockroachDB / Spanner-like systems; good fit for geo-distributed metadata

For a single-region metadata store, a monotonic integer sequence per shard works. For global deployments (Google Workspace), Spanner's TrueTime-backed commit timestamps serve as cursors with bounded uncertainty (typically 1–6 ms per the Spanner paper).

Fan-out at scale. A file shared with 10,000 users means one metadata write fans out to 10,000 notification subscriptions. If you do that fan-out synchronously on the write path, your write latency grows linearly with subscriber count. The fix: write to metadata DB → emit one Kafka event per file change → notification workers expand to per-subscriber push asynchronously. Write latency stays flat; fan-out becomes a background throughput problem.

Conflict resolution

User A edits report.pdf on their laptop. While offline, they also edit the same file on their phone. Both devices come online with conflicting changes.

How conflict is detected. The client sends its change with a base_version — the version it read before editing. The server compares base_version against current_version. If they differ, there's a conflict.

Four approaches, ranging from simplest to most powerful:

Last write wins: simple, but silently discards one user's work. Acceptable only for key-value stores where the semantics are understood.
Both versions saved as report.pdf and report (Conflicted Copy 2026-06-02).pdf. Dropbox's classic approach. No data ever lost; the user resolves it manually.
3-way merge: diff the base against both branches, apply non-overlapping changes automatically, flag the overlapping hunks for the user. Works well for text; not possible for binary files.
Operational Transform (OT) / CRDT: merge character-level edits semantically, as Google Docs does. OT requires the full document to remain server-side at all times; CRDTs are peer-to-peer-friendly but add schema complexity. Powerful, but a deep rabbit hole.

How Dropbox and Drive actually differ. Dropbox treats every file as an opaque blob — conflict always yields a conflicted copy. Google Docs treats the document as a structured data type and runs OT on every keystroke, so conflicts are invisible to users during live collaboration. Drive for regular binary files (.zip, .pdf) behaves like Dropbox.

For the interview, start with conflicted copies — safe and simple to explain. Mention 3-way merge for text as an upgrade, and name OT/CRDT as "how real-time collaborative editors work" without necessarily deriving it from scratch.

One failure mode worth mentioning. If the server commits a new version but crashes before notifying the client, the client retries with the same base_version. Without an idempotency key on the upload request (a client-generated UUID works), the server creates a duplicate version. The fix is cheap: check for the idempotency key before writing.

flowchart LR
    FILE["File 'budget.xlsx'"]
    OWNER["Owner: Alice"] --> FILE
    SHARE["Share to Bob: 'edit'"] --> FILE
    SHARE2["Share to all employees: 'view'"] --> FILE
    LINK["Link share: 'view, anyone with link'"] --> FILE

Permission resolution walks up the folder tree (folder permissions inherit), checks direct grants, then checks group memberships. That can be slow for deeply nested hierarchies, so cache the effective permissions per (file, user) pair and invalidate on any ACL change.

Blob storage — the big tier

Each chunk lives in S3 / GCS / Azure Blob. These services are designed for 11-nines durability (99.999999999%), multi-region replication, petabyte scale at low cost (~$0.02/GB-month for hot storage, less for cold), and simple HTTP access.

At Drive or Dropbox scale, you negotiate enterprise pricing — and eventually you may build your own. Dropbox famously moved ~90% of its user data off S3 to Magic Pocket, its own exabyte-scale blob store, by October 2015 (publicly announced March 2016). The migration saved ~$74.6M in infrastructure costs over two years, per their 2018 S-1 filing.

CDN at the edge

Frequently-accessed public chunks go through a CDN. For private files, the CDN serves signed URLs that expire after a short window. This means a popular shared document can be served from edge nodes around the world without hitting your origin for every download, while access control stays enforced.

Garbage collection

A chunk is potentially referenced by many files and many versions. When does it actually get deleted?

Reference counting: each chunk has a refcount (how many live file versions point to it). When a version is deleted, decrement the refcount for each of its chunks. When refcount reaches zero, the chunk is eligible for deletion.

Don't delete in real time. The upload flow goes: client sends chunk hashes → server marks them "in-flight" → client uploads to blob store → server atomically increments refcount. If you delete a chunk the instant its refcount hits zero, a concurrent re-upload of the same hash can race in, the server says "already have it" (dedup skip), but the blob is gone — data corruption. Instead, move chunks whose refcount hits zero into a "pending delete" state with a tombstone timestamp, and run a nightly GC sweep that only removes blobs whose tombstone is older than 24 hours. That window makes the race essentially impossible.

Epoch-based alternative. Rather than refcounting, some systems do mark-and-sweep: periodically scan all live versions, collect all referenced hashes (a Bloom filter works well here to test membership at scale), then scan blob storage and delete anything not in the set. Simpler correctness argument; this is well-suited to exabyte-scale erasure-coded blob stores like Magic Pocket.

Sync notification path — end to end

It helps to see the full path from a save on one device to the notification landing on another. Here is what happens after the upload completes:

sequenceDiagram
    participant DeviceA as Device A (editor)
    participant MetaSvc as Metadata Service
    participant Kafka
    participant NotifWorker as Notification Worker
    participant DeviceB as Device B (subscriber)
    DeviceA->>MetaSvc: upload complete, register new version
    MetaSvc->>Kafka: emit FileChanged event
    Kafka->>NotifWorker: deliver event
    NotifWorker->>NotifWorker: expand to per-subscriber list
    NotifWorker->>DeviceB: push notification (WebSocket / long-poll)
    DeviceB->>MetaSvc: GET /sync?cursor=...
    MetaSvc-->>DeviceB: changed chunks for file_id=123
    DeviceB->>CDN: fetch missing chunks

The metadata write and the Kafka emit are synchronous; everything after that is async. Device B usually sees the change within a few seconds on an active WebSocket connection, and within the 30-second p99 target even on a polling fallback.

Edge cases

Large file upload interrupted. The client resumes from the last successful chunk — chunks already uploaded are skipped because the server already has their hashes. Multipart upload with checkpointed chunks.

Slow or metered network. The client throttles upload bandwidth and backs off when on cellular. User-configurable.

Bandwidth tax for popular shared files. A viral shared document downloaded 10M times would obliterate egress costs if every request hit origin. CDN caching brings per-download cost to pennies.

Mobile vs. desktop. Desktop clients do full sync — mirror the folder locally. Mobile clients do lazy sync — download a file on tap, not proactively. Different access patterns, different APIs, same chunk + metadata backend.

Things to discuss in an interview

Chunked upload with dedup — the killer feature that makes everything else possible.
Hash-based content addressing — content + hash → location; why this enables dedup and delta sync.
Sync protocol with cursor / version numbers and the tradeoffs between cursor types.
Conflict resolution strategy, especially binary vs. text files.
Garbage collection of orphan chunks, and why real-time deletion is unsafe.
Permission resolution — folder inheritance and caching effective permissions.

Things you should now be able to answer

Why store files as chunks instead of single blobs?
What's content-defined chunking and why does it matter for edits?
How do you sync only the changed bytes when a user edits a file?
What happens when two devices edit the same file offline?
Why does the upload start with a "do you have these hashes?" probe?

Frequently asked questions

▸Why does Google Drive store files as chunks instead of single blobs?

Splitting a file into roughly 4 MB chunks and addressing each by its SHA-256 hash enables three things at once: cross-user deduplication (a million users sharing the same template store it once), within-user delta sync (only the changed chunks go over the wire on edits), and resumable uploads (a failure resumes from the last successful chunk, not from scratch). These benefits fall out of a single mechanism, which is why the chunk-based design is the load-bearing decision the rest of the architecture builds on.

▸What is content-defined chunking and why does it matter?

Content-defined chunking (CDC) uses a rolling hash — typically Rabin fingerprinting — that slides over the byte stream and cuts a boundary whenever the hash matches a pattern, such as the low 13 bits being zero. Because boundaries depend on local content rather than fixed offsets, inserting bytes near the start of a file shifts only the one or two chunks around the insertion; all downstream chunks are unchanged and still hit the dedup cache. Fixed-size chunking, by contrast, shifts every chunk boundary downstream of any insertion, destroying dedup for the tail of the file.

▸How does Google Drive detect and deliver a file change to other devices within 30 seconds?

When a file is saved, the Metadata Service emits a FileChanged event to Kafka, and Notification Workers expand that into per-subscriber pushes over WebSocket (for active sessions) or long-poll (for background reconnects). The receiving device wakes up, calls GET /sync with its cursor to get the changed chunk list, then fetches only the missing chunks from the CDN. The metadata write and Kafka emit are synchronous; everything after that is async, and the design targets p99 sync detection under 30 seconds.

▸How does the system handle conflicts when two devices edit the same file offline?

Each upload includes a base_version — the version the client read before editing. If base_version differs from current_version on the server, a conflict is detected. The safest approach, used by Dropbox for opaque files, saves both as separate copies (e.g., report.pdf and report (Conflicted Copy 2026-06-02).pdf) so no data is ever lost. Google Docs handles real-time collaboration differently by running Operational Transform on every keystroke, but Drive still treats binary files like .zip and .pdf as opaque blobs and falls back to conflicted copies.

▸How did Dropbox reduce storage costs at exabyte scale, and by how much?