Designing Pastebin
A simple service for sharing text snippets — and a surprisingly rich design problem. Storage strategy, expiry, syntax highlighting, abuse prevention.
The problem
Pastebin.com launched in 2002 and turned a dumb-simple idea into a developer institution: dump any text into a box, click a button, get a short link to share anywhere. GitHub Gist, dpaste, and dozens of clones followed. The use cases are mundane — sharing a stack trace in a chat, dropping a config snippet in a ticket, posting a log file too big for Slack — yet the system is a near-perfect proxy for a family of "write once, read many" services you'll build throughout your career.
The core loop is tiny: POST /pastes accepts a text blob up to 10 MB and returns a short, unguessable URL like paste.io/Xy7Ab2c. GET /paste.io/Xy7Ab2c returns the content. That's it. No accounts required, no collaboration, no real-time editing. The simplicity is the point.
What makes it interesting under the hood is the payload. A URL shortener stores a 200-byte redirect row and serves it billions of times. A pastebin stores blobs that can reach 10 MB — and those blobs have to live somewhere. Stuff them in a database and you immediately wreck WAL performance, replication lag, and storage cost. Pull them into object storage and now every read is a network hop. The design question is really about where bytes live and how reads stay fast despite an order-of-magnitude heavier payload than a URL shortener.
The second tension is key generation. IDs need to be short (for shareability), unguessable (for privacy), and collision-resistant at millions of writes per day. Sequential counters are efficient but enumerable — anyone can walk your dataset by incrementing a number. Random generation is private but risks retry overhead at high write rates. Threading that needle, plus building a CDN-fronted read path that absorbs viral traffic spikes, is what makes this an interview staple.
Requirements
Functional
- User pastes arbitrary text (up to, say, 10 MB).
- Gets a unique short URL like
paste.io/Xy7Ab2c. - Anyone with the URL reads the paste.
- Optional: custom alias (
paste.io/my-deploy-log). - Optional: expiry (1 hour, 1 day, 1 week, never).
- Optional: syntax highlighting.
- Optional: private (unguessable URL) vs public (listed).
- Optional: edit / delete (uncommon — most pastes are write-once).
Non-functional
- Reads dominate writes (~10:1 to 100:1).
- Latency: read p99 < 200ms.
- Durability: pastes shouldn't disappear (durable storage).
- Available: 99.9%+.
- Cost: scales to millions of pastes/day cheaply.
Out of scope
- Auth (most pastebins are anonymous).
- Real-time collaboration (that's a different product).
- Comments / forking / versioning (GitHub Gist territory).
Back-of-the-envelope
Let's say 10M pastes/day, average 10 KB per paste.
| Dimension | Estimate | How we got there |
|---|---|---|
| Write throughput (avg) | ~120 writes/sec | 10M ÷ 86,400 s |
| Write throughput (peak) | ~500/sec | ~4× avg burst headroom |
| Read throughput (avg) | ~12k reads/sec | 120 × 100 (100:1 read ratio) |
| Read throughput (peak) | ~50k/sec | ~4× avg burst headroom |
| Storage per day | 100 GB/day | 10M × 10 KB |
| Storage per year | ~36 TB/year | 100 GB × 365 |
| Storage over 5 years | ~180 TB | 36 TB × 5 |
That's enough to need object storage (S3-like) instead of stuffing all of it in a database.
Takeaway: 36 TB/year at 12k avg reads/sec — object storage for blobs is non-negotiable; a CDN absorbing 90%+ of reads keeps the origin at a manageable ~1,200 reads/sec.
Building up to the design
Pastebin is the most "obviously easy then surprisingly hard" question in the catalog. Walking from the dumbest version up makes every later decision feel obvious.
V1: One server, one Postgres table
CREATE TABLE pastes (id TEXT PRIMARY KEY, content TEXT, created_at TIMESTAMP);
POST /pastes: insert row, return id. GET /:id: select row, return content. Done in 30 lines. Works beautifully for a small team sharing config snippets.
The problem reveals itself at scale. A 10 MB blob in Postgres bloats the WAL, slows replication, and hurts every query in the table. At ~10M pastes/day × ~10KB avg, you're growing 100 GB/day in a table you're also serving reads from. The database becomes the cost driver and the latency outlier — two things you really want it not to be.
V2: Split metadata and blob
Keep metadata in Postgres (id, owner, expiry, language); put the content in S3 keyed by paste/{id}. The DB row is tiny again.
This buys you 10× cheaper storage and S3 durability for free. But now every read is a network hop to S3 — in-region TTFB for S3 Standard small objects is typically in the 20–100ms range, with tail latency reaching 200ms under load. And popular pastes (front-page Hacker News link) hammer the same S3 object repeatedly.
V3: Cache + CDN
Pastes are immutable until they expire — a CDN dream. Set Cache-Control: public, max-age=3600. Put Redis in front of S3 for the origin path.
read → CDN edge → (miss) → API → Redis → (miss) → S3
Now 90%+ of reads are served at CDN edges in under 30ms. The origin sees almost nothing. A paste going viral becomes a non-event.
The next bottleneck shifts to ID generation. Random base62 with a UNIQUE retry is fine until you're at thousands of writes/sec — collision retries add unpredictable latency. And sequential IDs leak privacy ("paste #42 is right after paste #41").
V4: ID generation strategy
Pick one of:
- Random base62 + retry — simplest, fine up to ~1k writes/sec.
- KGS (Key Generation Service) — pre-generate a pool of unused random keys; app servers reserve batches; zero collisions on the hot path.
- Snowflake/KSUID — time-sortable, longer.
For most pastebins, random base62 wins on simplicity; switch to KGS when write latency must be predictable.
V5: Tiny pastes inline, expiry, abuse
Two production-quality refinements:
- Inline pastes < 8KB in Postgres
content_inlinecolumn (skip the S3 round-trip for ~70% of pastes). - Compression (gzip) before upload: 3–10× storage win for code/log content.
- Expiry via S3 lifecycle rules + nightly metadata sweep + lazy 404 in between.
- Abuse: rate limits, regex scanning for leaked credentials, DMCA flow.
V6: Production
Everything wired together: CDN → API (multi-instance, LB-fronted) → Postgres (metadata, with expires_at partial index) → S3 (blobs, with lifecycle rules) → Redis (origin cache) → background workers for expiry, view counts, abuse scanning.
flowchart LR
V1[V1: blob in Postgres<br/>WAL meltdown] --> V2[V2: + S3<br/>cheap, slow reads]
V2 --> V3[V3: + CDN + Redis<br/>90% deflection]
V3 --> V4[V4: ID strategy<br/>scale + privacy]
V4 --> V5[V5: inline + gzip + expiry<br/>cost + UX]
V5 --> V6[V6: production pastebin]
style V1 fill:#0e7490,color:#fff
style V3 fill:#15803d,color:#fff
style V5 fill:#ff6b1a,color:#0a0a0f
style V6 fill:#a855f7,color:#fff
High-level design
flowchart LR
U[User] --> CDN
CDN --> LB[Load Balancer]
LB --> API[API Servers]
API -->|metadata| DB[(Postgres)]
API -->|paste content| OBJ[(S3 / Object Storage)]
API --> CACHE[(Redis<br/>hot pastes)]
BG[Background workers] -->|expire pastes| DB
BG --> OBJ
style API fill:#ff6b1a,color:#0a0a0f
style OBJ fill:#15803d,color:#fff
style CACHE fill:#0e7490,color:#fff
Two storage tiers:
- Database (Postgres): metadata only — paste ID, owner, created_at, expires_at, language, MIME type, byte size. Small rows. Fast indexes.
- Object storage (S3): the actual text blob. Cheap, durable, scales infinitely.
Why split? Putting 10 MB blobs in Postgres bloats the WAL, kills replication, and hurts every query. Object storage is built for blobs at 1/10 the cost.
Generating the short ID
This is the same problem as URL shortener. Four candidates:
1. Counter + Base62
A central counter (SELECT nextval('paste_id_seq')) encoded in base62. Yields short, sequential IDs.
ALPHABET = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
def to_base62(n: int) -> str:
s = []
while n:
n, r = divmod(n, 62)
s.append(ALPHABET[r])
return ''.join(reversed(s)) or '0'
Short and deterministic with no collisions. The downside is that IDs are enumerable — paste #7 is right next to paste #6, which lets anyone walk your entire dataset sequentially. Bad for privacy.
2. Random + Bloom filter
Generate a 7-char random base62 ID. Query the Bloom filter: a "definitely not a member" result means the ID is safe to use immediately; a "probably a member" result (which may be a false positive) triggers a DB confirmation lookup before retrying. Because Bloom filters have no false negatives, a negative result is conclusive.
Unguessable, and avoids the DB round-trip for the common (new ID) case. The tradeoff: more code, and the filter must be large enough to keep the false-positive rate below ~0.1% to avoid excessive DB fallback reads. Filter state must also be persisted or rebuilt after a restart.
3. UUID truncated
Take 8 chars of a UUID. Tiny collision chance per insert. Handle on duplicate-key error by retrying. Simple and unguessable, just a hair longer than the minimum.
4. Snowflake / KSUID
Time-sortable IDs. Useful for indexing and ordering, but longer (~20 chars).
5. Key Generation Service (KGS)
Pre-generate a large pool of unique random keys offline and stash them in an "available keys" table. The application server grabs a chunk (say, 1000 keys at a time), marks them as used atomically, and hands them out at write time.
flowchart LR
KGS[KGS Worker] -->|"pre-generate"| AVAIL[(available_keys<br/>1B+ rows)]
API[API Server] -->|"reserve batch"| AVAIL
AVAIL -->|"100 keys"| API
API -->|"local cache"| MEM[in-memory queue]
API -->|"on paste"| USED[(used_keys)]
style KGS fill:#ff6b1a,color:#0a0a0f
style AVAIL fill:#15803d,color:#fff
The write path never blocks on key generation, and there are no collision retries or central counter contention at insert time. The cost is an extra service to run, and a process restart drops whatever keys it had cached but hadn't yet handed out. That's fine in practice — 6 chars of base62 yields 62⁶ ≈ 56.8 billion keys and 7 chars ≈ 3.5 trillion; losing a few hundred on a crash is rounding error. Critical: the reserve_batch operation must use SELECT ... FOR UPDATE SKIP LOCKED (or equivalent) so two API servers cannot receive overlapping key batches from concurrent reservation requests.
For most pastebins, the right choice is random base62 with collision retry. Privacy matters, the collision probability is negligible at our scale, and there's no central counter to bottleneck. Switch to KGS when you need predictable write latency or offline auditing of key entropy.
7 chars of base62 = 62⁷ ≈ 3.5 trillion IDs. At 10M/day, ~1000 years before space exhaustion. Plenty.
Custom aliases
When the user requests paste.io/my-snippet, the alias shares the same key space as auto-generated IDs. Insert path:
def reserve(alias: str | None) -> str:
if alias:
if not re.match(r'^[A-Za-z0-9_-]{4,32}$', alias):
raise BadAlias
# unique insert; raises on collision
db.insert("INSERT INTO pastes(id) VALUES (%s)", alias)
return alias
return reserve_random()
Two gotchas: reserve some prefixes (api, admin, login) so users can't squat on routes; and collisions with auto-generated IDs are fine — both live in the same table, both go through UNIQUE on id.
Storing the content
Order matters: write content first, then metadata.
sequenceDiagram
participant C as Client
participant API as API Server
participant S3 as S3
participant DB as Postgres
C->>API: POST /pastes (content)
API->>API: generate ID, compress
API->>S3: PUT paste/{id} (gzipped blob)
S3-->>API: 200 OK
API->>DB: INSERT metadata row (id, s3_key, expires_at, ...)
DB-->>API: 200 OK
API-->>C: 201 Created — paste.io/Xy7Ab2c
If the API crashes between the two steps with metadata written first, you get a DB row pointing to a nonexistent S3 object — every subsequent read returns a 500 or 404. That's bad. If the crash happens after S3 but before the DB write, you get an orphaned blob — wasteful but harmless. A nightly orphan-scrub job can reclaim S3 objects whose IDs have no matching metadata row.
Lifecycle: S3 lifecycle rules auto-delete blobs older than X days, filtered by prefix or object tag. Store blobs under tier-namespaced prefixes (paste/never/, paste/1d/, paste/1wk/) or tag objects at upload time; then create one rule per tier. This is cheaper than a dedicated worker process, but lifecycle runs at most once per day — suitable for coarse expiry tiers (days/weeks) but it cannot guarantee sub-day precision.
Compressing
Pastebin text is usually code or logs — both compress well (3–10×). Compress before upload:
content_gz = gzip.compress(content.encode())
s3.put(key, content_gz, ContentEncoding='gzip')
S3 stores 1/5 the bytes; reads decompress in the browser or in the API. Savings on 36 TB/year is real money.
Inline vs blob
Small pastes (< 8 KB) can live in the database itself in a content TEXT column, skipping the S3 round-trip entirely.
if len(content) < 8000:
save_inline(content) # in Postgres
else:
save_blob(content) # in S3
70%+ of pastes are tiny in practice. For those, a Postgres round-trip is faster than an S3 call, so inline storage is a meaningful latency win at no extra cost.
The read path
sequenceDiagram
participant U as User
participant CDN as CDN Edge
participant API as API Server
participant R as Redis
participant DB as Postgres
participant S3 as S3
U->>CDN: GET /Xy7Ab2c
alt CDN hit
CDN-->>U: 200 OK (cached content)
else CDN miss
CDN->>API: forward request
API->>R: GET paste:Xy7Ab2c
alt Redis hit
R-->>API: content
else Redis miss
API->>DB: SELECT content_inline, s3_key, expires_at WHERE id = 'Xy7Ab2c'
DB-->>API: metadata row
alt content_inline IS NOT NULL (small paste, ~70% of reads)
API->>R: SET paste:Xy7Ab2c (TTL 3600)
else content_inline IS NULL (large paste — fetch from S3)
API->>S3: GET paste/Xy7Ab2c
S3-->>API: blob
API->>R: SET paste:Xy7Ab2c (TTL 3600)
end
end
API-->>CDN: 200 OK + Cache-Control headers
CDN-->>U: 200 OK (now cached at edge)
end
Three caches, each catching a different slice:
- CDN — pastes are immutable until they expire. Set
Cache-Control: public, max-age=3600. The CDN handles 90%+ of reads. - Redis / Memcached — for the API's view, cache content for hot pastes.
- Browser — same Cache-Control headers; browser caches naturally.
Hot pastes (something just went viral on Hacker News) live in CDN edges and never touch your backend after the first read. Cold pastes go all the way to S3.
Sizing the cache (80/20 rule)
Assume 20% of pastes drive 80% of reads. At 10M pastes/day × 10 KB avg, the daily working set is 100 GB; 20% of that is ~20 GB. Cache the last few days of hot pastes and you're at a few hundred GB of memory — split across a Memcached/Redis ring.
daily writes = 10M × 10 KB = 100 GB
hot working set 20% = 20 GB / day
7-day hot window ≈ 140 GB cache
shard 8× node ≈ 18 GB / node
Eviction is LRU — hot pastes stay; cold ones fall out to S3. Don't write-through on the create path; just let the cache populate on first read. If you go write-through for cross-region replica freshness, make sure the write fans out to all replicas atomically so a follow-up read can't see a stale empty slot. Use consistent hashing on paste ID across cache nodes so one node dying only invalidates 1/N of the cache. See consistent hashing.
Expiry
Option 1: Worker scan
Background job runs nightly: SELECT id FROM pastes WHERE expires_at < NOW() LIMIT 10000. Delete each.
Simple and easy to reason about. The cost is a scan-heavy operation that requires an index on expires_at.
Option 2: S3 lifecycle rules
Object storage handles expiry natively via lifecycle rules that filter by key prefix or object tag. To support multiple expiry tiers (1 h, 1 d, 1 wk, never), store blobs under tier-namespaced prefixes (paste/1h/{id}, paste/1d/{id}, etc.) or tag each object at upload time (expiry=1d), then create one lifecycle rule per tier. S3 does not support per-object arbitrary expiry dates — the rules operate at the prefix/tag level.
Zero application code for the actual deletion is a big win. The downside is coarse timing: lifecycle runs once per day at midnight UTC, so "1 hour" expiry cannot be guaranteed at hour granularity. Metadata still needs a separate cleanup sweep, and you need prefix/tag discipline from day one.
Option 3: Lazy delete on read
Don't delete eagerly. When a paste is read past its expiry, return 404 and queue async deletion. Minimal write load, but storage grows and abandoned pastes linger longer than users expect.
In practice, you combine all three: S3 lifecycle for blobs + nightly metadata cleanup + lazy-404 for the moments in between.
The database schema
CREATE TABLE pastes (
id VARCHAR(12) PRIMARY KEY,
s3_key VARCHAR(128),
content_inline TEXT, -- nullable; used for small pastes
language VARCHAR(32),
byte_size INT,
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
expires_at TIMESTAMP, -- NULL = never
view_count INT NOT NULL DEFAULT 0,
is_public BOOLEAN NOT NULL DEFAULT FALSE
);
CREATE INDEX idx_expires ON pastes(expires_at) WHERE expires_at IS NOT NULL;
CREATE INDEX idx_public_created ON pastes(created_at) WHERE is_public;
Notes:
- Partial index on
expires_atskips pastes that never expire. - Partial index for public listing keeps it tiny.
- No FKs;
view_countis a write-heavy column.
View counts
Updating view_count on every read creates a write bottleneck — every cache hit at the CDN still means a DB write, which defeats the point. The standard fix is to decouple the count from the read path:
flowchart LR
R[Read paste] --> Q[Queue: increment view]
Q --> BATCH[Batched worker]
BATCH -->|every 30s| DB[(UPDATE views<br/>SET count = count + N)]
style Q fill:#ff6b1a,color:#0a0a0f
A queue absorbs the writes; a worker batches updates. The displayed count is eventually consistent but very-cheap-to-write. Thirty seconds of lag on a view counter is a trade most products happily make.
Syntax highlighting
Server-side or client-side? For most pastebins, client-side is the right default — ship Highlight.js or Prism.js to the browser, pay zero server cost, and the CDN still serves the raw text blob. Server-side pre-rendering (render HTML once, cache it) is worth it only if you need CDN-cacheable HTML or server-rendered exports like PDF. For private pastes that fall through to the origin anyway, do it on first read and cache.
Abuse prevention
The dirty secret of pastebin: people use it for credentials, malware, leaked data, copyright violation, and spam. You will face:
| Threat | Mitigation |
|---|---|
| Mass paste spam | Rate limit per IP. Captcha after N pastes/hour. See rate limiter design. |
| Credential dumps | Scan with regex (AWS keys, API tokens) on ingest; flag and notify. |
| Malware payloads | Hash check against known-bad; partner with security vendors. |
| Phishing pages | Pastebin isn't HTML — but rendered code can include URLs. Manual + automated abuse reports. |
| Copyrighted content | DMCA flow. |
| Hot pastes | Auto-block on excessive reads (could be a denial-of-wallet attack on your S3 egress). |
This is real work. Real pastebin operators have full-time abuse teams.
Privacy: public vs unlisted vs private
| Tier | Indexed by search? | URL is enough? | Auth needed? |
|---|---|---|---|
| Public | Yes | Yes | No |
| Unlisted | No | Yes (unguessable URL) | No |
| Private | No | No | Yes |
For unlisted pastes, the URL is the security boundary. The 7-char base62 IDs used throughout this design (62⁷ ≈ 3.5 trillion combinations) make brute-force guessing infeasible at any realistic request rate. Don't log full URLs anywhere they might leak.
What if you go viral?
A paste gets posted to Hacker News and gets 1M reads in an hour.
Without caching: 1M / 3600 ≈ 280 reads/sec of one row, plus S3 reads. Not catastrophic but real.
With caching: CDN serves 99.9%+ of those, single-digit reads/sec hit the origin. CDN is the single biggest win for this workload.
Plan ahead:
- All paste reads through a CDN.
Cache-Controlset generously for immutable content.- Origin protected by a cache-fronting layer (Cloudflare, Fastly).
Trade-offs to discuss in an interview
- Why split metadata and blob storage? Blob in DB bloats WAL, hurts replication; S3 is 10× cheaper and built for it.
- Why random IDs instead of sequential? Privacy. Sequential IDs let attackers enumerate.
- Random-with-retry vs KGS? Retry is simpler; KGS gives predictable write latency and decouples key generation from the write path. KGS shines once you're past ~1,000 writes/sec.
- Why a CDN? 90%+ read deflection. Cheapest and biggest performance win.
- Why eventually-consistent view counts? Each read writing is a bottleneck for trivial UX value.
- Why compress? 3–10× storage savings on text; basically free CPU.
- Inline vs blob storage threshold? ~8 KB. Below that, a Postgres round-trip beats an S3 round-trip; above, S3 wins on cost and DB health.
Things you should now be able to answer
- Why does a pastebin need different storage from a URL shortener?
- How do you size your ID space, and why does randomness matter?
- What does a CDN buy you here, and what fraction of reads does it absorb?
- How do you handle paste expiry without scanning the whole DB?
- What abuse vectors will your service face, and which are easy to block?
Further reading
- URL Shortener design — the simpler cousin
- Rate Limiter design — useful for abuse prevention
- GitHub Gist API docs — a richer pastebin
- AWS S3 lifecycle policies documentation
Frequently asked questions
▸Why store paste content in S3 instead of Postgres?
Putting blobs up to 10 MB in Postgres bloats the write-ahead log, hurts replication, and costs roughly 10 times more than object storage. S3 is built for binary blobs at scale; Postgres is kept for tiny metadata rows only — paste ID, expiry, language, and byte size.
▸What fraction of reads does a CDN absorb, and what does that mean for origin load?
A CDN absorbs 90%+ of reads, reducing origin traffic from roughly 12,000 reads per second down to about 1,200 reads per second at the 100:1 read-to-write ratio used in the article. Pastes are immutable until expiry, so setting Cache-Control: public, max-age=3600 turns a viral Hacker News link into a non-event at the origin.
▸When should you use a Key Generation Service instead of random base62 with retry?
Random base62 with collision retry is the right default and works cleanly up to roughly 1,000 writes per second. Switch to a KGS — which pre-generates a pool of unique keys and lets API servers reserve batches atomically using SELECT FOR UPDATE SKIP LOCKED — when you need predictable write latency or must eliminate retry overhead at higher write rates.
▸What is the inline paste optimization and when does it apply?
Pastes smaller than 8 KB are stored directly in a Postgres content_inline column instead of uploading to S3. Because roughly 70% of pastes are tiny, this skips the S3 round-trip for the majority of creates and reads, making a Postgres query faster than an S3 call for that common case.
▸How does the article recommend handling paste expiry?
The article combines three mechanisms: S3 lifecycle rules on tier-namespaced key prefixes (such as paste/1d/ or paste/1wk/) to delete blobs automatically, a nightly metadata sweep that runs SELECT id FROM pastes WHERE expires_at < NOW(), and a lazy-404 strategy that returns a 404 and queues async deletion when an expired paste is read between the two scheduled passes.
You may also like
Design an Object Storage Service (S3)
Store arbitrary blobs with HTTP GET/PUT at exabyte scale and 11 nines of durability. Metadata vs data separation, erasure coding, and self-healing.
Design a Distributed Message Queue (Kafka)
Build a durable, partitioned, replicated commit log like Kafka — ordering, consumer groups, replication (ISR), and exactly-once.
Design an Email Service (Gmail)
Send, receive, store, and search email for hundreds of millions of users. SMTP ingestion, sharded mailbox storage, full-text search, and spam filtering.