Design Google Docs (real-time collaborative editor)
Multiple people editing the same document simultaneously, every keystroke synced, never a corrupt merge. Operational Transformation, CRDTs, presence, and the architecture that's run quietly at Google for 17+ years.
The problem
Google Docs launched in 2006 (as the acquired Writely product), and by 2009 it had quietly solved one of the hardest problems in distributed systems: letting N people type into the same document at exactly the same time and guaranteeing they all end up with the same text. Today it handles roughly 50 million simultaneous editing sessions at peak, with every keystroke propagated to collaborators in under 200 milliseconds.
The product is a browser-based word processor you share with a link. Click "Share," paste a URL into Slack, and five colleagues can open the same document and all start typing — simultaneously, no turns, no locks. Their cursors appear on your screen in near-real time, colour-coded by user. The document stays coherent no matter who types what, when, or in what order the network delivers their changes.
That last sentence is the hard part. "Stays coherent" is not free. If two people insert text at the same position in the same millisecond, one person's insert is based on a document version that no longer exists by the time it reaches the server. Naive approaches destroy data silently — the "save last write wins" strategy that works for a solo user will randomly wipe a collaborator's paragraph. Even diff-merge, the technique version-control systems use, produces garbled output when two edits overlap, with no way for the user to know something went wrong.
The core engineering tension is operation ordering under concurrency. Every edit must be expressed as a versioned, transformable operation; every server must act as the canonical ordering authority; and every client must converge to the same final state regardless of what order messages arrive. Two approaches do this correctly at scale — Operational Transformation (the Jupiter model Google has run since 2009) and CRDTs (used by Figma, JupyterLab, and others). Choosing between them, and building the sharding and offline model around the choice, is exactly what the interview is asking you to design.
Functional requirements
- Multiple users can edit the same document at the same time.
- Each user sees others' changes within a few hundred milliseconds.
- See other users' cursors, selections, and presence.
- Edits never get lost; no garbled merges.
- Works offline; sync on reconnect.
- Comments, suggestions, version history.
- Permissions: owner, editor, commenter, viewer.
Non-functional requirements
- p99 latency for "I typed a character → I see it" < 10ms (local echo — optimistic apply before server ack).
- p99 latency for server round-trip ack < 30ms in-region.
- p99 latency for "I typed → my collaborator sees it" < 200ms.
- Document state convergence — every client sees the same final document regardless of network ordering.
- Durability — never lose acknowledged edits.
- 100M+ active documents, 50M concurrent editing sessions at peak.
- Documents up to 10MB of text + embedded objects.
Capacity estimation
| Dimension | Estimate | How we got there |
|---|---|---|
| Total users | 1 B | — |
| Concurrent editors (peak) | ~50 M | Most "open" docs are 1 person reading |
| Active docs with 2+ editors | ~5 M | — |
| Ops/sec (peak) | ~10 M ops/sec | 50M × 0.2 active fraction × 1 op/sec/editor |
| Inbound bandwidth | ~500 MB/sec | 10M ops/sec × 50 B/op (insert/delete with position) |
| Text storage (active docs) | ~5 TB | 100M active docs × 50 KB text/doc |
| Op-log storage | ~50 TB | 100M docs × 500 KB op history/doc (~10× text during session) |
| Total raw (before compression) | ~50–100 TB | Heavily compressible |
Takeaway: 10M ops/sec of order-sensitive writes, sub-200ms multi-user delivery, and ~50–100 TB of op log — the bottleneck is the per-doc OT serialization, not raw storage or bandwidth.
Building up to the design
Collaborative editing is the textbook "naive approach destroys your data" problem. Walking forward, each fix earns its place by being the only thing that prevents a specific kind of corruption.
V1: Lock the document while one person edits
Classic 1990s shared-drive UX. User clicks "edit"; server takes a lock; everyone else sees "read-only, locked by Varun."
This is trivially correct — at any moment, exactly one author owns the document and there's nothing to merge. The problem is it isn't collaborative. Two people across the world can't both type into a meeting notes doc at the same time. Every shared-drive editor looked like this before browser-based real-time collab, which is precisely why Writely (2005, later acquired by Google to become Docs) felt revolutionary.
V2: Last-write-wins on the whole document
Both users edit locally; periodically POST the full document back. Server keeps whichever version arrived last.
Working end-to-end for a single user — the "save" button works. But bring in two users and you get catastrophic data loss: User A types a paragraph, User B types a sentence, whoever saves second silently wipes the other's work. Every collaborative editing demo in 1995 had exactly this bug.
V3: Diff and merge
Each client sends a diff — the changes since last sync — and the server attempts a 3-way merge (last common version plus each user's diff). Small edits in different parts of the document merge cleanly.
The problem surfaces when two users touch the same region. Unlike git, text editors don't surface conflict markers — they just produce wrong text that the user can't tell is wrong. User A inserts "not" at position 50; User B deletes characters 48–60 (which now covers A's "not"); the result depends on merge order, and either way the document silently disagrees on different clients. There's no heuristic fix for this. Text editing needs a different primitive entirely.
V4: Operational Transformation (OT)
Frame edits as operations on the document: insert(pos, text), delete(pos, len). When two operations arrive concurrently, transform each against the other so the final document is the same regardless of which order they're applied.
The textbook example — two concurrent inserts into "abc":
Start: "abc"
User A: insert(1, "X") based on "abc"
User B: insert(2, "Y") based on "abc"
Naive: apply A then B
"abc" → "aXbc" (A applied)
Now B wants insert(2, "Y") — but "Y" should go between b and c.
In "aXbc", that position is 3, not 2.
If we apply B's original op verbatim: insert(2, "Y") → "aXYbc" ✗
Naive: apply B then A
"abc" → "abYc" (B applied)
Apply A's insert(1, "X"): "aXbYc" ✓ (this one happens to work)
Without OT, the two orderings produce different documents. With OT, the server transforms B against A before applying it — insert(2, "Y") becomes insert(3, "Y") because A inserted one character before B's position — and both orderings converge to "aXbYc".
Every client converges on the same document regardless of which operation arrived first, as long as the transformation function is correct for every pair of op types. That last clause is the hard part. OT still needs a referee: peer-to-peer OT — every client transforming against every other — requires satisfying the TP2 property, and implementing TP2 correctly for the full range of text op types (insert, delete, format) is notoriously error-prone. Most attempts have produced subtle correctness bugs. Google sidesteps this entirely with a central server.
V5: Jupiter model — server-centric OT
A single server owns the canonical operation order. Clients send operations tagged with a version number ("I'm editing on top of version 42"); the server transforms incoming ops against any ops accepted since then, applies them, and broadcasts the transformed op to all other clients with the new version number.
sequenceDiagram
participant A as Client A v42
participant S as Server canonical
participant B as Client B v42
A->>S: op_A based on v42
B->>S: op_B based on v42
S->>S: serialize: A first, transform B against A
S->>S: apply both → v44
S-->>A: ack op_A v43, deliver transformed op_B v44
S-->>B: deliver op_A v43, ack transformed op_B v44
This is production-grade collaborative editing — what Google has run since 2009 (when OT was adopted as the core technique for Google Wave and Google Docs), battle-tested at billions of edits per day. The remaining gap is scale and offline. One central server per document is a bottleneck for hot documents, and if you go offline for an hour and type 1000 characters, those ops need to be transformed against everything that happened while you were gone.
V6: Sharding by document + offline-first clients
Each document gets a "home shard" — one process that owns its operation log. All ops for that doc route to that shard via consistent hashing. The shard holds the OT state machine in memory, persists the op log to durable storage, and snapshots periodically.
Clients buffer ops locally when offline; on reconnect, they replay against the latest server version and the server transforms each one in turn.
V7: Presence, comments, suggestions, version history
Cursors, selections, and "Bob is typing…" flow over a separate pub/sub channel — ephemeral, no convergence guarantees needed. Comments anchor to character ranges via stable IDs that survive edits. Suggestions are a parallel op stream the doc owner can accept or reject. Version history is the op log itself — replay to any point.
This is the production design: V5 + V6 + V7.
flowchart LR
V1[V1: lock-on-edit<br/>not collaborative] --> V2[V2: LWW<br/>data loss]
V2 --> V3[V3: diff/merge<br/>garbled overlaps]
V3 --> V4[V4: OT operations<br/>convergence math]
V4 --> V5[V5: Jupiter server<br/>central referee]
V5 --> V6[V6: + sharding + offline<br/>scale + UX]
V6 --> V7[V7: + presence + history<br/>production Docs]
style V1 fill:#0e7490,color:#fff
style V4 fill:#15803d,color:#fff
style V5 fill:#ff6b1a,color:#0a0a0f
style V7 fill:#a855f7,color:#fff
High-level architecture
flowchart TD
A[Editor A<br/>browser/mobile] -->|WebSocket| GW[Realtime Gateway]
B[Editor B] -->|WebSocket| GW
C[Editor C] -->|WebSocket| GW
GW --> ROUTE{Doc Router}
ROUTE -->|"hash(doc_id)"| SHARD1[Doc Shard 1<br/>OT engine for doc 42]
ROUTE --> SHARD2[Doc Shard 2<br/>OT engine for doc 43]
SHARD1 --> OPLOG[(Op Log<br/>append-only, per doc)]
SHARD1 --> SNAP[(Snapshot Store<br/>Spanner / Bigtable)]
SHARD1 --> PUB[Pub/Sub<br/>presence + cursors]
PUB --> GW
DOC[Doc Service] --> SHARD1
META[(Metadata DB<br/>ACL, title, owner)] --> DOC
A -.media uploads.-> BLOB[(Blob Store)]
BLOB --> CDN
style SHARD1 fill:#ff6b1a,color:#0a0a0f
style OPLOG fill:#15803d,color:#fff
style GW fill:#0e7490,color:#fff
Five conceptual layers, each with a clear job:
- Realtime gateway — owns WebSocket connections, multiplexes per-doc subscriptions.
- Doc router — maps
doc_idto the shard currently serving it. - Doc shard — single-threaded OT engine for each owned document, in-memory operation log, periodic snapshot to durable storage.
- Op log + snapshot store — durability: every accepted op persisted before ack, snapshots for fast load.
- Pub/sub channel — non-convergent ephemeral state (cursors, presence, typing indicators).
The OT operation model
Every edit is encoded as a structured operation. The simplest set:
type Op =
| { type: 'insert', pos: number, text: string }
| { type: 'delete', pos: number, len: number }
| { type: 'retain', len: number } // skip N chars unchanged
For rich text (bold, italic, links, embeds), operations carry attributes:
type RichOp =
| { type: 'insert', pos: number, text: string, attrs?: { bold?: bool, link?: string } }
| { type: 'format', pos: number, len: number, attrs: {...} }
| { type: 'delete', pos: number, len: number }
The key insight is that operations compose linearly: a document is "starting from empty, apply this list of ops in order."
The transformation function
For each pair of operation types, there's a transform(a, b) function that returns (a', b') such that applying a then b' produces the same result as applying b then a'. This is the math that lets the server reorder concurrent operations safely.
// Simplified — real impl handles all op-type pairs
function transform(a: Op, b: Op): [Op, Op] {
if (a.type === 'insert' && b.type === 'insert') {
// tie-break: a's user_id wins
if (a.pos < b.pos || (a.pos === b.pos && a.user < b.user)) {
return [a, { ...b, pos: b.pos + a.text.length }];
} else {
return [{ ...a, pos: a.pos + b.text.length }, b];
}
}
// insert vs delete, delete vs delete, format vs anything...
// ~10 cases total for plain text; many more for rich text
}
The full set of cases is the hard part. Get one wrong and you have a silent data-corruption bug that nobody can reproduce. This is why production OT systems take years to mature.
The Jupiter model (Google's choice)
Every client tracks a pair of scalar sequence numbers — (ops_sent_to_server, ops_received_from_server). This is a 2D state coordinate, not a vector clock; the Jupiter paper uses it to identify exactly which transformations are still pending. On send, the op is tagged with the server version the client based it on. The server:
- Looks up all ops applied since that version.
- Transforms the incoming op against each one in order.
- Applies and persists the transformed op.
- Broadcasts to other clients (each receives a version they can apply directly).
sequenceDiagram
participant A as Client A
participant S as Server
participant B as Client B
Note over A,B: Both at v42
A->>S: insert(5,"x") @ v42
B->>S: delete(3,2) @ v42
S->>S: serialize: A first
S->>S: transform B's delete against A's insert
S-->>A: ACK insert as v43
S-->>A: deliver transformed delete as v44
S-->>B: deliver A's insert as v43
S-->>B: ACK transformed delete as v44
Each client's local state always tracks "v43 → v44 → …" — no client ever sees an inconsistent middle state.
CRDTs: the modern alternative
Conflict-free Replicated Data Types take a different approach: instead of transforming operations against each other, design data structures where merges are mathematically commutative. For text, this means giving every character a unique ID and ordering characters by those IDs rather than by integer positions. The most-used implementations are:
| Yjs | Automerge | |
|---|---|---|
| Algorithm | YATA | RGA with columnar encoding |
| Wire / memory overhead | ~few bytes/char; compact binary | ~20-30% over raw text at rest (Automerge 2.0 columnar compression; older versions could exceed 1,000× overhead) |
| Speed | Fast for text | Slower at runtime; compact at rest |
| Used by | JupyterLab, many editor integrations (CodeMirror, Monaco, Quill) | Local-first apps, structured docs |
CRDTs are appealing because they don't need a central server — any two clients can merge offline changes when they reconnect, and convergence is guaranteed by construction rather than by getting every transformation case right. The flip side is that per-character metadata is real overhead at scale, and intent preservation is harder: two users typing "cat" and "dog" at the same position can interleave to "cdaotg" — both clients converge, but on garbage. CRDTs are also harder to bolt onto an existing OT-based server.
Google's stance has been consistent: Jupiter-model OT has run without incident for 17+ years (since 2009). The cost of rewriting the stack outweighs the marginal benefit. That said, the industry is genuinely split. Figma explicitly rejected OT as unnecessarily complex for design-tool property changes and instead built a custom last-writer-wins property system inspired by CRDTs on a centralized server. JupyterLab and many editor integrations use Yjs. Notion and Linear each have homegrown sync layers. The takeaway: Jupiter-OT and modern CRDTs are both viable for a fresh design — neither has "won."
Storage: op log + snapshots
A document is fundamentally an append-only log of operations. Two storage concerns work together to keep things fast and durable:
flowchart LR
OP[Each accepted op] --> LOG[(Op Log<br/>append-only)]
LOG -.every N ops or T minutes.-> SNAP[Snapshot job]
SNAP --> S[(Snapshot Store<br/>compact doc state at v_N)]
LOAD[Load doc] --> S
S --> APPLY[Apply ops since snapshot]
APPLY --> READY[Doc ready in memory]
style LOG fill:#15803d,color:#fff
style S fill:#ff6b1a,color:#0a0a0f
The op log records every operation before acking it to the client. Sharded by doc_id, replicated, durable — Bigtable, Spanner, or Cassandra all fit here.
The snapshot kicks in every N ops (say, 100) or T minutes, writing a compact image of the full document state. Loading a doc then means: fetch the latest snapshot, replay ops since. Old op log entries can be pruned after a snapshot covers them, though Google keeps them for version history.
Document sharding
Each doc lives on one shard. That shard holds the latest doc state in memory, serializes incoming ops single-threaded (no locks needed), persists each op to the op log before acking, and broadcasts accepted ops to subscribed clients via the gateway. When no editors have been active for ~10 minutes, the shard evicts the doc to free memory and reloads on the next connection.
flowchart LR
GW[Realtime Gateway] --> ROUTE[Router<br/>consistent hash on doc_id]
ROUTE --> S1[Shard 1<br/>doc_id ∈ ring slice 1]
ROUTE --> S2[Shard 2<br/>doc_id ∈ ring slice 2]
ROUTE --> SN[Shard N]
S1 -.failover.-> S1B[Shard 1 standby]
style ROUTE fill:#ff6b1a,color:#0a0a0f
If a shard dies mid-session, a standby reads the snapshot plus recent ops and reattaches clients. Clients buffer ops during the brief reconnect window.
Three failure modes worth calling out in an interview. First, shard restart latency: reloading a large doc — snapshot fetch plus op replay — can take several seconds. The gateway must hold buffered ops from clients without dropping them, typically via a per-doc in-memory queue with a bounded timeout before surfacing a transient error. Second, hot documents: a viral shared doc linked from social media may attract thousands of concurrent editors, saturating a single shard's CPU. Shed presence traffic first (it's the highest volume), apply per-doc connection limits, and schedule hot docs onto beefy shard hosts via a placement service. Third, thundering herd on shard restart: if a shard that owns many hot docs crashes, all clients reconnect simultaneously, triggering many parallel snapshot reads. Jittered reconnect backoff at the gateway dampens this.
Presence and cursors
Cursor positions, selections, "Alice is typing…" — these don't need convergence. They flow over a separate pub/sub channel:
Client A: cursor moved to position 42
→ gateway → pub/sub topic doc:123:presence
→ all other clients of doc 123 receive the event
→ render Alice's cursor at position 42 (locally transformed if needed)
Lossy is fine. If a cursor update drops, the next one (~100ms later) corrects it. The interesting subtlety: cursor positions still need to be transformed on incoming ops. If Alice's cursor is at position 50 and Bob inserts "hello" at position 30, Alice's cursor should logically shift to 55. The client does this transformation locally based on the same op stream, without routing it through the server.
Offline editing
The client maintains three things: the latest acknowledged server version v_ack, a local op log of unacked operations, and the currently rendered document (server state with local ops applied on top).
sequenceDiagram
participant C as Client (offline)
participant L as Local Op Buffer
participant S as Server
Note over C,S: Connection drops at v_ack=50
C->>L: insert("hello") → local v51
C->>L: delete(3,2) → local v52
Note over C,S: Reconnects
C->>S: send [insert("hello") @ v50, delete(3,2) @ v50]
S->>S: transform each op against ops 51..N accepted while offline
S-->>C: transformed ops v_N+1, v_N+2
C->>C: rebase local UI on transformed ops
While offline, edits go into the local op log and the document updates locally — zero latency, no spinners. On reconnect, the client sends its unacked ops tagged with v_ack. The server transforms each against everything that happened since then and broadcasts the results. For long offline sessions of hours, the client may end up rebasing significantly — but the convergence math still holds.
Permissions and ACLs
CREATE TABLE doc_acls (
doc_id BIGINT,
user_id BIGINT,
role TEXT, -- 'owner' | 'editor' | 'commenter' | 'viewer'
expires_at TIMESTAMPTZ,
PRIMARY KEY (doc_id, user_id)
);
ACL checks happen at the gateway (on connect) and at the shard (on every op). Viewers can subscribe to the op stream but the shard rejects their writes. Commenters can write comment ops but not document ops.
Sharing-with-link is just an ACL row with user_id = link_token and a special role.
Comments and suggestions
Both anchor to character ranges. A naive (start_offset, end_offset) breaks under concurrent edits — if someone deletes the commented text, the comment dangles.
The fix: an anchor op inserts an invisible marker into the doc; the comment refers to that marker by ID. Edits naturally move the marker via OT, just like any other character.
Suggestions are a parallel "tracked changes" stream: suggest_insert(pos, text, by=user_X). The owner can accept (merges into the main op stream) or reject (discards).
Version history
The op log is the version history. When the UI shows "Version from 3pm today," it:
- Finds the snapshot closest to the requested time.
- Replays ops up to the target timestamp.
- Renders read-only.
Restoring a version means creating a giant replace_all op that brings the current doc back to the historical state — applied through the normal OT path, so concurrent editors see it as just another edit.
Performance budgets
| Action | Budget |
|---|---|
| Local echo (I type, I see) | < 10ms |
| Local → server roundtrip ack | < 30ms in-region |
| Server → other clients | < 100ms in-region |
| Cross-region peer | < 300ms |
| Snapshot load (open doc) | < 1s |
Local echo is the most important UX number. Every keystroke must paint immediately — the client applies its own op optimistically, then reconciles when the server's transformed version arrives.
Edge cases
Undo / redo
Each user's undo stack is user-local — Alice's undo only undoes Alice's ops, not Bob's. Implementing this on top of OT requires tracking, for each op, an inverse op that's also transformed against intervening edits. Get this wrong and undo "un-types" someone else's work — a famously irritating bug.
Embedded objects (images, tables, charts)
Treat as a single character at the document level. The actual asset lives in blob storage; the doc op stores insert(pos, "[blob_id=abc]"). Concurrent edits to inside a table (cell edits) go through their own OT scope.
Massive paste
Pasting a 100-page document is one giant insert op. Most systems split into chunks (one op per 1KB, for example) to keep individual ops small and the UI responsive during sync.
Long-running offline (a week)
The op log on the server may have millions of operations since the client went offline. The client doesn't try to replay everything — it just fetches the latest snapshot and treats its own unsynced ops as edits against that snapshot. Some convergence guarantees relax in this case; users rarely hit it in practice.
Cross-region writes
Each doc has a home region. Cross-region clients pay the round-trip cost (~80–100ms US to EU depending on routing). Google does not try to do multi-region writes per doc — the consistency cost isn't worth it. Read replicas can serve viewers, but writes always go to the home region.
Things to discuss in an interview
- Why not last-write-wins or diff/merge — concrete examples of data loss.
- OT vs CRDT — name both, pick OT-server for Google scale; mention CRDTs for new builds.
- Jupiter model — central server, scalar version counters, transformation against concurrent ops.
- Per-doc sharding — single-threaded OT engine, hand-off on failover.
- Separation of doc state and presence — convergent vs ephemeral channels.
- Offline support — client op buffer, transform-on-reconnect.
Things you should now be able to answer
- Why does naive last-write-wins lose data, with a concrete example?
- What does OT's
transform(a, b)function actually do? - Why does Google use a central server even though peer-to-peer OT exists?
- How do snapshots and op logs combine to make doc loads fast and durable?
- Why is cursor state on a different channel than document ops?
- What's the trade-off if you'd start fresh today — OT or CRDT?
Further reading
- "High-Latency, Low-Bandwidth Windowing in the Jupiter Collaboration System" (Nichols et al., 1995) — the OT paper Google built on
- "Operational Transformation in Real-Time Group Editors" (Sun & Ellis, 1998)
- "A commutative replicated data type for cooperative editing" (Preguiça et al., 2009) — early CRDT
- Yjs documentation and source — modern CRDT for text
- "How Figma's multiplayer technology works" — Figma engineering blog (explicitly rejected OT; uses a custom last-writer-wins + CRDT-inspired property system on a centralized server)
- "Towards a Unified Theory of Operational Transformation and CRDT" (Raph Levien)
Frequently asked questions
▸What is Operational Transformation and why does Google Docs use it instead of last-write-wins or diff-merge?
Operational Transformation encodes every keystroke as a versioned operation (insert, delete, format) and defines a transform(a, b) function that adjusts concurrent ops so every client converges to the same final document regardless of network ordering. Last-write-wins silently destroys one user's work; diff-merge produces garbled output when two users edit the same region with no way to detect the corruption. OT is the only primitive that gives convergence guarantees for simultaneous text editing.
▸What is the Jupiter model and why does Google use a central server instead of peer-to-peer OT?
The Jupiter model routes every op through a single authoritative server that serializes concurrent operations, transforms each incoming op against everything accepted since the client's base version, and broadcasts the canonical result. Peer-to-peer OT requires satisfying the TP2 property across all clients, which is notoriously error-prone and has produced subtle correctness bugs in most implementations. Google has run the Jupiter-model OT server without incident since 2009.
▸OT vs CRDTs: when should you choose one over the other for a collaborative editor?
Jupiter-model OT is the proven choice for Google-scale systems with a central server; it has run at billions of edits per day for 17+ years. CRDTs like Yjs give peer-to-peer merge and offline-first convergence without a central referee, but carry real per-character metadata overhead — Automerge 2.0 adds roughly 20-30% over raw text at rest, and older versions could exceed 1,000x overhead. For a fresh build, both are viable; the industry is genuinely split, with Figma using a custom last-writer-wins property system and JupyterLab using Yjs.
▸What are the latency requirements for Google Docs and how does the architecture meet them?
Local echo (user sees own keystroke) must be under 10 ms, achieved by optimistic apply on the client before the server acks. Server roundtrip ack must be under 30 ms in-region, and the collaborator must see the change within 200 ms. Cross-region peers pay an additional 80-100 ms round-trip cost; Google routes writes to a document's home region rather than attempting multi-region write consistency.
▸Why is cursor and presence data sent on a separate channel from document operations?
Cursor positions, selections, and typing indicators are ephemeral and lossy delivery is acceptable — if a cursor update drops, the next one arrives within about 100 ms. Routing presence through the OT engine would waste convergence math on state that does not need it. The separate pub/sub channel keeps the single-threaded doc shard focused on the order-sensitive writes that actually require transformation.
You may also like
Design an LLM Observability Platform
Build the distributed tracing backbone for non-deterministic, multi-step LLM applications — capturing every prompt, completion, token count, and dollar cost across chains, retrievals, and tool calls so you can debug a failed agent run and account for every cent.
Design an LLM Gateway (AI Gateway & Model Router)
A single proxy control plane in front of OpenAI, Anthropic, Google, and open models — routing ~65 trillion tokens a month with automatic failover, semantic caching, per-team budget enforcement, and streaming SSE passthrough, all under 50 ms of added latency.
Design an LLM Fine-Tuning Platform
Turn a base model and a dataset into a deployed fine-tuned adapter at scale — the end-to-end platform covering dataset ingestion, LoRA/QLoRA/DPO training, fault-tolerant distributed GPU scheduling, eval gating, and multi-LoRA serving for hundreds of concurrent fine-tunes.