~/articles/design-llm-finetuning-platform

◆◆◆Advancedasked at OpenAIasked at Metaasked at Hugging Faceasked at Together AIasked at Databricks

Design an LLM Fine-Tuning Platform

Turn a base model and a dataset into a deployed fine-tuned adapter at scale — the end-to-end platform covering dataset ingestion, LoRA/QLoRA/DPO training, fault-tolerant distributed GPU scheduling, eval gating, and multi-LoRA serving for hundreds of concurrent fine-tunes.

37 min read2026-06-25Ironclad Academy

#interview #ai #llm #fine-tuning #training #mlops

// DEPTH

the full breakdown — requirements, capacity, evolution, trade-offs

The problem

In 2023, Bloomberg trained BloombergGPT — a 50B-parameter model pre-trained from scratch on 700 billion tokens of financial text. Total compute bill: tens of millions of dollars. Then researchers at Stanford showed that fine-tuning Llama on 52,000 instruction-following examples cost about $600 in API calls and produced a model competitive with GPT-3.5 on many general benchmarks. The LIMA paper followed: 1,000 carefully curated examples matched models trained on 50,000 noisy ones. The conclusion was unavoidable — for most tasks, fine-tuning a commodity base model is vastly more economical than pre-training from scratch, and data quality matters far more than data quantity.

This created a new class of infrastructure problem. OpenAI launched a fine-tuning API for GPT-3.5 and GPT-4o-mini. Hugging Face built AutoTrain. Together AI, Predibase (LoRAX), Anyscale, and Databricks (via the MosaicML acquisition) all offer managed fine-tuning. Internally, Meta maintains the Llama cookbook and torchtune library. Axolotl has become the de facto open-source harness. All of them solve the same core problem: take a base model and a labeled dataset, adapt the model's parameters to a specific task, and deploy the result — reliably, reproducibly, and without requiring every user to manage distributed training infrastructure.

The tensions this creates are real. GPU memory is finite and models are large — a LLaMA-2 7B model in FP16 mixed precision requires roughly 112 GB just for weights, gradients, and optimizer state before accounting for activations (16 bytes/param × 7B params). Spot GPU instances can disappear with two minutes' notice. Fine-tuned models have lineage: if you update the base model, every adapter trained against the old weights potentially degrades silently. Safety alignment erodes when you fine-tune on narrow domain data. And multi-tenant economics only work if you can serve hundreds of adapters without spawning hundreds of model replicas.

This article covers the platform layer — from dataset upload to deployed adapter. The internals of the serving system (KV cache, continuous batching, PagedAttention) live in Design an LLM Inference & Serving System. The job scheduling infrastructure shares patterns with Design a Distributed Job Scheduler. The eval gating system connects to Design an LLM Eval Platform. Dataset versioning and lineage tracking overlap with concepts in Design a Feature Store.

Functional requirements

Accept training datasets in JSONL, Parquet, or Hugging Face Hub format; validate schema, deduplicate, check for benchmark contamination.
Support fine-tuning methods: supervised fine-tuning (SFT), LoRA, QLoRA, DPO, and continued pre-training (CPT).
Schedule training jobs on a heterogeneous GPU pool (A100, H100); support spot/preemptible instances with automatic resume.
Track per-run metrics (loss, gradient norm, throughput) and hyperparameters; support experiment comparison.
Gate adapter promotion with automated eval: benchmark regression, task holdout, safety scan, LLM-as-judge pairwise comparison.
Register versioned adapters with full lineage (base model SHA + dataset version + hyperparams + eval results).
Serve fine-tuned adapters alongside a shared base model with dynamic hot-swap; serve full fine-tunes as standalone replicas.

Non-functional requirements

Job scheduling latency: a submitted job should begin provisioning within 60 seconds of queue entry.
Fault tolerance: spot preemption must not lose more than 500 training steps of work.
Training throughput: maintain >60% GPU utilization (MFU — model FLOP utilization) for standard SFT jobs.
Storage durability: dataset artifacts and model checkpoints must survive with 99.999999999% durability (S3-class).
Lineage completeness: every adapter in the registry must trace to an immutable dataset version and base model SHA.
Eval gating SLA: eval pipeline completes within 30 minutes of training completion for standard 7B models.

Capacity estimation

Dimension	Estimate	How we got there
Concurrent training jobs	50 peak	500 tenants × 2 jobs/week → 1,000 jobs/week; 6 concurrent avg during active scheduling windows (low-duration jobs); 8× burst-to-avg factor at peak business hours
GPU pool size	~240 A100 80GB + on-demand H100s	50 concurrent jobs × 4 GPUs avg per job = 200 GPUs; 20% scheduling-slack buffer → 240 A100s; 70B+ jobs spill to on-demand H100 pool
Pool cost (spot)	~$360–550/hr (A100 base)	240 × $1.49–2.29/hr spot A100 rate (mid-2025 post-44% AWS price cut); H100 on-demand jobs billed separately at $1.99–3.89/hr
Dataset storage	25 TB/year	1,000 jobs/week × 500 MB avg dataset × 52 weeks
Checkpoint storage	~200 TB/year	1,000 jobs/week × 10 checkpoints/job × ~0.4 GB per LoRA checkpoint (adapter weights ~150 MB + Adam states ~200 MB ≈ 350–400 MB) × 52 weeks
Adapter registry	~2.5 TB/year	1,000 completions/week × 50 MB avg LoRA adapter × 52 weeks
Training memory for 7B LoRA	~28 GB	Weights (14 GB FP16) + adapter params + gradients (small, ~200 MB) + Adam states (~400 MB) + activations at seq_len 2048 (~14 GB); fits on one A100 80GB with headroom
Training memory for 70B full FT	~1,120 GB	16 bytes/param × 70B params = 1,120 GB; requires 16× H100 80GB with ZeRO-3 (~70 GB/GPU)
Typical LoRA job cost	$3–$80	7B model: ~$3–$9 spot ($1.49–2.29/hr × 2–4h × 1 GPU), up to ~$15 on-demand ($3.75/hr × 4h); 70B: ~$30–$80 (2–4 H100s × $1.99/hr spot × 8–10h)

Takeaway: LoRA is the workhorse — it fits a 7B fine-tune onto a single A100, costs under $15 on-demand (or $3–$9 on spot), and produces an adapter small enough to serve thousands of variants from one shared base model. Full fine-tuning of a 70B model requires 16 H100s and burns several thousand dollars — reserved for CPT or cases where the quality gap from LoRA is measurable and material. Full fine-tune replicas range from ~14 GB (7B FP16) to ~140 GB (70B FP16) per serving copy.

Building up to the design

V1: Script on a single machine. The naive version is a Python script: load the dataset, call transformers.Trainer, save the weights to disk. This works for a proof of concept. It breaks immediately in production when: the model doesn't fit in one GPU's memory (a 13B model in FP16 needs 26 GB just for weights, plus 6–7x that for training states (FP16 gradients + FP32 master copy + Adam m + Adam v)); the machine crashes mid-run and you lose everything; two engineers try to reproduce each other's results but used different random seeds and tokenizer versions; and you need to serve the fine-tuned model to users who also need the base model.

V2: Add LoRA and multi-GPU support. The Hu et al. LoRA paper (ICLR 2022) showed you can freeze the base model weights and inject trainable low-rank matrices into each attention and MLP projection. For a 7B model with rank-8 LoRA applied to all linear layers, trainable parameters drop from 7B to about 20M — a 350x reduction. Now a single A100 80GB can run the job. Multi-GPU support via PyTorch FSDP or DeepSpeed ZeRO lets you distribute the remaining memory pressure. This solves the memory problem but not the reliability or reproducibility problems.

V3: Add checkpointing and a job scheduler. Register a SIGTERM signal handler that saves a checkpoint immediately when cloud preemption arrives (spot instances give a 2-minute warning). Write checkpoints to object storage every 500 steps: model state + optimizer state + RNG state + data loader offset so training resumes exactly where it left off. Wrap this in a job scheduler that tracks job state transitions, retries on failure, and routes jobs to available GPU capacity. Now the system is reliable, but there is still no data validation, no experiment tracking, no eval gating, and no serving integration.

flowchart TD
    A["V1: single-GPU script"] -->|OOM on 13B model| B["V2: LoRA + FSDP/ZeRO"]
    B -->|crash loses the job| C["V3: checkpointing + scheduler"]
    C -->|silent data quality bugs| D["V4: ingestion validation pipeline"]
    D -->|no way to compare runs| E["V5: experiment tracking + eval gating"]
    E -->|100 adapters need 100 replicas| F["V6: multi-LoRA adapter registry + hot-swap serving"]
    style A fill:#ff6b1a,color:#0a0a0f
    style B fill:#0e7490,color:#fff
    style C fill:#0e7490,color:#fff
    style D fill:#15803d,color:#fff
    style E fill:#a855f7,color:#fff
    style F fill:#ffaa00,color:#0a0a0f

V4: Dataset ingestion validation. The moment you open the platform to external users, data quality becomes the primary source of failures. A single wrong role label in the JSONL (e.g., "role": "system" appearing after "role": "assistant") causes the chat template to produce malformed training sequences. Near-duplicate examples in the dataset inflate loss curves without adding information. Including MMLU or HellaSwag questions in training data inflates benchmark scores without improving the model. You need a validation pipeline: schema checks, token count validation, MinHash deduplication (exact and near-duplicate), benchmark contamination scan (n-gram overlap with eval splits), and a length histogram to catch distributions with many outlier-length examples.

V5: Experiment tracking and eval gating. Without a systematic way to compare runs, practitioners adjust hyperparameters blindly. Log every training step's loss, gradient norm, learning rate, GPU utilization, and tokens/sec to a time-series store (MLflow or W&B). Store hyperparameter configs immutably linked to each run. After training, run an automated eval suite: MMLU regression to catch catastrophic forgetting (flag if MMLU drops more than 2–3%), task-specific holdout with domain metrics, safety/toxicity scan, and LLM-as-judge pairwise comparison for chat-facing models. Only adapters that pass all gates get promoted to the registry.

V6: Multi-LoRA registry and serving. Fine-tuned adapters need to be served — but spawning a dedicated 7B model replica per adapter wastes enormous GPU capacity. The key insight from Punica (Chen et al., MLSys 2024): a custom SGMV (Segmented Gather Matrix-Vector) kernel can batch forward passes through different LoRA adapters in a single operation. When request A uses adapter_1 and request B uses adapter_2, the SGMV kernel dispatches both LoRA matmuls together as a batched operation rather than two sequential operations, achieving 12x throughput over naive per-adapter serving. vLLM absorbed this via LoRAX/Punica integration; Predibase built their entire platform on it. The serving system loads adapters on-demand from object storage, caches hot adapters in GPU SRAM, and evicts cold ones to CPU DRAM. The full design is in Design an LLM Inference & Serving System — this article covers the training platform that produces those adapters.

API

The platform exposes a REST API for job management and dataset operations.

# Submit a fine-tuning job
POST /v1/fine-tuning/jobs
{
  "base_model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
  "training_file": "dataset_v3_abc123",   // versioned dataset artifact ID
  "validation_file": "dataset_v3_val_abc123",
  "method": "lora",                        // lora | qlora | sft | dpo | cpt
  "hyperparameters": {
    "n_epochs": 1,
    "learning_rate_multiplier": 1.0,       // scaled relative to platform default
    "batch_size": "auto",
    "lora_rank": 32,
    "lora_alpha": 64,
    "lora_target_modules": ["q_proj", "k_proj", "v_proj", "o_proj",
                             "gate_proj", "up_proj", "down_proj"]
  },
  "eval_config": {
    "run_mmlu_regression": true,
    "task_eval_file": "eval_holdout_abc123",
    "safety_scan": true
  }
}

# Response
{
  "job_id": "ftjob-xyz789",
  "status": "queued",
  "created_at": 1750800000,
  "estimated_start": 1750800060,
  "base_model": "meta-llama/Meta-Llama-3.1-8B-Instruct@sha256:a1b2c3...",
  "training_file_version": "dataset_v3_abc123@version:7"
}

# Check job status
GET /v1/fine-tuning/jobs/{job_id}
# Returns: status (queued|provisioning|running|checkpointing|evaluating|succeeded|failed)
#          + step, loss, tokens_per_sec, estimated_completion, checkpoint_path

# List fine-tuned models in the registry
GET /v1/models?type=fine-tuned&base_model=meta-llama/Meta-Llama-3.1-8B-Instruct
# Returns adapter metadata: artifact_uri, base_model_sha, dataset_version,
#                           eval_metrics, stage (dev|staging|production|archived)

# Deploy adapter to serving
POST /v1/models/{model_id}/deploy
{ "stage": "production", "serving_pool": "multi-lora-8b" }

The schema

The platform manages several core entities. The following represents the key tables in a relational store (PostgreSQL), with large blobs in object storage.

-- Dataset artifacts produced by the ingestion pipeline
CREATE TABLE dataset_artifacts (
    id            TEXT PRIMARY KEY,           -- content-hash-based ID
    name          TEXT NOT NULL,
    version       INT  NOT NULL,
    format        TEXT NOT NULL,              -- jsonl | parquet | hf_hub
    row_count     INT,
    token_count   BIGINT,
    dedup_removed INT,
    contamination_flagged BOOLEAN DEFAULT FALSE,
    s3_uri        TEXT NOT NULL,
    created_at    TIMESTAMPTZ DEFAULT NOW()
);

-- Fine-tuning job lifecycle
CREATE TABLE training_jobs (
    job_id          TEXT PRIMARY KEY,
    tenant_id       TEXT NOT NULL,
    base_model      TEXT NOT NULL,
    base_model_sha  TEXT NOT NULL,            -- pinned SHA; critical for lineage
    dataset_id      TEXT REFERENCES dataset_artifacts(id),
    method          TEXT NOT NULL,            -- lora | qlora | sft | dpo | cpt
    hyperparams     JSONB NOT NULL,
    status          TEXT NOT NULL DEFAULT 'queued',
    gpu_type        TEXT,
    gpu_count       INT,
    spot_instance   BOOLEAN DEFAULT TRUE,
    step_current    INT DEFAULT 0,
    loss_current    FLOAT,
    throughput_tps  FLOAT,                    -- tokens/sec
    latest_checkpoint_uri TEXT,
    created_at      TIMESTAMPTZ DEFAULT NOW(),
    started_at      TIMESTAMPTZ,
    completed_at    TIMESTAMPTZ
);

-- Versioned model/adapter registry
CREATE TABLE model_artifacts (
    model_id        TEXT PRIMARY KEY,
    tenant_id       TEXT NOT NULL,
    job_id          TEXT REFERENCES training_jobs(job_id),
    base_model      TEXT NOT NULL,
    base_model_sha  TEXT NOT NULL,
    dataset_id      TEXT REFERENCES dataset_artifacts(id),
    artifact_uri    TEXT NOT NULL,            -- S3 path to adapter weights or full model
    artifact_type   TEXT NOT NULL,            -- lora_adapter | full_model
    lora_rank       INT,
    lora_alpha      FLOAT,
    lora_modules    TEXT[],
    stage           TEXT NOT NULL DEFAULT 'dev',  -- dev|staging|production|archived
    eval_metrics    JSONB,                    -- {mmlu_delta, task_f1, safety_pass}
    eval_passed     BOOLEAN,
    created_at      TIMESTAMPTZ DEFAULT NOW(),
    promoted_at     TIMESTAMPTZ
);

-- Per-run experiment metrics (time-series; also mirrored to MLflow/W&B)
CREATE TABLE run_metrics (
    job_id    TEXT REFERENCES training_jobs(job_id),
    step      INT,
    loss      FLOAT,
    grad_norm FLOAT,
    lr        FLOAT,
    gpu_util  FLOAT,
    tps       FLOAT,                          -- tokens per second
    recorded_at TIMESTAMPTZ DEFAULT NOW(),
    PRIMARY KEY (job_id, step)
);

Architecture

The production system has four major subsystems. Here is the full flowchart, followed by a sequence diagram of the hot path (a single training step and checkpoint cycle).

flowchart LR
    subgraph CLIENT["Clients"]
        SDK[SDK / UI]
    end
    subgraph INGEST["Ingestion Service"]
        UP[Upload Handler] --> VAL[Schema + Token Validator]
        VAL --> DD[MinHash Deduper]
        DD --> CC[Contamination Checker]
        CC --> PACK[Tokenizer and Packer]
        PACK --> OBJ[(Object Storage<br/>S3 / GCS)]
    end
    subgraph SCHED["Job Scheduler"]
        API2[Fine-Tune API] --> PQ[Priority Queue<br/>Redis Sorted Set]
        PQ --> PROV[GPU Provisioner]
        PROV --> SPOT[Spot Fleet<br/>A100 / H100]
        PROV --> OD[On-Demand Reserve]
    end
    subgraph TRAIN["Training Cluster"]
        SPOT --> RANK0[Rank-0 Coordinator]
        RANK0 --> WORKERS["Worker Ranks 1..N<br/>FSDP / ZeRO-3"]
        WORKERS --> CKPT[(Checkpoint Store<br/>S3)]
        WORKERS --> EXP[Experiment Tracker<br/>MLflow]
        RANK0 --> CKPT
    end
    subgraph EVAL["Eval Gating"]
        EAPI[Eval Trigger] --> MMLU[MMLU Regression]
        MMLU --> TASK[Task Holdout Eval]
        TASK --> SAFE[Safety Scanner]
        SAFE --> GATE{Pass?}
        GATE -->|yes| REG[(Model Registry<br/>Postgres + S3)]
        GATE -->|no| NOTIF[Notify Tenant]
    end
    subgraph SERVE["Serving Fleet"]
        REG --> LOADER[Adapter Loader]
        LOADER --> BASE[Base Model Replica<br/>vLLM]
        BASE --> SGMV[SGMV Multi-LoRA Kernel]
        SGMV --> RESP[Token Stream]
    end
    SDK --> UP
    SDK --> API2
    OBJ --> RANK0
    RANK0 -->|weights| EAPI
    style VAL fill:#0e7490,color:#fff
    style PACK fill:#0e7490,color:#fff
    style PQ fill:#ff6b1a,color:#0a0a0f
    style PROV fill:#ff6b1a,color:#0a0a0f
    style WORKERS fill:#15803d,color:#fff
    style CKPT fill:#ff2e88,color:#fff
    style GATE fill:#a855f7,color:#fff
    style REG fill:#15803d,color:#fff
    style SGMV fill:#ffaa00,color:#0a0a0f

The hot path — a single training step with spot preemption handling:

sequenceDiagram
    participant C as "Coordinator Rank-0"
    participant W as "Workers Rank-1..N"
    participant CS as "Checkpoint Store S3"
    participant ET as "Experiment Tracker"
    participant VM as "VM Manager Spot Fleet"

    C->>W: Broadcast micro-batch shard
    W->>W: Forward pass (activations checkpointed)
    W->>W: Backward pass (recompute activations)
    W->>W: ZeRO all-reduce gradients across ranks
    W->>W: Optimizer step (update sharded params)
    W-->>ET: Log step metrics (loss, grad_norm, tps)
    C->>C: step mod 500 equals 0?
    C->>CS: Write checkpoint (weights + opt state + RNG + data offset)
    VM-->>C: SIGTERM (spot preemption, T-120s)
    C->>W: Broadcast checkpoint-now signal
    W->>CS: Flush current shard to checkpoint
    C->>CS: Write checkpoint with in-progress flag
    Note over C,CS: Job re-queued; resumes from latest checkpoint

Dataset ingestion and curation

The ingestion pipeline is the first place quality is enforced, and it is cheaper to catch problems here than after 12 hours of GPU time.

Schema validation. Every example must conform to the model's expected chat format. For Llama 3.1, that means a messages array with role values from {system, user, assistant} in a valid sequence — a user turn must precede each assistant turn. Token count is validated against the model's context limit (131,072 tokens for Together AI's fine-tuning endpoint). Examples exceeding the limit are truncated or rejected based on policy. The validation step also applies the model's actual tokenizer (not a generic tokenizer) so token counts are accurate.

Deduplication. Exact duplicates are caught with SHA-256 hashing of the normalized text. Near-duplicates — examples that are 90%+ similar, often produced by synthetic data generation pipelines — are caught with MinHash LSH over character 5-grams. The LIMA paper showed that 50,000 noisy Alpaca examples could be matched by 1,000 high-quality ones; near-duplicate inflation is one of the main mechanisms that makes large, "high-quality" datasets worse than their size suggests.

Benchmark contamination check. If the training dataset contains n-gram overlaps with standard evaluation benchmarks (MMLU, HellaSwag, ARC, TruthfulQA), post-training benchmark scores become meaningless. The contamination checker computes 13-gram overlap between every training example and each benchmark's test split. Examples with overlap above a threshold (typically >50% of 13-grams matching) are flagged and, depending on policy, either removed or escalated for human review.

Tokenization and packing. After validation, each example is tokenized using the exact tokenizer the base model was trained with — critical because LLaMA-2 and LLaMA-3 use different vocabularies and different special tokens. The chat template is applied: tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=False). Loss masks are constructed so only completion (assistant turn) tokens contribute to the loss. Prompt tokens are masked out with -100 (PyTorch's ignore_index).

Short examples are then packed into fixed-length windows (e.g., 4096 tokens) to eliminate padding waste. The critical implementation requirement: Flash Attention 2's varlen_flash_attn must be used with per-example position IDs and a 2D causal attention mask that treats each example as independent. Without boundary-aware attention, token A from example 1 attends to token B from example 2, causing silent contamination. IBM Research (Kundu et al., 2024) measured up to 92.22% padding tokens on a typical SFT batch at 4096 context — packing nearly eliminates this.

Training methods: LoRA, QLoRA, SFT, DPO

Memory math first. Mixed-precision training of a 7B model requires: 2 bytes/param for FP16 weights (14 GB), 2 bytes/param for FP16 gradients (14 GB), 4 bytes/param for FP32 master copy (28 GB), 4 bytes/param for Adam first moment (28 GB), 4 bytes/param for Adam second moment (28 GB). Total: 16 bytes/param × 7B = 112 GB. Plus activations at seq_len 1024: roughly 13.7 GB (scales quadratically with sequence length — at seq_len 4096 activations alone reach ~73 GB, which is why gradient checkpointing is non-negotiable for long-context training). A single A100 80GB cannot fit this for a full fine-tune without optimization.

LoRA (Hu et al., ICLR 2022) sidesteps this by freezing all base model weights and injecting trainable low-rank matrices at each projection. For a weight matrix W ∈ R^{d×k}, LoRA trains matrices B ∈ R^{d×r} and A ∈ R^{r×k} where r << min(d,k), replacing W with W + (α/r)·BA during forward passes, where α (lora_alpha) is a separate scaling hyperparameter. For a 7B model with r=8 applied to all linear layers (q, k, v, o, gate, up, down projections), trainable parameters are roughly 20.3M instead of 7B — a 345x reduction. Gradients and optimizer states only accumulate for those 20.3M parameters, dropping memory from 112 GB to roughly 14 GB for weights + 1 GB for adapter training states. A single A100 80GB handles it comfortably.

The rank r is the primary quality lever. r=8 is a common default; r=16 is sufficient for style and tone adaptation. r=32 is the standard for general SFT on instruction-following datasets. r=64 is worth considering for complex coding tasks or multi-turn conversational fine-tunes. A well-established heuristic: set alpha = 2×r. One empirical result that surprises most practitioners: applying LoRA to ALL linear layers (including MLP projections, not just attention) matters more than increasing rank — at r=8 it raises trainable parameters from 4.2M (q+v only) to 20.3M (all 7 linear modules) and measurably improves quality at the same rank level.

QLoRA (Dettmers et al., 2023) adds NF4 4-bit quantization to the frozen base model, reducing total VRAM by roughly 33% relative to FP16 LoRA (the 4-bit quantization applies only to the frozen base weights; adapter weights and optimizer states remain in BF16, so the end-to-end saving is less than the 4× compression on weights alone would suggest). A Guanaco-65B model trained with QLoRA for 24 hours on a single 48GB GPU reached 99.3% of ChatGPT performance on the Vicuna benchmark. The cost: roughly 50% longer training time compared to BF16 LoRA (for a 7B model: 14.18 GB VRAM in 2.79h vs LoRA's 21.33 GB in 1.85h, per Raschka's benchmarks — 2.79/1.85 ≈ 1.51). The memory tradeoff is significant when multiple jobs share a GPU or when the model would otherwise require renting a second GPU.

Critical QLoRA compatibility note: QLoRA does not work with ZeRO-3. ZeRO-3 shards the frozen quantized weights across GPUs, which conflicts with bitsandbytes' 4-bit quantization kernels. Production fix: use ZeRO-1 or ZeRO-2 with QLoRA, or FSDP with custom quantization wrappers.

SFT (full fine-tune) uses the same 16 bytes/param formula. At 7B it requires distributing across multiple GPUs; at 70B it requires 16+ H100s. Full fine-tuning is justified for continued pre-training (CPT) where the model must absorb new vocabulary, factual associations, and long-range statistical patterns that LoRA's rank-limited updates cannot capture. NVIDIA's ChipNeMo project pre-trained a 13B chip design assistant on 23.1 billion domain tokens — roughly 1.8 tokens per parameter — before SFT, and used full fine-tuning for CPT because LoRA cannot adequately absorb new vocabulary.

DPO (Rafailov et al., NeurIPS 2023) alignment training follows SFT. The dataset format changes from (prompt, completion) pairs to (prompt, chosen, rejected) triples. The DPO loss is:

L_DPO = -E[(prompt,chosen,rejected)] [log σ(β · (log π_θ(chosen|prompt) - log π_ref(chosen|prompt))
                                        - β · (log π_θ(rejected|prompt) - log π_ref(rejected|prompt)))]

where π_ref is the frozen SFT checkpoint. The key operational insight: the reference model log-probs log π_ref(chosen|prompt) and log π_ref(rejected|prompt) can be precomputed offline once and stored as dataset features. This turns DPO training into effectively a single-model operation — you serve the trainable policy and load reference log-probs from disk, rather than maintaining a live reference model during training. PPO keeps four live model copies (policy, reference, reward, value) and requires a rollout loop with an LLM generating trajectories — strictly harder infrastructure. For most chat and instruction-following tasks, DPO matches PPO quality at far lower operational complexity. See the Together AI platform for a concrete implementation: they support SFT + DPO in a single job specification, with LN-DPO and SimPO variants available.

Distributed training infrastructure

Parallelism taxonomy. Three complementary strategies exist:

Data parallelism (DP) replicates the full model on each GPU, shards the training batch, and all-reduces gradients after each backward pass. It only works if the model fits in one GPU. Throughput scales linearly with GPU count as long as the all-reduce communication cost stays below the compute cost of a step.

Fully Sharded Data Parallelism (FSDP) / ZeRO shards parameters, gradients, and optimizer states across GPUs. DeepSpeed ZeRO has three stages: ZeRO-1 shards optimizer states only — on 8 GPUs this cuts total per-GPU training memory from ~16 bytes/param to ~9 bytes/param (~1.8× total reduction; the optimizer-state sub-component alone shrinks 8×, but weights and gradients stay replicated); ZeRO-2 also shards gradients (~4.3× total reduction on 8 GPUs); ZeRO-3 shards parameters too, achieving ~8× total per-GPU training memory reduction on 8 GPUs (112 GB → ~14 GB/GPU for a 7B full fine-tune), and up to Nd× reduction at scale. The cost: ZeRO-3 requires 1.5x more all-reduce communication because parameter gathering happens at every forward and backward pass. For LoRA fine-tuning, ZeRO-1 or ZeRO-2 is usually sufficient because only the adapter parameters (20–200M) carry gradient state — the frozen base model weights don't need gradient-state storage.

One notorious gotcha: Falcon 40B initialization on 8 GPUs with ZeRO-1/2 requires more than 1.5 TB of CPU RAM for model init unless zero_init=1 is set in the DeepSpeed config. Without this flag the full model is loaded onto every rank during init before sharding. Always set zero_init=1 for large models.

Tensor parallelism (TP) shards individual weight matrices column-wise across GPUs. Every GPU participates in every matrix multiply, with an all-reduce after each matmul. This requires NVLink-class bandwidth (>400 GB/s) because the all-reduce happens at every layer. TP reduces per-GPU memory proportionally to the TP degree but introduces all-reduce latency on the critical path — only justified when you need low latency and have NVLink-connected GPUs. For fine-tuning (latency-insensitive, throughput-bound), DP/FSDP is almost always preferred over TP.

Pipeline parallelism (PP) assigns whole transformer layers to different devices. Activations flow forward through the pipeline; only inter-stage activations cross device boundaries. PP tolerates slower inter-node Ethernet because communication volume is lower (one tensor per layer boundary rather than all-reduce of full weight matrices). The downside: pipeline bubbles waste GPU cycles at the beginning and end of each micro-batch pass. For fine-tuning jobs, the most common configuration is FSDP across all GPUs on a node, with pipeline parallelism across nodes when the model spans multiple nodes.

Flash Attention. Standard scaled dot-product attention materializes the full N×N attention matrix in SRAM, requiring O(N²) memory. Flash Attention (Dao et al., 2022) fuses the attention computation into a single kernel with O(N) SRAM, using tiling to compute the softmax in blocks. Flash Attention v1 achieves 124 TFLOPS on A100; Flash Attention v2 reaches 220+ TFLOPS (versus the A100's theoretical peak of 312 TFLOPS in BF16). This ~77% speedup from v1 to v2 — (220−124)÷124 — comes from better work partitioning across warps and reduced shared memory reads. Flash Attention 2 is non-optional for any training job with context length above 2048 tokens.

Gradient checkpointing. Activation memory scales as O(batch × layers × seq_len × d_model) with a quadratic attention term. At seq_len=4096 for LLaMA-7B (32 layers, d_model=4096), activation memory reaches roughly 73 GB without checkpointing — the entire H100 memory budget. Gradient checkpointing discards activations after each layer's forward pass and recomputes them during the backward pass. The ~20% compute overhead is almost always worth it for sequence lengths above 1024 tokens.

GPU scheduling and fault tolerance

The scheduler maintains a priority queue (implemented as a Redis sorted set, with priority score derived from tenant tier + queue age to prevent starvation). When a job is dequeued, the provisioner checks the GPU pool for available capacity. It prefers spot instances (60–80% cost savings) and falls back to on-demand reserved instances when spot capacity is unavailable or when the job is a high-priority production run.

Spot instance preemption is the primary reliability challenge. AWS gives a 2-minute SIGTERM warning before terminating spot instances. The training script registers a handler:

import signal, threading

def sigterm_handler(signum, frame):
    trainer.save_checkpoint(emergency=True)
    # write checkpoint with in_progress=True marker
    # scheduler reads this marker and re-queues the job
    raise SystemExit(0)

signal.signal(signal.SIGTERM, sigterm_handler)

Normal checkpoints write every 500 steps (roughly every 10–20 minutes for typical batch sizes). The checkpoint bundle contains: adapter weights (or full model weights), optimizer states (Adam m and v tensors), RNG state (for reproducibility), and the data loader offset (which dataset shard and example index training was at). TorchElastic handles node count changes — if a replacement spot instance has different characteristics or if only partial capacity is restored, TorchElastic can remap ranks without a full restart.

Databricks Mosaic AI Watchdog (the system managing MPT-7B training on 440 A100s over 9.5 days) auto-detected four hardware failures and resumed automatically. The key principle: the scheduler must distinguish between failed (a real training error — e.g., NaN loss, dataset exhaustion) and preempted (eviction by the cloud provider). Only preempted jobs are automatically re-queued; failed jobs require human review because automatic re-queuing on a NaN-loss job wastes GPU time.

Experiment tracking and hyperparameter management

Every training run logs the following to the experiment tracker (MLflow or W&B):

Per-step: loss (training and validation), gradient norm, learning rate (for learning rate schedule visualization), GPU utilization, tokens/sec throughput.
Per-checkpoint: eval metrics on the validation split (perplexity, task-specific metrics if configured).
Hyperparameters: immutably stored at job creation — base model SHA, LoRA rank, alpha, target modules, learning rate schedule, batch size, packing enabled, loss mask config.

The hyperparameter config is hashed and stored with the run, so two runs with identical configs can be detected as duplicates without re-running. Sweep support allows grid or random search over key hyperparameters: common axes are learning rate (3e-5, 1e-4, 3e-4), LoRA rank (16, 32, 64), and number of epochs (1, 2, 3 — where 1 epoch is optimal more often than practitioners expect for datasets above 10k examples).

Eval gating before promotion

Training a model that performs worse than the baseline on general benchmarks is worse than not training at all — users who deployed the model in production now have a regression. Eval gating is the automated quality gate that prevents this.

Stage 1: Benchmark regression. Run MMLU (57 subjects, 5-shot), HellaSwag (10-shot), and ARC-Challenge (25-shot) on the fine-tuned model. Compare against the stored baseline scores for the same base model version. Block promotion if any benchmark drops more than 2–3 percentage points from baseline. This catches catastrophic forgetting — a common failure when training data is narrow and epochs are too many.

Stage 2: Task holdout eval. The tenant provides a holdout evaluation set at job submission time. After training, compute task-specific metrics (F1, accuracy, BLEU, or LLM-as-judge) on this set. The promotion threshold is configurable per-job; default is a relative improvement of ≥5% over the base model on the task-specific metric.

Stage 3: Safety scan. Run the adapter through a safety classifier (Llama Guard or equivalent) on a fixed set of adversarial prompts covering the harmful content taxonomy. A fine-tuned model that refuses fewer than 95% of clearly harmful prompts is blocked from production promotion regardless of task performance. The Yang et al. (2023) finding that 10 adversarial fine-tuning examples cost under $0.20 to erase 99% of safety refusals is the motivating evidence here — safety degradation is a real attack surface, not a theoretical concern.

Stage 4: LLM-as-judge pairwise comparison. For models targeting chat or instruction-following use cases, run pairwise comparisons between the fine-tuned model and the baseline using an independent LLM judge (GPT-4o or a specialized judge model). Present prompts from the task distribution and ask the judge which response is preferred. This catches regressions that benchmark numbers miss — tone degradation, verbosity increases, sycophantic tendencies.

The eval gating pipeline connects to Design an LLM Eval Platform for the detailed implementation of automated eval infrastructure.

Model and adapter registry

Every artifact in the registry records the full provenance chain: base model + base model SHA (pinned), training dataset + dataset version hash, hyperparameter config hash, training run ID, and eval metrics at promotion time. This is not optional — it is the core contract that makes the registry trustworthy.

The SHA pinning problem. If you update the base model from Llama-3.1-8B v1.0 to a refreshed v1.1 (e.g., a safety patch), every LoRA adapter registered against v1.0 becomes suspect. The adapter weight matrices were trained to modify the specific weight values in v1.0. Applied to v1.1 with different weight distributions, the adapter's LoRA deltas may produce degraded or unpredictable output. Production fix: pin the base model SHA in every adapter record. Block base model updates to the serving fleet until all dependent adapters have been re-evaluated against the new base. Databricks uses Unity Catalog for this lineage tracking; other platforms use MLflow Model Registry with custom tags.

Stage transitions. Adapters move through stages: dev (just trained, not yet evaluated) → staging (passed eval gates, available for testing) → production (actively serving traffic) → archived (replaced by a newer version). Promotion from staging to production requires explicit approval — either automated (if eval scores exceed a high-confidence threshold) or human-in-the-loop (for high-stakes models). Rollback means pointing the serving fleet back to a previous production adapter, which takes seconds via the hot-swap mechanism.

Serving fine-tuned models

Full fine-tuned models are served as standalone model replicas — identical to serving any other model, just with different weights. The cost is proportional to model size.

LoRA adapters are served differently. Predibase's LoRAX and vLLM's multi-LoRA support both implement the SGMV (Segmented Gather Matrix-Vector) kernel from Punica (Chen et al., MLSys 2024). The mechanics: for a batch of N requests using M different adapters, a single fused kernel dispatches all adapter matmuls as a segmented operation rather than M sequential operations. This is the same principle as batched matrix multiplication but for heterogeneous adapter weights — each request's adapter A and B matrices are gathered into a contiguous layout for efficient GPU computation. Punica measured 12x throughput improvement over naive per-request adapter application in vLLM.

Adapter loading is dynamic: when a request arrives for an adapter not currently in GPU SRAM, the serving system loads the adapter from object storage (typically 10–100 MB per LoRA adapter vs 14–140 GB for a full model in FP16) via an asynchronous prefetch mechanism, caches it in GPU SRAM if space permits, and evicts cold adapters to CPU DRAM using an LRU policy. For Predibase LoRAX, thousands of adapters can be "live" in the sense that their weights are cached in a DRAM pool and loaded to GPU on the first request within milliseconds.

The serving integration is referenced in Design an LLM Inference & Serving System — specifically the multi-LoRA and prefix caching sections. The fine-tuning platform's responsibility ends at registering the adapter artifact with verified eval metrics; the serving infrastructure picks it up from there.

Cost math

For an interview, being able to reason about GPU costs is as important as knowing the architecture.

A 7B LoRA fine-tune on Together AI or a self-hosted A100: 2–4 hours on one A100 at $1.49–2.29/hr spot = $3–$9 at spot rates; up to ~$15 at on-demand rates (~$3.75/hr × 4h). This covers datasets up to 100k examples at standard sequence lengths.

A 70B LoRA fine-tune: 8–10 hours on 2–4 H100s at $1.99/hr spot = $30–$80 total.

A 70B full fine-tune (SFT): 300–1000 hours on 16–32 H100s at $1.99/hr spot = $10,000–$64,000 total at spot rates. On-demand equivalents run $25,000–$160,000 — this is where spot instances and checkpointing save real money (60% discount = $15,000–$96,000 savings on large runs). (8 H100s are insufficient: 8 × 80 GB = 640 GB < 1,120 GB required even with ZeRO-3.)

For CPT (continued pre-training) at scale, numbers get large fast. ChipNeMo (13B, 23B domain tokens) cost roughly $150,000–$200,000 using H100 clusters. Meta's Llama 3.1 405B pretraining used 16,000 H100s for months — not a fine-tuning problem anymore.

The operational implication: at 7–13B scale, LoRA is the obvious economic choice. At 70B+ full fine-tune, the training cost alone forces careful ROI analysis. The serving economics reinforce this: an A100 running a shared multi-LoRA setup can serve hundreds of fine-tuned 7B variants simultaneously; 100 full fine-tune replicas would need 100× the GPU capacity.

Edge cases & gotchas

Cross-example attention leakage from packing without boundary masks. When short examples are concatenated and Flash Attention is applied without 2D attention masks, tokens from example B attend to example A. Training loss improves because the model gets "free" signal from adjacent examples, but eval performance degrades because the model learned cross-example dependencies that don't exist at inference time. This is fixed in Hugging Face Transformers 4.44+ with DataCollatorWithFlattening, but older implementations and custom training loops frequently get this wrong. Validate by running a packing sanity check: train on single-example batches and packed batches, and verify identical per-example losses.

QLoRA + ZeRO-3 silent OOM. QLoRA uses bitsandbytes to quantize frozen base weights to 4-bit. ZeRO-3 tries to shard those quantized weights across ranks. The interaction is not supported and causes cryptic CUDA errors or silent incorrect gradients depending on the bitsandbytes version. Use ZeRO-1 or ZeRO-2 with QLoRA, or PyTorch FSDP with a custom quantization wrapper.

ZeRO-1/2 CPU RAM explosion during init. Falcon 40B on 8 GPUs with ZeRO-1 requires >1.5 TB CPU RAM during initialization unless zero_init: 1 is set. Most cloud instances have 256–512 GB RAM. The failure is an OOM during model loading before any training step. Always set zero_init: 1 in the DeepSpeed config for models above 13B parameters.

Chat template mismatch. Llama 2, Llama 3, and Mistral use different special tokens and template structures. Fine-tuning Llama 3.1 with the Llama 2 chat template produces a model that generates garbled output because it was trained on malformed instruction boundaries. Always call tokenizer.apply_chat_template() with the tokenizer matched to the exact base model checkpoint, and validate by decoding three sample examples and manually verifying the formatting before starting a run.

Base model SHA drift. Deploying a base model security patch without re-evaluating all mounted LoRA adapters causes silent quality regressions. The adapter deltas were optimized against the old weight matrices; applied to updated weights, they produce behavior neither the original model nor the fine-tuned version would exhibit. Mitigation: enforce SHA pinning in the registry schema as a NOT NULL foreign key constraint, and block base model updates to the serving fleet via a registry check that alerts on dependent un-evaluated adapters.

Multi-epoch overfitting on moderate datasets. Training 2+ epochs on 50k Alpaca-style examples consistently degrades performance in practice. Single epoch is optimal for datasets above ~10k examples; more epochs cause the model to memorize specific phrasings rather than generalizing. The counter-intuitive recommendation: use more data instead of more epochs.

Loss masking on prompt tokens. Computing cross-entropy loss over the full concatenated sequence (prompt + completion) trains the model to predict prompt tokens. On formatted chat datasets this means the model learns to reproduce system prompt and user message templates, which wastes gradient steps and can cause the model to prefix its outputs with partial prompt templates. DataCollatorForCompletionOnlyLM in HF TRL handles this but requires correct response_template strings to identify the completion boundary.

Reward hacking in RLHF/PPO. Gao et al. (2022) showed the relationship between proxy reward and gold reward follows R*(d) = d(α − β log d) — the proxy reward grows initially but the gold reward peaks then degrades as the policy drifts too far from the reference. Without a KL penalty between policy and reference model, PPO will produce a model that scores extremely well on the reward model while generating nonsensical or sycophantic output on real prompts. LLM-as-judge evaluations compound this: positional bias causes 40–60% disagreement when the same pair of responses is presented in different orders. Mitigations: always include a KL penalty term in PPO; use position-swapped pairwise comparisons with the same judge; supplement with domain-specific reward models rather than relying on a general-purpose LLM judge.

Spot preemption with no SIGTERM handler. If the training script doesn't register a SIGTERM handler, the 2-minute spot eviction warning is wasted — the process is killed and the last checkpoint was 30 minutes ago. Always register the handler before the training loop starts, not inside it (so it's registered even if setup takes a long time).

Trade-offs to discuss in an interview

LoRA rank vs quality. Higher rank means more expressive adapter but larger delta weights and slower serving. For most tasks r=32 is optimal; r=64 adds marginal quality at measurable serving overhead. The debate worth having: whether to increase rank or increase the number of target modules (applying LoRA to MLP layers in addition to attention layers), which research suggests matters more than rank alone.

DPO vs PPO. DPO wins on simplicity (2 model copies, no rollout loop, no reward model). PPO wins when reward is non-stationary, online (tool-use RL), or when you need to push alignment beyond what static preference triples capture. Most platforms in 2024–2025 defaulted to DPO; PPO is reserved for frontier alignment work (e.g., NVIDIA NeMo-Aligner at 1000+ GPUs for Nemotron 340B).

Dataset quality vs quantity. LIMA showed 1,000 high-quality examples can match 50,000 noisy ones. Implication: invest engineering effort in dataset validation (schema, dedup, contamination, quality scoring) rather than raw data acquisition. Synthetic data generation (using GPT-4 to generate instruction-answer pairs) is common and effective, but risks reward hacking if the same model family generates training data and evaluates the result.

Spot vs on-demand. Spot saves 60–80% on long training runs. The operational cost: checkpoint overhead (5–10% compute), SIGTERM handler complexity, occasional run elongation from multiple preemptions. For jobs under 4 hours, on-demand is often worth the cost premium for simplicity. For 70B+ runs that take days, spot with disciplined checkpointing is almost always the right call.

Centralized vs federated training. Most platforms centralize training on a managed GPU pool. Federated fine-tuning (running on the client's private GPUs without data leaving the organization) is occasionally required for data residency compliance, but introduces significant complexity in job monitoring, checkpointing, and cluster management.

Things you should now be able to answer

A 7B model in FP16 mixed precision requires approximately how many GB for full fine-tuning before activations? What does ZeRO-3 do to per-GPU memory on an 8-GPU node?
What is the difference between LoRA rank and alpha, and why does applying LoRA to MLP layers matter more than increasing rank for most tasks?
Why does QLoRA not work with ZeRO-3, and what is the recommended configuration for multi-GPU QLoRA training?
Walk through the sequence packing pipeline: what happens if you forget to apply boundary-aware 2D attention masks?
What data must a checkpoint include for training to resume exactly where it left off after spot preemption?
Your fine-tuned model scores 3% higher on the task holdout but 4% lower on MMLU. What does this indicate, and what would you change about the training process?
Describe the SGMV kernel and explain how it enables serving thousands of LoRA adapters on one base model replica.
Why does DPO precompute reference model log-probs offline, and what does that mean for infrastructure complexity versus PPO?
A tenant updates their base model from v1.0 to a v1.1 security patch. What must the platform do before allowing traffic to route through existing LoRA adapters?
What eval gates would you require before promoting a fine-tuned model to production, and what does each gate catch?

Frequently asked questions

▸When should you use LoRA instead of a full fine-tune, and when does full fine-tuning still make sense?

LoRA is almost always the right default for instruction-following, style, and domain-adaptation tasks — it trains 100–1000x fewer parameters, fits on a single GPU, and the resulting adapter weights (10–100 MB) enable cheap multi-tenant serving where thousands of adapters share one base model replica. Full fine-tuning makes sense when you are doing continued pre-training on a large domain corpus (the base model needs to absorb new vocabulary and factual associations that LoRA rank-limited updates cannot capture), when you need to change the model architecture (e.g., extending the context window), or when the required adapter rank climbs so high that the delta weight size approaches a full model copy anyway.

▸What is DPO and why has it largely replaced PPO for alignment fine-tuning in 2024–2025?

Direct Preference Optimization (Rafailov et al., NeurIPS 2023) reformulates the RLHF objective as a classification loss directly on (prompt, chosen, rejected) preference triples. Because the DPO loss implicitly maximizes a KL-constrained policy improvement, you get alignment without running a live reward model or a PPO training loop. The practical advantage is dramatic: PPO requires four simultaneous model copies (policy, reference, reward, value) plus a rollout generation loop, while DPO needs only two (policy and reference), with the reference model log-probs precomputed offline. PPO still wins when the reward signal cannot be expressed as static preference pairs — online tool-use RL, active learning reward loops, or frontier alignment beyond what static preferences capture.

▸How does gradient checkpointing help and what does it cost?

During a forward pass, the GPU must store intermediate activations for every layer to compute gradients in the backward pass. For a LLaMA-7B model at sequence length 4096 this reaches roughly 73 GB of activation memory alone — more than an 80 GB H100 can hold after loading the weights. Gradient checkpointing discards those activations during the forward pass and recomputes them layer-by-layer during the backward pass. The tradeoff is roughly 20% additional compute in exchange for about a 3x reduction in activation memory. For long-context training it is almost always worth enabling.

▸What is sequence packing and why does it matter for training throughput?

Most SFT datasets have highly variable example lengths. When you batch short examples padded to the maximum sequence length, the padding tokens contribute no gradient signal — measurements on real datasets show up to 92% padding at a 4096-token context window. Packing concatenates multiple examples end-to-end within a single fixed-length window, separated by EOS tokens, and applies loss masks so only completion tokens contribute to the loss. The critical implementation detail: you must use boundary-aware 2D attention masks (supported by Flash Attention 2) to prevent tokens from one example attending to tokens from another. Without proper masks, examples contaminate each other and training metrics look great while eval degrades silently.

▸How do you prevent a fine-tune from erasing the base model safety alignment?

Safety degradation from fine-tuning is real and underestimated. A Stanford/Princeton study in 2023 showed that as few as 10 adversarial examples at a cost under $0.20 reduced GPT-3.5 Turbo refusal rate from 100% to 1% on harmful queries. Even benign domain fine-tuning causes measurable safety regression. Mitigations layered in production: (1) run a safety/toxicity scan as a mandatory promotion gate before any adapter reaches serving; (2) augment fine-tuning data with a small fraction of alignment-preserving safety examples; (3) apply lower learning rates and fewer training steps than intuition suggests; (4) pin a KL penalty relative to the reference model when using PPO; (5) reject user-uploaded datasets that score above a toxicity threshold during ingestion validation.

← previous

Design an LLM Gateway (AI Gateway & Model Router)

Design an LLM Evaluation Platform

// RELATED