1 article
Serve token generation for a 70B-parameter model at scale — where KV cache, not FLOPs, caps concurrency and continuous batching is what separates good GPU utilization from terrible utilization.