Writing

Long-form articles on LLM inference, performance engineering, low-level development, and AI compiler work for XPUs. Written here first, cross-posted to Medium.

2025

LLM Inference Series · §00 Apr 23, 2025

Why Running an LLM Is Harder Than It Looks

Training gets the papers. Using models gets the API tutorials. In between sits inference — a complete systems engineering discipline that is almost entirely distinct from the ML research that produced the model.

LLM Inference Series · §01 Coming soon

Memory Mapping and How a 140GB Model Actually Loads

How safetensors and GGUF make zero-copy model loading possible, and why .pt files cannot.

More articles in progress. New pieces publish here first, then cross-posted to Medium.

Follow on Medium →