The gap nobody talks about
Most of the conversation around large language models focuses on two things: training them and using them.
Training gets the research papers, the GPU cluster announcements, the billion dollar compute budgets. Using them gets the API tutorials, the prompt engineering guides, the "build a chatbot in 10 lines of Python" blog posts.
In between sits a problem that is neither of those things and is significantly harder than both are given credit for.
Inference: taking a trained model and making it respond to real requests, at low latency, on real hardware, repeatedly, for real users.
When a researcher finishes training a model, what they have is a collection of files on a disk. There is no "run" button. The weights do not execute themselves. Getting from "files on a disk" to "system that answers questions" requires a category of engineering that is almost entirely distinct from the ML research that produced the model in the first place.
This series is about that engineering.
Unlike traditional software, downloading an LLM doesn't give you an executable file. You just get a static collection of artifacts resting on your disk: weights, configurations, and tokenizers.
To actually generate text, you must rely on an inference engine like vLLM, llama.cpp, or TensorRT-LLM to lift these massive artifacts into memory and orchestrate their execution across your hardware. And every engine handles that heavy lifting differently.
What you actually download
When you pull a model from Hugging Face say, meta-llama/Llama-3-8B, you get a directory of
files. Most people treat this as a black box. Let's open it.
Llama-3-8B/
├── config.json
├── tokenizer.json
├── tokenizer_config.json
├── special_tokens_map.json
├── generation_config.json
└── model.safetensors # or split across multiple shards
These fall into three categories, each with a distinct role.
The architectural blueprint: config.json
This is the most important file in the directory, and the one people look at least.
{
"num_attention_heads": 32,
"num_hidden_layers": 32,
"hidden_size": 4096,
"intermediate_size": 14336,
"num_key_value_heads": 8,
"vocab_size": 128256,
"rope_theta": 500000.0,
"max_position_embeddings": 8192
}
Every number here is load-bearing. num_hidden_layers: 32 means there are 32 transformer blocks
stacked on top of each other. hidden_size: 4096 means every token is represented as a
4096-dimensional vector flowing through the network. num_key_value_heads: 8 tells you this model
uses Grouped Query Attention: 8 KV heads shared across 32 query heads, a design choice that directly affects
KV cache size during inference.
Without config.json, the weight file is meaningless. It is a binary blob with no indication of
what shape the tensors are, how many layers exist, or how the computation flows. The inference engine reads this
file first and uses it to reconstruct the computation graph before a single weight is loaded.
Think of it this way: config.json is the blueprint. The weight file is the raw materials. You
cannot build the structure without both.
The language interface: tokenizer files
tokenizer.json and its companion files define the model's vocabulary, the mapping between text
and the integer token IDs the model actually operates on.
When you send the text "Why is inference hard?" to a model, the tokenizer converts it to something
like [4599, 374, 45478, 2107, 30] before anything computational happens. When the model generates a
response, it produces token IDs that the tokenizer converts back to text.
This matters for inference engineering for a reason that is not obvious: the tokenizer determines the
shape of every input and output tensor. vocab_size: 128256 means the model's final layer
produces a probability distribution over 128,256 possible next tokens, a 128,256-dimensional output vector for
every token generated. The tokenizer's design directly impacts memory allocation.
The weights: model.safetensors
This is the bulk of what you download. For an 8B parameter model stored in BF16 (Brain Float 16, 2 bytes per parameter), this file is approximately 16 gigabytes.
It contains nothing but raw floating point numbers, the learned parameters of the model, serialized sequentially. Every attention weight matrix, every feed-forward layer, every embedding vector. Nothing more.
For a 70B model in BF16, this file is 140 gigabytes.
That number is the beginning of every problem inference engineering has to solve.
Why 140GB is a problem
Consumer GPUs, the hardware most people actually have max out at 24GB of VRAM. High-end data center GPUs like the H100 have 80GB. Even with multiple GPUs, fitting a 70B model requires careful planning.
But VRAM is not the only constraint. Consider what inference actually requires at runtime:
- The model weights themselves, static, loaded once
- The KV cache, dynamic, grows with every token generated, must persist across the entire context window
- Activations, intermediate computation results flowing through the network
- The inference engine's own memory overhead, CUDA kernels, scheduling structures, batch buffers
The weights alone fill the GPU. Everything else still needs to fit.
This is not a problem you can brute-force with more hardware. A single H100 (80GB) cannot fit a 70B BF16 model and leave meaningful room for the KV cache of even a few concurrent users. The engineering problem is not "get a bigger GPU." It is "rethink what it means to have a model in memory."
That rethinking produced two major solutions, memory mapping and quantization, which are the subject of the first two articles in this series.
What an inference engine actually does
Here is where most explanations stop at the wrong level of abstraction.
An inference engine is not a wrapper around the model. It is a complete system that manages five distinct problems simultaneously:
- Loading Getting the weights from disk into the appropriate memory tier (system RAM, GPU VRAM, or both) in a way that is fast and memory-efficient. This is solved by memory mapping and quantization, covered in Parts 1 and 2.
- Compute scheduling Deciding in what order to execute operations, on which hardware, with what precision. A transformer has two fundamentally different computational phases, prefill and decode, with opposing hardware characteristics. This is covered in Part 3.
- Memory management Allocating and releasing the KV cache for concurrent requests without fragmentation or waste. This is the problem PagedAttention was invented to solve, covered in Part 4.
- Batching Grouping multiple user requests together to maximize GPU utilization. Static batching, dynamic batching, and continuous batching each represent a different understanding of what "utilization" means in practice. Also Part 4.
- Serving Exposing the inference system as a network service with appropriate APIs, handling request queuing, timeouts, streaming responses, and multi-model routing. Part 4.
Different inference engines make different tradeoffs across these five problems. Understanding those tradeoffs requires understanding the problems themselves, which is what this series builds toward.
The inference engine landscape
There are five engines you will encounter repeatedly in this space. They are not interchangeable, and the differences between them are not superficial.
- llama.cpp (C++) is the fastest-starting, most hardware-flexible engine. It runs on Apple Silicon, NVIDIA GPUs, AMD GPUs, and CPU-only. It uses memory mapping by default, GGUF format natively, and has almost zero startup overhead. The tradeoff: it is optimized for single-user local inference, not high-concurrency serving.
- vLLM (Python/CUDA) is the dominant production serving engine. It invented PagedAttention, implements continuous batching, and handles multi-GPU tensor parallelism natively. The tradeoff: heavy Python and CUDA initialization means startup takes minutes, and it requires NVIDIA GPUs.
- SGLang (Python) is newer, designed around structured generation and complex multi-call workflows. It introduced RadixAttention, a variant of PagedAttention that reuses KV cache across requests with shared prefixes. Faster than vLLM for certain workloads, particularly those with repeated system prompts.
- TensorRT-LLM (C++/CUDA, by NVIDIA) is the performance-ceiling engine. It compiles models into highly optimized TensorRT execution plans, enabling custom CUDA kernels and operation fusion at a level that Python-based engines cannot reach. The tradeoff: NVIDIA-only, complex to set up, models must be explicitly compiled before serving.
- TGI (Text Generation Inference) (Rust/Python, by Hugging Face) prioritizes broad model support and production deployment features. It integrates tightly with the Hugging Face ecosystem and handles model sharding across GPUs natively.
The programming language these are written in is not the primary differentiator. The differences are in how each engine handles loading, memory management, batching strategy, and hardware targeting.
Why the file format matters more than you think
One detail that connects the artifacts to the inference engines: the format of the weight file determines what loading strategies are possible.
PyTorch's original .pt format uses Python's pickle serialization. It embeds Python
object metadata, requires the Python interpreter to deserialize, and does not guarantee tensor alignment. You
cannot memory-map a .pt file and directly access tensors as raw pointers, you must fully
deserialize it first, which means a full copy into RAM before any computation begins.
Safetensors (Hugging Face) and GGUF (llama.cpp) were both designed to fix this. They store raw tensor data at aligned offsets within the file, with a lightweight header that describes each tensor's name, shape, dtype, and byte offset. An inference engine can mmap the entire file and immediately pointer-cast any tensor offset to a typed array, zero deserialization, zero copy.
This is not a minor implementation detail. It is the design decision that makes sub-10-second model loading possible. The file format and the loading strategy are inseparable, you cannot have one without the other being designed for it.
The thread that connects everything
Every problem in inference engineering traces back to the same root tension:
Models are large. Hardware is finite. Users expect low latency.
That tension does not resolve it only gets managed. Memory mapping manages it at load time. Quantization manages it by compressing the weights. KV cache management manages it at runtime. Continuous batching manages it across concurrent users.
Each part of this series covers one layer of that management, from the moment a weight file sits on disk to the moment a token reaches a user.
The next article starts at the beginning: how a 140GB file lands in memory on a machine that does not have 140GB to spare.