I’ve been recently studying the nano-vLLM codebase and writing this article to document my notes and learnings for future reference.
config.py
- model: path to the model weight and config files.
- max_num_batched_tokens: The maximum total number of tokens that can be processed in a single batch across all sequences.
- max_num_seqs: The maximum number of sequences (requests) that can be processed simultaneously in a batch.
- max_model_len: The maximum length (in tokens) of a single sequence.
- gpu_memory_utilization: The fraction of GPU memory to use (90% by default).
- tensor_parallel_size: tensor parallelism.
- enforce_eager: eager mode.
- hf_config: The HuggingFace model configuration object.
- eos: likely the end-of-sequence token ID.
- kvcache_block_size: The size of each block in the key-value cache. Must be multiple of 256.
- num_kvcache_blocks: likely the total number of KV-cache blocks to allocate.
sampling_params.py
- temperature
- max_tokens
- ignore_eos
bench.py
The script generates random token sequences, feeds them to a language model, and measures throughput.
engine/sequence.py
This file implements a Sequence class that represents a single request/prompt being processed by the inference engine.
Attributes:
- seq_id: Unique id.
- status: waiting, running, finished.
- token_ids: full list of tokens.
- rest of the token ids, sampling params.
Properties:
- Multiple token ids, block counts.
engine/block_manager.py
Block class for a single block:
- block_id
- ref_count: int (for prefix caching)
- hash: int (for prefix caching)
- token_ids: list[int]
BlockManager class for orchestration and prefix caching:
Attributes:
- block_size: int
- blocks: list[Block]
- hash_to_block_id: dict[int, int]
- free_block_ids: deque[int]
- used_block_ids: deque[int]
- allocate:
Methods:
- allocate
- deallocate
- can_allocate
- can_append
- may_append
- compute_hash
utils/context.py
Context to share globle state info across different part of the inference engine.
Attributes:
- is_prefill: whether it’s in prefill phase
- cu_seqlens_q: Cumulative sequence lengths for queries for FlashAttention
- cu_seqlens_k: Cumulative sequence lengths for keys for FlashAttention
- max_seqlen_q: Maximum query sequence length in the current batch
- max_seqlen_k: Maximum key sequence length in the current batch
- slot_mapping: token to physical KV cache slots mapping
- context_lens: Number of cached tokens for each sequence
- block_tables: Maps each sequence to its allocated memory blocks
Glable Context instance, set and reset.
engine/scheduler.py
The brain of continuous batching. It maintains queues for waiting and running sequences. It decides how many sequences from the waiting queue can be processed in the “prefill” phase, and which running sequences can proceed in the “decode” phase without exceeding the GPU’s KV cache capacity. It also handles “preemption” (pausing a sequence if the engine runs out of memory)
Methods:
- is_finished(): Returns True if both queues are empty, meaning the engine has no more work to do.
- add(seq: Sequence): Accepts a new incoming request (Sequence) and puts it at the back of the waiting queue.
- Schedule(): This method is called before every forward pass of the model. It decides what the model will compute next. It returns a tuple containing the list of sequences to process and a boolean indicating if it’s doing a Prefill (True) or Decode (False).
- preempt(): When a sequence is preempted due to lack of memory.
- postprocess(): After the model generates the next tokens (token_ids), this method applies them to the sequences.
Notes:
- max_num_seqs limits how many requests (sequences) can be scheduled in one iteration, while max_num_batched_tokens limits total prompt tokens during prefill. Prefill is batched: in one scheduling step, it can schedule multiple waiting requests together, stopping when it hits sequence limit, token limit, or KV-cache allocation limit. Decode also schedules multiple running sequences (typically one token each), and is constrained by max_num_seqs plus KV-cache append capacity rather than max_num_batched_tokens.
engine/llm_engine.py
It defines the LLMEngine class — the central orchestrator of the nano-vllm inference engine. It ties together model execution, scheduling, tokenization, and multi-process tensor parallelism into a unified interface for running LLM inference.
Engine Initialization
- Config construction: Filters kwargs to only pass recognized fields to the Config dataclass
- Tensor parallelism setup: Rank0 has its own ModelRunner, worker processes created for rank 1 and above.
- Tokenizer
- Scheduler
exit
Sends an “exit” command to the model runner (which propagates to all worker processes), then waits for all child processes to terminate with join().
add_request
- Accepts a prompt as either a string (which gets tokenized) or pre-tokenized token IDs.
- Wraps it in a Sequence object (which tracks state like generated tokens, finish status, etc.).
- Adds it to the scheduler’s queue.
step
- scheduler.schedule(): Selects a batch of sequences to run and determines whether this is a prefill or decode
- execute forward pass with model runner
- scheduler.postprocess(): Updates sequences with the newly generated tokens and marks finished sequences.
- Return: completed sequences and a throughput metric: positive for prefill, negative for decode.
generate
This is the user-facing method that runs end-to-end generation.
- Input normalization: If a single SamplingParams is provided, it’s broadcast to all prompts.
- Enqueuing: All prompts are added to the scheduler via add_request.
- Main loop: Repeatedly calls step() until all sequences finish.
- Throughput tracking (with tqdm)
- Output assembly
engine/model_runner.py
ModelRunner is the workhorse that actually runs the LLM on a GPU. It handles model loading, KV cache allocation, input preparation for both prefill and decode phases, CUDA graph capture for decode acceleration, and tensor-parallel coordination across multiple GPUs via NCCL and shared memory.
Initialization
- NCCL init: establishes GPU-to-GPU communication
- Device binding
- Model creation: Instantiates model with the HuggingFace config, then loads pretrained weights via load_model()
- Sampler creation: Creates a Sampler that converts logits → token IDs using temperature-scaled softmax + Gumbel sampling.
- Warmup: Runs a dummy forward pass to trigger all lazy CUDA/Triton kernel compilation
- KV cache allocation: allocate_kv_cache() — uses the peak memory measurement to compute how many KV cache blocks fit in remaining GPU memory
- CUDA graph capture
- Tensor parallel setup: If world_size > 1, rank 0 creates a 1 MB shared memory segment (SharedMemory)
Tensor Parallel Communication
- The multi-GPU design uses a leader-follower pattern with shared memory + multiprocessing Events
write_shm: Serializes a method name + arguments with pickle, writes a 4-byte length header + payload into shared memory, then signals all follower processes via their Events.read_shm: Blocks until signaled, deserializes the command, then clears the event.call: When rank 0 calls call(“run”, seqs, is_prefill), it first broadcasts the command to followers, then executes locally. Followers receive the command in loop() and execute the same method. This ensures all ranks execute the same operations in lock-step.loop: Non-zero ranks sit in this infinite loop, only breaking on an “exit” command.
Model Warmup
Creates worst-case dummy sequences (maximum length, maximum batch) and runs a full prefill forward pass.
- Triggers all lazy CUDA kernel compilation
- Records peak memory usage so allocate_kv_cache knows how much memory the model needs
KV Cache Allocation
Input Preparation
- prepare_prefill: This builds inputs compatible with Flash Attention’s variable-length API
- prepare_decode: All tensors are created with pin_memory=True and transferred with .cuda(non_blocking=True) for overlapped CPU→GPU transfers.
- prepare_sample: Collects per-sequence temperature values into a tensor for the sampler.
Model Execution
- Prefill or eager or batch > 512: Eager execution
- Decode with batch ≤ 512: CUDA graph replay
CUDA Graph Capture
layers/attention.py
This file implements the attention layer for a minimal vLLM-style inference engine. It has three components:
- A Triton GPU kernel for writing KV data into a paged KV cache
- A Python wrapper that launches that kernel
- An Attention module that orchestrates KV caching and dispatches to FlashAttention for both prefill and decode phases
The Triton Kernel
A GPU kernel written in Triton that copies newly computed key/value vectors into the paged KV cache.
The Python Wrapper
- Shape extraction: key has shape [N, num_heads, head_dim] where N = total tokens in the batch.
- Stride assertions: Verifies the tensors are contiguous in the expected layout — the last dimension (head_dim) must be contiguous (stride 1), and the num_heads dimension must have stride head_dim. This ensures the kernel can treat num_heads × head_dim as a single flat vector of size D.
- Launch: [(N,)] launches N thread blocks — one per token.
The Attention Module
- Stores attention hyperparameters
- self.k_cache and self.v_cache are initialized as empty tensors. They get replaced later by the ModelRunner with properly sized paged cache buffers allocated by the block manager.
layers/linear.py & layers/embed_head.py
Tensor Parallelism in nano-vllm
Overview
These two files implement Megatron-LM style tensor parallelism (TP), splitting model weights across GPUs to serve large language models. The core pattern is:
Column-parallel (no comm) → local compute → Row-parallel (all-reduce)
Each transformer sub-block (attention, MLP) requires only one all-reduce, minimizing communication overhead.
Linear Layers (linear.py)
| Class | Sharding | Communication | Use Case |
|---|---|---|---|
ReplicatedLinear |
None (full copy) | None | Small layers not worth splitting |
ColumnParallelLinear |
Output dim (dim=0) |
None | First projection (Q/K/V, gate, up) |
MergedColumnParallelLinear |
Output dim (multiple merged) | None | Fused gate + up proj |
QKVParallelLinear |
Output dim (Q/K/V merged) | None | Fused QKV with GQA support |
RowParallelLinear |
Input dim (dim=1) |
all_reduce |
Second projection (o_proj, down_proj) |
Key Mechanisms
weight_loaderhook: Attached to each parameter; called during checkpoint loading to extract the correct shard per GPU via.narrow()/.chunk().- Column-parallel: Splits output dim → each GPU produces a slice of the output, no communication needed.
- Row-parallel: Splits input dim → each GPU computes a partial result, summed via
all_reduce. Bias added only on rank 0 to avoid double-counting.
QKV Weight Layout (per GPU)
For GQA (e.g., 8 Q heads, 2 KV heads, tp_size=2):
[Q_shard (4 heads) | K_shard (1 head) | V_shard (1 head)]
Loaded via loaded_shard_id ∈ {"q", "k", "v"} with computed offsets.
Embedding & LM Head (embed_head.py)
| Class | Sharding | Communication | Purpose |
|---|---|---|---|
VocabParallelEmbedding |
Vocab rows | all_reduce |
Input token embeddings |
ParallelLMHead |
Vocab rows | gather to rank 0 |
Output logits for sampling |
Embedding Forward
- Mask tokens outside this rank’s vocab range.
- Remap global token IDs to local indices.
- Lookup embeddings, zero out masked positions.
- All-reduce to combine (each token is non-zero on exactly one rank).
LM Head Forward
- Prefill optimization: Extract only the last token per sequence (only those need logits).
- Linear projection with this rank’s vocab shard → partial logits.
- Gather to rank 0 and concatenate → only rank 0 has full logits for sampling. Other ranks return
None.
End-to-End Transformer Layer Flow
Input (identical on all GPUs)
│
├─► QKVParallelLinear (column, no comm)
│ └─► Attention (local)
│ └─► RowParallelLinear/o_proj (all_reduce) ──► + residual
│
├─► MergedColumnParallelLinear/gate+up (column, no comm)
│ └─► SiLU(gate) * up (local)
│ └─► RowParallelLinear/down_proj (all_reduce) ──► + residual
│
└─► (repeat N layers) ─► ParallelLMHead (gather to rank 0) ─► Sampling
Design Principles
- Minimize communication: Column + Row pairing ensures only one
all_reduceper sub-block. - Decoupled loading:
weight_loaderhooks let each GPU extract its shard from full checkpoint weights independently. - Asymmetric head output: Embedding uses
all_reduce(all GPUs need embeddings), LM head usesgather(only rank 0 samples), saving memory.
models/qwen3.py
Implementation of the Qwen3 causal language model for the nano-vllm inference engine, with tensor parallelism (TP) and KV caching support.
Architecture
Qwen3ForCausalLM # top-level entry point
└─ Qwen3Model # embedding + N decoder layers + final norm
└─ Qwen3DecoderLayer # single transformer block
├─ Qwen3Attention # GQA + RoPE + optional QK-norm
└─ Qwen3MLP # gated SiLU (SwiGLU) FFN
Classes
Qwen3Attention
Multi-head attention with Grouped-Query Attention (GQA) — fewer KV heads than Q heads to reduce memory.
qkv_proj(QKVParallelLinear): Fused Q/K/V projection in one matmul, column-sharded across GPUs.o_proj(RowParallelLinear): Output projection, row-sharded withall_reduce.rotary_emb: Applies RoPE positional encoding via a precomputed cos/sin cache.attn(Attention): FlashAttention with KV cache — usesflash_attn_varlen_funcfor prefill andflash_attn_with_kvcachefor decode.q_norm/k_norm(RMSNorm): Per-head QK normalization, active only whenqkv_bias=False(Qwen3 default).
Qwen3MLP
Gated SiLU (SwiGLU) feed-forward network.
gate_up_proj(MergedColumnParallelLinear): Fused gate + up projection, column-sharded.down_proj(RowParallelLinear): Down projection, row-sharded withall_reduce.act_fn(SiluAndMul): Splits output in half, appliesSiLU(gate) * up.
Qwen3DecoderLayer
Single transformer block with Pre-RMSNorm and a fused residual add optimization.
- The
RMSNorm.add_rms_forwardmethod fusesresidual = x + old_residualandx = RMSNorm(residual)into one compiled pass, reducing memory overhead. - The residual tensor is passed between layers as a separate stream.
Qwen3Model
Transformer backbone.
embed_tokens(VocabParallelEmbedding): Vocabulary sharded across GPUs, combined viaall_reduce.layers: Stack ofQwen3DecoderLayermodules.norm: Final RMSNorm that folds in the last residual.
Qwen3ForCausalLM
Top-level model class.
packed_modules_mapping: Maps original checkpoint weight names (e.g.,q_proj,k_proj,v_proj) to fused parameter names (e.g.,qkv_proj) for weight loading.lm_head(ParallelLMHead): Projects hidden states to vocab logits; gathers across TP ranks.- Supports weight tying between embedding and LM head.
Key Design Patterns
| Pattern | Details |
|---|---|
| Tensor Parallelism | Column-parallel for Q/K/V/gate/up; row-parallel + all_reduce for O/down |
| Fused Projections | QKV in one linear; gate + up in one linear |
| Fused Residual + Norm | RMSNorm combines residual addition and normalization in a single @torch.compiled kernel |
| KV Caching | Triton kernel for cache storage; FlashAttention for prefill and paged decode |
| GQA | Fewer KV heads than Q heads to reduce memory |
| QK Normalization | Per-head RMSNorm on Q and K (active when no QKV bias) |