Nano-vLLM Study Notes

I’ve been recently studying the nano-vLLM codebase and writing this article to document my notes and learnings for future reference.

config.py

  • model: path to the model weight and config files.
  • max_num_batched_tokens: The maximum total number of tokens that can be processed in a single batch across all sequences.
  • max_num_seqs: The maximum number of sequences (requests) that can be processed simultaneously in a batch.
  • max_model_len: The maximum length (in tokens) of a single sequence.
  • gpu_memory_utilization: The fraction of GPU memory to use (90% by default).
  • tensor_parallel_size: tensor parallelism.
  • enforce_eager: eager mode.
  • hf_config: The HuggingFace model configuration object.
  • eos: likely the end-of-sequence token ID.
  • kvcache_block_size: The size of each block in the key-value cache. Must be multiple of 256.
  • num_kvcache_blocks: likely the total number of KV-cache blocks to allocate.

sampling_params.py

  • temperature
  • max_tokens
  • ignore_eos

bench.py

The script generates random token sequences, feeds them to a language model, and measures throughput.

engine/sequence.py

This file implements a Sequence class that represents a single request/prompt being processed by the inference engine.

Attributes:

  • seq_id: Unique id.
  • status: waiting, running, finished.
  • token_ids: full list of tokens.
  • rest of the token ids, sampling params.

Properties:

  • Multiple token ids, block counts.

engine/block_manager.py

Block class for a single block:

  • block_id
  • ref_count: int
  • hash: int
  • token_ids: list[int]

BlockManager class for orchestration and prefix caching:
Attributes:

  • block_size: int
  • blocks: list[Block]
  • hash_to_block_id: dict[int, int]
  • free_block_ids: deque[int]
  • used_block_ids: deque[int]
  • allocate:

Methods:

  • allocate
  • deallocate
  • can_allocate
  • can_append
  • may_append
  • compute_hash

utils/context.py

Context to share globle state info across different part of the inference engine.

Attributes:

  • is_prefill: whether it’s in prefill phase
  • cu_seqlens_q: Cumulative sequence lengths for queries for FlashAttention
  • cu_seqlens_k: Cumulative sequence lengths for keys for FlashAttention
  • max_seqlen_q: Maximum query sequence length in the current batch
  • max_seqlen_k: Maximum key sequence length in the current batch
  • slot_mapping: token to physical KV cache slots mapping
  • context_lens: Number of cached tokens for each sequence
  • block_tables: Maps each sequence to its allocated memory blocks

Glable Context instance, set and reset.