Python bindings for quant.cpp -- a minimal C inference engine for local LLMs with KV cache compression.
pip install quantcppPre-built wheels are published for Linux (x86_64, aarch64), macOS (Intel + Apple Silicon), and Windows (x64). On other platforms pip falls back to the source distribution and compiles quant.h automatically using your system C compiler — no external dependencies.
cd quant.cpp/bindings/python
pip install . # build + install
pip install -e . # editable / development installTo point at a pre-built library instead:
export QUANTCPP_LIB=/path/to/libquant.dylib
pip install .- Python >= 3.8
- A C compiler (cc, gcc, or clang)
- The quant.cpp repository (for
quant.h)
from quantcpp import Model
m = Model.from_pretrained("Phi-3.5-mini") # ~2.4 GB, downloaded once and cached
print(m.ask("What is 2+2?"))from_pretrained accepts any name from quantcpp.available_models().
Phi-3.5-mini is the recommended default — 3.8B params with the smallest
vocab (32K) in the registry, which makes the per-token lm_head matmul
the fastest of any model we ship. Other ready-to-use names:
SmolLM2-1.7B— lightweight all-rounder (1.7 GB, vocab 49K)Llama-3.2-1B— smallest download (750 MB) but slower at inferenceSmolLM2-135M— 138 MB demo model, low qualityQwen3.5-0.8B
You can also load any local GGUF file directly:
m = Model("model.gguf")
print(m.ask("What is 2+2?"))for token in m.generate("Once upon a time"):
print(token, end="", flush=True)m = Model.from_pretrained("Phi-3.5-mini")
history = ""
while True:
user = input("\nYou: ")
history += f"<|user|>\n{user}<|end|>\n<|assistant|>\n"
print("AI: ", end="", flush=True)
reply = ""
for tok in m.chat(history):
print(tok, end="", flush=True)
reply += tok
history += reply + "<|end|>\n"m.chat() reuses the KV cache across turns — turn N's prefill cost is
O(new tokens), not O(history). Catch quantcpp.ChatContextOverflow if
the conversation exceeds the model's context window.
with Model.from_pretrained("Phi-3.5-mini") as m:
print(m.ask("Explain gravity in one sentence"))m = Model(
"model.gguf",
temperature=0.5, # Lower = more deterministic
top_p=0.9, # Nucleus sampling
max_tokens=512, # Max tokens per generation
n_threads=8, # CPU threads
kv_compress=2, # 0=off, 1=4-bit K+V, 2=delta+3-bit
)from quantcpp import load
m = load("model.gguf", kv_compress=2)
print(m.ask("Hello!"))Load a GGUF model file and create an inference context.
path-- Path to a.ggufmodel file.temperature-- Sampling temperature (0.0 = greedy).top_p-- Nucleus sampling threshold.max_tokens-- Maximum tokens per generation.n_threads-- CPU thread count.kv_compress-- KV cache compression mode (0=off, 1=4-bit, 2=delta+3-bit).
Download a registered model from HuggingFace (cached at
~/.cache/quantcpp/) and return an open Model. See
quantcpp.available_models() for the registry.
Generate a complete response. Returns the full text.
Stream tokens one at a time. Yields individual token strings.
Stream tokens with KV cache reuse across calls — turn N pays only for
the new bytes since turn N-1. Pass prompt=None (or call
Model.reset_chat()) to start a fresh session. Raises
quantcpp.ChatContextOverflow when the history exceeds the model's
context window (the C side has already auto-reset by then).
Release resources. Called automatically via with or garbage collection.
The path to the loaded model file (read-only property).
The package looks for the compiled shared library in this order:
QUANTCPP_LIBenvironment variable- Installed alongside the Python package (normal
pip install) build/relative to the project root (development)- System library path
cd bindings/python
python -m pytest tests/