Name	Name	Last commit message	Last commit date
parent directory ..
bindings/python	bindings/python
quantcpp	quantcpp
tests	tests
turboquant	turboquant
.gitignore	.gitignore
MANIFEST.in	MANIFEST.in
README.md	README.md
example.py	example.py
pyproject.toml	pyproject.toml
quant.h	quant.h
setup.py	setup.py
turboquant.py	turboquant.py
turboquant_cli.py	turboquant_cli.py

quantcpp

Python bindings for quant.cpp -- a minimal C inference engine for local LLMs with KV cache compression.

Installation

pip install quantcpp

Pre-built wheels are published for Linux (x86_64, aarch64), macOS (Intel + Apple Silicon), and Windows (x64). On other platforms pip falls back to the source distribution and compiles quant.h automatically using your system C compiler — no external dependencies.

From source (dev tree)

cd quant.cpp/bindings/python
pip install .          # build + install
pip install -e .       # editable / development install

To point at a pre-built library instead:

export QUANTCPP_LIB=/path/to/libquant.dylib
pip install .

Requirements

Python >= 3.8
A C compiler (cc, gcc, or clang)
The quant.cpp repository (for quant.h)

Usage

Quick start (auto-download)

from quantcpp import Model

m = Model.from_pretrained("Phi-3.5-mini")  # ~2.4 GB, downloaded once and cached
print(m.ask("What is 2+2?"))

from_pretrained accepts any name from quantcpp.available_models(). Phi-3.5-mini is the recommended default — 3.8B params with the smallest vocab (32K) in the registry, which makes the per-token lm_head matmul the fastest of any model we ship. Other ready-to-use names:

SmolLM2-1.7B — lightweight all-rounder (1.7 GB, vocab 49K)
Llama-3.2-1B — smallest download (750 MB) but slower at inference
SmolLM2-135M — 138 MB demo model, low quality
Qwen3.5-0.8B

You can also load any local GGUF file directly:

m = Model("model.gguf")
print(m.ask("What is 2+2?"))

Streaming generation

for token in m.generate("Once upon a time"):
    print(token, end="", flush=True)

Multi-turn chat with KV cache reuse

m = Model.from_pretrained("Phi-3.5-mini")
history = ""
while True:
    user = input("\nYou: ")
    history += f"<|user|>\n{user}<|end|>\n<|assistant|>\n"
    print("AI: ", end="", flush=True)
    reply = ""
    for tok in m.chat(history):
        print(tok, end="", flush=True)
        reply += tok
    history += reply + "<|end|>\n"

m.chat() reuses the KV cache across turns — turn N's prefill cost is O(new tokens), not O(history). Catch quantcpp.ChatContextOverflow if the conversation exceeds the model's context window.

Context manager

with Model.from_pretrained("Phi-3.5-mini") as m:
    print(m.ask("Explain gravity in one sentence"))

Configuration

m = Model(
    "model.gguf",
    temperature=0.5,      # Lower = more deterministic
    top_p=0.9,            # Nucleus sampling
    max_tokens=512,       # Max tokens per generation
    n_threads=8,          # CPU threads
    kv_compress=2,        # 0=off, 1=4-bit K+V, 2=delta+3-bit
)

Convenience loader

from quantcpp import load

m = load("model.gguf", kv_compress=2)
print(m.ask("Hello!"))

API Reference

`Model(path, *, temperature=0.7, top_p=0.9, max_tokens=256, n_threads=4, kv_compress=1)`

Load a GGUF model file and create an inference context.

path -- Path to a .gguf model file.
temperature -- Sampling temperature (0.0 = greedy).
top_p -- Nucleus sampling threshold.
max_tokens -- Maximum tokens per generation.
n_threads -- CPU thread count.
kv_compress -- KV cache compression mode (0=off, 1=4-bit, 2=delta+3-bit).

`Model.from_pretrained(name) -> Model`

Download a registered model from HuggingFace (cached at ~/.cache/quantcpp/) and return an open Model. See quantcpp.available_models() for the registry.

`Model.ask(prompt) -> str`

Generate a complete response. Returns the full text.

`Model.generate(prompt) -> Iterator[str]`

Stream tokens one at a time. Yields individual token strings.

`Model.chat(prompt) -> Iterator[str]`

Stream tokens with KV cache reuse across calls — turn N pays only for the new bytes since turn N-1. Pass prompt=None (or call Model.reset_chat()) to start a fresh session. Raises quantcpp.ChatContextOverflow when the history exceeds the model's context window (the C side has already auto-reset by then).

`Model.close()`

Release resources. Called automatically via with or garbage collection.

`Model.path -> str`

The path to the loaded model file (read-only property).

Library search order

The package looks for the compiled shared library in this order:

QUANTCPP_LIB environment variable
Installed alongside the Python package (normal pip install)
build/ relative to the project root (development)
System library path

Running tests

cd bindings/python
python -m pytest tests/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

README.md

quantcpp

Installation

From source (dev tree)

Requirements

Usage

Quick start (auto-download)

Streaming generation

Multi-turn chat with KV cache reuse

Context manager

Configuration

Convenience loader

API Reference

`Model(path, *, temperature=0.7, top_p=0.9, max_tokens=256, n_threads=4, kv_compress=1)`

`Model.from_pretrained(name) -> Model`

`Model.ask(prompt) -> str`

`Model.generate(prompt) -> Iterator[str]`

`Model.chat(prompt) -> Iterator[str]`

`Model.close()`

`Model.path -> str`

Library search order

Running tests

Uh oh!

FilesExpand file tree

python

Directory actions

More options

Directory actions

More options

Latest commit

History

python

Folders and files

parent directory

README.md

quantcpp

Installation

From source (dev tree)

Requirements

Usage

Quick start (auto-download)

Streaming generation

Multi-turn chat with KV cache reuse

Context manager

Configuration

Convenience loader

API Reference

Model(path, *, temperature=0.7, top_p=0.9, max_tokens=256, n_threads=4, kv_compress=1)

Model.from_pretrained(name) -> Model

Model.ask(prompt) -> str

Model.generate(prompt) -> Iterator[str]

Model.chat(prompt) -> Iterator[str]

Model.close()

Model.path -> str

Library search order

Running tests

`Model(path, *, temperature=0.7, top_p=0.9, max_tokens=256, n_threads=4, kv_compress=1)`

`Model.from_pretrained(name) -> Model`

`Model.ask(prompt) -> str`

`Model.generate(prompt) -> Iterator[str]`

`Model.chat(prompt) -> Iterator[str]`

`Model.close()`

`Model.path -> str`