qwen3-tts.cpp

Benchmark Snapshot (PyTorch vs qwen3-tts.cpp): Basic 3.19x faster, Clone 4.07x faster. Peak RSS delta: Basic +19.0%, Clone +7.7%.

C++ inference for Qwen3-TTS using the GGML tensor library.

Runs the full TTS pipeline in pure C++17, including text tokenization, speaker encoding, transformer code generation, and vocoder decoding, without Python or PyTorch at inference time.

Features

Full text-to-speech pipeline in C++17 with GGML backend
0.6B and 1.7B model support (Base, CustomVoice, VoiceDesign)
Voice cloning from reference audio (ECAPA-TDNN x-vector extraction)
Speaker embedding save/load for fast repeated cloning (skip encoder)
Voice steering instructions for controlling speech style and emotion
Unicode/multilingual tokenizer (Chinese, Japanese, Korean, German, etc.)
Greedy and sampled decoding (temperature, top-k, repetition penalty, seed)
GGUF model format with built-in C++ quantizer (F16, Q8_0, Q4_K, etc.)
Runtime backend selection: Metal (macOS), Vulkan (Linux/Windows), CUDA (NVIDIA)
Cross-platform: macOS, Linux, Windows (MSVC)
C API for FFI integration (Python, Rust, Nim, etc.)
Deterministic reference tests comparing C++ output against Python

Prerequisites

Compiler: C++17-capable GCC 9+ or Clang 10+ (Linux/macOS), or MSVC 2019+ (Windows)
CMake: 3.14+
GGML: Built from source (vendored as git submodule)
GPU backends (optional):
- Metal — macOS only, built into Xcode Command Line Tools
- Vulkan SDK — Linux and Windows (AMD, NVIDIA, Intel)
- CUDA Toolkit — NVIDIA GPUs
Python 3.10+ with uv — model conversion only; not needed if using pre-built GGUF files

Quickstart (macOS)

Run these commands from a fresh clone. For Linux or Windows, see the Build section below.

git clone https://github.com/predict-woo/qwen3-tts.cpp.git
cd qwen3-tts.cpp
git submodule update --init --recursive

# 1) Build GGML with Metal
cmake -S ggml -B ggml/build -DGGML_METAL=ON
cmake --build ggml/build -j4

# 2) Build qwen3-tts.cpp
cmake -S . -B build
cmake --build build -j4

# 3) Create a uv Python environment for setup/conversion tools
uv venv .venv
source .venv/bin/activate

# 4) Install Python dependencies
uv pip install --upgrade pip
uv pip install huggingface_hub gguf torch safetensors numpy tqdm coremltools

# Optional if model access requires auth:
# huggingface-cli login

# 5) Download and generate all runtime model artifacts
python scripts/setup_pipeline_models.py

# 6) Basic synthesis example
./build/qwen3-tts-cli \
  -m models \
  -t "Hello from qwen3-tts.cpp running on macOS with CoreML by default." \
  -o examples/readme_example_basic.wav

# 7) Voice-clone example using sample audio in this repo
./build/qwen3-tts-cli \
  -m models \
  -r examples/readme_clone_input.wav \
  -t "This is a voice cloning example generated from the sample audio file in this directory." \
  -o examples/readme_example_clone.wav

Expected model artifacts after step 5:

models/qwen3-tts-0.6b-f16.gguf
models/qwen3-tts-tokenizer-f16.gguf
models/coreml/code_predictor.mlpackage (on macOS)

Expected audio outputs after steps 6-7:

examples/readme_example_basic.wav
examples/readme_example_clone.wav

Included voice-clone input/output pair (so users can compare directly):

Input reference audio: examples/readme_clone_input.wav
Generated output audio: examples/readme_example_clone.wav

Audio preview (inline):

If your Markdown renderer does not show inline controls, use direct links:

Build

Clone and initialize

git clone https://github.com/predict-woo/qwen3-tts.cpp.git
cd qwen3-tts.cpp
git submodule update --init --recursive

Note: The top-level CMake expects GGML in ./ggml with libraries under ./ggml/build/src.

Build matrix

Pick the row that matches your platform and GPU. CPU backend is always compiled in — GPU flags just add an additional backend that QWEN3_TTS_BACKEND=auto (the default) will prefer when available.

Platform	Mode	GGML flags (step 1)	Notes
macOS (Apple Silicon / Intel)	Metal + CPU (default)	(none — Metal and Apple Accelerate BLAS are default-on)	CoreML code predictor also enabled by default; see below.
macOS	CPU only	`-DGGML_METAL=OFF -DGGML_BLAS=OFF` + top-level `-DQWEN3_TTS_COREML=OFF`	Useful for bisecting GPU issues or on machines without a usable GPU.
Linux	CPU only	(none)	Opt into BLAS with `-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS` (or similar) if installed.
Linux	Vulkan + CPU	`-DGGML_VULKAN=ON`	Works on AMD, NVIDIA, Intel; install the Vulkan SDK first.
Linux	CUDA + CPU	`-DGGML_CUDA=ON`	NVIDIA only; requires the CUDA Toolkit.
Windows (MSVC)	CPU only	(none)	Open a Developer Command Prompt for VS 2019/2022.
Windows	Vulkan + CPU	`-DGGML_VULKAN=ON`	Copy `ggml-vulkan.dll` next to `qwen3-tts-cli.exe`.
Windows	CUDA + CPU	`-DGGML_CUDA=ON`	Copy `ggml-cuda.dll` next to `qwen3-tts-cli.exe`.

"GPU + CPU" is the normal operating mode for any GPU build — the GGML scheduler runs most work on GPU with automatic CPU fallback for unsupported ops. See Backend Selection for how to force one or the other at runtime.

macOS (Metal + CoreML, default)

# Step 1: Build GGML. Metal and Accelerate BLAS are on by default on Apple;
# passing -DGGML_METAL=ON is optional and just makes the intent explicit.
cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release
cmake --build ggml/build -j$(sysctl -n hw.ncpu)

# Step 2: Build qwen3-tts.cpp. QWEN3_TTS_COREML is ON by default on APPLE.
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(sysctl -n hw.ncpu)

CoreML acceleration for the code predictor stage is enabled by default when models/coreml/code_predictor.mlpackage exists. Generate that with python scripts/setup_pipeline_models.py --coreml on. Turn CoreML off at runtime with QWEN3_TTS_USE_COREML=0 or at compile time with -DQWEN3_TTS_COREML=OFF.

macOS (CPU only)

# Disable Metal, Accelerate BLAS, and the CoreML code predictor bridge.
cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release \
    -DGGML_METAL=OFF -DGGML_BLAS=OFF
cmake --build ggml/build -j$(sysctl -n hw.ncpu)

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DQWEN3_TTS_COREML=OFF
cmake --build build -j$(sysctl -n hw.ncpu)

You can also keep a Metal build and force CPU at runtime with QWEN3_TTS_BACKEND=cpu — no rebuild required.

Linux (CPU only)

cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release
cmake --build ggml/build -j$(nproc)

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

Linux (Vulkan + CPU)

Install the Vulkan SDK first, then:

cmake -S ggml -B ggml/build -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release
cmake --build ggml/build -j$(nproc)

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

Linux (CUDA + CPU)

Requires the CUDA Toolkit, then:

cmake -S ggml -B ggml/build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
cmake --build ggml/build -j$(nproc)

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

At runtime, set QWEN3_TTS_BACKEND=cuda to force the CUDA backend, or leave it at the default auto to let the scheduler pick it when available (see Backend Selection).

Windows (MSVC)

Open a Developer Command Prompt for VS 2022 (or 2019), then pick one:

:: CPU only (default)
cmake -S ggml -B ggml\build -DCMAKE_BUILD_TYPE=Release
cmake --build ggml\build --config Release

:: Vulkan + CPU (install Vulkan SDK from https://vulkan.lunarg.com/)
:: cmake -S ggml -B ggml\build -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release

:: CUDA + CPU (install CUDA Toolkit)
:: cmake -S ggml -B ggml\build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release

:: Then build qwen3-tts.cpp against the chosen GGML build.
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release

DLL placement: Copy these DLLs next to qwen3-tts-cli.exe (and libqwen3tts.dll if you use the shared lib):

Always: ggml.dll, ggml-base.dll, ggml-cpu.dll
Vulkan build: also ggml-vulkan.dll
CUDA build: also ggml-cuda.dll plus CUDA runtime DLLs from your toolkit install

These are produced under ggml\build\bin\Release\ (or ggml\build\src\ depending on generator).

Model Setup (Recommended)

Use the one-shot setup script:

source .venv/bin/activate
python scripts/setup_pipeline_models.py

Useful flags:

--force re-downloads and re-generates all artifacts.
--coreml auto|on|off controls CoreML export behavior.
--skip-download skips HF download and uses existing local model dirs.

Pre-built Models (Skip Python)

Community-hosted GGUF files are available on HuggingFace. Download directly — no Python required:

mkdir -p models
curl -L -o models/qwen3-tts-0.6b-f16.gguf \
  "https://huggingface.co/koboldcpp/tts/resolve/main/qwen3-tts-0.6b-f16.gguf"
curl -L -o models/qwen3-tts-tokenizer-f16.gguf \
  "https://huggingface.co/koboldcpp/tts/resolve/main/qwen3-tts-tokenizer-f16.gguf"

These are F16 (full-precision) models. Q8_0 quantized variants are also available and will be auto-detected if named qwen3-tts-0.6b-q8_0.gguf.

Manual Model Conversion (Advanced)

Convert HuggingFace models to GGUF format:

# Download the model
huggingface-cli download Qwen/Qwen3-TTS-12Hz-0.6B-Base \
    --local-dir models/Qwen3-TTS-12Hz-0.6B-Base

# Convert TTS model (transformer + speaker encoder + tokenizer)
python scripts/convert_tts_to_gguf.py \
    models/Qwen3-TTS-12Hz-0.6B-Base \
    models/qwen3-tts-0.6b-f16.gguf

# Convert vocoder (audio decoder)
python scripts/convert_tokenizer_to_gguf.py \
    models/Qwen3-TTS-12Hz-0.6B-Base \
    models/qwen3-tts-tokenizer-f16.gguf

Place both .gguf files in a models/ directory.

Usage

# Basic synthesis
./build/qwen3-tts-cli -m models -t "Hello, world!" -o hello.wav

# Voice cloning from reference audio
./build/qwen3-tts-cli -m models -t "Hello! How are you?" -r reference.wav -o cloned.wav

# Greedy decoding with max length
./build/qwen3-tts-cli -m models -t "Hello!" -r ref.wav -o out.wav \
    --temperature 0 --max-tokens 2048

CLI Options

Flag	Description	Default
`-m, --model <dir>`	Model directory containing GGUF files	(required)
`--tts-model <file>`	TTS model filename (overrides auto-detection)	(auto)
`--tokenizer-model <file>`	Tokenizer/vocoder filename	(auto)
`-t, --text <text>`	Text to synthesize	(required)
`-o, --output <file>`	Output WAV file path	`output.wav`
`-r, --reference <file>`	Reference audio for voice cloning	(none)
`--speaker-embedding <file>`	Use precomputed speaker embedding (.json/.bin)	(none)
`--dump-speaker-embedding <file>`	Save extracted embedding from `--reference`	(none)
`--speaker <name>`	Use a named preset voice (CustomVoice / VoiceDesign models)	(none)
`--list-speakers`	Print the preset voices baked into the loaded model and exit	—
`--ref-text <text>`	Reference transcript (combined with `-r`) → ICL voice cloning	(none)
`-i, --instruction, --instruct <text>`	Voice steering instruction (e.g. "Speak happily")	(none)
`--temperature <val>`	Sampling temperature (0 = greedy)	0.9
`--top-k <n>`	Top-k sampling (0 = disabled)	50
`--top-p <val>`	Top-p (nucleus) sampling cutoff	1.0
`--max-tokens <n>`	Maximum audio frames (codec tokens) to generate	2048
`--repetition-penalty <val>`	Repetition penalty on codebook-0 token generation	1.05
`--seed <n>`	RNG seed for reproducible output	(random)
`--no-f32-acc`	Disable f32 matmul accumulation (faster, less precise)	(off)
`-l, --language <lang>`	Language: en, ru, zh, ja, ko, de, fr, es	en
`-j, --threads <n>`	Number of compute threads	4

ICL Voice Cloning (Base models)

For Qwen3-TTS-12Hz-*-Base checkpoints, Qwen's documented cloning mode is in-context learning (ICL) — you provide a reference WAV and its transcript, and the pipeline encodes the audio through the Mimi codec, then threads both the codes and the transcript into the talker prefill. This is different from our legacy x-vector (ECAPA-TDNN) cloning, which only sees a 1024-dim speaker embedding. ICL typically produces tighter, better-matched clones on Base models at the cost of a short encode pass over the reference audio.

./build/qwen3-tts-cli -m models \
  --tts-model qwen3-tts-1.7b-f16.gguf \
  -r examples/readme_clone_input.wav \
  --ref-text "This is a sample recording of a human voice reading text into a microphone." \
  -t "Hello, this is an ICL voice clone." \
  -o out.wav

When --ref-text is omitted, -r falls back to the x-vector path. ICL cannot currently be combined with --instruction.

In the Python server, supply reference_audio_path + reference_text in the /v1/audio/speech request body to route a single request through ICL:

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input":"Hello","reference_audio_path":"/path/to/ref.wav","reference_text":"the transcript."}' \
  -o out.wav

Preset Voices (CustomVoice / VoiceDesign)

The Qwen3-TTS-12Hz-1.7B-CustomVoice and -VoiceDesign repos ship a set of built-in preset voices (e.g. serena, ryan, dylan) that are reserved codec token IDs. No reference audio is required — just pick a preset by name.

# Convert the CustomVoice checkpoint (once)
huggingface-cli download Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \
    --local-dir models/hf/qwen3-tts-1.7b-customvoice
python scripts/convert_tts_to_gguf.py \
    --input models/hf/qwen3-tts-1.7b-customvoice \
    --output models/qwen3-tts-1.7b-customvoice-f16.gguf --type f16

# List preset voices baked into the model
./build/qwen3-tts-cli -m models \
  --tts-model qwen3-tts-1.7b-customvoice-f16.gguf --list-speakers

# Synthesize with a named preset
./build/qwen3-tts-cli -m models \
  --tts-model qwen3-tts-1.7b-customvoice-f16.gguf \
  --speaker serena -t "Hello from the preset voice!" -o out.wav

Base variants (-Base on HF) carry no preset list — use voice cloning via -r reference.wav instead. The Python server's /v1/audio/voices endpoint lists both local JSON embeddings (from voices/) and model-level presets; /v1/audio/speech accepts either kind in the voice field.

Speaker Embedding Workflow

Precompute a speaker embedding once (saves ~20s per synthesis):

# Extract and save embedding from reference audio
./build/qwen3-tts-cli -m models -r reference.wav --dump-speaker-embedding speaker.json \
  -t "Initial extraction." -o /dev/null

# Reuse embedding for fast voice-cloned synthesis (no encoder needed)
./build/qwen3-tts-cli -m models --speaker-embedding speaker.json \
  -t "Fast voice cloning with cached embedding." -o output.wav

Voice Steering Instructions

Control voice characteristics with the --instruction flag (works best with 1.7B VoiceDesign models):

./build/qwen3-tts-cli -m models --instruction "Speak in a cheerful tone" \
  -t "Good morning, everyone!" -o cheerful.wav

Quantization

Quantize models from F16 to smaller formats without Python:

./build/qwen3-tts-quantize --input models/qwen3-tts-0.6b-f16.gguf \
  --output models/qwen3-tts-0.6b-q8_0.gguf --type q8_0

Supported types: q8_0, q4_0, q4_1, q5_0, q5_1, q4_k, q5_k, q6_k. The CLI auto-detects Q8_0 models when present.

The quantizer is built as a C++17 target by default; no C++20 compiler mode is required. To build only the quantizer after building GGML:

cmake --build build --target qwen3-tts-quantize -j4

On Windows, run this from a Developer Command Prompt:

cmake --build build --target qwen3-tts-quantize --config Release

Backend Selection

At runtime, each component (tokenizer, encoder, transformer, decoder) independently selects a backend and logs it — for example, TTSTransformer backend: MTL0 or AudioTokenizerDecoder backend: Vulkan0.

QWEN3_TTS_BACKEND controls the selection:

Value	Effect
`auto` (default)	Scheduler picks the best available device in the order `IGPU` → `GPU` → `ACCEL` → `CPU`. Metal, Vulkan, and CUDA all surface through `auto` when the corresponding GGML flag was enabled at build time.
`cpu`	Forces CPU-only execution regardless of what was built in. Useful for reproducibility and for bisecting GPU issues without rebuilding.
`cuda`	Forces the CUDA backend. Only valid when GGML was built with `-DGGML_CUDA=ON`; otherwise falls back to CPU with a warning.

There is no separate metal or vulkan value — those are reached via auto. If you need to disable them, either rebuild without their GGML flag or set QWEN3_TTS_BACKEND=cpu.

"Hybrid GPU + CPU" is the default behavior of any GPU build: the GGML scheduler keeps most work on the GPU and automatically falls back to CPU for ops the GPU backend does not support.

Idle GPU RAM offload is intended for long-lived CUDA/Vulkan processes that want to release VRAM between requests. It copies weights to host RAM after the idle timeout fires, frees GPU weight buffers, and reloads from RAM on the next tensor-using request. It does not stream layers during inference and is disabled for Metal/CPU backends.

Validating idle GPU RAM offload

Use the opt-in validation test on a machine with the GGUF models and a CUDA or Vulkan GGML build. The test keeps a Qwen3TTS instance alive across two synthesis requests, waits for the production idle worker, requires transformer and decoder weights to move to RAM, then verifies the next synthesis reloads them successfully.

For CUDA:

cmake -S ggml -B ggml/build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
cmake --build ggml/build -j
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --target test_gpu_idle_offload_validation -j
QWEN3_TTS_RUN_GPU_OFFLOAD_VALIDATION=1 \
QWEN3_TTS_BACKEND=cuda \
QWEN3_TTS_MODEL_DIR=models \
ctest --test-dir build -R gpu_idle_offload_validation --output-on-failure

For Vulkan, build GGML with Vulkan and leave backend selection on auto unless your local build has a custom selector:

cmake -S ggml -B ggml/build -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release
cmake --build ggml/build -j
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --target test_gpu_idle_offload_validation -j
QWEN3_TTS_RUN_GPU_OFFLOAD_VALIDATION=1 \
QWEN3_TTS_MODEL_DIR=models \
ctest --test-dir build -R gpu_idle_offload_validation --output-on-failure

Expected validation signs:

Logs show CUDA or Vulkan backends for TTSTransformer and AudioTokenizerDecoder.
For Vulkan validation, those backend logs must specifically show Vulkan. If auto selects another GPU backend, disable that backend in the GGML build or rebuild with only Vulkan enabled for this check.
Logs show GPU idle RAM offload: transformer copied ... to host RAM and GPU idle RAM offload: decoder copied ... to host RAM.
GPU memory rises during load/synthesis, drops after the idle timeout, and rises again for the second synthesis. Use nvidia-smi, nvtop, radeontop, vendor tooling, or Vulkan memory tooling as appropriate.
The validation test exits successfully after the post-offload synthesis.

Negative checks:

QWEN3_TTS_LOW_MEM=1 QWEN3_TTS_GPU_OFFLOAD_IDLE_SECS=2 ./build/qwen3-tts-cli -m models -t "low memory check" -o /tmp/qwen-lowmem.wav
QWEN3_TTS_BACKEND=cpu QWEN3_TTS_GPU_OFFLOAD_IDLE_SECS=1 ./build/test_pipeline_offload_lifecycle
QWEN3_TTS_RUN_GPU_OFFLOAD_VALIDATION=1 QWEN3_TTS_BACKEND=cpu ./build/test_gpu_idle_offload_validation

Low-memory mode should log that idle GPU RAM offload is disabled and keep the existing staged load/unload behavior. The CPU lifecycle check should pass without production-offloading weights. The final GPU validation guard should exit non-zero with QWEN3_TTS_BACKEND=cpu cannot validate CUDA/Vulkan idle offload; that confirms the CUDA/Vulkan validation gate is not being satisfied by a CPU run. Metal paths should log as ineligible/disabled for idle GPU RAM offload.

Runtime environment variables

Variable	Default	Purpose
`QWEN3_TTS_BACKEND`	`auto`	Runtime backend override. See Backend Selection.
`QWEN3_TTS_DEVICE`	`0`	CUDA device index when `QWEN3_TTS_BACKEND=cuda`. Ignored otherwise.
`QWEN3_TTS_DECODER_GPU_MAX_FRAMES`	`34`	Max frames per CUDA vocoder chunk. Lower it if the GPU OOMs during decode.
`QWEN3_TTS_DECODER_GPU_CONTEXT_FRAMES`	`12`	Left-context frames per CUDA vocoder chunk.
`QWEN3_TTS_LOW_MEM`	unset	Set to `1` to enable low-memory pipeline mode (loads/unloads components in sequence instead of holding everything resident).
`QWEN3_TTS_GPU_OFFLOAD_IDLE_SECS`	`0`	CUDA/Vulkan only. Positive values enable idle RAM offload after N idle seconds. Disabled when `QWEN3_TTS_LOW_MEM=1`.
`QWEN3_TTS_USE_COREML`	`1` on macOS when model exists	Set to `0` to disable the CoreML code-predictor bridge without rebuilding.
`QWEN3_TTS_COREML_MODEL`	auto-detected	Absolute path override for a custom `.mlpackage` location (macOS only).

C API

The shared library libqwen3tts.{so,dylib,dll} is built automatically by CMake (the qwen3tts_shared target). It provides a C-linkage API for FFI integration from Python, Rust, Nim, Go, etc.

Structs

typedef struct Qwen3TtsParams {
    int32_t max_audio_tokens;    // default: 2048
    float   temperature;         // default: 0.9, 0=greedy
    float   top_p;               // default: 1.0
    int32_t top_k;               // default: 50, 0=disabled
    int32_t n_threads;           // default: 4
    float   repetition_penalty;  // default: 1.05
    int32_t language_id;         // 2050=en, 2058=ja, 2055=zh, etc.
} Qwen3TtsParams;

typedef struct Qwen3TtsAudio {
    const float* samples;  // PCM float32 mono, 24kHz
    int32_t n_samples;
    int32_t sample_rate;   // always 24000
} Qwen3TtsAudio;

Functions

Function	Description
`qwen3_tts_default_params(params)`	Fill `Qwen3TtsParams` with defaults
`qwen3_tts_create(model_dir, n_threads)`	Load models, return opaque handle (NULL on failure)
`qwen3_tts_is_loaded(tts)`	Check if models are loaded
`qwen3_tts_synthesize(tts, text, params)`	Text to audio
`qwen3_tts_synthesize_with_voice_file(tts, text, wav_path, params)`	Voice clone from WAV
`qwen3_tts_synthesize_with_voice_samples(tts, text, samples, n, params)`	Voice clone from float32
`qwen3_tts_extract_embedding_file(tts, wav_path, buf, max)`	Extract speaker embedding
`qwen3_tts_synthesize_with_embedding(tts, text, emb, size, params)`	Synthesize with cached embedding
`qwen3_tts_sample_rate(tts)`	Returns 24000
`qwen3_tts_free_audio(audio)`	Free `Qwen3TtsAudio*` (required)
`qwen3_tts_destroy(tts)`	Destroy engine
`qwen3_tts_get_error(tts)`	Get last error string

C Usage Example

#include "qwen3tts_c_api.h"

Qwen3Tts *tts = qwen3_tts_create("./models", 4);
if (!tts) { fprintf(stderr, "load failed\n"); return 1; }

Qwen3TtsParams params;
qwen3_tts_default_params(&params);

Qwen3TtsAudio *audio = qwen3_tts_synthesize(tts, "Hello, world!", &params);
if (audio) {
    // audio->samples is PCM float32, audio->n_samples samples at 24kHz
    // ... write to WAV file ...
    qwen3_tts_free_audio(audio);
}

qwen3_tts_destroy(tts);

Python ctypes

A full Python binding is provided in server/qwen3_tts_binding.py. See the Server README for an OpenAI-compatible HTTP API.

Memory Management

Every Qwen3TtsAudio* returned by synthesis functions must be freed with qwen3_tts_free_audio()
The engine handle must be destroyed with qwen3_tts_destroy()
The qwen3_tts_get_error() string is owned by the engine and valid until the next API call

Architecture

Text ──► [Tokenizer] ──► token IDs
                              │
Reference Audio ──► [Speaker Encoder] ──► speaker embedding
                              │
token IDs + speaker embedding ──► [TTS Transformer] ──► speech codes (N frames x 16 codebooks)
                              │
speech codes ──► [Vocoder] ──► audio waveform (24kHz)

Source Files

Module	Files	Description
`src/common/`	`gguf_loader`, `coreml_code_predictor`, `speaker_embedding_io`	Shared infrastructure
`src/tokenizer/`	`text_tokenizer`, `tokenizer_unicode`	BPE tokenizer with Unicode/GPT-2 regex
`src/transformer/`	`tts_transformer`	Qwen2 talker + code predictor (0.6B and 1.7B)
`src/encoder/`	`audio_tokenizer_encoder`	ECAPA-TDNN x-vector speaker encoder
`src/decoder/`	`audio_tokenizer_decoder`	WavTokenizer vocoder (codes to 24kHz audio)
`src/pipeline/`	`qwen3_tts`, `qwen3tts_c_api`	Pipeline orchestration + C API
`src/`	`main.cpp`, `qwen3_tts_quantize.cpp`	CLI entry point + quantization tool

TTS Transformer Details

The transformer generates speech codes in two stages per frame:

Talker (28 layers, 16 heads, 1024 hidden) produces a hidden state and codebook-0 logits.
Code Predictor (5 layers) autoregressively generates codebooks 1-15 from that hidden state.

The prefill embedding mirrors the Python pipeline exactly:

Positions 0-2: text-projected role tokens (<|im_start|>, assistant, \n)
Positions 3-6: TTS pad + codec embeddings (think tokens, language ID)
Position 7: TTS pad + speaker embedding
Position 8: TTS BOS + codec pad embedding
Position 9+: text-projected text tokens + codec BOS/embeddings

Testing

# End-to-end user smoketest (macOS): build → download → synthesize on
# Metal + CPU × F16 + Q8_0 × basic/clone/instruction + Python server
bash scripts/run_user_smoketest.sh            # see scripts/USER_SMOKETEST.md

# Run full test suite (component + E2E reference comparison)
bash scripts/run_all_tests.sh

# Individual component tests
./build/test_tokenizer --model models/qwen3-tts-0.6b-f16.gguf
./build/test_encoder --tokenizer models/qwen3-tts-0.6b-f16.gguf \
    --audio clone.wav --reference reference/ref_audio_embedding.bin
./build/test_transformer --model models/qwen3-tts-0.6b-f16.gguf \
    --ref-dir reference/
./build/test_decoder --tokenizer models/qwen3-tts-tokenizer-f16.gguf \
    --codes reference/speech_codes.bin --reference reference/decoded_audio.bin

# End-to-end Python vs C++ comparison
uv run python scripts/compare_e2e.py

# Generate deterministic reference data from Python
uv run python scripts/generate_deterministic_reference.py

Test Results (F16 model)

Prefill logits: cosine similarity = 0.99999994 with Python reference
Codebook 0 match rate: 81% (frame-level exact match)
Codebooks 1-4: ~84% match rate
Audio output is perceptually equivalent; low waveform correlation is expected due to autoregressive divergence from F16 precision

Profiling

Build with compile-time timing instrumentation (zero overhead when disabled):

cmake .. -DQWEN3_TTS_TIMING=ON
make -j4

Example output (92 frames, 7.3s audio):

=== Detailed Generation Timing (92 frames) ===

  Prefill:
      Compute:           175.9 ms

  Talker forward_step:
      Graph build:        21.8 ms   (0.2 ms/frame)
      Graph alloc:        34.1 ms   (0.4 ms/frame)
      Compute:          7717.4 ms   (83.9 ms/frame)

  Code predictor:
      Init/KV/embed:       7.7 ms   (0.1 ms/frame)
      Prefill (2tok):   1393.2 ms   (15.1 ms/frame)
      Steps (14):      19531.7 ms   (212.3 ms/frame)
      Compute:         20702.6 ms   (225.0 ms/frame)

  Total generate:      28915.0 ms   (3.2 frames/s)

The code predictor (15 sequential forward passes per frame) accounts for ~71% of generation time.

Troubleshooting

Linux: shared library link error (`recompile with -fPIC`)

If you see relocation R_X86_64_32S ... recompile with -fPIC when building the shared library, your CMakeLists.txt is missing set(CMAKE_POSITION_INDEPENDENT_CODE ON). This is fixed in current builds.

Vulkan: segfault when loading vocoder

Older versions had a bug where normalize_codebooks() wrote directly to GPU-mapped memory, causing a segfault on Vulkan backends. This is fixed in current builds. If you hit this on a fork, see issue #20 for the fix.

Windows: missing DLL errors at runtime

Copy all required DLLs next to qwen3-tts-cli.exe:

ggml.dll, ggml-base.dll, ggml-cpu.dll
Backend DLL: ggml-vulkan.dll or ggml-cuda.dll (if using GPU)

These are built under ggml\build\bin\Release\ (or ggml\build\src\ depending on generator).

CUDA: slower than expected

Ensure you are using the CUDA backend at runtime:

QWEN3_TTS_BACKEND=cuda QWEN3_TTS_DEVICE=0 ./build/qwen3-tts-cli -m models -t "test" -o out.wav

Tune chunked vocoder decode with QWEN3_TTS_DECODER_GPU_MAX_FRAMES (default 34) and QWEN3_TTS_DECODER_GPU_CONTEXT_FRAMES (default 12) if you see OOM or slow decode on your GPU.

Vocoder runs on CPU despite GPU backend shown

If logs show AudioTokenizerDecoder backend: Vulkan0 but vocoder decode is still slow (~17s), this is the backend/buffer mismatch bug. Fixed in current builds — update to the latest version.

Acknowledgments

Qwen3-TTS by Alibaba Qwen team
GGML tensor library by Georgi Gerganov
WavTokenizer vocoder architecture
predict-woo/qwen3-tts.cpp — original upstream project
gonwan/qwen3-tts.cpp — 1.7B model support, speaker embedding save/load, explicit model selection, seed parameter, Windows _fseeki64 fix, --no-f32-acc flag
Danmoreng/qwen3-tts.cpp — Unicode/UTF-8 GPT-2 regex tokenizer (derived from llama.cpp), voice steering instructions, C++ quantizer tool, decoder graph caching, ggml_pad_reflect_1d optimization
SiaoZeng — Vulkan vocoder backend-safe normalize_codebooks fix (issue #20)
kevinzhow/clawd20130 — CUDA runtime backend override and chunked GPU vocoder decode (PR #11)
DeryabinIvan — Windows build fixes (PR #2)
TheTastefulToastie — cleaner CMake approach for MSVC _USE_MATH_DEFINES / NOMINMAX

Name		Name	Last commit message	Last commit date
Latest commit History 123 Commits
.github		.github
.sisyphus		.sisyphus
docs		docs
examples		examples
ggml @ 5cecdad		ggml @ 5cecdad
reference		reference
scripts		scripts
server		server
src		src
tests		tests
voices		voices
.gitignore		.gitignore
.gitmodules		.gitmodules
AGENTS.md		AGENTS.md
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
OPTIMIZATION.md		OPTIMIZATION.md
README.md		README.md
reference_text.txt		reference_text.txt

Folders and files

Latest commit

History

Repository files navigation

qwen3-tts.cpp

Features

Prerequisites

Quickstart (macOS)

Build

Clone and initialize

Build matrix

macOS (Metal + CoreML, default)

macOS (CPU only)

Linux (CPU only)

Linux (Vulkan + CPU)

Linux (CUDA + CPU)

Windows (MSVC)

Model Setup (Recommended)

Pre-built Models (Skip Python)

Manual Model Conversion (Advanced)

Usage

CLI Options

ICL Voice Cloning (Base models)

Preset Voices (CustomVoice / VoiceDesign)

Speaker Embedding Workflow

Voice Steering Instructions

Quantization

Backend Selection

Validating idle GPU RAM offload

Runtime environment variables

C API

Structs

Functions

C Usage Example

Python ctypes

Memory Management

Architecture

Source Files

TTS Transformer Details

Testing

Test Results (F16 model)

Profiling

Troubleshooting

Linux: shared library link error (recompile with -fPIC)

Vulkan: segfault when loading vocoder

Windows: missing DLL errors at runtime

CUDA: slower than expected

Vocoder runs on CPU despite GPU backend shown

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Linux: shared library link error (`recompile with -fPIC`)

Packages