Voice AI API Documentation

Speech Synthesis REST API + WebSocket API
Features and Capabilities

asticaVoice Text to Speech Engine

Play audio in real-time with the Streaming REST API, or connect with WebSockets for lower latency: time to first audio between 250 - 400ms.

If you have time to wait, use the low priority mode for reduced costs.

Access hundreds of natural sounding voices with a single API: generate natural, expressive speech from text for real‑time conversations, narration, games, agents, and more.

  • Expressive voices: rich character, emotion, and ready for real-time.
  • Programmable voices: fine‑grained control over tone and persona.
  • Neural voices: clean, clear, production‑ready narration in many accents.

You can choose from hundreds of voices across ages, genders, and nationalities, including both pre‑built characters and custom voices you create with voice cloning.

Try Online Code Samples

Text to Speech REST API

  • POST /api/tts — main text‑to‑speech endpoint (all engines).
  • POST /api/voice_list — list of public voices available for use.
  • POST /api/voice_clone — create a personal custom voice.
  • POST /api/voice_clone_list — list your private custom voice clones.
  • POST /api/tts/task — poll results for low‑priority queued jobs.
How It Works
  1. Choose a voice:
    • Expressive voices like "expressive_ava"
    • Programmable aliases like "prog_avery".
    • Neural voices like "neural_jennifer"
    • Custom voices like "clone_15"
  2. Call POST /api/tts with:
    • tkn: your API token.
    • text: the text to speak.
    • voice: the desired voice.
    • stream: true for audio streaming, or false for a single JSON response.
    • timestamps (optional): start and end times of each spoken word.
  3. Receive audio as:
    • A continuous HTTP audio stream (for low‑latency playback), or
    • A JSON payload with audio_b64 (Base64 WAV) and metadata.
Introduction

The asticaVoice API allows developers to integrate natural-sounding voice outputs into their applications. Offering a wide selection of voices, seamless integration including plug and play REST API or WebSocket access, and multilanguage support asticaVoice API can be used to empower your application or next project.

1. Real-time Speech Generation:
The asticaVoice API allows you to generate realistic speech suitable for time sensitive applications and real-time use. All voices are synthesized on demand and just in time with time to first audio between 250 - 400ms.

2. Diverse and Realistic Voices:
Browse a vast library containing hundreds of unique voices to choose from, comprised of different age groups, genders, and nationalities. This enables developers to tailor the voice output and personality to suit their specific needs and preferences of their users for a more personalized and engaging user experience.

3. Multilingual Support:
asticaVoice is capable of supporting multiple languages with a high level fluency. The ability to handle translated text to speech is instrumental in supporting seamless experiences for global audiences and adapting content to cater to diverse linguistic demographics.

4. Naturally Unique Speech
All voice output from the Expressive and Programmable voices is generated is unique, with its own inflections and natural disfluencies. That means each recording sounds a little different, like a real person talking - allowing you create high quality and interactive, engaging voice experiences.

Try Online Code Samples
Get Started with asticaVoice API:
asticaVoice TTS API FAQ
Discover Frequently Asked Questions
Speech Synthesis API - Common Questions

Is there a WebSocket API?

Yes. You can use the WebSockets API for lower latency and increased functionality for real-time integrations. There are no additional cost or requirements to use the WebSocket API. You can swap between the REST API and WebSocket API depending on your usecase.


How do I get word‑level timestamps?

Per-word timestamps available for streaming=true using the WebSockets API. Using the REST API the timestamps are only available by setting streaming=false. Note that only expressive voices are supporting timestamps. voices, or neural voices. Timestamps are available exclusively for expressive voices when:

  • WebSockets API: stream = true OR stream = false
  • REST API: stream = false, and<timestamps = true is set in the request body.

The response will include a meta.timestamps array containing the start and end time of each spoken word.


Are voice clones private?

Yes. All custom voice clones are private and will only be available to your account. The asticaVoice API allows you to manage your custom voice clones:

  • Create a new custom voice clones.
  • List all existing custom voice clones.
  • Use them to generate speech via voice = "clone_n".

Other users cannot access your clones or their underlying audio/embeddings through the public API.


When should I use streaming vs. non‑streaming?
  • Use streaming (stream = true) when you want to start playback as soon as possible, e.g. live agents or interactive apps.
  • Use non‑streaming (stream = false) when you need a complete audio file (Base64 WAV in audio_b64), word timestamps, or easy integration with storage/CDNs.

What is the difference between voice types?

All three voice types are accessed via the same /api/tts endpoint; the difference is in how you select and control the voice:

  • Expressive Voices Recommended
    rich character and emotion, best for agents, games, and storytelling. Many built‑in characters plus custom clones. Supports word‑level timestamps in non‑streaming mode and low‑priority queueing.
  • Programmable Voices
    controlled by a prompt that describes persona and style. Great for assistants and dynamic character work where you want to adjust tone per request.
  • Neural Voices
    clean, consistent narration voices across accents and genders. Ideal for tutorials, IVR, and long‑form reading where you want a straightforward sound.
Voice AI - REST API Endpoints
Generate AI Voice Using REST API
Streaming REST API
REST API: Voices, parameters, and responses

The Voice API exposes a single text‑to‑speech endpoint. The voice you pass determines which engine is used internally (expressive, programmable, or neural).

Endpoint
POST https://voice.astica.ai/api/tts
Content-Type: application/json
Request body
Field Type Required Description
tkn string yes Your API token. Can also be sent as X-API-Key for REST API
text string yes The text to synthesize into spoken speech.
voice string no String corresponding to chosen voice identifier and engine:
  • Expressive: "expressive_jeanne", "expressive_ava", "clone_1".
  • Programmable: "prog_avery" (see programmable section).
  • Neural: "neural_jennifer".
stream boolean no

Stream audio in real-time: defaults to false if not specified.

  • false: return JSON with audio_b64 (Base64 WAV) and metadata.
  • true: stream raw audio bytes over HTTP as they are generated.
Streaming format depends on engine (see below).
timestamps boolean no Set to true if you would like to receive word‑level timestamps: list of start and end times for each word spoken.

Only supported for expressive voices when stream = false within the REST API or when using stream = true within the WebSockets API.
prompt string no Style instructions for programmable voices. Prompts are ignored by expressive and neural voices. See the programmable section for examples and best practices for prompting programmable voices.
low_priority boolean no Submit the task as a discounted low‑priority request. Only valid for expressive voices with stream = false using the REST API.

Low priority tasks are processed in a queue and provides you with a URL to query for the completed audio file. See the low‑priority section for details.
Response: non‑streaming (stream = false)

When stream = false, all engines return a JSON payload of the form:

{
  "status": "success",
  "result": "ok",
  "engine": "expressive",
  "voice": "expressive_sarah",
  "cost_units": 45,
  "meta": {
    "sample_rate": 24000,
    "timestamps": [
		{
			"text": "hello,",
			"start_s": 0.24,
			"stop_s": 0.96
		},
		{
			"text": "how",
			"start_s": 0.96,
			"stop_s": 1.2
		},
		{
			"text": "are",
			"start_s": 1.2,
			"stop_s": 1.36
		},
		{
			"text": "you",
			"start_s": 1.36,
			"stop_s": 1.44
		}
]
  },
  "audio_b64": "BASE64_WAV_DATA",
  "audio_format": "wav"
}
  • cost_units — logical units (roughly words + punctuation).
  • meta.timestamps — present for expressive, non‑streaming requests when timestamps = true.
Response: streaming (stream = true)

When stream = true, the HTTP response is raw audio; there is no JSON wrapper:

  • Expressive (GPU)
    Content-Type: audio/pcm, mono 16‑bit PCM chunks.
    This is ideal for extremely responsive playback in clients that can decode PCM.
  • Neural and Programmable
    Content-Type: audio/wav, WAV bytes streamed as they are generated.

Use your HTTP client’s streaming APIs (e.g. ReadableStream in browsers, or response.iter_content in Python) to incrementally read and play the audio.

Personalized Voices: Instruction-based TTS
Customize voices in real-time with custom prompts.
How to Use Programmable Voices

Programmable voices and the prompt parameter

Programmable voices are designed to be steered by a short natural‑language prompt. You specify what kind of speaker the voice should be (persona, mood, context), and the engine adapts pronunciation, pacing, and emphasis while still reading your text exactly.

Selecting a programmable voice

The prompt input is only usable with programmable voices. Programmable voices use dedicated aliases in the voice field, for example:

  • "prog_avery"
  • "prog_lena"
  • "prog_naomi"
  • "prog_morgan"

How prompt is used
  • Optional, but highly recommended for persona‑driven use cases.
  • Up to ~255 characters; concise descriptions work best.
  • Interpreted as instructions to the speaker. The spoken content always comes from text.
  • Safe to change on every request: you can reuse the same programmable voice with many different prompts.
Example request with custom voice prompt
{
  "tkn": "YOUR_API_TOKEN",
  "voice": "prog_avery",
  "text": "Welcome to the product tour. Let me walk you through the main features.",
  "prompt": "You are a friendly, modern product specialist on a video call, "
          + "speaking clearly and confidently, with upbeat but not exaggerated energy.",
  "stream": false
}

Programmable voices support REST API streaming and WebSockets API for interactive real-time experiences. The length of your prompt can impact the time to first audio for that request.

Prompt examples

The following are example inputs that might be used for the prompt field. You can experiment with these examples using the online Web UI to see how each voice reacts.

Character / role prompts
  • "You're a cowboy with a lazy drawl, western twang, frontier wisdom, friendly and calm, like a seasoned ranch hand around a campfire at sunset, partner."
  • "You are a seasoned news anchor on a national broadcast, speaking crisply, with neutral accent and professional, measured pacing."
  • "You are a warm kindergarten teacher reading a bedtime story, soft and soothing, smiling as you speak, pausing gently at the end of each sentence."
  • "You are a sarcastic but good‑natured tech reviewer on a YouTube channel, energetic and witty, with quick, expressive delivery."
Support and assistant prompts
  • "You are a calm, empathetic support agent on a phone line, speaking slowly and clearly, reassuring and non‑judgmental."
  • "You are a concise voice assistant on a smart speaker, neutral and direct, keeping responses short and to the point."
Narration & learning prompts
  • "You are an audiobook narrator bringing a non‑fiction book to life, engaged but not theatrical, with clear emphasis on key ideas."
  • "You are a college professor explaining concepts to first‑year students, patient and precise, occasionally pausing after important terms."
Style‑only prompts

You can also use short style‑only prompts when you don't need a full persona. These can useful for handling subtle mood changes with a voice:

  • "Soft‑spoken, introspective tone with gentle pacing."
  • "High‑energy, excited delivery like a game show host."
  • "Understated, documentary‑style narration."
Guidelines for effective prompts
  • Describe the role: e.g. “teacher”, “coach”, “anchor”, “friend”.
  • Mood and energy: calm, excited, serious, playful, etc.
  • Context: on a podcast, phone call, game, bedtime story, etc.
  • Keep it focused: avoid very long multi‑paragraph prompts.
  • Do not repeat the main text inside prompt; it should describe how to speak, not what to say.
Create Private Custom Voices
The TTS voice engine that supports your voice.
Instant Voice Cloning with 3 Seconds of Audio.

Custom Voice Cloning API

Voice cloning allows you to create private custom voices from short audio samples. Once a clone is ready, you can use it like any other expressive voice by referencing clone_1, clone_2, and so on in the voice field of /api/tts.

Note that each clone ID auto increments and is specific to your user account. The very first custom voice that is created can be used by requesting voice "clone_1", and the next will be "clone_2".


Overview
  • The maximum number of custom voice clones that you can create is determined by your voice upgrade tier: you can upgrade at anytime with prorated billing.
  • You must have a positive voice compute balance to create a clone.
  • Input audio should be a single speaker, clean, and between 5 and 7 seconds for best results. The API will allow a minimum of 2 audio seconds up to 30 audio seconds.
  • When you submit a custom voice cloning request the new voice is typically available for speech generation within 3 seconds.
  • Clones are private to your account; other users cannot see or use them. You can permanently remove custom voice clones at anytime.

Custom Voice Cloning Quota

The maximum number of custom voice clones that are available to you depends on your account upgrades. This is separate from the Pay as You Go Voice Compute and begins at $3.79/month.

  • You can upgrade or downgrade your quota at anytime with pro-rated pricing.
  • You can remove existing custom voices and create more: the number of total clones is based on monthly capacity * 6.2. If you a quota of 1000 custom voices then you are permitted to process up to 6200 unique clones per month.

Create a clone

Use POST https://voice.astica.ai/api/voice_clone with multipart/form-data:

POST /api/voice_clone
Content-Type: multipart/form-data
Field Type Required Description
tkn text yes Your API token.
nickname text no (max ~64 characters).
audio file yes Audio sample of the voice.
(WAV, MP3, M4A, AAC, OGG)
Sample JSON response
{
  "clone_id": 1,          // per-user ID (1, 2, 3, ...)
  "status": "queued",
  "nickname": "My Voice",
  "duration_sec": 24.3,
  "clone_limit": 10,
  "clones_used": 1,
  "clones_remaining": 9
}

The clone begins in status 0 (pending). Once the request has finished processing, it transitions to status 1 (completed) and becomes available to TTS and can be used for producing speech.

List your clones

You can list clones either via POST request:

  • POST /api/voice_clone_list
Request body (for POST):
{
  "tkn": "YOUR_API_TOKEN"
}
Example response:
{
  "status": "success",
  "clones": [
    {
      "clone_id": 1,
      "nickname": "Brand Main",
      "status": 1,
      "error": "",
      "duration_sec": 24,
      "date_created": 1732300000,
      "date_updated": 1732300123
    }
  ]
}
  • status = 0 — pending.
  • status = 1 — ready.
  • status = 3 — failed (see error field).
Use a clone in TTS

Each clone has a per‑user clone_id starting at 1. To use your first clone, set the voice field in /api/tts to "clone_1":

{
  "tkn": "YOUR_API_TOKEN",
  "text": "This is my custom voice.",
  "voice": "clone_1",
  "stream": false
}
  • You can also use "clone-1"; both underscore and dash are accepted.
  • If the clone is not ready or does not belong to your user, TTS returns invalid_custom_voice.
Update a clone nickname

To rename a clone, call POST /api/voice_clone/{id} (or POST /api/voice_clone/{id}) with JSON body:

POST /api/voice_clone/123
Content-Type: application/json
{
  "tkn": "YOUR_API_TOKEN",
  "nickname": "New Friendly Name"
}
Response:
{
  "status": "success",
  "id": 123,
  "nickname": "New Friendly Name"
}
Delete a clone To permanently remove a custom voice clone, use:
POST https://voice.astica.ai/api/voice_clone/123

Body or query string must include tkn. Example JSON response:

{
  "status": "success",
  "id": 123
}
  • The clone is marked as cancelled (status 2).
  • Its audio and embedding references are cleared.
  • Future TTS calls using "clone_{id}" for this user will fail once that clone is deleted.
  • Your available custom voice capacity quota will be reduced immediately allowing you to train a new custom voice if you had previously reached your capacity limit.
Low Latency: Real-time Responses.
Designed for Interactive Integrations
Real-time Text to Speech WebSockets API

WebSocket API (advanced streaming)

The WebSocket API is an advanced option for streaming text‑to‑speech. For most applications you should use the REST API; switch to WebSockets only when you need continuous audio with minimal latency and tight alignment between playback and word timings.

The primary benefit of the WebSocket API is enabling real‑time alignment of audio and timestamps for compatible expressive voices — the spoken words and their timing information can be streamed together, rather than waiting for synthesis to finish as in the REST API.

Endpoint

Connect to the same host and port as the HTTPS API, using the unified WebSocket endpoint:

wss://voice.astica.ai/ws/api

Your application should send a TTS message immediately after connecting, as described below.

Client TTS request message

To synthesize speech over WebSockets, send a JSON message with type "tts" (or "speak"):

Request body
{
  "type": "tts",              // or "speak"
  "request_id": "optional-id",// echoed back in responses
  "tkn": "YOUR_API_TOKEN",
  "text": "Hello from WebSockets.",
  "voice": "expressive_steven", 
  "stream": true,             // recommended for WS
  "prompt": "optional style instructions for programmable voices",
    "timestamps": true,         // for expressive alignment (engine-dependent)
  "low_priority": false       // not supported over WS; must be false/omitted
}

If stream is omitted, WebSocket TTS defaults to streaming mode. The voice field is interpreted the same way as in the REST API and controls which engine is used internally.

Acknowledgement

When a TTS request is accepted, the server sends a quick acknowledgement before any audio:

Example response:
{
  "type": "tts_ack",
  "request_id": "YOUR_REQUEST_ID",
  "voice": "expressive_steven",
  "cost_units": 42,
  "stream": true
}

The acknowledgement confirms routing and estimated cost, and signals that audio will begin streaming shortly if there are no errors.

Streaming audio messages

Audio is delivered as a sequence of tts_audio messages followed by tts_audio_end. The format depends on the engine:

  • Expressive (GPU): format: "pcm_s16le" (raw 16‑bit mono PCM, typically 24 kHz).
  • Neural / Programmable: format: "wav" (WAV bytes).
Example streaming chunk:
{
  "type": "tts_audio",
  "request_id": "YOUR_REQUEST_ID",
  "seq": 0,
  "chunk_b64": "BASE64_ENCODED_AUDIO_BYTES",
  "format": "pcm_s16le",      // or "wav"
  "sample_rate": 24000
}
Example end‑of‑audio marker:
{
  "type": "tts_audio_end",
  "request_id": "YOUR_REQUEST_ID"
}

Clients should reassemble and decode the chunk_b64 payloads in order of the seq field, feeding them directly into an audio buffer or streaming decoder for immediate playback.

Completion and metadata

After audio has finished streaming (or after a non‑streaming TTS over WS), the server sends a tts_complete message with metadata and, for some modes, a full audio buffer.

Example completion message:
{
  "type": "tts_complete",
  "request_id": "YOUR_REQUEST_ID",
  "status": "success",
  "voice": "expressive_steven",
  "audio_b64": null,               // or Base64 WAV for non-streaming WS calls
  "audio_format": "pcm_s16le",     // or "wav"
  "meta": {
    "sample_rate": 24000,
    "audio_format": "pcm_s16le",
    "timestamps": [ /* word timestamps data */ ]
  }
}

For expressive voices with timestamp support, the meta.timestamps array can be used to align on‑screen text, captions, or highlights with audio playback in near real time.

Error messages

Any error during request validation or synthesis is reported as a tts_error message. Typical conditions include invalid tokens, exceeded concurrency, or upstream engine issues.

Example error message:
{
  "type": "tts_error",
  "request_id": "YOUR_REQUEST_ID",
  "code": "insufficient_balance" | "missing_text" | "tts_engine_failed" | "...",
  "error": "human-readable description",
  "http_status": 401        // present for some auth-related errors
}

In addition to standard API rate limits the WS API will enforce a separate lmit on concurrent TTS jobs per WebSocket connection. If you exceed this, you will receive tts_error with code "too_many_inflight_requests"; open multiple connections or queue requests client‑side if you need higher concurrency.

The concurrent request per WebSocket connection is set to 50% of your account's standard API rate limit. When to choose WebSockets vs REST

  • Prefer REST for most integrations: simpler client code, easy non‑streaming responses, and built‑in support for timestamps on expressive non‑streaming calls.
  • Use WebSockets when you need:
    • Continuous, low‑latency audio streaming.
    • Fine‑grained synchronization between playback and word timings for on‑screen text or avatars.
    • Many small, rapid TTS exchanges on a single persistent connection.
Low Priority Mode: Reduced Costs
It pays to be patient:
Use Low Priority Mode for Lower Cost

Low‑priority Mode

Low‑priority mode lets you generate voice (TTS) at a discounted rate by placing requests into a background queue. This is ideal for batch jobs, large backlogs, or non‑interactive workloads where reducing cost is more important than response time.

Key properties
  1. Submit your request with low_priority = true.
  2. You receive a task_id immediately and poll a separate endpoint for results.
  3. Periodically poll the endpoint to receive your generated audio.
  4. Benefit from significantly reduced costs.
Key properties
  • Low priority mode is only supported by expressive voices and custom voice clones..
  • Non‑streaming only: stream must be false (or omitted).
  • The default queue capacity allows up to 5000 requests per user account to be queued. Your low priority tasks will be processed in the order that they were received. If you require an increased limit please get in touch.
Enqueue a low‑priority job

Send a normal /api/tts request with low_priority = true and an expressive voice:

{
  "tkn": "YOUR_API_TOKEN",
  "text": "Generate this in the background.",
  "voice": "expressive_sarah",
  "stream": false,
  "low_priority": true
}

Example response:

{
  "status": "queued",
  "result": "low_priority",
  "task_id": "2b3e2f5e-0acf-4a2d-9b64-8d1a6b3a5f87",
  "voice": "expressive_sarah",
  "cost_units": 42
}
  • No audio is returned at this stage.
  • Use the task_id to poll for task status.
Poll for task completion

Use POST /api/tts/task with your API token and the returned task_id:

POST https://voice.astica.ai/api/tts/task
Content-Type: application/json
{
  "tkn": "YOUR_API_TOKEN",
  "task_id": "2b3e2f5e-0acf-4a2d-9b64-8d1a6b3a5f87"
}
Pending response
{
  "status": "pending",
  "task_status": 0
}
Success response

Once completed, you receive a success payload with the same audio fields as a normal expressive non‑streaming call:

{
  "status": "success",
  "task_status": 1,
  "result": {
    "audio_b64": "BASE64_WAV_DATA",
    "audio_format": "wav",
    "meta": {
      "sample_rate": 24000,
      "timestamps": [ /* if timestamps were enabled */ ]
    },
  }
}
Error response

If the task fails or is cancelled, you receive:

{
  "status": "error",
  "task_status": 2 | 3,
  "error": "tts_engine_failed" | "..."
  }
}
Capacity and queue limits
  • If the queue is full, /api/tts with low_priority = true returns 429 with an error such as "low_priority queue is full".
  • Before running a queued job, the system re‑checks your live voice balance. If your balance has fallen too low, queued tasks for that user are cancelled with status 2 (cancelled).
When to use low‑priority vs. normal TTS
  • Use normal TTS (without low_priority) for interactive applications, live agents, or any user‑facing experience where latency matters.
  • Use low‑priority for batch pre‑generation (e.g., audiobooks, bulk content, nightly jobs) where jobs can be processed opportunistically.

Note: low‑priority mode is not available over the WebSocket API and is not supported for neural or programmable voices.

Voice API - Troubleshooting
Common Errors and Responses Explained
Troubleshoot Your API Integration

What error format should I expect?

For REST calls that return JSON, errors generally have the shape:

{
  "status": "error",
  "error": "error_code_or_message"
}

HTTP status codes are used to differentiate classes of problems:

  • 400 — client error (missing token, missing text, invalid parameters).
  • 401 — invalid API token.
  • 402 — insufficient balance.
  • 403 — forbidden (e.g., admin‑only endpoints).
  • 404 — not found (task or clone not found).
  • 429 — too many requests or low‑priority queue full.
  • 500 — internal error or TTS engine failure.
  • 503 — temporarily unavailable

What does “insufficient balance” mean?

When your voice balance is too low to cover the projected cost of a request, /api/tts returns an error:

{
  "status": "error",
  "error": "insufficient balance"
}

You can also receive 402 responses from authentication checks if your underlying voice balance is zero or negative. In both cases you need to top up your account before making more calls.

Note that consumption prioritizes active upgrade quota over pay as you go voice compute.


How is billing calculated?

Each request returns:

  • cost_units — an integer based on the input text to be spoken.

Costs differ by engine and mode (expressive vs. neural vs. programmable, normal vs. low‑priority).


What are the rate limits?

The default RPM request per minute (RPM) limit:

  • 60 requests per minute per user
Get in touch to increase your limit.

If you exceed your request limit, you will receive HTTP 429 with an error message asking you not to exceed the configured RPM. Implement client‑side throttling or queuing to stay within limits.


Why do I get invalid_system_voice?

This error usually indicates:

  • You referenced a voice that does not exist (e.g. typo in "expressive_...") or by using an invalid voice_engine prefix expressive, neural, programmable for that particular voice.

Why do I get invalid_custom_voice?

This error usually indicates one of:

  • You used "clone_n" but that clone does not exist or is not owned by this token.
  • The clone is still pending and not yet in status 1 (completed).

Use the clone listing endpoint (/api/voice_clones) or voice list endpoint (/api/voice_list) to verify available voices, and ensure you are using the correct clone_id and spelling.

Voice API - Audio Specification
Understanding Audio Outputs
Synthetic Audio - Technical Specifications

What audio format does asticaVoice API return?

All voice types ultimately return linear PCM audio, 16‑bit, little‑endian, mono. The primary differences are the container (raw PCM vs. WAV) and how streaming is delivered:

  • Expressive Voices (GPU):
    • HTTP non‑streaming: WAV (PCM S16LE, 24 kHz, mono) in audio_b64.
    • HTTP streaming: raw PCM S16LE over audio/pcm (no container).
    • WebSocket streaming: base64 chunks with format: "pcm_s16le" (raw PCM).
  • Neural Voices (Azure):
    • HTTP non‑streaming: WAV (PCM S16LE, 24 kHz, mono) in audio_b64.
    • HTTP streaming: WAV over audio/wav (full WAV file streamed).
    • WebSocket streaming: base64 chunks with format: "pcm_s16le" (raw PCM).
  • Programmable Voices (OpenAI):
    • HTTP non‑streaming: WAV (PCM S16LE, 24 kHz, mono) in audio_b64.
    • HTTP streaming: WAV over audio/wav (full WAV file streamed).
    • WebSocket streaming: base64 chunks with format: "wav" (WAV file bytes).

In all cases the audio is suitable for direct playback in most media players and WebAudio libraries once properly decoded (base64 → bytes → PCM/WAV).


PCM & WAV format reference

This section summarizes the exact technical audio formats and how they are delivered from each voice type.

Context Format How it is delivered
/api/tts (JSON, all engines, stream=false) Container: WAV (RIFF/WAVE)

Encoding: PCM S16LE (signed 16‑bit, little‑endian)

Channels: Mono (1)

Sample rate: 24 kHz
JSON with audio_b64 (base64 of full WAV) and audio_format: "wav".
/api/tts streaming, Expressive engine Container: None (raw bytes)

Encoding: PCM S16LE

Channels: Mono (1)

Sample rate: 24 kHz
HTTP response with Content-Type: audio/pcm and Transfer-Encoding: chunked.
Each chunk is an arbitrary slice of the PCM stream; concatenate in order.
/api/tts streaming, Neural & Programmable engines Container: WAV

Encoding: PCM S16LE

Channels: Mono (1)

Sample rate: 24 kHz
HTTP response with Content-Type: audio/wav and Transfer-Encoding: chunked.
The first chunk starts at byte 0 of the WAV and includes the full header; later chunks continue the same file.
WebSocket /ws/api, streaming, Expressive Container: None (raw bytes)

Encoding: PCM S16LE

Channels: Mono (1)

Sample rate: 24 kHz
JSON messages of type "tts_audio" with chunk_b64, format: "pcm_s16le", sample_rate: 24000.
Base64‑decode and concatenate chunks in seq order.
WebSocket /ws/api, streaming, Neural Container: None (raw bytes)

Encoding: PCM S16LE

Channels: Mono (1)

Sample rate: 24 kHz
JSON messages of type "tts_audio" with chunk_b64, format: "pcm_s16le", sample_rate: 24000.
Base64‑decode and concatenate chunks in seq order, then treat as 16‑bit mono PCM at 24 kHz.
WebSocket /ws/api, streaming, Programmable Container: WAV

Encoding: PCM S16LE

Channels: Mono (1)

Sample rate: 24 kHz
JSON messages of type "tts_audio" with chunk_b64, format: "wav", sample_rate: 24000.
Chunks together form a single WAV file (the first chunk includes the WAV header).
WebSocket /ws/api, non‑streaming (stream=false) Container: WAV (for neural & programmable)

Encoding: PCM S16LE

Channels: Mono (1)

Sample rate: 24 kHz
Final "tts_complete" message contains audio_b64 (full WAV) and audio_format: "wav" for neural & programmable engines.
Expressive (GPU) currently supports streaming mode (stream=true) over WebSockets.
User clone uploads (/api/voice_clone) after conversion Container: WAV

Encoding: PCM S16LE

Channels: Mono (1)

Sample rate: 16 kHz
You may upload various audio types (.wav, .mp3, .m4a, etc.).
System GPU voices (/api/admin/system_voice_clones, audio field) Container: WAV

Encoding: PCM S16LE (recommended)

Channels: Mono (1, recommended)

Sample rate: 16–24 kHz (recommended)
You must upload WAV. The code validates the file as RIFF/WAVE and enforces duration limits.
For best quality, use mono, 16‑bit PCM, 16–24 kHz.

How do streaming responses work over HTTP?

Expressive Voices, /api/tts with "stream": true:

  • Content-Type: audio/pcm, Transfer-Encoding: chunked.
  • No header and no framing beyond HTTP chunking.
  • Each chunk is raw PCM S16LE at 24 kHz mono.
  • To play back:
    1. Read each chunk of the HTTP body in order.
    2. Concatenate them into a single byte buffer.
    3. Interpret as 16‑bit signed little‑endian mono PCM at 24 kHz.

Neural & Programmable engines, /api/tts with "stream": true:

  • Content-Type: audio/wav, Transfer-Encoding: chunked.
  • The first chunk begins at byte 0 and contains the entire WAV header.
  • Later chunks are just continuation of the same WAV file (more PCM frames).
  • To play back:
    1. Concatenate all chunks into a single buffer.
    2. Decode as a normal PCM S16LE mono 24 kHz WAV file.

How do streaming responses work over WebSockets?

For the WebSocket API (/ws/api), audio is always sent as JSON messages with base64‑encoded chunks.

Streaming responses:

{
  "type": "tts_audio",
  "request_id": "your_request_id",
  "seq": 0,
  "chunk_b64": "base64_audio_chunk",
  "format": "pcm_s16le" | "wav",
  "sample_rate": 24000
}
  • format: "pcm_s16le" — expressive (GPU) and neural (Azure) engines, raw 16‑bit PCM.
  • format: "wav" — programmable (OpenAI) engine, WAV file bytes.
  • seq is a monotonically increasing integer (≥ 0); use it to order chunks.

When the stream is finished, you will receive:

{
  "type": "tts_audio_end",
  "request_id": "your_request_id"
}

In non‑streaming WebSocket mode ("stream": false in your TTS request), there are no tts_audio chunks for neural & programmable engines. Instead you receive a single completion frame:

{
  "type": "tts_complete",
  "request_id": "your_request_id",
  "status": "success",
  "result": "ok",
  "voice": "…",
  "cost_units": 123,
  "audio_b64": "base64_wav_bytes",
  "audio_format": "wav",
  "meta": { "sample_rate": 24000, "audio_format": "wav", ... }
}

Expressive (GPU) over WebSockets is currently intended for streaming ("stream": true); non‑streaming mode is best used with neural or programmable engines.


What does pcm_s16le mean?

Many clients and libraries use the string "pcm_s16le" to describe the raw sample format used by the expressive and neural WebSocket streams.

  • pcm — uncompressed linear Pulse Code Modulation.
  • s16 — signed 16‑bit integer samples.
  • le — little‑endian byte order (low byte first).

In practical terms, each audio sample is a 16‑bit signed integer in the range [-32768, 32767], encoded as two bytes (little‑endian), at 24,000 samples per second, one channel.


What format should I upload for voice cloning?

User clones via /api/voice_clone:

  • Submit multipart/form‑data with a file field named audio.
  • We accept typical audio formats: .wav, .mp3, .m4a, .aac, .ogg, or any audio/* MIME type.
  • We convert your upload to:
    • WAV container
    • PCM S16LE
    • Mono (1 channel)
    • 16 kHz sample rate
  • Duration requirements:
    • Minimum: VOICE_CLONE_MIN_SEC (typically 1 second).
    • Maximum: VOICE_CLONE_MAX_SEC (typically 35 seconds).

System GPU voices via /api/admin/system_voice_clones (audio field):

  • Must be WAV (PCM S16LE is strongly recommended).
  • We validate it as a RIFF/WAVE file and enforce the same duration limits.
  • For best results, use:
    • Mono (1 channel)
    • 16‑bit PCM
    • Sample rate 16–24 kHz

Providing clean, noise‑free voice clips that meet these specs will give the best performance across expressive, neural, and programmable engines.

astica ai Discover More AI

Experiment with different kinds of artificial intelligence. See, hear, and speak with astica.

Return to Dashboard
You can return to this page at any time.
Success Just Now
Copied to clipboard
Success Just Now
Submission has been received.
Success Just Now
Account preferences have been updated.