Voice AI API Documentation
asticaVoice Text to Speech Engine
Play audio in real-time with the Streaming REST API, or connect with WebSockets for lower latency: time to first audio between 250 - 400ms.
If you have time to wait, use the low priority mode for reduced costs.
Access hundreds of natural sounding voices with a single API: generate natural, expressive speech from text for real‑time conversations, narration, games, agents, and more.
- Expressive voices: rich character, emotion, and ready for real-time.
- Programmable voices: fine‑grained control over tone and persona.
- Neural voices: clean, clear, production‑ready narration in many accents.
You can choose from hundreds of voices across ages, genders, and nationalities, including both pre‑built characters and custom voices you create with voice cloning.
Try Online Code SamplesText to Speech REST API
POST /api/tts— main text‑to‑speech endpoint (all engines).POST /api/voice_list— list of public voices available for use.POST /api/voice_clone— create a personal custom voice.POST /api/voice_clone_list— list your private custom voice clones.POST /api/tts/task— poll results for low‑priority queued jobs.
-
Choose a voice:
- Expressive voices like
"expressive_ava" - Programmable aliases like
"prog_avery". - Neural voices like
"neural_jennifer" - Custom voices like
"clone_15"
- Expressive voices like
-
Call
POST /api/ttswith:tkn: your API token.text: the text to speak.voice: the desired voice.stream:truefor audio streaming, orfalsefor a single JSON response.timestamps(optional): start and end times of each spoken word.
-
Receive audio as:
- A continuous HTTP audio stream (for low‑latency playback), or
- A JSON payload with
audio_b64(Base64 WAV) and metadata.
The asticaVoice API allows developers to integrate natural-sounding voice outputs into their applications. Offering a wide selection of voices, seamless integration including plug and play REST API or WebSocket access, and multilanguage support asticaVoice API can be used to empower your application or next project.
1. Real-time Speech Generation:
The asticaVoice API allows you to generate realistic speech suitable for time sensitive applications and real-time use.
All voices are synthesized on demand and just in time with time to first audio between 250 - 400ms.
2. Diverse and Realistic Voices:
Browse a vast library containing hundreds of unique voices to choose from, comprised of different age groups, genders, and nationalities.
This enables developers to tailor the voice output and personality to suit their specific needs and preferences of their users for a more personalized and engaging user experience.
3. Multilingual Support:
asticaVoice is capable of supporting multiple languages with a high level fluency.
The ability to handle translated text to speech is instrumental in supporting seamless experiences for global audiences and adapting content to cater to diverse linguistic demographics.
4. Naturally Unique Speech
All voice output from the Expressive and Programmable voices is generated is unique, with its own inflections and natural disfluencies.
That means each recording sounds a little different, like a real person talking - allowing you create high quality and interactive, engaging voice experiences.
Is there a WebSocket API?
Yes. You can use the WebSockets API for lower latency and increased functionality for real-time integrations. There are no additional cost or requirements to use the WebSocket API. You can swap between the REST API and WebSocket API depending on your usecase.
How do I get word‑level timestamps?
Per-word timestamps available for streaming=true using the WebSockets API.
Using the REST API the timestamps are only available by setting streaming=false. Note that only expressive voices are supporting timestamps.
voices, or neural voices. Timestamps are available exclusively for expressive voices when:
- WebSockets API:
stream = trueORstream = false - REST API:
stream = false, and<timestamps = trueis set in the request body.
The response will include a meta.timestamps array containing the start and end time of each spoken word.
Are voice clones private?
Yes. All custom voice clones are private and will only be available to your account. The asticaVoice API allows you to manage your custom voice clones:
- Create a new custom voice clones.
- List all existing custom voice clones.
- Use them to generate speech via
voice = "clone_n".
Other users cannot access your clones or their underlying audio/embeddings through the public API.
When should I use streaming vs. non‑streaming?
-
Use streaming (
stream = true) when you want to start playback as soon as possible, e.g. live agents or interactive apps. -
Use non‑streaming (
stream = false) when you need a complete audio file (Base64 WAV inaudio_b64), word timestamps, or easy integration with storage/CDNs.
What is the difference between voice types?
All three voice types are accessed via the same /api/tts endpoint; the difference is in how
you select and control the voice:
-
Expressive Voices Recommended
rich character and emotion, best for agents, games, and storytelling. Many built‑in characters plus custom clones. Supports word‑level timestamps in non‑streaming mode and low‑priority queueing. -
Programmable Voices
controlled by apromptthat describes persona and style. Great for assistants and dynamic character work where you want to adjust tone per request. -
Neural Voices
clean, consistent narration voices across accents and genders. Ideal for tutorials, IVR, and long‑form reading where you want a straightforward sound.
The Voice API exposes a single text‑to‑speech endpoint. The voice you pass
determines which engine is used internally (expressive, programmable, or neural).
POST https://voice.astica.ai/api/tts
Content-Type: application/json
Request body
| Field | Type | Required | Description |
|---|---|---|---|
tkn |
string | yes | Your API token. Can also be sent as X-API-Key for REST API |
text |
string | yes | The text to synthesize into spoken speech. |
voice |
string | no |
String corresponding to chosen voice identifier and engine:
|
stream |
boolean | no |
Stream audio in real-time: defaults to
|
timestamps |
boolean | no |
Set to true if you would like to receive word‑level timestamps: list of start and end times for each word spoken.
Only supported for expressive voices when stream = false within the REST API
or when using stream = true within the WebSockets API.
|
prompt |
string | no |
Style instructions for programmable voices.
Prompts are ignored by expressive and neural voices.
See the programmable section for examples and best practices for prompting programmable voices.
|
low_priority |
boolean | no |
Submit the task as a discounted low‑priority request.
Only valid for
expressive voices with stream = false using the REST API.
Low priority tasks are processed in a queue and provides you with a URL to query for the completed audio file. See the low‑priority section for details. |
stream = false)
When stream = false, all engines return a JSON payload of the form:
{
"status": "success",
"result": "ok",
"engine": "expressive",
"voice": "expressive_sarah",
"cost_units": 45,
"meta": {
"sample_rate": 24000,
"timestamps": [
{
"text": "hello,",
"start_s": 0.24,
"stop_s": 0.96
},
{
"text": "how",
"start_s": 0.96,
"stop_s": 1.2
},
{
"text": "are",
"start_s": 1.2,
"stop_s": 1.36
},
{
"text": "you",
"start_s": 1.36,
"stop_s": 1.44
}
]
},
"audio_b64": "BASE64_WAV_DATA",
"audio_format": "wav"
}
cost_units— logical units (roughly words + punctuation).meta.timestamps— present for expressive, non‑streaming requests whentimestamps = true.
stream = true)
When stream = true, the HTTP response is raw audio; there is no JSON wrapper:
-
Expressive (GPU)
Content-Type: audio/pcm, mono 16‑bit PCM chunks.
This is ideal for extremely responsive playback in clients that can decode PCM. -
Neural and Programmable
Content-Type: audio/wav, WAV bytes streamed as they are generated.
Use your HTTP client’s streaming APIs (e.g. ReadableStream in browsers,
or response.iter_content in Python) to incrementally read and play the audio.
Programmable voices and the prompt parameter
Programmable voices are designed to be steered by a short natural‑language
prompt. You specify what kind of speaker the voice should be
(persona, mood, context), and the engine adapts pronunciation, pacing, and emphasis
while still reading your text exactly.
The prompt input is only usable with programmable voices.
Programmable voices use dedicated aliases in the voice field, for example:
"prog_avery""prog_lena""prog_naomi""prog_morgan"
How
prompt is used
- Optional, but highly recommended for persona‑driven use cases.
- Up to ~255 characters; concise descriptions work best.
-
Interpreted as instructions to the speaker. The spoken
content always comes from
text. -
Safe to change on every request: you can reuse the same
programmable voicewith many different prompts.
{
"tkn": "YOUR_API_TOKEN",
"voice": "prog_avery",
"text": "Welcome to the product tour. Let me walk you through the main features.",
"prompt": "You are a friendly, modern product specialist on a video call, "
+ "speaking clearly and confidently, with upbeat but not exaggerated energy.",
"stream": false
}
Programmable voices support REST API streaming and WebSockets API for interactive real-time experiences. The length of your prompt can impact the time to first audio for that request.
Prompt examples
The following are example inputs that might be used for the prompt field.
You can experiment with these examples using the online Web UI to see how each voice reacts.
-
"You're a cowboy with a lazy drawl, western twang, frontier wisdom, friendly and calm, like a seasoned ranch hand around a campfire at sunset, partner." -
"You are a seasoned news anchor on a national broadcast, speaking crisply, with neutral accent and professional, measured pacing." -
"You are a warm kindergarten teacher reading a bedtime story, soft and soothing, smiling as you speak, pausing gently at the end of each sentence." -
"You are a sarcastic but good‑natured tech reviewer on a YouTube channel, energetic and witty, with quick, expressive delivery."
-
"You are a calm, empathetic support agent on a phone line, speaking slowly and clearly, reassuring and non‑judgmental." -
"You are a concise voice assistant on a smart speaker, neutral and direct, keeping responses short and to the point."
-
"You are an audiobook narrator bringing a non‑fiction book to life, engaged but not theatrical, with clear emphasis on key ideas." -
"You are a college professor explaining concepts to first‑year students, patient and precise, occasionally pausing after important terms."
You can also use short style‑only prompts when you don't need a full persona. These can useful for handling subtle mood changes with a voice:
"Soft‑spoken, introspective tone with gentle pacing.""High‑energy, excited delivery like a game show host.""Understated, documentary‑style narration."
- Describe the role: e.g. “teacher”, “coach”, “anchor”, “friend”.
- Mood and energy: calm, excited, serious, playful, etc.
- Context: on a podcast, phone call, game, bedtime story, etc.
- Keep it focused: avoid very long multi‑paragraph prompts.
-
Do not repeat the main text inside
prompt; it should describe how to speak, not what to say.
Custom Voice Cloning API
Voice cloning allows you to create private custom voices from short audio samples.
Once a clone is ready, you can use it like any other expressive voice by referencing
clone_1, clone_2, and so on in the voice field of /api/tts.
Note that each clone ID auto increments and is specific to your user account. The very first custom voice that is created can be used by requesting voice "clone_1", and the next will be "clone_2".
Overview
- The maximum number of custom voice clones that you can create is determined by your
voice upgrade tier: you can upgrade at anytime with prorated billing. - You must have a positive voice compute balance to create a clone.
- Input audio should be a single speaker, clean, and between 5 and 7 seconds for best results. The API will allow a minimum of 2 audio seconds up to 30 audio seconds.
- When you submit a custom voice cloning request the new voice is typically available for speech generation within 3 seconds.
- Clones are private to your account; other users cannot see or use them. You can permanently remove custom voice clones at anytime.
Custom Voice Cloning Quota
The maximum number of custom voice clones that are available to you depends on your account upgrades. This is separate from the Pay as You Go Voice Compute and begins at $3.79/month.
- You can upgrade or downgrade your quota at anytime with pro-rated pricing.
- You can remove existing custom voices and create more: the number of total clones is based on monthly capacity * 6.2. If you a quota of 1000 custom voices then you are permitted to process up to 6200 unique clones per month.
Create a clone
Use POST https://voice.astica.ai/api/voice_clone with multipart/form-data:
POST /api/voice_clone
Content-Type: multipart/form-data
| Field | Type | Required | Description |
|---|---|---|---|
tkn |
text | yes | Your API token. |
nickname |
text | no | (max ~64 characters). |
audio |
file | yes |
Audio sample of the voice. (WAV, MP3, M4A, AAC, OGG) |
{
"clone_id": 1, // per-user ID (1, 2, 3, ...)
"status": "queued",
"nickname": "My Voice",
"duration_sec": 24.3,
"clone_limit": 10,
"clones_used": 1,
"clones_remaining": 9
}
The clone begins in status 0 (pending). Once the request has finished processing,
it transitions to status 1 (completed) and becomes available to TTS and can be used for producing speech.
You can list clones either via POST request:
POST /api/voice_clone_list
{
"tkn": "YOUR_API_TOKEN"
}
Example response:
{
"status": "success",
"clones": [
{
"clone_id": 1,
"nickname": "Brand Main",
"status": 1,
"error": "",
"duration_sec": 24,
"date_created": 1732300000,
"date_updated": 1732300123
}
]
}
status = 0— pending.status = 1— ready.status = 3— failed (seeerrorfield).
Each clone has a per‑user clone_id starting at 1. To use your first clone,
set the voice field in /api/tts to "clone_1":
{
"tkn": "YOUR_API_TOKEN",
"text": "This is my custom voice.",
"voice": "clone_1",
"stream": false
}
- You can also use
"clone-1"; both underscore and dash are accepted. - If the clone is not ready or does not belong to your user, TTS returns
invalid_custom_voice.
To rename a clone, call POST /api/voice_clone/{id} (or
POST /api/voice_clone/{id}) with JSON body:
POST /api/voice_clone/123
Content-Type: application/json
{
"tkn": "YOUR_API_TOKEN",
"nickname": "New Friendly Name"
}
Response:
{
"status": "success",
"id": 123,
"nickname": "New Friendly Name"
}
Delete a clone
To permanently remove a custom voice clone, use:
POST https://voice.astica.ai/api/voice_clone/123
Body or query string must include tkn. Example JSON response:
{
"status": "success",
"id": 123
}
- The clone is marked as cancelled (status
2). - Its audio and embedding references are cleared.
- Future TTS calls using
"clone_{id}"for this user will fail once that clone is deleted. - Your available custom voice capacity quota will be reduced immediately allowing you to train a new custom voice if you had previously reached your capacity limit.
WebSocket API (advanced streaming)
The WebSocket API is an advanced option for streaming text‑to‑speech. For most applications you should use the REST API; switch to WebSockets only when you need continuous audio with minimal latency and tight alignment between playback and word timings.
The primary benefit of the WebSocket API is enabling real‑time alignment of audio and timestamps for compatible expressive voices — the spoken words and their timing information can be streamed together, rather than waiting for synthesis to finish as in the REST API.
EndpointConnect to the same host and port as the HTTPS API, using the unified WebSocket endpoint:
wss://voice.astica.ai/ws/api
Your application should send a TTS message immediately after connecting, as described below.
Client TTS request message
To synthesize speech over WebSockets, send a JSON message with type
"tts" (or "speak"):
{
"type": "tts", // or "speak"
"request_id": "optional-id",// echoed back in responses
"tkn": "YOUR_API_TOKEN",
"text": "Hello from WebSockets.",
"voice": "expressive_steven",
"stream": true, // recommended for WS
"prompt": "optional style instructions for programmable voices",
"timestamps": true, // for expressive alignment (engine-dependent)
"low_priority": false // not supported over WS; must be false/omitted
}
If stream is omitted, WebSocket TTS defaults to streaming mode. The
voice field is interpreted the same way as in the REST API and controls
which engine is used internally.
When a TTS request is accepted, the server sends a quick acknowledgement before any audio:
Example response:{
"type": "tts_ack",
"request_id": "YOUR_REQUEST_ID",
"voice": "expressive_steven",
"cost_units": 42,
"stream": true
}
The acknowledgement confirms routing and estimated cost, and signals that audio will begin streaming shortly if there are no errors.
Streaming audio messages
Audio is delivered as a sequence of tts_audio messages followed by
tts_audio_end. The format depends on the engine:
-
Expressive (GPU):
format: "pcm_s16le"(raw 16‑bit mono PCM, typically 24 kHz). -
Neural / Programmable:
format: "wav"(WAV bytes).
{
"type": "tts_audio",
"request_id": "YOUR_REQUEST_ID",
"seq": 0,
"chunk_b64": "BASE64_ENCODED_AUDIO_BYTES",
"format": "pcm_s16le", // or "wav"
"sample_rate": 24000
}
Example end‑of‑audio marker:
{
"type": "tts_audio_end",
"request_id": "YOUR_REQUEST_ID"
}
Clients should reassemble and decode the chunk_b64 payloads in order of
the seq field, feeding them directly into an audio buffer or streaming
decoder for immediate playback.
After audio has finished streaming (or after a non‑streaming TTS over WS), the
server sends a tts_complete message with metadata and, for some modes,
a full audio buffer.
{
"type": "tts_complete",
"request_id": "YOUR_REQUEST_ID",
"status": "success",
"voice": "expressive_steven",
"audio_b64": null, // or Base64 WAV for non-streaming WS calls
"audio_format": "pcm_s16le", // or "wav"
"meta": {
"sample_rate": 24000,
"audio_format": "pcm_s16le",
"timestamps": [ /* word timestamps data */ ]
}
}
For expressive voices with timestamp support, the meta.timestamps array
can be used to align on‑screen text, captions, or highlights with audio playback
in near real time.
Any error during request validation or synthesis is reported as a tts_error
message. Typical conditions include invalid tokens, exceeded concurrency, or
upstream engine issues.
{
"type": "tts_error",
"request_id": "YOUR_REQUEST_ID",
"code": "insufficient_balance" | "missing_text" | "tts_engine_failed" | "...",
"error": "human-readable description",
"http_status": 401 // present for some auth-related errors
}
In addition to standard API rate limits the WS API will enforce a separate lmit on concurrent
TTS jobs per WebSocket connection. If you exceed this, you will receive
tts_error with code "too_many_inflight_requests"; open multiple
connections or queue requests client‑side if you need higher concurrency.
The concurrent request per WebSocket connection is set to 50% of your account's standard API rate limit. When to choose WebSockets vs REST
- Prefer REST for most integrations: simpler client code, easy non‑streaming responses, and built‑in support for timestamps on expressive non‑streaming calls.
-
Use WebSockets when you need:
- Continuous, low‑latency audio streaming.
- Fine‑grained synchronization between playback and word timings for on‑screen text or avatars.
- Many small, rapid TTS exchanges on a single persistent connection.
Low‑priority Mode
Low‑priority mode lets you generate voice (TTS) at a discounted rate by placing requests into a background queue. This is ideal for batch jobs, large backlogs, or non‑interactive workloads where reducing cost is more important than response time.
Key properties- Submit your request with
low_priority = true. - You receive a
task_idimmediately and poll a separate endpoint for results. - Periodically poll the endpoint to receive your generated audio.
- Benefit from significantly reduced costs.
- Low priority mode is only supported by expressive voices and custom voice clones..
- Non‑streaming only:
streammust befalse(or omitted). - The default queue capacity allows up to 5000 requests per user account to be queued. Your low priority tasks will be processed in the order that they were received. If you require an increased limit please get in touch.
Send a normal /api/tts request with low_priority = true and an expressive voice:
{
"tkn": "YOUR_API_TOKEN",
"text": "Generate this in the background.",
"voice": "expressive_sarah",
"stream": false,
"low_priority": true
}
Example response:
{
"status": "queued",
"result": "low_priority",
"task_id": "2b3e2f5e-0acf-4a2d-9b64-8d1a6b3a5f87",
"voice": "expressive_sarah",
"cost_units": 42
}
- No audio is returned at this stage.
- Use the task_id to poll for task status.
Use POST /api/tts/task with your API token and the returned task_id:
POST https://voice.astica.ai/api/tts/task
Content-Type: application/json
{
"tkn": "YOUR_API_TOKEN",
"task_id": "2b3e2f5e-0acf-4a2d-9b64-8d1a6b3a5f87"
}
Pending response
{
"status": "pending",
"task_status": 0
}
Success response
Once completed, you receive a success payload with the same audio fields as a normal expressive non‑streaming call:
{
"status": "success",
"task_status": 1,
"result": {
"audio_b64": "BASE64_WAV_DATA",
"audio_format": "wav",
"meta": {
"sample_rate": 24000,
"timestamps": [ /* if timestamps were enabled */ ]
},
}
}
Error response
If the task fails or is cancelled, you receive:
{
"status": "error",
"task_status": 2 | 3,
"error": "tts_engine_failed" | "..."
}
}
Capacity and queue limits
-
If the queue is full,
/api/ttswithlow_priority = truereturns429with an error such as"low_priority queue is full". -
Before running a queued job, the system re‑checks your live voice balance. If your
balance has fallen too low, queued tasks for that user are cancelled with status
2(cancelled).
-
Use normal TTS (without
low_priority) for interactive applications, live agents, or any user‑facing experience where latency matters. - Use low‑priority for batch pre‑generation (e.g., audiobooks, bulk content, nightly jobs) where jobs can be processed opportunistically.
Note: low‑priority mode is not available over the WebSocket API and is not supported for neural or programmable voices.
What error format should I expect?
For REST calls that return JSON, errors generally have the shape:
{
"status": "error",
"error": "error_code_or_message"
}
HTTP status codes are used to differentiate classes of problems:
400— client error (missing token, missing text, invalid parameters).401— invalid API token.402— insufficient balance.403— forbidden (e.g., admin‑only endpoints).404— not found (task or clone not found).429— too many requests or low‑priority queue full.500— internal error or TTS engine failure.503— temporarily unavailable
What does “insufficient balance” mean?
When your voice balance is too low to cover the projected cost of a request,
/api/tts returns an error:
{
"status": "error",
"error": "insufficient balance"
}
You can also receive 402 responses from authentication checks if your
underlying voice balance is zero or negative. In both cases you need to top up your
account before making more calls.
How is billing calculated?
Each request returns:
cost_units— an integer based on the input text to be spoken.
Costs differ by engine and mode (expressive vs. neural vs. programmable, normal vs. low‑priority).
What are the rate limits?
The default RPM request per minute (RPM) limit:
- 60 requests per minute per user
If you exceed your request limit, you will receive HTTP 429 with an error message asking you
not to exceed the configured RPM. Implement client‑side throttling or queuing to
stay within limits.
Why do I get
invalid_system_voice?
This error usually indicates:
- You referenced a voice that does not exist (e.g. typo in
"expressive_...") or by using an invalid voice_engine prefixexpressive,neural,programmablefor that particular voice.
Why do I get
invalid_custom_voice?
This error usually indicates one of:
- You used
"clone_n"but that clone does not exist or is not owned by this token. - The clone is still pending and not yet in status
1(completed).
Use the clone listing endpoint (/api/voice_clones) or voice list endpoint
(/api/voice_list) to verify available voices, and ensure you are using
the correct clone_id and spelling.
What audio format does asticaVoice API return?
All voice types ultimately return linear PCM audio, 16‑bit, little‑endian, mono. The primary differences are the container (raw PCM vs. WAV) and how streaming is delivered:
- Expressive Voices (GPU):
- HTTP non‑streaming: WAV (PCM S16LE, 24 kHz, mono) in
audio_b64. - HTTP streaming: raw PCM S16LE over
audio/pcm(no container). - WebSocket streaming: base64 chunks with
format: "pcm_s16le"(raw PCM).
- HTTP non‑streaming: WAV (PCM S16LE, 24 kHz, mono) in
- Neural Voices (Azure):
- HTTP non‑streaming: WAV (PCM S16LE, 24 kHz, mono) in
audio_b64. - HTTP streaming: WAV over
audio/wav(full WAV file streamed). - WebSocket streaming: base64 chunks with
format: "pcm_s16le"(raw PCM).
- HTTP non‑streaming: WAV (PCM S16LE, 24 kHz, mono) in
- Programmable Voices (OpenAI):
- HTTP non‑streaming: WAV (PCM S16LE, 24 kHz, mono) in
audio_b64. - HTTP streaming: WAV over
audio/wav(full WAV file streamed). - WebSocket streaming: base64 chunks with
format: "wav"(WAV file bytes).
- HTTP non‑streaming: WAV (PCM S16LE, 24 kHz, mono) in
In all cases the audio is suitable for direct playback in most media players and WebAudio libraries once properly decoded (base64 → bytes → PCM/WAV).
PCM & WAV format reference
This section summarizes the exact technical audio formats and how they are delivered from each voice type.
| Context | Format | How it is delivered |
|---|---|---|
/api/tts (JSON, all engines, stream=false) |
Container: WAV (RIFF/WAVE) Encoding: PCM S16LE (signed 16‑bit, little‑endian) Channels: Mono (1) Sample rate: 24 kHz |
JSON with audio_b64 (base64 of full WAV) and audio_format: "wav".
|
/api/tts streaming, Expressive engine |
Container: None (raw bytes) Encoding: PCM S16LE Channels: Mono (1) Sample rate: 24 kHz |
HTTP response with Content-Type: audio/pcm and Transfer-Encoding: chunked.Each chunk is an arbitrary slice of the PCM stream; concatenate in order. |
/api/tts streaming, Neural & Programmable engines |
Container: WAV Encoding: PCM S16LE Channels: Mono (1) Sample rate: 24 kHz |
HTTP response with Content-Type: audio/wav and Transfer-Encoding: chunked.The first chunk starts at byte 0 of the WAV and includes the full header; later chunks continue the same file. |
WebSocket /ws/api, streaming, Expressive |
Container: None (raw bytes) Encoding: PCM S16LE Channels: Mono (1) Sample rate: 24 kHz |
JSON messages of type "tts_audio" with
chunk_b64, format: "pcm_s16le", sample_rate: 24000.Base64‑decode and concatenate chunks in seq order.
|
WebSocket /ws/api, streaming, Neural |
Container: None (raw bytes) Encoding: PCM S16LE Channels: Mono (1) Sample rate: 24 kHz |
JSON messages of type "tts_audio" with
chunk_b64, format: "pcm_s16le", sample_rate: 24000.Base64‑decode and concatenate chunks in seq order, then treat as 16‑bit mono PCM at 24 kHz.
|
WebSocket /ws/api, streaming, Programmable |
Container: WAV Encoding: PCM S16LE Channels: Mono (1) Sample rate: 24 kHz |
JSON messages of type "tts_audio" with
chunk_b64, format: "wav", sample_rate: 24000.Chunks together form a single WAV file (the first chunk includes the WAV header). |
WebSocket /ws/api, non‑streaming (stream=false) |
Container: WAV (for neural & programmable) Encoding: PCM S16LE Channels: Mono (1) Sample rate: 24 kHz |
Final "tts_complete" message contains audio_b64 (full WAV) and
audio_format: "wav" for neural & programmable engines.Expressive (GPU) currently supports streaming mode ( stream=true) over WebSockets.
|
User clone uploads (/api/voice_clone) after conversion |
Container: WAV Encoding: PCM S16LE Channels: Mono (1) Sample rate: 16 kHz |
You may upload various audio types (.wav, .mp3, .m4a, etc.). |
System GPU voices (/api/admin/system_voice_clones, audio field) |
Container: WAV Encoding: PCM S16LE (recommended) Channels: Mono (1, recommended) Sample rate: 16–24 kHz (recommended) |
You must upload WAV. The code validates the file as RIFF/WAVE and enforces duration limits. For best quality, use mono, 16‑bit PCM, 16–24 kHz. |
How do streaming responses work over HTTP?
Expressive Voices, /api/tts with "stream": true:
Content-Type: audio/pcm,Transfer-Encoding: chunked.- No header and no framing beyond HTTP chunking.
- Each chunk is raw PCM S16LE at 24 kHz mono.
- To play back:
- Read each chunk of the HTTP body in order.
- Concatenate them into a single byte buffer.
- Interpret as 16‑bit signed little‑endian mono PCM at 24 kHz.
Neural & Programmable engines, /api/tts with "stream": true:
Content-Type: audio/wav,Transfer-Encoding: chunked.- The first chunk begins at byte 0 and contains the entire WAV header.
- Later chunks are just continuation of the same WAV file (more PCM frames).
- To play back:
- Concatenate all chunks into a single buffer.
- Decode as a normal PCM S16LE mono 24 kHz WAV file.
How do streaming responses work over WebSockets?
For the WebSocket API (/ws/api), audio is always sent as JSON messages with base64‑encoded chunks.
Streaming responses:
{
"type": "tts_audio",
"request_id": "your_request_id",
"seq": 0,
"chunk_b64": "base64_audio_chunk",
"format": "pcm_s16le" | "wav",
"sample_rate": 24000
}
format: "pcm_s16le"— expressive (GPU) and neural (Azure) engines, raw 16‑bit PCM.format: "wav"— programmable (OpenAI) engine, WAV file bytes.seqis a monotonically increasing integer (≥ 0); use it to order chunks.
When the stream is finished, you will receive:
{
"type": "tts_audio_end",
"request_id": "your_request_id"
}
In non‑streaming WebSocket mode ("stream": false in your TTS request), there are no tts_audio chunks for neural & programmable engines. Instead you receive a single completion frame:
{
"type": "tts_complete",
"request_id": "your_request_id",
"status": "success",
"result": "ok",
"voice": "…",
"cost_units": 123,
"audio_b64": "base64_wav_bytes",
"audio_format": "wav",
"meta": { "sample_rate": 24000, "audio_format": "wav", ... }
}
Expressive (GPU) over WebSockets is currently intended for streaming ("stream": true); non‑streaming mode is best used with neural or programmable engines.
What does
pcm_s16le mean?
Many clients and libraries use the string "pcm_s16le" to describe the raw sample format used by the expressive and neural WebSocket streams.
- pcm — uncompressed linear Pulse Code Modulation.
- s16 — signed 16‑bit integer samples.
- le — little‑endian byte order (low byte first).
In practical terms, each audio sample is a 16‑bit signed integer in the range [-32768, 32767],
encoded as two bytes (little‑endian), at 24,000 samples per second, one channel.
What format should I upload for voice cloning?
User clones via /api/voice_clone:
- Submit multipart/form‑data with a file field named
audio. - We accept typical audio formats:
.wav,.mp3,.m4a,.aac,.ogg, or anyaudio/*MIME type. - We convert your upload to:
- WAV container
- PCM S16LE
- Mono (1 channel)
- 16 kHz sample rate
- Duration requirements:
- Minimum:
VOICE_CLONE_MIN_SEC(typically 1 second). - Maximum:
VOICE_CLONE_MAX_SEC(typically 35 seconds).
- Minimum:
System GPU voices via /api/admin/system_voice_clones (audio field):
- Must be WAV (PCM S16LE is strongly recommended).
- We validate it as a RIFF/WAVE file and enforce the same duration limits.
- For best results, use:
- Mono (1 channel)
- 16‑bit PCM
- Sample rate 16–24 kHz
Providing clean, noise‑free voice clips that meet these specs will give the best performance across expressive, neural, and programmable engines.
Discover More AI
Experiment with different kinds of artificial intelligence. See, hear, and speak with astica.