Text to speech turns the model's reply into audio. It is the most visible part of the stack — customers hear it, they form an instant opinion about it, and they do not forget. That makes TTS the part of the pipeline that teams over-tune first and under-tune everywhere else.
Four axes that matter
- Naturalness — does it sound like a person, or like a robot reading?
- Latency to first byte — how fast does the first word come back after generation starts?
- Language coverage — does it sound good in the language and accent your customer actually speaks?
- Cost — TTS adds up fast at real call volume; the provider you pick matters.
What we wire up
Vocily AI separates TTS from the rest of the pipeline so teams can pick the voice that fits the use case, without redesigning the rest of the agent. Today the platform supports four providers, each with a different sweet spot.
- Sarvam (Bulbul) — Indian-language voices including Hindi, Hinglish auto, and regional languages.
- Smallest (Lightning) — low-latency Indian-language playback when responsiveness matters more than expressiveness.
- Cartesia (Sonic) — English voices tuned for low latency.
- ElevenLabs (Turbo) — expressive English voices when the brand wants more character.
Naturalness is not just the model
Even the best TTS sounds wrong if it pronounces a brand name funny, breaks a long number in the wrong place, or stresses the wrong syllable. Most of this is solved by feeding the model cleaner text — using SSML hints, normalising numbers ahead of time, and treating names as protected tokens. The model only does what you let it do.