SonarAI: Zero-Shot Voice Cloning & TTS Engine

1. Introduction & Project Objective

Modern media production, training formulation, and documentary narration frequently require high-quality voiceovers. However, utilizing commercial cloud APIs presents strict limitations regarding data privacy, ongoing subscription costs, and restricted offline capabilities.

SonarAI was engineered to solve these constraints. It is a production-grade, 100% locally deployed text-to-speech (TTS) system. By running inference entirely on local hardware (supporting CUDA, Apple MPS, and CPU fallback), the system ensures that voice clips and proprietary scripts never leave the host machine while still delivering broadcast-ready audio.

2. Dual-Engine Architecture

A resilient TTS system cannot rely on a single model. Different use cases require different approaches to latency, consistency, and flexibility. SonarAI implements a unified routing layer over two distinct open-weight models:

Kokoro-82M (Apache 2.0): Serves as the primary engine for high-consistency, fixed-voice tasks. It provides 54 built-in professional voices across 9 languages. It is highly optimized and requires no reference audio.
XTTS v2 (Coqui CPML): Powers the system's dynamic capabilities via zero-shot voice cloning. By analyzing a brief 5–12 second reference audio clip, the model extracts a speaker embedding and conditions its autoregressive decoder to replicate the voice's timbre and prosody across 17 different languages—without any fine-tuning.

3. Audio Engineering & Broadcast Pipeline

Raw output from machine learning acoustic models is rarely suitable for immediate media integration. SonarAI implements a robust 7-step post-processing pipeline to ensure the audio is ready to drop into NLE software (like Premiere Pro or DaVinci Resolve) without manual equalization.

The EBU R128 Pipeline

Every generated waveform undergoes automated refinement:

Resampling: Conversion from the engine's native 24 kHz to an anti-aliased 44.1 kHz stereo output.
Spectral Noise Reduction: Subtracts latent artifact hum introduced by the neural vocoder.
Frequency Sculpting: Applies an 80 Hz high-pass filter to remove low-frequency rumble, paired with a +2 dB presence shelf at 3 kHz to dramatically improve vocal intelligibility.
Loudness Normalization: Utilizes `pyloudnorm` to hit target LUFS profiles (e.g., -23 LUFS for broadcast, -14 LUFS for streaming) protected by a -1.0 dBTP true-peak limiter to prevent clipping.

4. Engineering Insights & Problem Solving

Deploying large language and audio models locally uncovers several complex software engineering challenges. Key mitigations designed into SonarAI include:

A. Thread Safety in Autoregressive Models

The XTTS v2 model's internal attention-layer tensor buffers are fundamentally not thread-safe. Concurrent synthesis calls result in immediate memory corruption. To enable batch processing of long documentary scripts, a strict threading.Lock mechanism was engineered to serialize inference calls safely without crashing the FastAPI server.

B. Cross-Lingual Text Segmentation

Neural TTS models have strict context window limits (e.g., 250 characters). Standard English text splitters fail on non-Latin scripts. Example The system was engineered to pre-split Hindi text specifically on Danda characters (।, ॥) and Japanese text on specific punctuations (。！？) before passing chunks to the model, completely eliminating the garbled or truncated audio that usually plagues multilingual generation.

C. Multi-Reference Speaker Averaging

A single 5-second audio clip often fails to capture a speaker's full emotional range. SonarAI's architecture was extended to allow multi-reference audio conditioning. By extracting speaker latents from 2 to 4 distinct clips and mathematically averaging them into a composite representation, the system dramatically improves clone stability and resemblance.

5. Narration Control via SSML

To provide directors with granular control over pacing, SonarAI integrates an SSML (Speech Synthesis Markup Language) preprocessor. Before text reaches the acoustic model, tags are parsed to dynamically instruct the engine:

<pause ms="500"/> physically injects precise milliseconds of silence into the output tensor array.
<emphasis level="strong"> mathematically resamples that specific word snippet to play at 0.85× speed, emphasizing the term naturally.
<say-as interpret-as="characters">NASA</say-as> forces the normalizer to read N-A-S-A rather than treating it as a single word.

← Return to Portfolio

Let's Connect

I am always open to discussing new challenges in the AI and Machine Learning space. Whether you are exploring how these patterns can be adapted for your specific domain, have questions about the architectural choices detailed above, or are looking to collaborate on impactful technology projects that help the society, I would love to hear from you.

Connect on LinkedIn