TTS Providers

Qwen3-TTS (local, GPU)

Generate voiceover narration locally on a CUDA GPU using Alibaba's Qwen3-TTS models.

QwenTtsProvider runs TTS locally via Alibaba's Qwen3-TTS family. It supports two modes:

  • Clone — synthesize text in the voice of a reference WAV/MP3 (voiceSample).
  • Design — synthesize text in a voice described by a prompt (voiceDescription). The provider first generates a reference WAV from the description, then clones it for every subtitle.

A Python sidecar (PyTorch + qwen-tts + flash-attn) is spawned once per pipeline run and held open across the whole subtitle batch.

Requirements

  • CUDA-capable GPU — ~4–8 GB VRAM depending on model.
  • Python 3 — invoked via the path you pass as pythonBin.
  • HF_TOKEN in the environment if the chosen weights are gated.

No npm peer dependency is required — the provider talks to the sidecar over stdin/stdout JSON.

Python environment setup

The sidecar's stack (PyTorch + flash-attn + Qwen) is ~5–8 GB on disk, so where you install it matters. pip caches downloads in ~/.cache/pip, but installs are copied into each venv's site-packages — so a per-project venv with PyTorch costs ~5–8 GB per project.

Shared venv reused across projects

python3 -m venv ~/.venvs/playwright-recast
~/.venvs/playwright-recast/bin/pip install -r node_modules/playwright-recast/dist/voiceover/providers/qwen-sidecar/requirements.txt
QwenTtsProvider({
  mode: 'clone',
  voiceSample: './my-voice.wav',
  refText: 'Welcome! In this screencast we will walk through the key concepts.',
  pythonBin: `${process.env.HOME}/.venvs/playwright-recast/bin/python3`,
})

Absolute pythonBin works regardless of shell activation — useful for CI, cron, and IDE runners.

flash-attn build notes

flash-attn needs a CUDA toolchain at install time. If no precompiled wheel matches your CUDA/PyTorch combo, follow the build instructions at Dao-AILab/flash-attention.

Usage — clone mode

import { Recast } from 'playwright-recast'
import { QwenTtsProvider } from 'playwright-recast/providers/qwen'

await Recast
  .from('./traces')
  .parse()
  .subtitlesFromSrt('./narration.srt')
  .voiceover(QwenTtsProvider({
    mode: 'clone',
    voiceSample: './my-voice.wav',
    refText: 'Welcome! In this screencast we will walk through the key concepts.',
    language: 'English',
    cacheAudio: true,
  }))
  .render({ format: 'mp4' })
  .toFile('demo.mp4')

refText is the literal transcript of voiceSample — Qwen needs both to clone the voice correctly. Get it best by re-recording the line yourself, or by transcribing the sample with whisper.

Usage — design mode

.voiceover(QwenTtsProvider({
  mode: 'design',
  voiceDescription: 'A clear, steady male voice with a calm and even tone.',
  refText: 'Welcome! In this screencast we will walk through the key concepts.',
  language: 'English',
  cacheAudio: true,
  cacheVoiceDesign: true,           // reuse the generated reference WAV across runs
}))

In design mode refText is the line the model will speak into the reference WAV — it doesn't transcribe an existing file. Pick something close in tone and length to your actual narration.

Configuration options

Common

OptionTypeDefaultDescription
mode'clone' | 'design'(required)Which sub-API to use
refTextstring(required)Reference text (transcript in clone mode; prompt in design mode)
languagestring'English'Generation language — pass the Qwen language name ('English', 'German', 'Chinese', …)
cloneModelstring'Qwen/Qwen3-TTS-12Hz-0.6B-Base'HuggingFace model ID for cloning
cacheDirstring'./.recast-cache/voice'Where cached artifacts are written
cacheAudiobooleanfalseCache per-segment MP3s
pythonBinstring'python3'Python interpreter to launch the sidecar
devicestring'cuda:0'Torch device
dtype'bfloat16' | 'float16' | 'float32''bfloat16'Model precision

Clone mode

OptionTypeDescription
voiceSamplestringPath to a .wav / .mp3 of the voice you want to clone

Design mode

OptionTypeDefaultDescription
voiceDescriptionstring(required)Natural-language description of the target voice
designModelstring'Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign'HuggingFace model ID for voice design
cacheVoiceDesignbooleanfalseCache the generated reference WAV across runs

Caching

Caching is opt-in, both flags default off:

FlagWhat it cachesHash includes
cacheAudioPer-segment MP3s at cacheDir/audio/<hash>.mp3target text, refText, ref-audio fingerprint, language, model, dtype
cacheVoiceDesign (design only)The generated reference WAV at cacheDir/design/<hash>.wavdescription, refText, language, designModel, dtype

cacheDir defaults to ./.recast-cache/voice. The cache grows unbounded — manage retention yourself.

In clone mode, the ref-audio fingerprint is a deterministic hash over the contents of voiceSample, so swapping the file (even keeping the same path) correctly invalidates the cache.

Error handling

Sidecar failures surface as QwenSidecarError with a stage field:

stageMeaning
initMissing Python deps, malformed request, or CUDA / model load failure
designVoice design (design mode only) failed
clonePer-segment synthesis failed
import { QwenSidecarError } from 'playwright-recast/providers/qwen'

try {
  await pipeline.toFile('demo.mp4')
} catch (err) {
  if (err instanceof QwenSidecarError) {
    console.error(`Qwen failed at ${err.stage}:`, err.message)
    console.error(err.pythonTraceback)
  }
  throw err
}

Notes

  • The sidecar process is held open for the full subtitle batch — model load happens once, not per line.
  • Output is always MP3 at the Qwen sample rate (12 kHz native, re-encoded by the sidecar).
  • No CLI flag yet — wire it up programmatically.

On this page