Qwen3-TTS with MLX — Local Voice Guide

1. Where it fits

Qwen3-TTS through MLX-Audio is a practical local text-to-speech path for Apple Silicon. It can run reference-audio voice cloning, voice-design prompts, and preset-style voices without sending text or reference audio to a hosted API.

The workflow that holds up is not “send a huge Markdown file to the model.” It is:

Markdown or text → normalize → sentence chunks → Qwen3-TTS → WAV chunks → inspect → repair → stitch

Need	Best mode
Clone from an authorized short reference clip	Qwen3 Base model with `--ref_audio` and `--ref_text`
Design a voice from a description	Qwen3 VoiceDesign model with `--instruct`
Use bundled named speakers/styles	Qwen3 CustomVoice model or preset wrapper
Read long Markdown files	Chunk, render, inspect, repair, and stitch

Important: Qwen3 “CustomVoice” preset models are not the same thing as arbitrary voice cloning. For cloning from your own reference clip, use a Base model with reference audio and its exact transcript.

2. Reference and consent rules

Voice references should be boring, clean, and authorized.

Use

A voice you own, recorded yourself, licensed, or have explicit permission to use.
5–15 seconds of dry spoken audio.
One speaker.
A 24 kHz mono WAV reference.
The exact words spoken in the reference clip.

Avoid

Music-backed clips or singing.
Interviews with overlap.
Reverb-heavy rooms.
Placeholder transcripts.
Public-person imitation unless you have explicit rights for the exact use.

A quiet phone recording usually beats a polished clip with backing audio. The model conditions on the recording, not only the speaker identity.

3. Install

Create a small project folder and keep model/cache state local to that project:

mkdir -p qwen-voice/{bin,refs,text,out,tmp}
cd qwen-voice

python3.12 -m venv .venv
. .venv/bin/activate
python -m pip install -U pip
python -m pip install -U mlx-audio soundfile huggingface_hub

When running inside a restricted agent or sandbox, put Hugging Face cache files inside the project:

export HF_HOME="$PWD/tmp/hf-home"
export HF_HUB_CACHE="$HF_HOME/hub"
export HF_HUB_DISABLE_SYMLINKS_WARNING=1
mkdir -p "$HF_HUB_CACHE"

4. Smoke test

Start with a neutral one-sentence render before adding a reference voice:

python -m mlx_audio.tts.generate \
  --model mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit \
  --text "Hello from local Qwen text to speech." \
  --lang_code English \
  --output_path out \
  --file_prefix smoke \
  --audio_format wav \
  --join_audio \
  --max_tokens 80

Play it:

afplay out/smoke.wav

If this fails, fix MLX, Python, model download, or audio output before trying long text or a custom voice.

5. Reference voice

Convert the reference to 24 kHz mono WAV:

ffmpeg -y -i input-reference.m4a -ac 1 -ar 24000 refs/voice-a.wav

Then render with the exact transcript of that clip:

python -m mlx_audio.tts.generate \
  --model mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit \
  --text "This is the sentence you want the model to read." \
  --lang_code English \
  --output_path out \
  --file_prefix voice-a-test \
  --audio_format wav \
  --join_audio \
  --ref_audio refs/voice-a.wav \
  --ref_text "exact words spoken in the reference clip" \
  --temperature 0.55 \
  --top_p 0.82 \
  --top_k 40 \
  --repetition_penalty 1.7 \
  --max_tokens 120

If it sounds like the default voice, your wrapper probably did not pass the reference audio and reference transcript into the model command. Print those paths in private logs, but keep them out of public docs.

6. Long-form Markdown

Do not feed a full Markdown document as one generation. Long-form quality is much better when each chunk is small enough for the reference clip.

Practical starting settings

--chunk-words 16
--temperature 0.55
--top_p 0.82
--top_k 40
--repetition_penalty 1.7

Chunking rules

Convert Markdown to plain text first.
Remove YAML frontmatter, raw URLs, table pipes, code fences, and wiki syntax.
Preserve headings as optional spoken cues.
Split on sentence boundaries first.
For very short reference clips, keep chunks around 12–20 words.
Avoid chunks that end on weak words like “and”, “to”, “with”, or “the”.
Stitch chunk WAVs with ffmpeg after rendering.

The goal is boring repeatability. A slightly smaller chunk size is usually better than one fluent paragraph followed by one broken segment.

7. Inspect and repair

Every long-form render should produce a manifest with chunk number, word count, token budget, WAV path, and duration. Then flag suspicious chunks by seconds per word.

Flag	Threshold	Meaning
Too fast	below 0.27 sec/word	likely skipped, garbled, or collapsed
Too slow	above 0.75 sec/word	likely repeated, dragged, or hallucinated
Too short	under 2 seconds for an 8+ word chunk	likely failed render

Repair only the bad chunks, then restitch. Do not rerender the whole document unless the reference or settings are wrong globally.

Recommended wrapper contract

./bin/qwen-ref refs/voice-a.wav "exact words spoken in the reference"
./bin/qwen-read "Text to read."
./bin/qwen-read text/article.md
./bin/qwen-inspect out/article.wav
./bin/qwen-repair out/article.wav 14 22
./bin/qwen-batch text/*.md

8. Publish safely

Before turning a private experiment into a public guide, grep the new files for private data:

rg -n "PATH_PATTERN|CLOUD_FOLDER_PATTERN|SAMPLE_LABEL_PATTERN|TOPIC_PATTERN|TRANSCRIPT_PATTERN" guides/ output/ || true

Also check that the guide does not contain personal filesystem paths, real voice sample filenames, private generated content, real reference transcripts, names of people whose voices were tested, or screenshots/logs showing local directories.

Use neutral placeholders: refs/voice-a.wav, text/article.md, out/article.wav, and "exact words spoken in the reference clip".

1. Where it fits

2. Reference and consent rules

Use

Avoid

3. Install

4. Smoke test

5. Reference voice

6. Long-form Markdown

Practical starting settings

Chunking rules

7. Inspect and repair

Recommended wrapper contract

8. Publish safely

Sources