Qwen3-TTS with MLX
Local Apple Silicon voice generation with reference clips, Markdown chunking, inspection, and repair.
Last updated
Local Inference Voice AI MLX1. Where it fits
Qwen3-TTS through MLX-Audio is a practical local text-to-speech path for Apple Silicon. It can run reference-audio voice cloning, voice-design prompts, and preset-style voices without sending text or reference audio to a hosted API.
The workflow that holds up is not “send a huge Markdown file to the model.” It is:
Markdown or text → normalize → sentence chunks → Qwen3-TTS → WAV chunks → inspect → repair → stitch
| Need | Best mode |
|---|---|
| Clone from an authorized short reference clip | Qwen3 Base model with --ref_audio and --ref_text |
| Design a voice from a description | Qwen3 VoiceDesign model with --instruct |
| Use bundled named speakers/styles | Qwen3 CustomVoice model or preset wrapper |
| Read long Markdown files | Chunk, render, inspect, repair, and stitch |
2. Reference and consent rules
Voice references should be boring, clean, and authorized.
Use
- A voice you own, recorded yourself, licensed, or have explicit permission to use.
- 5–15 seconds of dry spoken audio.
- One speaker.
- A 24 kHz mono WAV reference.
- The exact words spoken in the reference clip.
Avoid
- Music-backed clips or singing.
- Interviews with overlap.
- Reverb-heavy rooms.
- Placeholder transcripts.
- Public-person imitation unless you have explicit rights for the exact use.
A quiet phone recording usually beats a polished clip with backing audio. The model conditions on the recording, not only the speaker identity.
3. Install
Create a small project folder and keep model/cache state local to that project:
mkdir -p qwen-voice/{bin,refs,text,out,tmp}
cd qwen-voice
python3.12 -m venv .venv
. .venv/bin/activate
python -m pip install -U pip
python -m pip install -U mlx-audio soundfile huggingface_hub
When running inside a restricted agent or sandbox, put Hugging Face cache files inside the project:
export HF_HOME="$PWD/tmp/hf-home"
export HF_HUB_CACHE="$HF_HOME/hub"
export HF_HUB_DISABLE_SYMLINKS_WARNING=1
mkdir -p "$HF_HUB_CACHE"
4. Smoke test
Start with a neutral one-sentence render before adding a reference voice:
python -m mlx_audio.tts.generate \
--model mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit \
--text "Hello from local Qwen text to speech." \
--lang_code English \
--output_path out \
--file_prefix smoke \
--audio_format wav \
--join_audio \
--max_tokens 80
Play it:
afplay out/smoke.wav
If this fails, fix MLX, Python, model download, or audio output before trying long text or a custom voice.
5. Reference voice
Convert the reference to 24 kHz mono WAV:
ffmpeg -y -i input-reference.m4a -ac 1 -ar 24000 refs/voice-a.wav
Then render with the exact transcript of that clip:
python -m mlx_audio.tts.generate \
--model mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit \
--text "This is the sentence you want the model to read." \
--lang_code English \
--output_path out \
--file_prefix voice-a-test \
--audio_format wav \
--join_audio \
--ref_audio refs/voice-a.wav \
--ref_text "exact words spoken in the reference clip" \
--temperature 0.55 \
--top_p 0.82 \
--top_k 40 \
--repetition_penalty 1.7 \
--max_tokens 120
If it sounds like the default voice, your wrapper probably did not pass the reference audio and reference transcript into the model command. Print those paths in private logs, but keep them out of public docs.
6. Long-form Markdown
Do not feed a full Markdown document as one generation. Long-form quality is much better when each chunk is small enough for the reference clip.
Practical starting settings
--chunk-words 16
--temperature 0.55
--top_p 0.82
--top_k 40
--repetition_penalty 1.7
Chunking rules
- Convert Markdown to plain text first.
- Remove YAML frontmatter, raw URLs, table pipes, code fences, and wiki syntax.
- Preserve headings as optional spoken cues.
- Split on sentence boundaries first.
- For very short reference clips, keep chunks around 12–20 words.
- Avoid chunks that end on weak words like “and”, “to”, “with”, or “the”.
- Stitch chunk WAVs with
ffmpegafter rendering.
The goal is boring repeatability. A slightly smaller chunk size is usually better than one fluent paragraph followed by one broken segment.
7. Inspect and repair
Every long-form render should produce a manifest with chunk number, word count, token budget, WAV path, and duration. Then flag suspicious chunks by seconds per word.
| Flag | Threshold | Meaning |
|---|---|---|
| Too fast | below 0.27 sec/word | likely skipped, garbled, or collapsed |
| Too slow | above 0.75 sec/word | likely repeated, dragged, or hallucinated |
| Too short | under 2 seconds for an 8+ word chunk | likely failed render |
Repair only the bad chunks, then restitch. Do not rerender the whole document unless the reference or settings are wrong globally.
Recommended wrapper contract
./bin/qwen-ref refs/voice-a.wav "exact words spoken in the reference"
./bin/qwen-read "Text to read."
./bin/qwen-read text/article.md
./bin/qwen-inspect out/article.wav
./bin/qwen-repair out/article.wav 14 22
./bin/qwen-batch text/*.md
8. Publish safely
Before turning a private experiment into a public guide, grep the new files for private data:
rg -n "PATH_PATTERN|CLOUD_FOLDER_PATTERN|SAMPLE_LABEL_PATTERN|TOPIC_PATTERN|TRANSCRIPT_PATTERN" guides/ output/ || true
Also check that the guide does not contain personal filesystem paths, real voice sample filenames, private generated content, real reference transcripts, names of people whose voices were tested, or screenshots/logs showing local directories.
Use neutral placeholders: refs/voice-a.wav, text/article.md, out/article.wav, and "exact words spoken in the reference clip".