Qwen3-TTS with MLX

Local Apple Silicon voice generation with reference clips, Markdown chunking, inspection, and repair.

Last updated

Local Inference Voice AI MLX

1. Where it fits

Qwen3-TTS through MLX-Audio is a practical local text-to-speech path for Apple Silicon. It can run reference-audio voice cloning, voice-design prompts, and preset-style voices without sending text or reference audio to a hosted API.

The workflow that holds up is not “send a huge Markdown file to the model.” It is:

Markdown or text → normalize → sentence chunks → Qwen3-TTS → WAV chunks → inspect → repair → stitch
NeedBest mode
Clone from an authorized short reference clipQwen3 Base model with --ref_audio and --ref_text
Design a voice from a descriptionQwen3 VoiceDesign model with --instruct
Use bundled named speakers/stylesQwen3 CustomVoice model or preset wrapper
Read long Markdown filesChunk, render, inspect, repair, and stitch
Important: Qwen3 “CustomVoice” preset models are not the same thing as arbitrary voice cloning. For cloning from your own reference clip, use a Base model with reference audio and its exact transcript.

3. Install

Create a small project folder and keep model/cache state local to that project:

mkdir -p qwen-voice/{bin,refs,text,out,tmp}
cd qwen-voice

python3.12 -m venv .venv
. .venv/bin/activate
python -m pip install -U pip
python -m pip install -U mlx-audio soundfile huggingface_hub

When running inside a restricted agent or sandbox, put Hugging Face cache files inside the project:

export HF_HOME="$PWD/tmp/hf-home"
export HF_HUB_CACHE="$HF_HOME/hub"
export HF_HUB_DISABLE_SYMLINKS_WARNING=1
mkdir -p "$HF_HUB_CACHE"

4. Smoke test

Start with a neutral one-sentence render before adding a reference voice:

python -m mlx_audio.tts.generate \
  --model mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit \
  --text "Hello from local Qwen text to speech." \
  --lang_code English \
  --output_path out \
  --file_prefix smoke \
  --audio_format wav \
  --join_audio \
  --max_tokens 80

Play it:

afplay out/smoke.wav

If this fails, fix MLX, Python, model download, or audio output before trying long text or a custom voice.

5. Reference voice

Convert the reference to 24 kHz mono WAV:

ffmpeg -y -i input-reference.m4a -ac 1 -ar 24000 refs/voice-a.wav

Then render with the exact transcript of that clip:

python -m mlx_audio.tts.generate \
  --model mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit \
  --text "This is the sentence you want the model to read." \
  --lang_code English \
  --output_path out \
  --file_prefix voice-a-test \
  --audio_format wav \
  --join_audio \
  --ref_audio refs/voice-a.wav \
  --ref_text "exact words spoken in the reference clip" \
  --temperature 0.55 \
  --top_p 0.82 \
  --top_k 40 \
  --repetition_penalty 1.7 \
  --max_tokens 120

If it sounds like the default voice, your wrapper probably did not pass the reference audio and reference transcript into the model command. Print those paths in private logs, but keep them out of public docs.

6. Long-form Markdown

Do not feed a full Markdown document as one generation. Long-form quality is much better when each chunk is small enough for the reference clip.

Practical starting settings

--chunk-words 16
--temperature 0.55
--top_p 0.82
--top_k 40
--repetition_penalty 1.7

Chunking rules

  • Convert Markdown to plain text first.
  • Remove YAML frontmatter, raw URLs, table pipes, code fences, and wiki syntax.
  • Preserve headings as optional spoken cues.
  • Split on sentence boundaries first.
  • For very short reference clips, keep chunks around 12–20 words.
  • Avoid chunks that end on weak words like “and”, “to”, “with”, or “the”.
  • Stitch chunk WAVs with ffmpeg after rendering.

The goal is boring repeatability. A slightly smaller chunk size is usually better than one fluent paragraph followed by one broken segment.

7. Inspect and repair

Every long-form render should produce a manifest with chunk number, word count, token budget, WAV path, and duration. Then flag suspicious chunks by seconds per word.

FlagThresholdMeaning
Too fastbelow 0.27 sec/wordlikely skipped, garbled, or collapsed
Too slowabove 0.75 sec/wordlikely repeated, dragged, or hallucinated
Too shortunder 2 seconds for an 8+ word chunklikely failed render

Repair only the bad chunks, then restitch. Do not rerender the whole document unless the reference or settings are wrong globally.

Recommended wrapper contract

./bin/qwen-ref refs/voice-a.wav "exact words spoken in the reference"
./bin/qwen-read "Text to read."
./bin/qwen-read text/article.md
./bin/qwen-inspect out/article.wav
./bin/qwen-repair out/article.wav 14 22
./bin/qwen-batch text/*.md

8. Publish safely

Before turning a private experiment into a public guide, grep the new files for private data:

rg -n "PATH_PATTERN|CLOUD_FOLDER_PATTERN|SAMPLE_LABEL_PATTERN|TOPIC_PATTERN|TRANSCRIPT_PATTERN" guides/ output/ || true

Also check that the guide does not contain personal filesystem paths, real voice sample filenames, private generated content, real reference transcripts, names of people whose voices were tested, or screenshots/logs showing local directories.

Use neutral placeholders: refs/voice-a.wav, text/article.md, out/article.wav, and "exact words spoken in the reference clip".

Sources