Text-to-Speech for Any LLM via MCP

June 25, 2026 · 15 min read

Yes - you can make almost any tool-calling LLM speak by exposing text-to-speech through an MCP tool schema. I’d keep the setup simple: define a strict JSON schema, map it to either the TTSBuddy CLI or /v1/agent-tts, clean Markdown before synthesis, and return one fixed JSON response shape every time.

Here’s the short version:

I’d use text, voice, and output_format as the core tool inputs.
I’d keep optional controls limited to things like speaking_rate, pitch, language, or Markdown stripping.
I’d choose between sync, async, streaming, or batch based on response length and playback timing.
I’d store the ttsb_ API key in environment variables, not in code.
I’d return fixed fields like job_id, status, and audio_link or file_path so the LLM always knows what happened.
I’d handle limits up front, including 500,000 characters per request, 1 submission per minute, and 30 status checks per minute.

A few parts matter more than the rest. Bad schemas lead to bad tool calls. Loose voice settings can cause invalid IDs. Long text can trigger timeouts. And if Markdown is left in place, screen readers and voice apps may read symbols out loud.

That’s why I’d treat this as a simple contract:

The LLM writes the text
MCP exposes the TTS tool
The server maps inputs to CLI or API
The client plays or saves the audio

If I wanted one setup that works across many MCP hosts, I’d keep the schema small, use enums for voice choices, apply idempotency keys for retries, and add a playback lock when more than one agent can speak at once.

Part	What I’d do	Why it matters
Schema	Keep required fields strict	Cuts down invalid tool calls
Voice control	Use enums or server-side mapping	Stops made-up voice IDs
Text cleanup	Strip Markdown before TTS	Avoids spoken formatting noise
Execution mode	Pick sync, async, stream, or batch	Matches speed and job length
Output format	Return one fixed JSON structure	Makes host behavior easier to handle
Error handling	Use fixed error codes	Helps the LLM retry or adjust input
Security	Store API keys outside source files	Lowers secret exposure risk

In other words, I’d build the TTS tool once, keep the input and output contract tight, and let any MCP-ready LLM use it without custom glue code each time.

Vibe Coding With The Speech MCP Server

How to Design an MCP Schema for a Reusable TTS Tool

With the tool wrapper in place, the next step is the input schema. The inputSchema is a JSON Schema object that defines the parameters the LLM can send. This is where most of the hard choices happen.

Required Fields: `text`, `voice`, and `output_format`

Start with text and voice. Only expose output_format if the caller needs to pick the audio container.

text holds the content you want spoken. voice or voice_id tells the system which speaker to use. For voice, use an enum with supported IDs such as alloy or en-US-Neural2-D, so the LLM selects from actual options instead of making a guess [1][5].

For output_format, mp3 is a safe default for most use cases. wav or opus can work too, depending on the workflow. It also helps to set a limit on text length, which can reduce oversized requests and timeout issues.

Schema field	Purpose in TTS	Typical value	Required
`text`	The content to be spoken	`"Hello, how can I help?"`	Yes
`voice`	Identifier for the speaker	`alloy`, `en-US-Neural2-D`	Yes
`output_format`	Audio file container	`mp3`, `wav`, `ogg`	Yes (or default)
`speaking_rate`	Speed of narration	`1.0` (range 0.5–2.0)	No
`pitch`	Tone/frequency adjustment	`-2.0` to `2.0`	No
`language`	Target language code	`en-US`, `de-DE`	No

Optional Fields for Accessibility and Narration Quality

speaking_rate gives the LLM a way to slow narration for dense instructions or speed it up for short status messages. pitch changes tone, which can help separate one narration style from another.

For accessibility work, strip_markdown matters a lot. LLMs often return Markdown, and if you leave it in, the audio may read formatting symbols out loud [6][7]. That gets awkward fast. A category or message_type field can also help the server pick the right voice or priority level [6][3].

Schema Trade-offs That Affect Cross-LLM Reuse

After you settle on the fields, the next choice is how much control to give the model.

A profile-locked schema keeps voice, language, and output_format in server-side config. The model only sees text and maybe speaking_rate. That cuts down on bad voice IDs and keeps output consistent, but it also means the LLM can't swap personas or change tone on the fly [8].

A full parameter control schema exposes every field to the LLM. That gives the model more room to tailor the output, but it also makes invalid tool calls more likely.

The sync-versus-async choice has the same kind of trade-off. Synchronous execution is easier to build, but long documents can hit timeout limits. Async submission returns a job_id right away, though it also means you need status or cancel tools.

Schema choice	Advantage	Disadvantage
Profile-Locked	High reliability; LLM only controls `text`/`speaking_rate`	LLM can't change voices dynamically
Full Parameter Control	LLM can adapt tone and speed to context	Higher risk of invalid calls or invalid voice IDs
Synchronous Result	Simple implementation; audio is ready immediately	Can cause timeouts for long documents
Async/Streaming	Low latency; agent continues while audio plays	Requires stop/poll tools for state management
Auto-Markdown Strip	Cleaner audio; no spoken formatting syntax	May remove intentional emphasis if not handled carefully

For most reusable setups, a category-driven mapping works well. Expose a category enum with values like summary, error, and status, then let the server map each one to the right voice and settings [6]. That keeps the LLM side simple and steady, while the backend handles the extra logic.

Once the schema is stable, map each field to the TTSBuddy CLI and /v1/agent-tts request shape.

How to Connect the Schema to TTSBuddy CLI and REST API

TTSBuddy

Once the schema is set, the MCP server has a pretty simple job: take a tool call, translate it for TTSBuddy, and send the result back.

Map MCP Parameters to TTSBuddy CLI Commands

The field mapping is direct. text maps to --text, voice maps to --voice, output_format maps to --format, and speaking_rate maps to --speaking-rate [8].

TTSBuddy can also read piped stdin, which makes scripted workflows much easier. If you want the server to parse the response cleanly and pass it to the LLM client, turn on structured output with --json [8]. For local playback, return an absolute file path.

The key idea is simple: keep the same field names across both paths. That way, one MCP schema can power local CLI execution and hosted API requests without extra translation logic.

Map the Same Schema to `/v1/agent-tts` Requests

The REST path works the same way. The MCP server takes the tool arguments, turns them into a JSON body for POST /v1/agent-tts, and sends authentication with a ttsb_-prefixed API key in the Authorization: Bearer header [4].

MCP Schema Field	REST API Field	Notes
`text`	`text`	Up to 500,000 characters per request
`voice`	`voice_id`	Must match a supported voice identifier
`output_format`	`output_format`	`mp3`, `wav`, or `ogg`
`speaking_rate`	`speaking_rate`	Numerical tempo value

For short replies, use streaming. It returns a single-use stream_url for immediate playback [4]. For longer documents or batch jobs, async submission is the better fit. It returns a job_id right away, so the agent can move on while the audio is made in the background [4].

The API allows up to 500,000 characters per request. It also limits usage to 1 submission per minute and 30 status checks per minute, so polling needs to stay conservative [4].

If the same text and voice_id have already been generated, the API may return a cached audio_link instead of making the file again [4]. In plain English, the same input contract should lead to the same audio behavior no matter which transport you use.

Store Defaults and Secrets Safely

Never hardcode a ttsb_ API key in source files. Put secrets in the MCP server's environment or config, not in the codebase [4][2][7]. TTSBuddy hashes API keys with SHA-256 on the server side, so the full key is not stored there.

For defaults like voice, language, and output_format, use a profile config file instead of sending them with every request [8][9]. The MCP server can load those defaults at startup, then let per-call overrides take priority when someone wants a different voice or file type.

If more than one agent can trigger speech at the same time, add a system-wide file lock or mutex so audio playback doesn't collide [2].

How to Build End-to-End LLM-to-Audio Workflows

Once your schema is mapped and your secrets are in place, the flow is pretty simple: the LLM writes the text, calls the MCP tool, and the client either plays the audio or saves it. The big variables are text cleanup and how you run the job. The same setup works for chatbot replies, summaries, and narrated status updates.

Turn Plain LLM Replies into Spoken Output

The most direct setup is a pass-through flow. The assistant writes a reply, the MCP server calls generate_speech with that reply as the text argument, and TTSBuddy returns either a file path or a shared audio link based on the output target you set.

For terminal-based flows, the CLI writes the file and prints the path. For web or app clients, a shared link is often the better fit. The client can play the URL directly, which makes it handy for screen-reader-friendly narration and hands-free use.

One small but important detail: suppress the Speaking: [text] console message so it never reenters model context.

Handle Markdown, Transcripts, and Long Documents

Before you send text to speech, clean it up. LLM output often comes wrapped in Markdown, and that formatting should be stripped before synthesis. TTSBuddy's CLI already handles Markdown files by removing headings, links, images, and code blocks before synthesis [6]. Use that same cleanup step before sending text to /v1/agent-tts.

Long inputs need another pass. If you're working with a large document or a long chat thread, summarize it first before sending it to TTS. Otherwise, the output can get bloated fast.

Choose Between Sync, Async, or Batch Execution

After the text is ready, pick the execution mode that fits the user experience you want. In plain terms, choose based on latency and narration length.

Workflow type	Latency	Complexity	Best for
Sync (Blocking)	High	Low	Critical confirmations, step-by-step instructions
Async (Non-blocking)	Low	Medium	Real-time status updates and background narration
Streaming	Lowest	High	Long-form content where immediate start is vital
Batch	Very High	High	Bulk content creation, offline playlists, and cost-optimization pipelines

If more than one agent or view can trigger speech at the same time, add a mutex or shared lock so only one audio stream plays at once [2][5]. That one guardrail can save you from overlapping playback and a pretty messy user experience.

How to Validate Results, Handle Errors, and Lock In the Pattern

After you pick sync, async, or batch, the next job is simple: make the output contract rigid and make retry behavior predictable.

Return Predictable Outputs and Errors

Every generate_speech call should return the same JSON shape, no matter which host handles it. At a minimum, include:

job_id
status (pending, completed, or failed)
either audio_link or file_path

For async jobs, return job_id and status right away. Then, once the job finishes, return audio_link or file_path.

Errors should also be structured. More importantly, they should use fixed error codes so the LLM can decide what to do next: retry, change inputs, or show a clear message to the user [9]. Use this fixed set:

Error Code	Meaning	Recommended LLM Action
`VALIDATION_ERROR`	Malformed input or missing fields	Check schema and resubmit
`INVALID_VOICE`	Voice ID not found or unavailable	Call `search_voices` to find a valid ID
`INSUFFICIENT_CREDITS`	Account balance too low	Notify user to top up credits
`STREAM_NOT_SUPPORTED`	Voice family lacks streaming capability	Switch to async delivery mode
`RATE_LIMIT_EXCEEDED`	Too many requests in a short window	Implement exponential backoff retry
`SYNTHESIS_FAILED`	Internal provider error	Retry only when `retryable` is true.

Set isError: true on validation failures so the LLM can correct the parameters and try again [9]. Use idempotency keys on every generate_speech call so you don’t end up with duplicate files or duplicate charges [4]. And use tts_{timestamp}_{hash}.{ext} for filenames to avoid collisions when requests hit at the same time [2].

That way, every host reads the same success and failure signals. No guesswork. No weird edge-case parsing.

Key Takeaways for Production Use

Once the transport path is set, reliability becomes the last layer.

Use one strict schema, clean inputs, predictable outputs, and safe retries. Define your MCP tool schema once with the required fields: text, voice, and output_format. Then map that schema cleanly to either the TTSBuddy CLI or /v1/agent-tts. Strip Markdown before synthesis, pick async or stream based on your latency needs, and use idempotency keys so retries stay safe.

With that setup, any MCP-compatible LLM that can read a tool definition can trigger speech in a steady way for screen readers, hands-free flows, and voice-enabled apps.

FAQs

How do I choose between sync, async, and streaming TTS?

Choose based on how fast audio needs to start and whether other work should keep running at the same time.

Streaming: best when you want audio to start right away. It plays in chunks as it’s being generated.
Async: best when audio can run in the background. It returns a job ID so you can fetch the result later.
Sync: best when you want clear, one-at-a-time playback. It helps prevent overlapping audio that can get confusing.

What should I do if the LLM sends invalid voice settings?

Use diagnostic tools to check your setup. Run tts_doctor to verify authentication, profile settings, and playback status. You can also call get_current_config to see your active voice, device, and other TTS settings.

Make sure voice IDs, speed, and format values match the provider’s requirements. Some profiles lock those fields or allow only certain ranges. If the problem sticks around, check your environment variables and your system output device settings.

How can I keep long or Markdown-heavy replies from sounding awkward?

Skip all-caps. Some systems read each letter one by one, which sounds awkward.

For smoother playback, use tools that support SSML so you can add pauses and guide the rhythm a bit better. Streaming modes also help when you want audio to start right away and play back in chunks instead of making people wait for the whole file.

You can also turn off echo output or use sequential queuing so audio clips don’t talk over each other. And if the text is more complex, pick tools that deal well with formatting or support multi-voice synthesis.

Vibe Coding With The Speech MCP Server​

How to Design an MCP Schema for a Reusable TTS Tool​

Required Fields: text, voice, and output_format​

Optional Fields for Accessibility and Narration Quality​

Schema Trade-offs That Affect Cross-LLM Reuse​

How to Connect the Schema to TTSBuddy CLI and REST API​

Map MCP Parameters to TTSBuddy CLI Commands​

Map the Same Schema to /v1/agent-tts Requests​

Store Defaults and Secrets Safely​

How to Build End-to-End LLM-to-Audio Workflows​

Turn Plain LLM Replies into Spoken Output​

Handle Markdown, Transcripts, and Long Documents​

Choose Between Sync, Async, or Batch Execution​

How to Validate Results, Handle Errors, and Lock In the Pattern​

Return Predictable Outputs and Errors​

Key Takeaways for Production Use​

FAQs​

How do I choose between sync, async, and streaming TTS?​

What should I do if the LLM sends invalid voice settings?​

How can I keep long or Markdown-heavy replies from sounding awkward?​