A self-hosted Discord voice agent on needs three pieces wired up correctly: the bot has to hold the Connect, Speak, and Read Message History permissions on the target channel, the gateway needs a realtime voice model registered, and the capture pipeline has to be tuned to the room so the agent does not cut speakers off. As of v2026.5.7 the gateway audits the first piece for you, ships a saner default for the second, and exposes a config knob for the third. This post walks through what changed, how to use the new audit, and how to stop the bot from talking over people.
What changed in v2026.5.7 for Discord voice
Two fixes in the latest stable release target Discord voice specifically. They look small in the notes and matter a lot in practice.
| Change | What it does | Why it matters |
|---|---|---|
Permission audit in channels capabilities | Reports Connect / Speak / Read Message History on configured voice channels, including auto-join targets | You see the missing permission before /vc join fails silently |
voice.captureSilenceGraceMs (default 2.5s) | Waits 2.5s of silence after a user stops speaking before ending the capture window | Fewer chopped sentences, fewer false end-of-turn signals |
| Tightened STT prompt around live fragments | The model gets a cleaner prompt while speech-to-text is still streaming | Less confused early responses while a user is mid-sentence |
If you have been running an Discord voice bot from an older build and it interrupts people or fails to join, both fixes are reasons to upgrade.
Prerequisites
Before configuring anything, confirm you have:
- A self-hosted instance (
pnpminstall, Docker, or the bundled CLI). The Docker deployment guide covers a clean install. - A Discord application with a bot token, invited to your server with the
botandapplications.commandsscopes. - A voice channel you control. The bot user must have a role that includes Connect, Speak, and Read Message History on that channel.
- A realtime-capable speech model configured (OpenAI Realtime, Gemini Live, or any provider with a registered TTS + STT pair). The bundled
voice-callplugin handles the audio routing.
Step 1: invite the bot and confirm permissions
Discord’s permission model is per-channel, not per-server, so it is easy to give the bot View Channel everywhere and forget Connect on the one channel that matters. After inviting:
- Right-click the target voice channel in the Discord client and pick
Edit Channel > Permissions. - Add the bot’s role (or the bot user directly) to the permission list.
- Tick: View Channel, Connect, Speak, Use Voice Activity, Read Message History. Leave Priority Speaker off unless you want the bot to duck other users.
Then run the audit:
channels capabilities
For each Discord voice channel knows about, you should see a green Connect/Speak/Read row. If any are red, fix the channel-level permission and rerun; the audit also covers auto-join targets configured under channels.discord.autoJoin, so you catch the right-channel-wrong-permission case before the bot actually tries to join.
You can also probe at startup:
channels status --probe
This walks every configured channel and surfaces missing capabilities in the same output stream as offline channels, which is what you want in a self-hosted personal agent setup where you do not want a silent failure mode.
Step 2: wire the voice model
voice-call is a bundled plugin, so it ships with the core install. Open .json and verify the plugin is enabled and pointed at a realtime model:
{
"plugins": {
"entries": [
{ "id": "@openclaw/voice-call", "enabled": true }
]
},
"voice": {
"model": "openai/gpt-realtime",
"voiceId": "alloy",
"captureSilenceGraceMs": 2500
}
}
The voice.model key picks the TTS+STT pair. Anything that exposes a realtime audio interface works; the provider catalog lists the cheaper options if you do not want to pay Realtime rates.
voiceId is provider-specific. For Gemini Live, use a voice name from the Live API roster instead.
captureSilenceGraceMs is the new knob. 2500 ms is the v2026.5.7 default and is what you want for most Discord servers — long enough that pauses, breaths, and “umm” gaps do not end the turn, short enough that the agent does not feel laggy. Tune up to 3500 ms if you are in a noisy room where the silence detector keeps triggering early; tune down to 1500 ms if you want a snappier back-and-forth and the room is quiet.
Step 3: join the channel and test
With the model wired and capabilities clean, the agent can join. The most reliable way is the slash command (registered automatically by the Discord plugin):
/vc join channel: #voice-lab
The bot connects, plays a short greeting if you set one, and waits for speech. If join fails, the most common reasons are:
- The voice channel is full (Discord enforces user limits).
- The bot’s role is below another role that explicitly denies Connect — Discord permission resolution is order-sensitive.
- A captcha or rate limit on the bot.
gateway.discord.rateLimitDebug: truesurfaces these.
To leave:
/vc leave
The agent flushes the audio queue and disconnects cleanly. As of v2026.5.4 the realtime pipeline bounds the paced audio queue and closes overloaded streams before provider audio piles up behind websocket backpressure, so leave-during-speech no longer leaves a stuck connection.
Step 4: tune capture for your room
The default 2.5s grace is a reasonable middle ground. If you are running the agent for one person in a quiet home office, push it down; if it is in a group voice chat with cross-talk, push it up. Two extra signals worth knowing about:
- Echo cancellation. Discord clients do this themselves, but if you are bridging in from a hardware mixer, you may pick up the agent’s own voice and feed it back into STT. Mute the bot’s playback path in your mixer.
- VAD aggressiveness. The
voice-callplugin uses voice activity detection upstream of the silence grace. If short utterances (“yes”, “go ahead”) get dropped, lower the VAD threshold; if every breath triggers a capture, raise it. The plugin’svoice.vad.aggressivenesssetting takes values 0–3, with 2 as the default.
FAQ
Does this work without paying for Realtime?
Yes. Any provider that registers a realtime audio model works. Gemini Live is the cheapest option as of writing and is what most self-hosters land on for personal use.
Can the agent speak first when it joins?
Yes, set voice.greeting to the text you want spoken on join. The plugin sends it through TTS before opening the capture stream.
Why is the bot interrupting people?
Almost always either captureSilenceGraceMs is too low or VAD aggressiveness is too high. Start by raising the grace to 3000 ms and see if it stops.
Why does the audit show green but join still fails?
Check role hierarchy. A deny on Connect at a higher role overrides an allow lower down. The audit reports the resolved permission as seen by the bot, but Discord can change resolution if you move roles around between the audit and the join.
Will voice work in Google Meet or Twilio dial-ins?
Yes — v2026.5.4 made Twilio Meet joins speak through the realtime Gemini bridge with paced audio and backpressure-aware buffering, which is also the path the Discord voice plugin uses internally. The configuration shape is the same; only the channel adapter differs.
Putting a Discord voice agent together
For a clean self-hosted Discord voice agent on the current : upgrade to v2026.5.7, run channels capabilities until it is all green, set voice.captureSilenceGraceMs to 2500 (or your room’s number), and confirm the realtime model is reachable before /vc join. The two pieces people skip are the per-channel Discord permission and the silence grace, and they are both first-class in the new release.
If you are building this out as part of a wider personal AI agent setup, the voice bridge plays nicely with everything else hosts — text in Slack, screenshots from the file-transfer plugin, cron-driven proactive messages — without you having to glue providers together by hand.
Sources: