A self-hosted Discord voice agent on needs three pieces wired up correctly: the bot has to hold the Connect, Speak, and Read Message History permissions on the target channel, the gateway needs a realtime voice model registered, and the capture pipeline has to be tuned to the room so the agent does not cut speakers off. As of v2026.5.7 the gateway audits the first piece for you, ships a saner default for the second, and exposes a config knob for the third. This post walks through what changed, how to use the new audit, and how to stop the bot from talking over people.

What changed in v2026.5.7 for Discord voice

Two fixes in the latest stable release target Discord voice specifically. They look small in the notes and matter a lot in practice.

ChangeWhat it doesWhy it matters
Permission audit in channels capabilitiesReports Connect / Speak / Read Message History on configured voice channels, including auto-join targetsYou see the missing permission before /vc join fails silently
voice.captureSilenceGraceMs (default 2.5s)Waits 2.5s of silence after a user stops speaking before ending the capture windowFewer chopped sentences, fewer false end-of-turn signals
Tightened STT prompt around live fragmentsThe model gets a cleaner prompt while speech-to-text is still streamingLess confused early responses while a user is mid-sentence

If you have been running an Discord voice bot from an older build and it interrupts people or fails to join, both fixes are reasons to upgrade.

Prerequisites

Before configuring anything, confirm you have:

  • A self-hosted instance (pnpm install, Docker, or the bundled CLI). The Docker deployment guide covers a clean install.
  • A Discord application with a bot token, invited to your server with the bot and applications.commands scopes.
  • A voice channel you control. The bot user must have a role that includes Connect, Speak, and Read Message History on that channel.
  • A realtime-capable speech model configured (OpenAI Realtime, Gemini Live, or any provider with a registered TTS + STT pair). The bundled voice-call plugin handles the audio routing.

Step 1: invite the bot and confirm permissions

Discord’s permission model is per-channel, not per-server, so it is easy to give the bot View Channel everywhere and forget Connect on the one channel that matters. After inviting:

  1. Right-click the target voice channel in the Discord client and pick Edit Channel > Permissions.
  2. Add the bot’s role (or the bot user directly) to the permission list.
  3. Tick: View Channel, Connect, Speak, Use Voice Activity, Read Message History. Leave Priority Speaker off unless you want the bot to duck other users.

Then run the audit:

 channels capabilities

For each Discord voice channel knows about, you should see a green Connect/Speak/Read row. If any are red, fix the channel-level permission and rerun; the audit also covers auto-join targets configured under channels.discord.autoJoin, so you catch the right-channel-wrong-permission case before the bot actually tries to join.

You can also probe at startup:

 channels status --probe

This walks every configured channel and surfaces missing capabilities in the same output stream as offline channels, which is what you want in a self-hosted personal agent setup where you do not want a silent failure mode.

Step 2: wire the voice model

voice-call is a bundled plugin, so it ships with the core install. Open .json and verify the plugin is enabled and pointed at a realtime model:

{
  "plugins": {
    "entries": [
      { "id": "@openclaw/voice-call", "enabled": true }
    ]
  },
  "voice": {
    "model": "openai/gpt-realtime",
    "voiceId": "alloy",
    "captureSilenceGraceMs": 2500
  }
}

The voice.model key picks the TTS+STT pair. Anything that exposes a realtime audio interface works; the provider catalog lists the cheaper options if you do not want to pay Realtime rates.

voiceId is provider-specific. For Gemini Live, use a voice name from the Live API roster instead.

captureSilenceGraceMs is the new knob. 2500 ms is the v2026.5.7 default and is what you want for most Discord servers — long enough that pauses, breaths, and “umm” gaps do not end the turn, short enough that the agent does not feel laggy. Tune up to 3500 ms if you are in a noisy room where the silence detector keeps triggering early; tune down to 1500 ms if you want a snappier back-and-forth and the room is quiet.

Step 3: join the channel and test

With the model wired and capabilities clean, the agent can join. The most reliable way is the slash command (registered automatically by the Discord plugin):

/vc join channel: #voice-lab

The bot connects, plays a short greeting if you set one, and waits for speech. If join fails, the most common reasons are:

  • The voice channel is full (Discord enforces user limits).
  • The bot’s role is below another role that explicitly denies Connect — Discord permission resolution is order-sensitive.
  • A captcha or rate limit on the bot. gateway.discord.rateLimitDebug: true surfaces these.

To leave:

/vc leave

The agent flushes the audio queue and disconnects cleanly. As of v2026.5.4 the realtime pipeline bounds the paced audio queue and closes overloaded streams before provider audio piles up behind websocket backpressure, so leave-during-speech no longer leaves a stuck connection.

Step 4: tune capture for your room

The default 2.5s grace is a reasonable middle ground. If you are running the agent for one person in a quiet home office, push it down; if it is in a group voice chat with cross-talk, push it up. Two extra signals worth knowing about:

  • Echo cancellation. Discord clients do this themselves, but if you are bridging in from a hardware mixer, you may pick up the agent’s own voice and feed it back into STT. Mute the bot’s playback path in your mixer.
  • VAD aggressiveness. The voice-call plugin uses voice activity detection upstream of the silence grace. If short utterances (“yes”, “go ahead”) get dropped, lower the VAD threshold; if every breath triggers a capture, raise it. The plugin’s voice.vad.aggressiveness setting takes values 0–3, with 2 as the default.

FAQ

Does this work without paying for Realtime?

Yes. Any provider that registers a realtime audio model works. Gemini Live is the cheapest option as of writing and is what most self-hosters land on for personal use.

Can the agent speak first when it joins?

Yes, set voice.greeting to the text you want spoken on join. The plugin sends it through TTS before opening the capture stream.

Why is the bot interrupting people?

Almost always either captureSilenceGraceMs is too low or VAD aggressiveness is too high. Start by raising the grace to 3000 ms and see if it stops.

Why does the audit show green but join still fails?

Check role hierarchy. A deny on Connect at a higher role overrides an allow lower down. The audit reports the resolved permission as seen by the bot, but Discord can change resolution if you move roles around between the audit and the join.

Will voice work in Google Meet or Twilio dial-ins?

Yes — v2026.5.4 made Twilio Meet joins speak through the realtime Gemini bridge with paced audio and backpressure-aware buffering, which is also the path the Discord voice plugin uses internally. The configuration shape is the same; only the channel adapter differs.

Putting a Discord voice agent together

For a clean self-hosted Discord voice agent on the current : upgrade to v2026.5.7, run channels capabilities until it is all green, set voice.captureSilenceGraceMs to 2500 (or your room’s number), and confirm the realtime model is reachable before /vc join. The two pieces people skip are the per-channel Discord permission and the silence grace, and they are both first-class in the new release.

If you are building this out as part of a wider personal AI agent setup, the voice bridge plays nicely with everything else hosts — text in Slack, screenshots from the file-transfer plugin, cron-driven proactive messages — without you having to glue providers together by hand.

Sources: