Voice Input
Realtime speech-to-text, voice activity detection, mic capture, and barge-in.
AvatarLayer supports a full voice input pipeline: microphone capture, optional voice activity detection (VAD), realtime speech-to-text, and automatic barge-in to interrupt the avatar when the user speaks.
Setup
Voice input requires a realtimeSTT adapter in the session config:
import {
AvatarSession,
OpenAIAdapter,
ElevenLabsAdapter,
VRMLocalRenderer,
DeepgramSTTAdapter,
} from "avatarlayer";
const session = new AvatarSession({
llm: new OpenAIAdapter({ apiKey: "sk-..." }),
tts: new ElevenLabsAdapter({ apiKey: "..." }),
renderer: new VRMLocalRenderer({ modelUrl: "/models/avatar.vrm" }),
realtimeSTT: new DeepgramSTTAdapter({ apiKey: "..." }),
voice: {
bargeIn: true, // default: true
bargeInMinLength: 2, // default: 2 characters
},
});Starting voice input
Pass an AsyncIterable<Float32Array> audio source to startListening. The SDK provides MicCapture for browser microphone access:
import { MicCapture } from "avatarlayer";
const mic = new MicCapture();
await mic.start();
await session.startListening(mic);MicCapture options
| Option | Type | Default | Description |
|---|---|---|---|
workletUrl | string | "/audio/microphone-worklet.js" | URL to the AudioWorklet processor |
sampleRate | number | 16000 | Output sample rate |
bufferSize | number | 4096 | Buffer size in samples |
Stopping voice input
session.stopListening();For a graceful stop that waits for the final transcript:
session.stopListening({ drain: true });With drain, the microphone stops but the STT connection stays alive until the server delivers the final transcript (or a 5-second safety timeout fires).
Barge-in
When voice.bargeIn is true (the default), receiving a final transcript while the avatar is speaking or thinking will automatically:
- Call
interrupt()to stop the current pipeline - Send the transcribed text as a new message via
sendMessage()
The bargeInMinLength option (default 2) prevents very short noise transcripts from triggering interrupts.
Voice activity detection (VAD)
VAD detects when the user starts and stops speaking. You can provide a vad adapter in the session config:
import { AmplitudeVADAdapter } from "avatarlayer";
const session = new AvatarSession({
// ...other config
vad: new AmplitudeVADAdapter({
positiveSpeechThreshold: 0.02, // RMS threshold to start
negativeSpeechThreshold: 0.01, // RMS threshold to stop
minSpeechMs: 250,
silenceDurationMs: 500,
}),
});AmplitudeVADAdapter
Simple RMS-based VAD that runs on the main thread. Good enough for many use cases.
| Option | Type | Default | Description |
|---|---|---|---|
positiveSpeechThreshold | number | 0.02 | RMS threshold to trigger speech start |
negativeSpeechThreshold | number | 0.01 | RMS threshold to trigger speech end |
minSpeechMs | number | 250 | Minimum speech duration before committing |
silenceDurationMs | number | 500 | Silence duration before triggering speech end |
speechPadMs | number | 30 | Padding around speech segments |
SileroVADAdapter (local)
Neural-network VAD running via ONNX on-device. More accurate than amplitude, especially in noisy environments. Available from the avatarlayer/local subpath:
import { SileroVADAdapter } from "avatarlayer/local";
const vad = new SileroVADAdapter({
positiveSpeechThreshold: 0.5,
negativeSpeechThreshold: 0.35,
minSpeechMs: 250,
silenceDurationMs: 500,
});
await vad.init();See Local ML for more on-device options.
Events
| Event | Payload | Description |
|---|---|---|
listening-change | boolean | Voice input started or stopped |
transcript | (text: string, opts: TranscriptionTextOptions) | Partial or final transcript from realtime STT |
TranscriptionTextOptions includes:
| Field | Type | Description |
|---|---|---|
final | boolean | Whether this is the final transcript for a completed utterance |
fullText | string | Full accumulated text across all utterances |
React: useMic
In React, use the useMic hook for one-click mic control:
import { useMic } from "avatarlayer/react";
function VoiceControls() {
const { listening } = useAvatarSession();
const { startMic, stopMic } = useMic();
return (
<button onClick={listening ? stopMic : startMic}>
{listening ? "Stop listening" : "Start listening"}
</button>
);
}