Voice Input

Realtime speech-to-text, voice activity detection, mic capture, and barge-in.

AvatarLayer supports a full voice input pipeline: microphone capture, optional voice activity detection (VAD), realtime speech-to-text, and automatic barge-in to interrupt the avatar when the user speaks.

Setup

Voice input requires a realtimeSTT adapter in the session config:

import {
  AvatarSession,
  OpenAIAdapter,
  ElevenLabsAdapter,
  VRMLocalRenderer,
  DeepgramSTTAdapter,
} from "avatarlayer";

const session = new AvatarSession({
  llm: new OpenAIAdapter({ apiKey: "sk-..." }),
  tts: new ElevenLabsAdapter({ apiKey: "..." }),
  renderer: new VRMLocalRenderer({ modelUrl: "/models/avatar.vrm" }),
  realtimeSTT: new DeepgramSTTAdapter({ apiKey: "..." }),
  voice: {
    bargeIn: true,         // default: true
    bargeInMinLength: 2,   // default: 2 characters
  },
});

Starting voice input

Pass an AsyncIterable<Float32Array> audio source to startListening. The SDK provides MicCapture for browser microphone access:

import { MicCapture } from "avatarlayer";

const mic = new MicCapture();
await mic.start();

await session.startListening(mic);

MicCapture options

OptionTypeDefaultDescription
workletUrlstring"/audio/microphone-worklet.js"URL to the AudioWorklet processor
sampleRatenumber16000Output sample rate
bufferSizenumber4096Buffer size in samples

Stopping voice input

session.stopListening();

For a graceful stop that waits for the final transcript:

session.stopListening({ drain: true });

With drain, the microphone stops but the STT connection stays alive until the server delivers the final transcript (or a 5-second safety timeout fires).

Barge-in

When voice.bargeIn is true (the default), receiving a final transcript while the avatar is speaking or thinking will automatically:

  1. Call interrupt() to stop the current pipeline
  2. Send the transcribed text as a new message via sendMessage()

The bargeInMinLength option (default 2) prevents very short noise transcripts from triggering interrupts.

Voice activity detection (VAD)

VAD detects when the user starts and stops speaking. You can provide a vad adapter in the session config:

import { AmplitudeVADAdapter } from "avatarlayer";

const session = new AvatarSession({
  // ...other config
  vad: new AmplitudeVADAdapter({
    positiveSpeechThreshold: 0.02,  // RMS threshold to start
    negativeSpeechThreshold: 0.01,  // RMS threshold to stop
    minSpeechMs: 250,
    silenceDurationMs: 500,
  }),
});

AmplitudeVADAdapter

Simple RMS-based VAD that runs on the main thread. Good enough for many use cases.

OptionTypeDefaultDescription
positiveSpeechThresholdnumber0.02RMS threshold to trigger speech start
negativeSpeechThresholdnumber0.01RMS threshold to trigger speech end
minSpeechMsnumber250Minimum speech duration before committing
silenceDurationMsnumber500Silence duration before triggering speech end
speechPadMsnumber30Padding around speech segments

SileroVADAdapter (local)

Neural-network VAD running via ONNX on-device. More accurate than amplitude, especially in noisy environments. Available from the avatarlayer/local subpath:

import { SileroVADAdapter } from "avatarlayer/local";

const vad = new SileroVADAdapter({
  positiveSpeechThreshold: 0.5,
  negativeSpeechThreshold: 0.35,
  minSpeechMs: 250,
  silenceDurationMs: 500,
});

await vad.init();

See Local ML for more on-device options.

Events

EventPayloadDescription
listening-changebooleanVoice input started or stopped
transcript(text: string, opts: TranscriptionTextOptions)Partial or final transcript from realtime STT

TranscriptionTextOptions includes:

FieldTypeDescription
finalbooleanWhether this is the final transcript for a completed utterance
fullTextstringFull accumulated text across all utterances

React: useMic

In React, use the useMic hook for one-click mic control:

import { useMic } from "avatarlayer/react";

function VoiceControls() {
  const { listening } = useAvatarSession();
  const { startMic, stopMic } = useMic();

  return (
    <button onClick={listening ? stopMic : startMic}>
      {listening ? "Stop listening" : "Start listening"}
    </button>
  );
}