AvatarLayer — pluggable SDK for realtime conversational avatars

AvatarLayer supports a full voice input pipeline: microphone capture, optional voice activity detection (VAD), realtime speech-to-text, and automatic barge-in to interrupt the avatar when the user speaks.

Setup

Voice input requires a realtimeSTT adapter in the session config:

import { AvatarSession } from "avatarlayer";
import { OpenAIAdapter } from "avatarlayer/llm";
import { ElevenLabsAdapter } from "avatarlayer/tts";
import { VRMLocalRenderer } from "avatarlayer/renderers";
import { DeepgramSTTAdapter } from "avatarlayer/stt";

const session = new AvatarSession({
  llm: new OpenAIAdapter({ apiKey: "sk-..." }),
  tts: new ElevenLabsAdapter({ apiKey: "..." }),
  renderer: new VRMLocalRenderer({ modelUrl: "/models/avatar.vrm" }),
  realtimeSTT: new DeepgramSTTAdapter({ apiKey: "..." }),
  voice: {
    bargeIn: true,         // default: true
    bargeInMinLength: 2,   // default: 2 characters
  },
});

Starting voice input

Pass an AsyncIterable<Float32Array> audio source to startListening. The SDK provides MicCapture for browser microphone access:

import { MicCapture } from "avatarlayer";

const mic = new MicCapture();
await mic.start();

await session.startListening(mic);

MicCapture options

Option	Type	Default	Description
`workletUrl`	`string`	`"/audio/microphone-worklet.js"`	URL to the AudioWorklet processor
`sampleRate`	`number`	`16000`	Output sample rate
`bufferSize`	`number`	`4096`	Buffer size in samples

Stopping voice input

session.stopListening();

For a graceful stop that waits for the final transcript:

session.stopListening({ drain: true });

With drain, the microphone stops but the STT connection stays alive until the server delivers the final transcript (or a 5-second safety timeout fires).

Barge-in

When voice.bargeIn is true (the default), receiving a final transcript while the avatar is speaking or thinking will automatically:

Call interrupt() to stop the current pipeline
Send the transcribed text as a new message via sendMessage()

The bargeInMinLength option (default 2) prevents very short noise transcripts from triggering interrupts.

Voice activity detection (VAD)

VAD detects when the user starts and stops speaking. You can provide a vad adapter in the session config:

import { AmplitudeVADAdapter } from "avatarlayer/vad";

const session = new AvatarSession({
  // ...other config
  vad: new AmplitudeVADAdapter({
    positiveSpeechThreshold: 0.02,  // RMS threshold to start
    negativeSpeechThreshold: 0.01,  // RMS threshold to stop
    minSpeechMs: 250,
    silenceDurationMs: 500,
  }),
});

AmplitudeVADAdapter

Simple RMS-based VAD that runs on the main thread. Good enough for many use cases.

Option	Type	Default	Description
`positiveSpeechThreshold`	`number`	`0.02`	RMS threshold to trigger speech start
`negativeSpeechThreshold`	`number`	`0.01`	RMS threshold to trigger speech end
`minSpeechMs`	`number`	`250`	Minimum speech duration before committing
`silenceDurationMs`	`number`	`500`	Silence duration before triggering speech end
`speechPadMs`	`number`	`30`	Padding around speech segments

SileroVADAdapter (local)

Neural-network VAD running via ONNX on-device. More accurate than amplitude, especially in noisy environments. Available from the avatarlayer/local subpath:

import { SileroVADAdapter } from "avatarlayer/local";

const vad = new SileroVADAdapter({
  positiveSpeechThreshold: 0.5,
  negativeSpeechThreshold: 0.35,
  minSpeechMs: 250,
  silenceDurationMs: 500,
});

await vad.init();

See Local ML for more on-device options.

Events

Event	Payload	Description
`listening-change`	`boolean`	Voice input started or stopped
`transcript`	`(text: string, opts: TranscriptionTextOptions)`	Partial or final transcript from realtime STT

TranscriptionTextOptions includes:

Field	Type	Description
`final`	`boolean`	Whether this is the final transcript for a completed utterance
`fullText`	`string`	Full accumulated text across all utterances

React: useMic

In React, use the useMic hook for one-click mic control:

import { useMic } from "avatarlayer/react";

function VoiceControls() {
  const { listening } = useAvatarSession();
  const { startMic, stopMic } = useMic();

  return (
    <button onClick={listening ? stopMic : startMic}>
      {listening ? "Stop listening" : "Start listening"}
    </button>
  );
}

Voice Input