AvatarLayer — pluggable SDK for realtime conversational avatars

Configure speech-to-text adapters — batch transcription and realtime streaming.

AvatarLayer provides two categories of STT adapters:

Batch — Transcribe a complete audio blob in one request (STTProvider)
Realtime — Stream audio frames over a persistent connection and receive partial/final transcripts (RealtimeSTTProvider)

Realtime adapters are used with startListening() for the voice input pipeline. Batch adapters are used for one-off transcription via the stt config field.

Batch adapters

OpenAI STT

import { OpenAISTTAdapter } from "avatarlayer/stt";

const stt = new OpenAISTTAdapter({
  apiKey: "sk-...",
  model: "whisper-1",  // optional
});

Option	Type	Default	Description
`apiKey`	`string`	required	OpenAI API key
`model`	`string`	`"whisper-1"`	Model identifier
`baseURL`	`string`	OpenAI default	Base URL for API

Google STT

import { GoogleSTTAdapter } from "avatarlayer/stt";

const stt = new GoogleSTTAdapter({
  apiKey: "...",
  languageCode: "en-US",  // optional
});

Option	Type	Default	Description
`apiKey`	`string`	required	Google Cloud API key
`sampleRateHertz`	`number`	`16000`	Audio sample rate
`languageCode`	`string`	`"en-US"`	BCP-47 language code

Azure STT

import { AzureSTTAdapter } from "avatarlayer/stt";

const stt = new AzureSTTAdapter({
  subscriptionKey: "...",
  region: "eastus",
  language: "en-US",  // optional
});

Option	Type	Default	Description
`subscriptionKey`	`string`	required	Azure Speech subscription key
`region`	`string`	required	Azure region
`language`	`string`	`"en-US"`	Recognition language

Realtime adapters

Realtime adapters are designed for the voice input pipeline. They maintain a persistent connection, accept streamed audio frames, and emit transcript events with partial and final results.

For realtime adapters that require an API key, use the tokenUrl pattern to avoid exposing secrets to the browser. Your server returns a short-lived token, and the adapter uses it to connect.

Deepgram

import { DeepgramSTTAdapter } from "avatarlayer/stt";

const realtimeSTT = new DeepgramSTTAdapter({
  apiKey: "...",          // or use tokenUrl for production
  model: "nova-3",        // optional
  language: "en",          // optional
});

Option	Type	Default	Description
`apiKey`	`string`	—	Deepgram API key (use `tokenUrl` in production)
`tokenUrl`	`string`	—	URL returning a temporary Deepgram token
`model`	`string`	`"nova-3"`	Transcription model
`language`	`string`	`"en"`	Language code
`encoding`	`string`	`"linear16"`	Audio encoding
`sampleRate`	`number`	`16000`	Sample rate
`baseURL`	`string`	Deepgram default	Base URL for WebSocket

ElevenLabs STT

import { ElevenLabsSTTAdapter } from "avatarlayer/stt";

const realtimeSTT = new ElevenLabsSTTAdapter({
  apiKey: "...",  // or tokenUrl
});

Option	Type	Default	Description
`apiKey`	`string`	—	ElevenLabs API key
`tokenUrl`	`string`	—	URL returning a temporary token
`modelId`	`string`	—	Model identifier
`vadSilenceThresholdSecs`	`number`	`1.0`	Silence threshold for speech end

Azure Speech STT

import { AzureSpeechSTTAdapter } from "avatarlayer/stt";

const realtimeSTT = new AzureSpeechSTTAdapter({
  tokenUrl: "/api/azure-speech-token",
  language: "en-US",  // optional
});

Option	Type	Default	Description
`subscriptionKey`	`string`	—	Azure Speech key (use `tokenUrl` in production)
`region`	`string`	—	Azure region (required with `subscriptionKey`)
`tokenUrl`	`string`	—	URL returning `{ token, region }`
`language`	`string`	—	Recognition language
`silenceTimeoutMs`	`number`	`1000`	Silence timeout

Amazon Transcribe STT

import { AmazonTranscribeSTTAdapter } from "avatarlayer/stt";

const realtimeSTT = new AmazonTranscribeSTTAdapter({
  tokenUrl: "/api/transcribe-token",
  language: "en-US",  // optional
});

Option	Type	Default	Description
`credentials`	`object`	—	AWS credentials (use `tokenUrl` in production)
`tokenUrl`	`string`	—	URL returning temporary credentials
`region`	`string`	—	AWS region
`language`	`string`	—	Language code
`silenceTimeoutMs`	`number`	`1000`	Silence timeout

WebSpeech STT

import { WebSpeechSTTAdapter } from "avatarlayer/stt";

const realtimeSTT = new WebSpeechSTTAdapter({
  language: "en-US",  // optional
});

No API key needed — uses the browser's built-in SpeechRecognition API. Note that sendAudio is a no-op; the browser captures audio directly.

Option	Type	Default	Description
`language`	`string`	`"en-US"`	Recognition language
`continuous`	`boolean`	`true`	Continuous recognition
`interimResults`	`boolean`	`true`	Emit partial results

Combined STT + VAD adapters

These adapters bundle realtime STT with voice activity detection, emitting speech-start and speech-end events in addition to transcripts.

AzureSpeechVADAdapter

import { AzureSpeechVADAdapter } from "avatarlayer/stt";

const realtimeSTT = new AzureSpeechVADAdapter({
  tokenUrl: "/api/azure-speech-token",
  maxDurationMs: 20000,  // optional
  prerollMs: 500,         // optional
});

AmazonTranscribeVADAdapter

import { AmazonTranscribeVADAdapter } from "avatarlayer/stt";

const realtimeSTT = new AmazonTranscribeVADAdapter({
  tokenUrl: "/api/transcribe-token",
  maxDurationMs: 20000,
  prerollMs: 500,
});

Interfaces

STTProvider (batch)

interface STTProvider {
  readonly id: string;
  transcribe(audio: Blob, opts?: STTOptions): Promise<string>;
}

RealtimeSTTProvider (streaming)

interface RealtimeSTTProvider {
  readonly id: string;
  connect(signal?: AbortSignal): Promise<void>;
  sendAudio(pcm: Float32Array): void;
  disconnect(): void;
  on(event: "transcript", fn: (text: string, opts: TranscriptionTextOptions) => void): void;
  on(event: "session-started", fn: () => void): void;
  on(event: "error", fn: (err: Error) => void): void;
  on(event: "close", fn: () => void): void;
}

See Voice Input for how to wire realtime STT into the session, and Custom Adapters for implementing your own.

STT Adapters