Providers

STT Adapters

Configure speech-to-text adapters — batch transcription and realtime streaming.

AvatarLayer provides two categories of STT adapters:

  • Batch — Transcribe a complete audio blob in one request (STTProvider)
  • Realtime — Stream audio frames over a persistent connection and receive partial/final transcripts (RealtimeSTTProvider)

Realtime adapters are used with startListening() for the voice input pipeline. Batch adapters are used for one-off transcription via the stt config field.

Batch adapters

OpenAI STT

import { OpenAISTTAdapter } from "avatarlayer";

const stt = new OpenAISTTAdapter({
  apiKey: "sk-...",
  model: "whisper-1",  // optional
});
OptionTypeDefaultDescription
apiKeystringrequiredOpenAI API key
modelstring"whisper-1"Model identifier
baseURLstringOpenAI defaultBase URL for API

Google STT

import { GoogleSTTAdapter } from "avatarlayer";

const stt = new GoogleSTTAdapter({
  apiKey: "...",
  languageCode: "en-US",  // optional
});
OptionTypeDefaultDescription
apiKeystringrequiredGoogle Cloud API key
sampleRateHertznumber16000Audio sample rate
languageCodestring"en-US"BCP-47 language code

Azure STT

import { AzureSTTAdapter } from "avatarlayer";

const stt = new AzureSTTAdapter({
  subscriptionKey: "...",
  region: "eastus",
  language: "en-US",  // optional
});
OptionTypeDefaultDescription
subscriptionKeystringrequiredAzure Speech subscription key
regionstringrequiredAzure region
languagestring"en-US"Recognition language

Realtime adapters

Realtime adapters are designed for the voice input pipeline. They maintain a persistent connection, accept streamed audio frames, and emit transcript events with partial and final results.

For realtime adapters that require an API key, use the tokenUrl pattern to avoid exposing secrets to the browser. Your server returns a short-lived token, and the adapter uses it to connect.

Deepgram

import { DeepgramSTTAdapter } from "avatarlayer";

const realtimeSTT = new DeepgramSTTAdapter({
  apiKey: "...",          // or use tokenUrl for production
  model: "nova-3",        // optional
  language: "en",          // optional
});
OptionTypeDefaultDescription
apiKeystringDeepgram API key (use tokenUrl in production)
tokenUrlstringURL returning a temporary Deepgram token
modelstring"nova-3"Transcription model
languagestring"en"Language code
encodingstring"linear16"Audio encoding
sampleRatenumber16000Sample rate
baseURLstringDeepgram defaultBase URL for WebSocket

ElevenLabs STT

import { ElevenLabsSTTAdapter } from "avatarlayer";

const realtimeSTT = new ElevenLabsSTTAdapter({
  apiKey: "...",  // or tokenUrl
});
OptionTypeDefaultDescription
apiKeystringElevenLabs API key
tokenUrlstringURL returning a temporary token
modelIdstringModel identifier
vadSilenceThresholdSecsnumber1.0Silence threshold for speech end

Azure Speech STT

import { AzureSpeechSTTAdapter } from "avatarlayer";

const realtimeSTT = new AzureSpeechSTTAdapter({
  tokenUrl: "/api/azure-speech-token",
  language: "en-US",  // optional
});
OptionTypeDefaultDescription
subscriptionKeystringAzure Speech key (use tokenUrl in production)
regionstringAzure region (required with subscriptionKey)
tokenUrlstringURL returning { token, region }
languagestringRecognition language
silenceTimeoutMsnumber1000Silence timeout

Amazon Transcribe STT

import { AmazonTranscribeSTTAdapter } from "avatarlayer";

const realtimeSTT = new AmazonTranscribeSTTAdapter({
  tokenUrl: "/api/transcribe-token",
  language: "en-US",  // optional
});
OptionTypeDefaultDescription
credentialsobjectAWS credentials (use tokenUrl in production)
tokenUrlstringURL returning temporary credentials
regionstringAWS region
languagestringLanguage code
silenceTimeoutMsnumber1000Silence timeout

WebSpeech STT

import { WebSpeechSTTAdapter } from "avatarlayer";

const realtimeSTT = new WebSpeechSTTAdapter({
  language: "en-US",  // optional
});

No API key needed — uses the browser's built-in SpeechRecognition API. Note that sendAudio is a no-op; the browser captures audio directly.

OptionTypeDefaultDescription
languagestring"en-US"Recognition language
continuousbooleantrueContinuous recognition
interimResultsbooleantrueEmit partial results

Combined STT + VAD adapters

These adapters bundle realtime STT with voice activity detection, emitting speech-start and speech-end events in addition to transcripts.

AzureSpeechVADAdapter

import { AzureSpeechVADAdapter } from "avatarlayer";

const realtimeSTT = new AzureSpeechVADAdapter({
  tokenUrl: "/api/azure-speech-token",
  maxDurationMs: 20000,  // optional
  prerollMs: 500,         // optional
});

AmazonTranscribeVADAdapter

import { AmazonTranscribeVADAdapter } from "avatarlayer";

const realtimeSTT = new AmazonTranscribeVADAdapter({
  tokenUrl: "/api/transcribe-token",
  maxDurationMs: 20000,
  prerollMs: 500,
});

Interfaces

STTProvider (batch)

interface STTProvider {
  readonly id: string;
  transcribe(audio: Blob, opts?: STTOptions): Promise<string>;
}

RealtimeSTTProvider (streaming)

interface RealtimeSTTProvider {
  readonly id: string;
  connect(signal?: AbortSignal): Promise<void>;
  sendAudio(pcm: Float32Array): void;
  disconnect(): void;
  on(event: "transcript", fn: (text: string, opts: TranscriptionTextOptions) => void): void;
  on(event: "session-started", fn: () => void): void;
  on(event: "error", fn: (err: Error) => void): void;
  on(event: "close", fn: () => void): void;
}

See Voice Input for how to wire realtime STT into the session, and Custom Adapters for implementing your own.