STT Adapters
Configure speech-to-text adapters — batch transcription and realtime streaming.
AvatarLayer provides two categories of STT adapters:
- Batch — Transcribe a complete audio blob in one request (
STTProvider) - Realtime — Stream audio frames over a persistent connection and receive partial/final transcripts (
RealtimeSTTProvider)
Realtime adapters are used with startListening() for the voice input pipeline. Batch adapters are used for one-off transcription via the stt config field.
Batch adapters
OpenAI STT
import { OpenAISTTAdapter } from "avatarlayer";
const stt = new OpenAISTTAdapter({
apiKey: "sk-...",
model: "whisper-1", // optional
});| Option | Type | Default | Description |
|---|---|---|---|
apiKey | string | required | OpenAI API key |
model | string | "whisper-1" | Model identifier |
baseURL | string | OpenAI default | Base URL for API |
Google STT
import { GoogleSTTAdapter } from "avatarlayer";
const stt = new GoogleSTTAdapter({
apiKey: "...",
languageCode: "en-US", // optional
});| Option | Type | Default | Description |
|---|---|---|---|
apiKey | string | required | Google Cloud API key |
sampleRateHertz | number | 16000 | Audio sample rate |
languageCode | string | "en-US" | BCP-47 language code |
Azure STT
import { AzureSTTAdapter } from "avatarlayer";
const stt = new AzureSTTAdapter({
subscriptionKey: "...",
region: "eastus",
language: "en-US", // optional
});| Option | Type | Default | Description |
|---|---|---|---|
subscriptionKey | string | required | Azure Speech subscription key |
region | string | required | Azure region |
language | string | "en-US" | Recognition language |
Realtime adapters
Realtime adapters are designed for the voice input pipeline. They maintain a persistent connection, accept streamed audio frames, and emit transcript events with partial and final results.
For realtime adapters that require an API key, use the tokenUrl pattern to avoid exposing secrets to the browser. Your server returns a short-lived token, and the adapter uses it to connect.
Deepgram
import { DeepgramSTTAdapter } from "avatarlayer";
const realtimeSTT = new DeepgramSTTAdapter({
apiKey: "...", // or use tokenUrl for production
model: "nova-3", // optional
language: "en", // optional
});| Option | Type | Default | Description |
|---|---|---|---|
apiKey | string | — | Deepgram API key (use tokenUrl in production) |
tokenUrl | string | — | URL returning a temporary Deepgram token |
model | string | "nova-3" | Transcription model |
language | string | "en" | Language code |
encoding | string | "linear16" | Audio encoding |
sampleRate | number | 16000 | Sample rate |
baseURL | string | Deepgram default | Base URL for WebSocket |
ElevenLabs STT
import { ElevenLabsSTTAdapter } from "avatarlayer";
const realtimeSTT = new ElevenLabsSTTAdapter({
apiKey: "...", // or tokenUrl
});| Option | Type | Default | Description |
|---|---|---|---|
apiKey | string | — | ElevenLabs API key |
tokenUrl | string | — | URL returning a temporary token |
modelId | string | — | Model identifier |
vadSilenceThresholdSecs | number | 1.0 | Silence threshold for speech end |
Azure Speech STT
import { AzureSpeechSTTAdapter } from "avatarlayer";
const realtimeSTT = new AzureSpeechSTTAdapter({
tokenUrl: "/api/azure-speech-token",
language: "en-US", // optional
});| Option | Type | Default | Description |
|---|---|---|---|
subscriptionKey | string | — | Azure Speech key (use tokenUrl in production) |
region | string | — | Azure region (required with subscriptionKey) |
tokenUrl | string | — | URL returning { token, region } |
language | string | — | Recognition language |
silenceTimeoutMs | number | 1000 | Silence timeout |
Amazon Transcribe STT
import { AmazonTranscribeSTTAdapter } from "avatarlayer";
const realtimeSTT = new AmazonTranscribeSTTAdapter({
tokenUrl: "/api/transcribe-token",
language: "en-US", // optional
});| Option | Type | Default | Description |
|---|---|---|---|
credentials | object | — | AWS credentials (use tokenUrl in production) |
tokenUrl | string | — | URL returning temporary credentials |
region | string | — | AWS region |
language | string | — | Language code |
silenceTimeoutMs | number | 1000 | Silence timeout |
WebSpeech STT
import { WebSpeechSTTAdapter } from "avatarlayer";
const realtimeSTT = new WebSpeechSTTAdapter({
language: "en-US", // optional
});No API key needed — uses the browser's built-in SpeechRecognition API. Note that sendAudio is a no-op; the browser captures audio directly.
| Option | Type | Default | Description |
|---|---|---|---|
language | string | "en-US" | Recognition language |
continuous | boolean | true | Continuous recognition |
interimResults | boolean | true | Emit partial results |
Combined STT + VAD adapters
These adapters bundle realtime STT with voice activity detection, emitting speech-start and speech-end events in addition to transcripts.
AzureSpeechVADAdapter
import { AzureSpeechVADAdapter } from "avatarlayer";
const realtimeSTT = new AzureSpeechVADAdapter({
tokenUrl: "/api/azure-speech-token",
maxDurationMs: 20000, // optional
prerollMs: 500, // optional
});AmazonTranscribeVADAdapter
import { AmazonTranscribeVADAdapter } from "avatarlayer";
const realtimeSTT = new AmazonTranscribeVADAdapter({
tokenUrl: "/api/transcribe-token",
maxDurationMs: 20000,
prerollMs: 500,
});Interfaces
STTProvider (batch)
interface STTProvider {
readonly id: string;
transcribe(audio: Blob, opts?: STTOptions): Promise<string>;
}RealtimeSTTProvider (streaming)
interface RealtimeSTTProvider {
readonly id: string;
connect(signal?: AbortSignal): Promise<void>;
sendAudio(pcm: Float32Array): void;
disconnect(): void;
on(event: "transcript", fn: (text: string, opts: TranscriptionTextOptions) => void): void;
on(event: "session-started", fn: () => void): void;
on(event: "error", fn: (err: Error) => void): void;
on(event: "close", fn: () => void): void;
}See Voice Input for how to wire realtime STT into the session, and Custom Adapters for implementing your own.