AvatarLayer — pluggable SDK for realtime conversational avatars

AvatarSession is the central orchestrator. It manages the conversational pipeline from user input through LLM streaming, sentence segmentation, TTS synthesis, and avatar speech. It also manages voice input, video capture, vision workloads, memory persistence, and thread management.

State machine

autoSpeak true (default):
idle → connecting → ready ⇄ thinking → speaking → ready
                      ↑                              |
                      └──── interrupt() ─────────────┘

autoSpeak false (text-first):
idle → connecting → ready ⇄ thinking → ready
                      ↑       speak(i) → speaking → ready
                      └──── interrupt() ────────────┘

States

State	Description
`idle`	Session created but not mounted
`connecting`	Renderer is mounting (establishing connections)
`ready`	Waiting for user input
`thinking`	LLM is streaming a response
`speaking`	TTS audio is playing through the avatar
`error`	Something went wrong
`destroyed`	Session torn down via `destroy()`

Events

Subscribe to events via session.on():

session.on("state-change", (state) => { /* ... */ });
session.on("message", (msg) => { /* ... */ });
session.on("chunk", (text, accumulated) => { /* ... */ });
session.on("speech-start", () => { /* ... */ });
session.on("speech-end", () => { /* ... */ });
session.on("error", (err) => { /* ... */ });
session.on("listening-change", (listening) => { /* ... */ });
session.on("transcript", (text, opts) => { /* ... */ });
session.on("emotion", (payload) => { /* ... */ });
session.on("segment", (segment) => { /* ... */ });
session.on("thread-change", (thread) => { /* ... */ });

Event	Payload	Description
`state-change`	`SessionState`	Fires on every state transition
`message`	`ChatMessage`	User or assistant message committed to history
`chunk`	`(text: string, accumulated: string)`	Streaming LLM text delta. `accumulated` contains the running full text in plain mode; when `emotions: true` it is always `""` (emotion markers are stripped inline).
`speech-start`	—	TTS audio playback begins
`speech-end`	—	TTS audio playback finishes
`error`	`Error`	Error from any pipeline stage
`listening-change`	`boolean`	Voice input started or stopped
`video-change`	`boolean`	Video capture started or stopped
`transcript`	`(text: string, opts: TranscriptionTextOptions)`	Realtime STT transcript (partial or final)
`emotion`	`{ name: string, intensity: number }`	Emotion marker parsed from LLM output (requires `emotions: true`)
`segment`	`Segment`	Sentence-sized text segment produced by the LLM (always emitted when `autoSpeak: false`)
`thread-change`	`Thread`	Active thread changed (create, switch, or initial load)
`history-loaded`	`ChatMessage[]`	Persisted messages loaded into history (on start or `switchThread`)
`vision-context`	`VisionContextEntry`	Vision workload completed with new context
`vision-workloads-change`	`boolean`	Vision workload ticker started or stopped
`vision-error`	`Error`	Vision workload inference failed

The pipeline in detail

When sendMessage(text) is called:

Interrupt — Any in-progress pipeline is cancelled via AbortController.

User message — The text is wrapped in a ChatMessage and added to history. The message event fires.

Memory recall — If memory with semantic recall is configured, relevant past messages are retrieved and injected into context.

LLM streaming — State transitions to thinking. The full message history (with system prompt or character card) is sent to the LLM. As chunks arrive, the chunk event fires with each delta.

Emotion parsing — If emotions: true, inline markers like <|ACT {...}|> are parsed from the stream, stripped from TTS text, and applied to the renderer.

Sentence splitting — The streaming text is buffered and split on sentence boundaries (., !, ?, newlines). Complete sentences are passed to TTS immediately — no need to wait for the full response.

TTS synthesis (pre-buffered) — Sentences are synthesized to audio blobs concurrently. Up to ttsBufferSize (default 3) TTS requests run in parallel so that the next audio blob is already waiting when the current one finishes playing. If the renderer has speakText, the external TTS step is skipped entirely and sentences are spoken sequentially.

Avatar speech — State transitions to speaking when the first blob is ready. Blobs are played through renderer.speak() in strict sentence order. The speech-start and speech-end events fire around each sentence's playback.

Completion — After all sentences are spoken, the assistant message is committed to history and the message event fires. State returns to ready.

Interruption

interrupt() cancels the pipeline at whatever stage it's in:

Cancels the speech queue, aborting all in-flight and buffered TTS requests
Aborts the LLM stream via AbortController.abort()
Calls renderer.interrupt() to stop audio/video playback
Returns the session to ready

If the LLM had already streamed non-empty text before the interrupt, the partial assistant message is committed to history and a message event fires. If no text was produced yet, nothing is committed.

Session configuration

interface AvatarSessionConfig {
  llm: LLMProvider;
  tts?: TTSProvider;
  renderer: AvatarRenderer;
  systemPrompt?: string;
  characterCard?: CharacterCard;
  reasoningEffort?: "none" | "low" | "medium" | "high";
  realtimeSTT?: RealtimeSTTProvider;
  voice?: VoiceConfig;
  vision?: VisionConfig;
  emotions?: boolean;
  memory?: MemoryConfig;
  visionWorkloads?: VisionWorkloadsConfig;
  ttsBufferSize?: number;
  autoSpeak?: boolean;
}

Field	Type	Description
`llm`	`LLMProvider`	Required. The language model adapter.
`tts`	`TTSProvider`	Optional when the renderer implements `speakText`.
`renderer`	`AvatarRenderer`	Required. The avatar renderer.
`systemPrompt`	`string`	System message content. Ignored if `characterCard` is set.
`characterCard`	`CharacterCard`	V3 character card. Builds system prompt from structured fields. Takes precedence over `systemPrompt`.
`reasoningEffort`	`ReasoningEffort`	Extended thinking budget passed to the LLM.
`realtimeSTT`	`RealtimeSTTProvider`	Enables `startListening` / `stopListening` voice pipeline.
`voice`	`VoiceConfig`	Voice input options: `bargeIn` (default `true`), `bargeInMinLength` (default `2`).
`vision`	`VisionConfig`	Image format, quality, and max width for snapshots.
`emotions`	`boolean`	Enable emotion marker parsing from LLM output.
`memory`	`MemoryConfig`	Thread-based persistent memory. See Memory.
`visionWorkloads`	`VisionWorkloadsConfig`	Periodic vision analysis. See Vision.
`ttsBufferSize`	`number`	Max sentences synthesized ahead of playback (default `3`). Ignored for `speakText` renderers.
`autoSpeak`	`boolean`	When `false`, `sendMessage()` emits `segment` events instead of automatically synthesizing and playing audio. Call `speak(index)` to play on demand. Default `true`.

Methods

Method	Description
`start(container)`	Mount the renderer and transition to ready
`sendMessage(text, opts?)`	Run the full LLM → TTS → speak pipeline. `opts.images` attaches images.
`interrupt()`	Cancel the current pipeline
`startListening(source)`	Begin voice input from an `AsyncIterable<Float32Array>` (e.g. `MicCapture`)
`stopListening(opts?)`	Stop voice input. `opts.drain` keeps connection alive for final transcript.
`startVideo(stream)`	Start video capture from a `MediaStream`
`stopVideo()`	Stop video capture
`startVisionWorkloads()`	Start the periodic vision workload ticker
`stopVisionWorkloads()`	Stop the vision workload ticker
`updateControl(control)`	Update avatar face/emotion/body/scene
`setLLM(llm)`	Swap the LLM provider at runtime
`setTTS(tts)`	Swap the TTS provider at runtime
`setRenderer(renderer, container)`	Swap the renderer (unmounts old one)
`setCharacterCard(card)`	Set or replace the character card
`switchThread(threadId)`	Switch to an existing thread (loads persisted messages)
`newThread(opts?)`	Create a new thread. `opts.title` sets the thread title.
`speak(index)`	Synthesize and play a single segment on demand. Only works in `ready` state.
`destroy()`	Tear down everything

Properties

Property	Type	Description
`state`	`SessionState`	Current session state
`messages`	`readonly ChatMessage[]`	Full conversation history
`thread`	`Thread \| null`	Current thread (when memory is configured)
`threadId`	`string \| null`	Current thread ID
`listening`	`boolean`	Whether voice input is active
`videoActive`	`boolean`	Whether video capture is active
`visionWorkloadsActive`	`boolean`	Whether the vision workload ticker is running
`visionContext`	`VisionContextEntry[]`	Latest context from vision workloads
`segments`	`readonly Segment[]`	Accumulated text segments from the current/last response (populated when `autoSpeak: false`)

Text-first mode

By default, AvatarSession runs the full pipeline: LLM stream → sentence split → TTS → avatar speech. Set autoSpeak: false to decouple text generation from speech. The session streams LLM output and emits structured segment events for each sentence-sized chunk, but does not synthesize or play audio. You can then call speak(index) to play any segment on demand.

This is useful for multi-character UIs where you want to display dialogue as text first and let the user (or your orchestration layer) decide when to play each line.

const session = new AvatarSession({
  llm,
  tts,
  renderer,
  autoSpeak: false,
});

session.on("segment", (segment) => {
  console.log(`[${segment.index}] ${segment.text}`);
  // segment.emotion is set when emotions: true
});

await session.sendMessage("Tell me a story");

// Play a specific segment later
await session.speak(0);

Segment type

interface Segment {
  index: number;
  text: string;
  emotion?: EmotionPayload;
}

Segments are cleared at the start of each sendMessage() call. The segments getter exposes the accumulated segments from the current or most recent response. Calling interrupt() during speak() stops playback but preserves the segment list — segments are only cleared on the next sendMessage().

Session Lifecycle