Session Lifecycle
Session states, events, and the conversational pipeline.
AvatarSession is the central orchestrator. It manages the conversational pipeline from user input through LLM streaming, sentence segmentation, TTS synthesis, and avatar speech. It also manages voice input, video capture, vision workloads, memory persistence, and thread management.
State machine
autoSpeak true (default):
idle → connecting → ready ⇄ thinking → speaking → ready
↑ |
└──── interrupt() ─────────────┘
autoSpeak false (text-first):
idle → connecting → ready ⇄ thinking → ready
↑ speak(i) → speaking → ready
└──── interrupt() ────────────┘States
| State | Description |
|---|---|
idle | Session created but not mounted |
connecting | Renderer is mounting (establishing connections) |
ready | Waiting for user input |
thinking | LLM is streaming a response |
speaking | TTS audio is playing through the avatar |
error | Something went wrong |
destroyed | Session torn down via destroy() |
Events
Subscribe to events via session.on():
session.on("state-change", (state) => { /* ... */ });
session.on("message", (msg) => { /* ... */ });
session.on("chunk", (text, accumulated) => { /* ... */ });
session.on("speech-start", () => { /* ... */ });
session.on("speech-end", () => { /* ... */ });
session.on("error", (err) => { /* ... */ });
session.on("listening-change", (listening) => { /* ... */ });
session.on("transcript", (text, opts) => { /* ... */ });
session.on("emotion", (payload) => { /* ... */ });
session.on("segment", (segment) => { /* ... */ });
session.on("thread-change", (thread) => { /* ... */ });| Event | Payload | Description |
|---|---|---|
state-change | SessionState | Fires on every state transition |
message | ChatMessage | User or assistant message committed to history |
chunk | (text: string, accumulated: string) | Streaming LLM text delta. accumulated contains the running full text in plain mode; when emotions: true it is always "" (emotion markers are stripped inline). |
speech-start | — | TTS audio playback begins |
speech-end | — | TTS audio playback finishes |
error | Error | Error from any pipeline stage |
listening-change | boolean | Voice input started or stopped |
video-change | boolean | Video capture started or stopped |
transcript | (text: string, opts: TranscriptionTextOptions) | Realtime STT transcript (partial or final) |
emotion | { name: string, intensity: number } | Emotion marker parsed from LLM output (requires emotions: true) |
segment | Segment | Sentence-sized text segment produced by the LLM (always emitted when autoSpeak: false) |
thread-change | Thread | Active thread changed (create, switch, or initial load) |
history-loaded | ChatMessage[] | Persisted messages loaded into history (on start or switchThread) |
vision-context | VisionContextEntry | Vision workload completed with new context |
vision-workloads-change | boolean | Vision workload ticker started or stopped |
vision-error | Error | Vision workload inference failed |
The pipeline in detail
When sendMessage(text) is called:
Interrupt — Any in-progress pipeline is cancelled via AbortController.
User message — The text is wrapped in a ChatMessage and added to history. The message event fires.
Memory recall — If memory with semantic recall is configured, relevant past messages are retrieved and injected into context.
LLM streaming — State transitions to thinking. The full message history (with system prompt or character card) is sent to the LLM. As chunks arrive, the chunk event fires with each delta.
Emotion parsing — If emotions: true, inline markers like <|ACT {...}|> are parsed from the stream, stripped from TTS text, and applied to the renderer.
Sentence splitting — The streaming text is buffered and split on sentence boundaries (., !, ?, newlines). Complete sentences are passed to TTS immediately — no need to wait for the full response.
TTS synthesis (pre-buffered) — Sentences are synthesized to audio blobs concurrently. Up to ttsBufferSize (default 3) TTS requests run in parallel so that the next audio blob is already waiting when the current one finishes playing. If the renderer has speakText, the external TTS step is skipped entirely and sentences are spoken sequentially.
Avatar speech — State transitions to speaking when the first blob is ready. Blobs are played through renderer.speak() in strict sentence order. The speech-start and speech-end events fire around each sentence's playback.
Completion — After all sentences are spoken, the assistant message is committed to history and the message event fires. State returns to ready.
Interruption
interrupt() cancels the pipeline at whatever stage it's in:
- Cancels the speech queue, aborting all in-flight and buffered TTS requests
- Aborts the LLM stream via
AbortController.abort() - Calls
renderer.interrupt()to stop audio/video playback - Returns the session to
ready
If the LLM had already streamed non-empty text before the interrupt, the partial assistant message is committed to history and a message event fires. If no text was produced yet, nothing is committed.
Session configuration
interface AvatarSessionConfig {
llm: LLMProvider;
tts?: TTSProvider;
renderer: AvatarRenderer;
systemPrompt?: string;
characterCard?: CharacterCard;
reasoningEffort?: "none" | "low" | "medium" | "high";
realtimeSTT?: RealtimeSTTProvider;
voice?: VoiceConfig;
vision?: VisionConfig;
emotions?: boolean;
memory?: MemoryConfig;
visionWorkloads?: VisionWorkloadsConfig;
ttsBufferSize?: number;
autoSpeak?: boolean;
}| Field | Type | Description |
|---|---|---|
llm | LLMProvider | Required. The language model adapter. |
tts | TTSProvider | Optional when the renderer implements speakText. |
renderer | AvatarRenderer | Required. The avatar renderer. |
systemPrompt | string | System message content. Ignored if characterCard is set. |
characterCard | CharacterCard | V3 character card. Builds system prompt from structured fields. Takes precedence over systemPrompt. |
reasoningEffort | ReasoningEffort | Extended thinking budget passed to the LLM. |
realtimeSTT | RealtimeSTTProvider | Enables startListening / stopListening voice pipeline. |
voice | VoiceConfig | Voice input options: bargeIn (default true), bargeInMinLength (default 2). |
vision | VisionConfig | Image format, quality, and max width for snapshots. |
emotions | boolean | Enable emotion marker parsing from LLM output. |
memory | MemoryConfig | Thread-based persistent memory. See Memory. |
visionWorkloads | VisionWorkloadsConfig | Periodic vision analysis. See Vision. |
ttsBufferSize | number | Max sentences synthesized ahead of playback (default 3). Ignored for speakText renderers. |
autoSpeak | boolean | When false, sendMessage() emits segment events instead of automatically synthesizing and playing audio. Call speak(index) to play on demand. Default true. |
Methods
| Method | Description |
|---|---|
start(container) | Mount the renderer and transition to ready |
sendMessage(text, opts?) | Run the full LLM → TTS → speak pipeline. opts.images attaches images. |
interrupt() | Cancel the current pipeline |
startListening(source) | Begin voice input from an AsyncIterable<Float32Array> (e.g. MicCapture) |
stopListening(opts?) | Stop voice input. opts.drain keeps connection alive for final transcript. |
startVideo(stream) | Start video capture from a MediaStream |
stopVideo() | Stop video capture |
startVisionWorkloads() | Start the periodic vision workload ticker |
stopVisionWorkloads() | Stop the vision workload ticker |
updateControl(control) | Update avatar face/emotion/body/scene |
setLLM(llm) | Swap the LLM provider at runtime |
setTTS(tts) | Swap the TTS provider at runtime |
setRenderer(renderer, container) | Swap the renderer (unmounts old one) |
setCharacterCard(card) | Set or replace the character card |
switchThread(threadId) | Switch to an existing thread (loads persisted messages) |
newThread(opts?) | Create a new thread. opts.title sets the thread title. |
speak(index) | Synthesize and play a single segment on demand. Only works in ready state. |
destroy() | Tear down everything |
Properties
| Property | Type | Description |
|---|---|---|
state | SessionState | Current session state |
messages | readonly ChatMessage[] | Full conversation history |
thread | Thread | null | Current thread (when memory is configured) |
threadId | string | null | Current thread ID |
listening | boolean | Whether voice input is active |
videoActive | boolean | Whether video capture is active |
visionWorkloadsActive | boolean | Whether the vision workload ticker is running |
visionContext | VisionContextEntry[] | Latest context from vision workloads |
segments | readonly Segment[] | Accumulated text segments from the current/last response (populated when autoSpeak: false) |
Text-first mode
By default, AvatarSession runs the full pipeline: LLM stream → sentence split → TTS → avatar speech. Set autoSpeak: false to decouple text generation from speech. The session streams LLM output and emits structured segment events for each sentence-sized chunk, but does not synthesize or play audio. You can then call speak(index) to play any segment on demand.
This is useful for multi-character UIs where you want to display dialogue as text first and let the user (or your orchestration layer) decide when to play each line.
const session = new AvatarSession({
llm,
tts,
renderer,
autoSpeak: false,
});
session.on("segment", (segment) => {
console.log(`[${segment.index}] ${segment.text}`);
// segment.emotion is set when emotions: true
});
await session.sendMessage("Tell me a story");
// Play a specific segment later
await session.speak(0);Segment type
interface Segment {
index: number;
text: string;
emotion?: EmotionPayload;
}Segments are cleared at the start of each sendMessage() call. The segments getter exposes the accumulated segments from the current or most recent response. Calling interrupt() during speak() stops playback but preserves the segment list — segments are only cleared on the next sendMessage().