Session Lifecycle
Session states, events, and the conversational pipeline.
AvatarSession is the central orchestrator. It manages the conversational pipeline from user input through LLM streaming, sentence segmentation, TTS synthesis, and avatar speech. It also manages voice input, video capture, vision workloads, memory persistence, and thread management.
State machine
idle → connecting → ready ⇄ thinking → speaking → ready
↑ |
└──── interrupt() ─────────────┘States
| State | Description |
|---|---|
idle | Session created but not mounted |
connecting | Renderer is mounting (establishing connections) |
ready | Waiting for user input |
thinking | LLM is streaming a response |
speaking | TTS audio is playing through the avatar |
error | Something went wrong |
destroyed | Session torn down via destroy() |
Events
Subscribe to events via session.on():
session.on("state-change", (state) => { /* ... */ });
session.on("message", (msg) => { /* ... */ });
session.on("chunk", (text, accumulated) => { /* ... */ });
session.on("speech-start", () => { /* ... */ });
session.on("speech-end", () => { /* ... */ });
session.on("error", (err) => { /* ... */ });
session.on("listening-change", (listening) => { /* ... */ });
session.on("transcript", (text, opts) => { /* ... */ });
session.on("emotion", (payload) => { /* ... */ });
session.on("thread-change", (thread) => { /* ... */ });| Event | Payload | Description |
|---|---|---|
state-change | SessionState | Fires on every state transition |
message | ChatMessage | User or assistant message committed to history |
chunk | (text: string, accumulated: string) | Streaming LLM text delta with full accumulated text |
speech-start | — | TTS audio playback begins |
speech-end | — | TTS audio playback finishes |
error | Error | Error from any pipeline stage |
listening-change | boolean | Voice input started or stopped |
video-change | boolean | Video capture started or stopped |
transcript | (text: string, opts: TranscriptionTextOptions) | Realtime STT transcript (partial or final) |
emotion | { name: string, intensity: number } | Emotion marker parsed from LLM output (requires emotions: true) |
thread-change | Thread | Active thread changed (create, switch, or initial load) |
history-loaded | ChatMessage[] | Persisted messages loaded into history (on start or switchThread) |
vision-context | VisionContextEntry | Vision workload completed with new context |
vision-workloads-change | boolean | Vision workload ticker started or stopped |
vision-error | Error | Vision workload inference failed |
The pipeline in detail
When sendMessage(text) is called:
Interrupt — Any in-progress pipeline is cancelled via AbortController.
User message — The text is wrapped in a ChatMessage and added to history. The message event fires.
Memory recall — If memory with semantic recall is configured, relevant past messages are retrieved and injected into context.
LLM streaming — State transitions to thinking. The full message history (with system prompt or character card) is sent to the LLM. As chunks arrive, the chunk event fires with each delta.
Emotion parsing — If emotions: true, inline markers like <|ACT {...}|> are parsed from the stream, stripped from TTS text, and applied to the renderer.
Sentence splitting — The streaming text is buffered and split on sentence boundaries (., !, ?, newlines). Complete sentences are passed to TTS immediately — no need to wait for the full response.
TTS synthesis — Each sentence is synthesized to an audio blob. If the renderer has speakText, the external TTS step is skipped entirely.
Avatar speech — State transitions to speaking. The audio blob is passed to renderer.speak(). The speech-start and speech-end events fire around playback.
Completion — After all sentences are spoken, the assistant message is committed to history and the message event fires. State returns to ready.
Interruption
interrupt() cancels the pipeline at whatever stage it's in:
- Aborts the LLM stream via
AbortController.abort() - Aborts any in-flight TTS request
- Calls
renderer.interrupt()to stop audio/video playback - Returns the session to
ready
The interrupted response is not committed to message history.
Session configuration
interface AvatarSessionConfig {
llm: LLMProvider;
tts?: TTSProvider;
stt?: STTProvider;
vad?: VADProvider;
renderer: AvatarRenderer;
systemPrompt?: string;
characterCard?: CharacterCard;
reasoningEffort?: "none" | "low" | "medium" | "high";
realtimeSTT?: RealtimeSTTProvider;
voice?: VoiceConfig;
vision?: VisionConfig;
emotions?: boolean;
memory?: MemoryConfig;
visionWorkloads?: VisionWorkloadsConfig;
}| Field | Type | Description |
|---|---|---|
llm | LLMProvider | Required. The language model adapter. |
tts | TTSProvider | Optional when the renderer implements speakText. |
stt | STTProvider | Batch STT provider (not used for startListening). |
vad | VADProvider | Voice activity detection provider. |
renderer | AvatarRenderer | Required. The avatar renderer. |
systemPrompt | string | System message content. Ignored if characterCard is set. |
characterCard | CharacterCard | V3 character card. Builds system prompt from structured fields. Takes precedence over systemPrompt. |
reasoningEffort | ReasoningEffort | Extended thinking budget passed to the LLM. |
realtimeSTT | RealtimeSTTProvider | Enables startListening / stopListening voice pipeline. |
voice | VoiceConfig | Voice input options: bargeIn (default true), bargeInMinLength (default 2). |
vision | VisionConfig | Image format, quality, and max width for snapshots. |
emotions | boolean | Enable emotion marker parsing from LLM output. |
memory | MemoryConfig | Thread-based persistent memory. See Memory. |
visionWorkloads | VisionWorkloadsConfig | Periodic vision analysis. See Vision. |
Methods
| Method | Description |
|---|---|
start(container) | Mount the renderer and transition to ready |
sendMessage(text, opts?) | Run the full LLM → TTS → speak pipeline. opts.images attaches images. |
interrupt() | Cancel the current pipeline |
startListening(source) | Begin voice input from an AsyncIterable<Float32Array> (e.g. MicCapture) |
stopListening(opts?) | Stop voice input. opts.drain keeps connection alive for final transcript. |
startVideo(stream) | Start video capture from a MediaStream |
stopVideo() | Stop video capture |
startVisionWorkloads() | Start the periodic vision workload ticker |
stopVisionWorkloads() | Stop the vision workload ticker |
updateControl(control) | Update avatar face/emotion/body/scene |
setLLM(llm) | Swap the LLM provider at runtime |
setTTS(tts) | Swap the TTS provider at runtime |
setRenderer(renderer, container) | Swap the renderer (unmounts old one) |
setCharacterCard(card) | Set or replace the character card |
switchThread(threadId) | Switch to an existing thread (loads persisted messages) |
newThread(opts?) | Create a new thread. opts.title sets the thread title. |
destroy() | Tear down everything |
Properties
| Property | Type | Description |
|---|---|---|
state | SessionState | Current session state |
messages | readonly ChatMessage[] | Full conversation history |
thread | Thread | null | Current thread (when memory is configured) |
threadId | string | null | Current thread ID |
listening | boolean | Whether voice input is active |
videoActive | boolean | Whether video capture is active |
visionWorkloadsActive | boolean | Whether the vision workload ticker is running |
visionContext | VisionContextEntry[] | Latest context from vision workloads |