Session Lifecycle

Session states, events, and the conversational pipeline.

AvatarSession is the central orchestrator. It manages the conversational pipeline from user input through LLM streaming, sentence segmentation, TTS synthesis, and avatar speech. It also manages voice input, video capture, vision workloads, memory persistence, and thread management.

State machine

idle → connecting → ready ⇄ thinking → speaking → ready
                      ↑                              |
                      └──── interrupt() ─────────────┘

States

StateDescription
idleSession created but not mounted
connectingRenderer is mounting (establishing connections)
readyWaiting for user input
thinkingLLM is streaming a response
speakingTTS audio is playing through the avatar
errorSomething went wrong
destroyedSession torn down via destroy()

Events

Subscribe to events via session.on():

session.on("state-change", (state) => { /* ... */ });
session.on("message", (msg) => { /* ... */ });
session.on("chunk", (text, accumulated) => { /* ... */ });
session.on("speech-start", () => { /* ... */ });
session.on("speech-end", () => { /* ... */ });
session.on("error", (err) => { /* ... */ });
session.on("listening-change", (listening) => { /* ... */ });
session.on("transcript", (text, opts) => { /* ... */ });
session.on("emotion", (payload) => { /* ... */ });
session.on("thread-change", (thread) => { /* ... */ });
EventPayloadDescription
state-changeSessionStateFires on every state transition
messageChatMessageUser or assistant message committed to history
chunk(text: string, accumulated: string)Streaming LLM text delta with full accumulated text
speech-startTTS audio playback begins
speech-endTTS audio playback finishes
errorErrorError from any pipeline stage
listening-changebooleanVoice input started or stopped
video-changebooleanVideo capture started or stopped
transcript(text: string, opts: TranscriptionTextOptions)Realtime STT transcript (partial or final)
emotion{ name: string, intensity: number }Emotion marker parsed from LLM output (requires emotions: true)
thread-changeThreadActive thread changed (create, switch, or initial load)
history-loadedChatMessage[]Persisted messages loaded into history (on start or switchThread)
vision-contextVisionContextEntryVision workload completed with new context
vision-workloads-changebooleanVision workload ticker started or stopped
vision-errorErrorVision workload inference failed

The pipeline in detail

When sendMessage(text) is called:

Interrupt — Any in-progress pipeline is cancelled via AbortController.

User message — The text is wrapped in a ChatMessage and added to history. The message event fires.

Memory recall — If memory with semantic recall is configured, relevant past messages are retrieved and injected into context.

LLM streaming — State transitions to thinking. The full message history (with system prompt or character card) is sent to the LLM. As chunks arrive, the chunk event fires with each delta.

Emotion parsing — If emotions: true, inline markers like <|ACT {...}|> are parsed from the stream, stripped from TTS text, and applied to the renderer.

Sentence splitting — The streaming text is buffered and split on sentence boundaries (., !, ?, newlines). Complete sentences are passed to TTS immediately — no need to wait for the full response.

TTS synthesis — Each sentence is synthesized to an audio blob. If the renderer has speakText, the external TTS step is skipped entirely.

Avatar speech — State transitions to speaking. The audio blob is passed to renderer.speak(). The speech-start and speech-end events fire around playback.

Completion — After all sentences are spoken, the assistant message is committed to history and the message event fires. State returns to ready.

Interruption

interrupt() cancels the pipeline at whatever stage it's in:

  • Aborts the LLM stream via AbortController.abort()
  • Aborts any in-flight TTS request
  • Calls renderer.interrupt() to stop audio/video playback
  • Returns the session to ready

The interrupted response is not committed to message history.

Session configuration

interface AvatarSessionConfig {
  llm: LLMProvider;
  tts?: TTSProvider;
  stt?: STTProvider;
  vad?: VADProvider;
  renderer: AvatarRenderer;
  systemPrompt?: string;
  characterCard?: CharacterCard;
  reasoningEffort?: "none" | "low" | "medium" | "high";
  realtimeSTT?: RealtimeSTTProvider;
  voice?: VoiceConfig;
  vision?: VisionConfig;
  emotions?: boolean;
  memory?: MemoryConfig;
  visionWorkloads?: VisionWorkloadsConfig;
}
FieldTypeDescription
llmLLMProviderRequired. The language model adapter.
ttsTTSProviderOptional when the renderer implements speakText.
sttSTTProviderBatch STT provider (not used for startListening).
vadVADProviderVoice activity detection provider.
rendererAvatarRendererRequired. The avatar renderer.
systemPromptstringSystem message content. Ignored if characterCard is set.
characterCardCharacterCardV3 character card. Builds system prompt from structured fields. Takes precedence over systemPrompt.
reasoningEffortReasoningEffortExtended thinking budget passed to the LLM.
realtimeSTTRealtimeSTTProviderEnables startListening / stopListening voice pipeline.
voiceVoiceConfigVoice input options: bargeIn (default true), bargeInMinLength (default 2).
visionVisionConfigImage format, quality, and max width for snapshots.
emotionsbooleanEnable emotion marker parsing from LLM output.
memoryMemoryConfigThread-based persistent memory. See Memory.
visionWorkloadsVisionWorkloadsConfigPeriodic vision analysis. See Vision.

Methods

MethodDescription
start(container)Mount the renderer and transition to ready
sendMessage(text, opts?)Run the full LLM → TTS → speak pipeline. opts.images attaches images.
interrupt()Cancel the current pipeline
startListening(source)Begin voice input from an AsyncIterable<Float32Array> (e.g. MicCapture)
stopListening(opts?)Stop voice input. opts.drain keeps connection alive for final transcript.
startVideo(stream)Start video capture from a MediaStream
stopVideo()Stop video capture
startVisionWorkloads()Start the periodic vision workload ticker
stopVisionWorkloads()Stop the vision workload ticker
updateControl(control)Update avatar face/emotion/body/scene
setLLM(llm)Swap the LLM provider at runtime
setTTS(tts)Swap the TTS provider at runtime
setRenderer(renderer, container)Swap the renderer (unmounts old one)
setCharacterCard(card)Set or replace the character card
switchThread(threadId)Switch to an existing thread (loads persisted messages)
newThread(opts?)Create a new thread. opts.title sets the thread title.
destroy()Tear down everything

Properties

PropertyTypeDescription
stateSessionStateCurrent session state
messagesreadonly ChatMessage[]Full conversation history
threadThread | nullCurrent thread (when memory is configured)
threadIdstring | nullCurrent thread ID
listeningbooleanWhether voice input is active
videoActivebooleanWhether video capture is active
visionWorkloadsActivebooleanWhether the vision workload ticker is running
visionContextVisionContextEntry[]Latest context from vision workloads