Session Lifecycle

Session states, events, and the conversational pipeline.

AvatarSession is the central orchestrator. It manages the conversational pipeline from user input through LLM streaming, sentence segmentation, TTS synthesis, and avatar speech. It also manages voice input, video capture, vision workloads, memory persistence, and thread management.

State machine

autoSpeak true (default):
idle → connecting → ready ⇄ thinking → speaking → ready
                      ↑                              |
                      └──── interrupt() ─────────────┘

autoSpeak false (text-first):
idle → connecting → ready ⇄ thinking → ready
                      ↑       speak(i) → speaking → ready
                      └──── interrupt() ────────────┘

States

StateDescription
idleSession created but not mounted
connectingRenderer is mounting (establishing connections)
readyWaiting for user input
thinkingLLM is streaming a response
speakingTTS audio is playing through the avatar
errorSomething went wrong
destroyedSession torn down via destroy()

Events

Subscribe to events via session.on():

session.on("state-change", (state) => { /* ... */ });
session.on("message", (msg) => { /* ... */ });
session.on("chunk", (text, accumulated) => { /* ... */ });
session.on("speech-start", () => { /* ... */ });
session.on("speech-end", () => { /* ... */ });
session.on("error", (err) => { /* ... */ });
session.on("listening-change", (listening) => { /* ... */ });
session.on("transcript", (text, opts) => { /* ... */ });
session.on("emotion", (payload) => { /* ... */ });
session.on("segment", (segment) => { /* ... */ });
session.on("thread-change", (thread) => { /* ... */ });
EventPayloadDescription
state-changeSessionStateFires on every state transition
messageChatMessageUser or assistant message committed to history
chunk(text: string, accumulated: string)Streaming LLM text delta. accumulated contains the running full text in plain mode; when emotions: true it is always "" (emotion markers are stripped inline).
speech-startTTS audio playback begins
speech-endTTS audio playback finishes
errorErrorError from any pipeline stage
listening-changebooleanVoice input started or stopped
video-changebooleanVideo capture started or stopped
transcript(text: string, opts: TranscriptionTextOptions)Realtime STT transcript (partial or final)
emotion{ name: string, intensity: number }Emotion marker parsed from LLM output (requires emotions: true)
segmentSegmentSentence-sized text segment produced by the LLM (always emitted when autoSpeak: false)
thread-changeThreadActive thread changed (create, switch, or initial load)
history-loadedChatMessage[]Persisted messages loaded into history (on start or switchThread)
vision-contextVisionContextEntryVision workload completed with new context
vision-workloads-changebooleanVision workload ticker started or stopped
vision-errorErrorVision workload inference failed

The pipeline in detail

When sendMessage(text) is called:

Interrupt — Any in-progress pipeline is cancelled via AbortController.

User message — The text is wrapped in a ChatMessage and added to history. The message event fires.

Memory recall — If memory with semantic recall is configured, relevant past messages are retrieved and injected into context.

LLM streaming — State transitions to thinking. The full message history (with system prompt or character card) is sent to the LLM. As chunks arrive, the chunk event fires with each delta.

Emotion parsing — If emotions: true, inline markers like <|ACT {...}|> are parsed from the stream, stripped from TTS text, and applied to the renderer.

Sentence splitting — The streaming text is buffered and split on sentence boundaries (., !, ?, newlines). Complete sentences are passed to TTS immediately — no need to wait for the full response.

TTS synthesis (pre-buffered) — Sentences are synthesized to audio blobs concurrently. Up to ttsBufferSize (default 3) TTS requests run in parallel so that the next audio blob is already waiting when the current one finishes playing. If the renderer has speakText, the external TTS step is skipped entirely and sentences are spoken sequentially.

Avatar speech — State transitions to speaking when the first blob is ready. Blobs are played through renderer.speak() in strict sentence order. The speech-start and speech-end events fire around each sentence's playback.

Completion — After all sentences are spoken, the assistant message is committed to history and the message event fires. State returns to ready.

Interruption

interrupt() cancels the pipeline at whatever stage it's in:

  • Cancels the speech queue, aborting all in-flight and buffered TTS requests
  • Aborts the LLM stream via AbortController.abort()
  • Calls renderer.interrupt() to stop audio/video playback
  • Returns the session to ready

If the LLM had already streamed non-empty text before the interrupt, the partial assistant message is committed to history and a message event fires. If no text was produced yet, nothing is committed.

Session configuration

interface AvatarSessionConfig {
  llm: LLMProvider;
  tts?: TTSProvider;
  renderer: AvatarRenderer;
  systemPrompt?: string;
  characterCard?: CharacterCard;
  reasoningEffort?: "none" | "low" | "medium" | "high";
  realtimeSTT?: RealtimeSTTProvider;
  voice?: VoiceConfig;
  vision?: VisionConfig;
  emotions?: boolean;
  memory?: MemoryConfig;
  visionWorkloads?: VisionWorkloadsConfig;
  ttsBufferSize?: number;
  autoSpeak?: boolean;
}
FieldTypeDescription
llmLLMProviderRequired. The language model adapter.
ttsTTSProviderOptional when the renderer implements speakText.
rendererAvatarRendererRequired. The avatar renderer.
systemPromptstringSystem message content. Ignored if characterCard is set.
characterCardCharacterCardV3 character card. Builds system prompt from structured fields. Takes precedence over systemPrompt.
reasoningEffortReasoningEffortExtended thinking budget passed to the LLM.
realtimeSTTRealtimeSTTProviderEnables startListening / stopListening voice pipeline.
voiceVoiceConfigVoice input options: bargeIn (default true), bargeInMinLength (default 2).
visionVisionConfigImage format, quality, and max width for snapshots.
emotionsbooleanEnable emotion marker parsing from LLM output.
memoryMemoryConfigThread-based persistent memory. See Memory.
visionWorkloadsVisionWorkloadsConfigPeriodic vision analysis. See Vision.
ttsBufferSizenumberMax sentences synthesized ahead of playback (default 3). Ignored for speakText renderers.
autoSpeakbooleanWhen false, sendMessage() emits segment events instead of automatically synthesizing and playing audio. Call speak(index) to play on demand. Default true.

Methods

MethodDescription
start(container)Mount the renderer and transition to ready
sendMessage(text, opts?)Run the full LLM → TTS → speak pipeline. opts.images attaches images.
interrupt()Cancel the current pipeline
startListening(source)Begin voice input from an AsyncIterable<Float32Array> (e.g. MicCapture)
stopListening(opts?)Stop voice input. opts.drain keeps connection alive for final transcript.
startVideo(stream)Start video capture from a MediaStream
stopVideo()Stop video capture
startVisionWorkloads()Start the periodic vision workload ticker
stopVisionWorkloads()Stop the vision workload ticker
updateControl(control)Update avatar face/emotion/body/scene
setLLM(llm)Swap the LLM provider at runtime
setTTS(tts)Swap the TTS provider at runtime
setRenderer(renderer, container)Swap the renderer (unmounts old one)
setCharacterCard(card)Set or replace the character card
switchThread(threadId)Switch to an existing thread (loads persisted messages)
newThread(opts?)Create a new thread. opts.title sets the thread title.
speak(index)Synthesize and play a single segment on demand. Only works in ready state.
destroy()Tear down everything

Properties

PropertyTypeDescription
stateSessionStateCurrent session state
messagesreadonly ChatMessage[]Full conversation history
threadThread | nullCurrent thread (when memory is configured)
threadIdstring | nullCurrent thread ID
listeningbooleanWhether voice input is active
videoActivebooleanWhether video capture is active
visionWorkloadsActivebooleanWhether the vision workload ticker is running
visionContextVisionContextEntry[]Latest context from vision workloads
segmentsreadonly Segment[]Accumulated text segments from the current/last response (populated when autoSpeak: false)

Text-first mode

By default, AvatarSession runs the full pipeline: LLM stream → sentence split → TTS → avatar speech. Set autoSpeak: false to decouple text generation from speech. The session streams LLM output and emits structured segment events for each sentence-sized chunk, but does not synthesize or play audio. You can then call speak(index) to play any segment on demand.

This is useful for multi-character UIs where you want to display dialogue as text first and let the user (or your orchestration layer) decide when to play each line.

const session = new AvatarSession({
  llm,
  tts,
  renderer,
  autoSpeak: false,
});

session.on("segment", (segment) => {
  console.log(`[${segment.index}] ${segment.text}`);
  // segment.emotion is set when emotions: true
});

await session.sendMessage("Tell me a story");

// Play a specific segment later
await session.speak(0);

Segment type

interface Segment {
  index: number;
  text: string;
  emotion?: EmotionPayload;
}

Segments are cleared at the start of each sendMessage() call. The segments getter exposes the accumulated segments from the current or most recent response. Calling interrupt() during speak() stops playback but preserves the segment list — segments are only cleared on the next sendMessage().