Introduction
AvatarLayer is a pluggable TypeScript SDK for realtime conversational avatars — LLM, TTS, STT, and avatar rendering behind a single unified interface.
Pluggable TypeScript SDK for realtime conversational avatars. Provides a clean provider model for LLM, TTS, STT, and avatar rendering — supporting local 3D avatars (VRM, Live2D), remote video avatar services (LemonSlice, Atlas, HeyGen), voice input, persistent memory, character cards, and on-device ML behind a single unified interface.
The pipeline
Every conversational turn follows the same flow:
User text / voice → LLM stream → sentence split → TTS → renderer.speak()When voice input is enabled, the pipeline extends to:
Mic → RealtimeSTT → transcript → (barge-in or sendMessage) → LLM → TTS → rendererAvatarSession handles streaming, sentence segmentation, interruption, voice activity detection, memory recall, and state transitions automatically. You plug in the providers you want and the SDK does the rest.
Install
One-line install
AvatarLayer is published on npm and works with any Node.js package manager.
npm install avatarlayerQuick start
import {
AvatarSession,
OpenAIAdapter,
ElevenLabsAdapter,
VRMLocalRenderer,
} from "avatarlayer";
const session = new AvatarSession({
llm: new OpenAIAdapter({ apiKey: "sk-...", model: "gpt-5.4-mini" }),
tts: new ElevenLabsAdapter({ apiKey: "...", voiceId: "21m00Tcm4TlvDq8ikWAM" }),
renderer: new VRMLocalRenderer({ modelUrl: "/models/avatar.vrm" }),
systemPrompt: "You are a helpful avatar assistant.",
});
await session.start(document.getElementById("avatar-container")!);
await session.sendMessage("Hello! Tell me about yourself.");Key features
13+ LLM adapters
OpenAI, Anthropic, Gemini, Groq, DeepSeek, Mistral, xAI, OpenRouter, Together, Fireworks, Azure OpenAI, Ollama, and Chrome Prompt API.
Multiple avatar backends
Local 3D (VRM, Live2D), remote video (LemonSlice, Atlas, HeyGen). One AvatarRenderer interface, swap at runtime.
Voice input
Realtime STT with barge-in, VAD, and mic capture. Deepgram, ElevenLabs, Azure Speech, Amazon Transcribe, WebSpeech, or local Whisper.
Memory and threads
Persist conversations across sessions with pluggable thread providers. Semantic recall via vector embeddings for long-term context.
Character cards
Load V3 character cards from PNG or JSON. Lorebook, personality, scenario, and message examples — all structured.
Emotions
Inline emotion markers in LLM output. Automatic expression mapping for VRM and Live2D renderers.
React bindings
AvatarProvider, useAvatarSession, AvatarView, and useMic — drop an avatar into any React app in minutes.
Interruptible pipeline
Cancel LLM streaming, TTS synthesis, and avatar speech at any point with a single interrupt() call.
Local ML
Run TTS (Kokoro, Kitten), STT (Whisper), VAD (Silero), and embeddings entirely on-device via WebGPU / WASM.
Vision
Video input with periodic vision workloads — screen interpretation, OCR, UI automation context injected into the LLM.
Custom adapters
Implement any provider interface to add new LLMs, TTS engines, STT services, or renderers.
Avatar control schema
Fine-grained control over face, emotion, body, and scene via the avatar-runtime v0.2 contract.
Next steps
Getting Started
Install, configure, and run your first avatar session
Providers
Configure LLM, TTS, and STT adapters
Renderers
Local VRM/Live2D, LemonSlice, Atlas, and HeyGen
Voice Input
Realtime speech-to-text, VAD, and mic capture
Memory
Persistent threads and semantic recall
React Integration
Use AvatarLayer in React apps