Introduction

AvatarLayer is a pluggable TypeScript SDK for realtime conversational avatars — LLM, TTS, STT, and avatar rendering behind a single unified interface.

Pluggable TypeScript SDK for realtime conversational avatars. Provides a clean provider model for LLM, TTS, STT, and avatar rendering — supporting local 3D avatars (VRM, Live2D), remote video avatar services (LemonSlice, Atlas, HeyGen), voice input, persistent memory, character cards, and on-device ML behind a single unified interface.

The pipeline

Every conversational turn follows the same flow:

User text / voice → LLM stream → sentence split → TTS → renderer.speak()

When voice input is enabled, the pipeline extends to:

Mic → RealtimeSTT → transcript → (barge-in or sendMessage) → LLM → TTS → renderer

AvatarSession handles streaming, sentence segmentation, interruption, voice activity detection, memory recall, and state transitions automatically. You plug in the providers you want and the SDK does the rest.

Install

One-line install

AvatarLayer is published on npm and works with any Node.js package manager.

npm install avatarlayer

Quick start

import {
  AvatarSession,
  OpenAIAdapter,
  ElevenLabsAdapter,
  VRMLocalRenderer,
} from "avatarlayer";

const session = new AvatarSession({
  llm: new OpenAIAdapter({ apiKey: "sk-...", model: "gpt-5.4-mini" }),
  tts: new ElevenLabsAdapter({ apiKey: "...", voiceId: "21m00Tcm4TlvDq8ikWAM" }),
  renderer: new VRMLocalRenderer({ modelUrl: "/models/avatar.vrm" }),
  systemPrompt: "You are a helpful avatar assistant.",
});

await session.start(document.getElementById("avatar-container")!);
await session.sendMessage("Hello! Tell me about yourself.");

Key features

13+ LLM adapters

OpenAI, Anthropic, Gemini, Groq, DeepSeek, Mistral, xAI, OpenRouter, Together, Fireworks, Azure OpenAI, Ollama, and Chrome Prompt API.

Multiple avatar backends

Local 3D (VRM, Live2D), remote video (LemonSlice, Atlas, HeyGen). One AvatarRenderer interface, swap at runtime.

Voice input

Realtime STT with barge-in, VAD, and mic capture. Deepgram, ElevenLabs, Azure Speech, Amazon Transcribe, WebSpeech, or local Whisper.

Memory and threads

Persist conversations across sessions with pluggable thread providers. Semantic recall via vector embeddings for long-term context.

Character cards

Load V3 character cards from PNG or JSON. Lorebook, personality, scenario, and message examples — all structured.

Emotions

Inline emotion markers in LLM output. Automatic expression mapping for VRM and Live2D renderers.

React bindings

AvatarProvider, useAvatarSession, AvatarView, and useMic — drop an avatar into any React app in minutes.

Interruptible pipeline

Cancel LLM streaming, TTS synthesis, and avatar speech at any point with a single interrupt() call.

Local ML

Run TTS (Kokoro, Kitten), STT (Whisper), VAD (Silero), and embeddings entirely on-device via WebGPU / WASM.

Vision

Video input with periodic vision workloads — screen interpretation, OCR, UI automation context injected into the LLM.

Custom adapters

Implement any provider interface to add new LLMs, TTS engines, STT services, or renderers.

Avatar control schema

Fine-grained control over face, emotion, body, and scene via the avatar-runtime v0.2 contract.

Next steps