AvatarLayer — pluggable SDK for realtime conversational avatars

AvatarLayer supports visual input through three mechanisms:

Image attachments — Attach images directly to messages
Video input — Capture frames from a live MediaStream
Vision workloads — Periodic background analysis of captured frames using a vision-capable LLM

Image attachments

Attach images (data URLs or HTTPS URLs) to any message:

await session.sendMessage("What do you see in this image?", {
  images: ["data:image/png;base64,..."],
});

The images option accepts data URLs or HTTPS URLs. They are included as ImageContentPart entries in the message sent to the LLM.

Video input

Start and stop video capture from a MediaStream:

const stream = await navigator.mediaDevices.getUserMedia({ video: true });
session.startVideo(stream);

session.stopVideo();

VisionConfig

Control how frames are captured for vision:

const session = new AvatarSession({
  // ...other config
  vision: {
    imageFormat: "image/webp",  // default
    imageQuality: 0.5,          // default
    maxWidth: 512,               // default
  },
});

Field	Type	Default	Description
`imageFormat`	`string`	`"image/webp"`	MIME type for captured frames
`imageQuality`	`number`	`0.5`	Image quality (0-1)
`maxWidth`	`number`	`512`	Max width in pixels — resized for token efficiency

Vision workloads

Vision workloads run a vision-capable LLM against captured video frames on a periodic interval. The results are injected into the chat LLM's context, giving the avatar awareness of what's on screen.

import { AvatarSession } from "avatarlayer";
import { OpenAIAdapter, GeminiAdapter } from "avatarlayer/llm";

const session = new AvatarSession({
  llm: new OpenAIAdapter({ apiKey: "sk-..." }),
  // ...other config
  visionWorkloads: {
    llm: new GeminiAdapter({ apiKey: "..." }),
    workloads: ["screen:interpret", "screen:ocr"],
    intervalMs: 3000,
    autoStart: false,
  },
});

VisionWorkloadsConfig

Field	Type	Default	Description
`llm`	`LLMProvider`	required	Vision-capable LLM (can differ from the chat LLM)
`workloads`	`(BuiltinWorkloadId \| VisionWorkload)[]`	required	Workloads to run
`intervalMs`	`number`	`3000`	Capture interval in milliseconds
`captureQuality`	`number`	`0.5`	Frame capture quality (0-1)
`captureMaxWidth`	`number`	`512`	Max frame width in pixels
`autoStart`	`boolean`	`false`	Auto-start when a video source becomes active

Built-in workloads

ID	Description
`screen:interpret`	General interpretation of what's visible on screen
`screen:understand`	Deeper understanding of screen context and user intent
`screen:ocr`	Extract text visible on screen
`screen:ui-automation`	Identify UI elements and possible interactions

Custom workloads

const customWorkload = {
  id: "my-workload",
  label: "Custom Analysis",
  description: "Analyze the screen for specific patterns",
  prompt: "Describe any charts or data visualizations you see.",
};

const session = new AvatarSession({
  // ...config
  visionWorkloads: {
    llm: new GeminiAdapter({ apiKey: "..." }),
    workloads: ["screen:interpret", customWorkload],
  },
});

Controlling workloads

session.startVisionWorkloads();
session.stopVisionWorkloads();

Events

Event	Payload	Description
`vision-context`	`VisionContextEntry`	A workload completed with new context
`vision-workloads-change`	`boolean`	Workload ticker started or stopped
`vision-error`	`Error`	A workload inference failed

session.on("vision-context", (entry) => {
  console.log(`[${entry.workloadLabel}]: ${entry.text}`);
});

VisionContextEntry

Field	Type	Description
`workloadId`	`string`	Workload that produced this context
`workloadLabel`	`string`	Human-readable workload label
`text`	`string`	The LLM's analysis result
`capturedAt`	`number`	Timestamp of the frame capture

Vision

On this page