Vision
Video input, image attachments, and periodic vision workloads.
AvatarLayer supports visual input through three mechanisms:
- Image attachments — Attach images directly to messages
- Video input — Capture frames from a live
MediaStream - Vision workloads — Periodic background analysis of captured frames using a vision-capable LLM
Image attachments
Attach images (data URLs or HTTPS URLs) to any message:
await session.sendMessage("What do you see in this image?", {
images: ["data:image/png;base64,..."],
});The images option accepts data URLs or HTTPS URLs. They are included as ImageContentPart entries in the message sent to the LLM.
Video input
Start and stop video capture from a MediaStream:
const stream = await navigator.mediaDevices.getUserMedia({ video: true });
session.startVideo(stream);session.stopVideo();VisionConfig
Control how frames are captured for vision:
const session = new AvatarSession({
// ...other config
vision: {
imageFormat: "image/webp", // default
imageQuality: 0.5, // default
maxWidth: 512, // default
},
});| Field | Type | Default | Description |
|---|---|---|---|
imageFormat | string | "image/webp" | MIME type for captured frames |
imageQuality | number | 0.5 | Image quality (0-1) |
maxWidth | number | 512 | Max width in pixels — resized for token efficiency |
Vision workloads
Vision workloads run a vision-capable LLM against captured video frames on a periodic interval. The results are injected into the chat LLM's context, giving the avatar awareness of what's on screen.
import { AvatarSession, OpenAIAdapter, GeminiAdapter } from "avatarlayer";
const session = new AvatarSession({
llm: new OpenAIAdapter({ apiKey: "sk-..." }),
// ...other config
visionWorkloads: {
llm: new GeminiAdapter({ apiKey: "..." }),
workloads: ["screen:interpret", "screen:ocr"],
intervalMs: 3000,
autoStart: false,
},
});VisionWorkloadsConfig
| Field | Type | Default | Description |
|---|---|---|---|
llm | LLMProvider | required | Vision-capable LLM (can differ from the chat LLM) |
workloads | (BuiltinWorkloadId | VisionWorkload)[] | required | Workloads to run |
intervalMs | number | 3000 | Capture interval in milliseconds |
captureQuality | number | 0.5 | Frame capture quality (0-1) |
captureMaxWidth | number | 512 | Max frame width in pixels |
autoStart | boolean | false | Auto-start when a video source becomes active |
Built-in workloads
| ID | Description |
|---|---|
screen:interpret | General interpretation of what's visible on screen |
screen:understand | Deeper understanding of screen context and user intent |
screen:ocr | Extract text visible on screen |
screen:ui-automation | Identify UI elements and possible interactions |
Custom workloads
const customWorkload = {
id: "my-workload",
label: "Custom Analysis",
description: "Analyze the screen for specific patterns",
prompt: "Describe any charts or data visualizations you see.",
};
const session = new AvatarSession({
// ...config
visionWorkloads: {
llm: new GeminiAdapter({ apiKey: "..." }),
workloads: ["screen:interpret", customWorkload],
},
});Controlling workloads
session.startVisionWorkloads();
session.stopVisionWorkloads();Events
| Event | Payload | Description |
|---|---|---|
vision-context | VisionContextEntry | A workload completed with new context |
vision-workloads-change | boolean | Workload ticker started or stopped |
vision-error | Error | A workload inference failed |
session.on("vision-context", (entry) => {
console.log(`[${entry.workloadLabel}]: ${entry.text}`);
});VisionContextEntry
| Field | Type | Description |
|---|---|---|
workloadId | string | Workload that produced this context |
workloadLabel | string | Human-readable workload label |
text | string | The LLM's analysis result |
capturedAt | number | Timestamp of the frame capture |