Vision

Video input, image attachments, and periodic vision workloads.

AvatarLayer supports visual input through three mechanisms:

  • Image attachments — Attach images directly to messages
  • Video input — Capture frames from a live MediaStream
  • Vision workloads — Periodic background analysis of captured frames using a vision-capable LLM

Image attachments

Attach images (data URLs or HTTPS URLs) to any message:

await session.sendMessage("What do you see in this image?", {
  images: ["data:image/png;base64,..."],
});

The images option accepts data URLs or HTTPS URLs. They are included as ImageContentPart entries in the message sent to the LLM.

Video input

Start and stop video capture from a MediaStream:

const stream = await navigator.mediaDevices.getUserMedia({ video: true });
session.startVideo(stream);
session.stopVideo();

VisionConfig

Control how frames are captured for vision:

const session = new AvatarSession({
  // ...other config
  vision: {
    imageFormat: "image/webp",  // default
    imageQuality: 0.5,          // default
    maxWidth: 512,               // default
  },
});
FieldTypeDefaultDescription
imageFormatstring"image/webp"MIME type for captured frames
imageQualitynumber0.5Image quality (0-1)
maxWidthnumber512Max width in pixels — resized for token efficiency

Vision workloads

Vision workloads run a vision-capable LLM against captured video frames on a periodic interval. The results are injected into the chat LLM's context, giving the avatar awareness of what's on screen.

import { AvatarSession, OpenAIAdapter, GeminiAdapter } from "avatarlayer";

const session = new AvatarSession({
  llm: new OpenAIAdapter({ apiKey: "sk-..." }),
  // ...other config
  visionWorkloads: {
    llm: new GeminiAdapter({ apiKey: "..." }),
    workloads: ["screen:interpret", "screen:ocr"],
    intervalMs: 3000,
    autoStart: false,
  },
});

VisionWorkloadsConfig

FieldTypeDefaultDescription
llmLLMProviderrequiredVision-capable LLM (can differ from the chat LLM)
workloads(BuiltinWorkloadId | VisionWorkload)[]requiredWorkloads to run
intervalMsnumber3000Capture interval in milliseconds
captureQualitynumber0.5Frame capture quality (0-1)
captureMaxWidthnumber512Max frame width in pixels
autoStartbooleanfalseAuto-start when a video source becomes active

Built-in workloads

IDDescription
screen:interpretGeneral interpretation of what's visible on screen
screen:understandDeeper understanding of screen context and user intent
screen:ocrExtract text visible on screen
screen:ui-automationIdentify UI elements and possible interactions

Custom workloads

const customWorkload = {
  id: "my-workload",
  label: "Custom Analysis",
  description: "Analyze the screen for specific patterns",
  prompt: "Describe any charts or data visualizations you see.",
};

const session = new AvatarSession({
  // ...config
  visionWorkloads: {
    llm: new GeminiAdapter({ apiKey: "..." }),
    workloads: ["screen:interpret", customWorkload],
  },
});

Controlling workloads

session.startVisionWorkloads();
session.stopVisionWorkloads();

Events

EventPayloadDescription
vision-contextVisionContextEntryA workload completed with new context
vision-workloads-changebooleanWorkload ticker started or stopped
vision-errorErrorA workload inference failed
session.on("vision-context", (entry) => {
  console.log(`[${entry.workloadLabel}]: ${entry.text}`);
});

VisionContextEntry

FieldTypeDescription
workloadIdstringWorkload that produced this context
workloadLabelstringHuman-readable workload label
textstringThe LLM's analysis result
capturedAtnumberTimestamp of the frame capture