AI Video Editor · Miika Kulmala

What it is

A self-hosted system that ingests full TV series, or movies, indexes every scene with rich metadata, and lets an autonomous LLM agent assemble polished (debatable) 15–30 second TikTok/Reels-style edits from a plain-English instruction. The agent searches the indexed clip library, builds a two-track timeline, applies effects and overlays, and renders the result. It is built for a single operator running their own media library and GPU hardware.

How it works

Indexing pipeline: scene detection, keyframe extraction, LLM vision analysis, faster-whisper transcription, and clip pre-extraction populate a PostgreSQL catalog of scenes and clips.
Character identity: faces are detected and embedded as 768-dim CCIP vectors via ONNX Runtime, then clustered and matched against a per-series gallery, staying provisional until a human names them.
Editor agent: a ReAct-style orchestrator in the renderer drives ~56 LLM-visible tools (clip retrieval, timeline edits, effects, overlays, audio), injecting extracted video frames back to a vision model so it can see its own edits.
Stack: TypeScript monorepo (pnpm workspaces) — Hono REST+WebSocket API, BullMQ/Redis job queue, Drizzle ORM on PostgreSQL 16, React 19 + Vite + Tailwind front end, and FFmpeg with CUDA/NVENC for compositing.
Deployment: seven Docker Compose services (API, renderer, indexer, Whisper, character-embedder, Postgres, Redis), wired to an OpenAI-compatible local LLM endpoint.

Why it's interesting

The interesting part is closing the perception loop: the editing agent doesn't just emit edit commands blind — it renders preview frames and feeds them back to a vision model, so it can judge cuts and transitions the way a human reviewing the timeline would. Character recognition runs entirely on local ONNX models with deliberately conservative auto-matching, keeping ambiguous identities provisional rather than guessing.

Status

Private hobby project, work in progress — runs end-to-end via Docker Compose.