What it is
A self-hosted system that ingests full TV series, or movies, indexes every scene with rich metadata, and lets an autonomous LLM agent assemble polished (debatable) 15–30 second TikTok/Reels-style edits from a plain-English instruction. The agent searches the indexed clip library, builds a two-track timeline, applies effects and overlays, and renders the result. It is built for a single operator running their own media library and GPU hardware.
How it works
- Indexing pipeline: scene detection, keyframe extraction, LLM vision analysis, faster-whisper transcription, and clip pre-extraction populate a PostgreSQL catalog of scenes and clips.
- Character identity: faces are detected and embedded as 768-dim CCIP vectors via ONNX Runtime, then clustered and matched against a per-series gallery, staying provisional until a human names them.
- Editor agent: a ReAct-style orchestrator in the renderer drives ~56 LLM-visible tools (clip retrieval, timeline edits, effects, overlays, audio), injecting extracted video frames back to a vision model so it can see its own edits.
- Stack: TypeScript monorepo (pnpm workspaces) — Hono REST+WebSocket API, BullMQ/Redis job queue, Drizzle ORM on PostgreSQL 16, React 19 + Vite + Tailwind front end, and FFmpeg with CUDA/NVENC for compositing.
- Deployment: seven Docker Compose services (API, renderer, indexer, Whisper, character-embedder, Postgres, Redis), wired to an OpenAI-compatible local LLM endpoint.
Why it's interesting
The interesting part is closing the perception loop: the editing agent doesn't just emit edit commands blind — it renders preview frames and feeds them back to a vision model, so it can judge cuts and transitions the way a human reviewing the timeline would. Character recognition runs entirely on local ONNX models with deliberately conservative auto-matching, keeping ambiguous identities provisional rather than guessing.
Status
Private hobby project, work in progress — runs end-to-end via Docker Compose.