Building a Video Understanding Agent: Search, Citations & Reasoning

Hack Session

About the session

Video is the richest and least searchable format of information we produce. Feeding a 2-3-hour video directly to an LLM is impractical, as the cost and latency make it a non-starter.

This talk covers building an agent-powered system that ingests long-form video and makes it fully queryable like a long form text document. We'll walk through the async ingestion pipeline, how a multimodal tool-calling agent orchestrates search and reasoning across clips, and how verifiable timestamp citations keep AI answers grounded in actual video evidence.

Speaker

Download Brochure