LLMs are good at reasoning over text. They are bad at reasoning over raw video. The gap between “here is a video file” and “here is a structured description an LLM can plan against” is real engineering — frame extraction, deduplication, object detection, vision-model summarization, schema design — and most teams that want to build video-aware AI products end up either reinventing that pipeline or paying for a closed-source service that does it badly.
The Video Intelligence API is the pipeline I'd want to consume if I were building such a product. You hand it a video. It returns a structured, timestamped JSON document that a downstream LLM can reason about — what's in the video, when it happens, what changes scene-to-scene — without ever touching the original frames. It was built during the PowerSync AI Hackathon in March 2026.
The core insight is that this is not an AI product. It's a smart preprocessing layer on top of existing models. The hard parts aren't model training; they're the parts the models alone don't solve — choosing which frames are worth analyzing, deduplicating near-identical frames, calling expensive vision models only when cheap ones don't have an answer, and stitching everything into a coherent timestamped output.
The pipeline is five stages. FFmpeg standardizes the input — variable frame rate, arbitrary resolution, dozens of codecs become a uniform 1 fps PNG sequence at a fixed resolution. OpenCV with PySceneDetect identifies scene boundaries and discards near-duplicate frames within scenes, typically reducing a 10-minute video to 60–250 keyframes. YOLOv8n runs as a cheap object-detection pass on every keyframe, producing a coarse object-and-bounding-box layer that's used both as final output and as a routing signal — frames containing objects of interest get escalated to the next stage. Gemini Flash runs as the vision-model stage, generating natural-language descriptions of the keyframes that survive the routing filter. A JSON stitcher assembles the final document with timestamps, object detections, scene-level summaries, and video-level summary.
The reason this stack makes sense is cost. A naive “send every frame to GPT-4o” pipeline on a 10-minute video would cost dollars per video and take minutes. Routing through YOLOv8 first (which runs locally and is essentially free) and using Gemini Flash (cheaper than GPT-4o by an order of magnitude) for the vision stage drops the cost to roughly $0.18 per 10-minute video — which is the difference between a viable API product and a vanity demo.
┌──────────────────────────────────────────┐
│ Upload → FastAPI → Cloudflare R2 │
└──────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────┐
│ Celery worker pulls job from Redis │
└──────────────────────────────────────────┘
│
▼
FFmpeg → OpenCV/scenedetect → YOLOv8n → Gemini Flash → JSON stitcher
│
▼
┌──────────────────────────────────────────┐
│ Result stored, retrievable by video_id │
└──────────────────────────────────────────┘The API surface is intentionally small. POST /v1/analyze accepts a video and returns a video_id immediately. GET /v1/status/{video_id} reports progress. GET /v1/result/{video_id} returns the structured JSON. There's also a synchronous endpoint for short videos under 60 seconds that skips the polling cycle. Everything is async because video processing is not synchronous — a 10-minute video legitimately takes a couple of minutes to process, and pretending otherwise leads to bad timeout behavior at every layer.
Storage is split. Incoming videos go to a temporary R2 bucket and auto-delete after processing — they're large, expensive to store, and not the product. Keyframes are temp storage during processing only. The output JSON is permanent and retrievable, indexed by video_id.
The biggest tradeoff was scope discipline under hackathon time pressure. The original architecture document had a v2 list — live-stream analysis, audio analysis, multi-language support, fine-grained billing tiers — and shipping any of those instead of the core pipeline would have been the wrong call. The hackathon version ships the five-stage pipeline, the async job queue, basic API key auth, and nothing else.
A second tradeoff: model choice. Gemini Flash is cheaper than GPT-4o by a wide margin, but it's also less accurate on subtle scene description. For the hackathon I shipped Flash because the cost shape matters more than marginal accuracy when you're trying to demonstrate a viable API. A production version would expose model selection as a tier — Flash for the cheap tier, a stronger model for the premium tier — but that's product work, not hackathon work.
A third worth noting: the API key auth is simple by design, not by accident. Real auth — JWT, rate-limiting per tier, billing integration — is a project. The hackathon needed something that proved the API was protected and a paying customer could be identified, not a complete auth system.
The pipeline shipped, end-to-end, within the hackathon window. Submission included the full async job queue, the five-stage processing pipeline, the API surface, and a small demo client that exercised it. The cost numbers held — videos in the 5–10 minute range processed in a couple of minutes for under a quarter in API spend.
The architecture has held up well enough as a portfolio piece that I've kept the codebase active rather than archiving it. The interesting engineering — the routing-by-detection logic, the scene-boundary deduplication, the JSON schema design — is reusable for other LLM-input-preparation problems beyond video. There's a version of this idea that handles audio, and a version that handles long PDFs, that share most of the same skeleton.
The honest next step is product, not engineering. The pipeline works; what it lacks is the surrounding shape of an actual API service — billing, tiered model selection, a dashboard, sane rate limits. None of that is technically hard, but it's only worth building for a real audience, and a hackathon submission is not a real audience. If the project gets revived as a commercial thing, the work is mostly product surface, not core architecture.