Research Engineer

Cloudglue - Video Understanding Infrastructure

Cloudglue is a Y Combinator-backed startup building developer APIs that turn video and audio into structured, searchable data. We handle the hard infrastructure - transcription, visual analysis, search, extraction - so developers can build on top of video without managing ML pipelines themselves.

We process millions of minutes of video for customers building search, analytics, and automation products. The research problems are real: how do you retrieve the right 10 seconds from 10,000 hours of video? How do you extract structured facts from noisy, multimodal content? How do you reason across visual and spoken information at scale?

Our team has shipped large-scale systems at Snapchat and Amazon, with work presented at NeurIPS, ICCV, CVPR, KubeCon, and DEF CON. We’re a small, technical team where researchers ship code and engineers read papers.

The Role

We’re looking for a research engineer to work on the core multimodal retrieval and video reasoning systems that power Cloudglue. This is a 50/50 research and engineering role - you’ll design novel approaches to hard retrieval and understanding problems, and you’ll ship them into production where real customers depend on them.

You’ll work across:

Multimodal retrieval - finding relevant moments across visual, audio, and text signals in large video collections
Structured extraction - pulling entities, facts, and relationships from video content
Video reasoning - understanding temporal, causal, and semantic relationships across long-form content
Evaluation and benchmarking - designing metrics and datasets to measure real-world system quality

This is not a pure research role. You’ll be expected to take ideas from paper to prototype to production. But it’s also not a pure engineering role - we need someone with genuine research depth who can identify the right problems to work on and design novel solutions.

What You’ll Do

Multimodal retrieval: Design and improve retrieval systems that search across video, audio, and text - including embedding models, re-ranking, and hierarchical search strategies.
Video understanding: Build systems that extract structured information from video - temporal segmentation, entity extraction, scene understanding, and content summarization.
Model fine-tuning & integration: Fine-tune and adapt vision and language models (LoRA/PEFT, full fine-tuning) for production use cases. Evaluate open-source and proprietary models and orchestrate them in serving pipelines.
Experiment and ship: Run experiments, analyze results rigorously, and turn successful research into production systems that handle real-world video at scale.
Collaborate: Work directly with founders and infrastructure engineers. Short feedback loops, no layers of process.

What We’re Looking For

Required