ai student update 0413

0. About AI Student

AI Student is an LLM-based evaluator that watches an educational video from a learner's point of view and reports how well the video would have taught them. It is designed to stand in for a human reviewer: given a video and a description of a target learner (the persona), AI Student simulates watching the video at that persona's level of background knowledge, pacing preference, and learning style, then scores the video across four independent axes.

What AI Student evaluates

AI Student evaluates short-to-medium educational videos (typically [3–15] minutes) on technical or academic topics. Each submission consists of:

The video itself — both audio (narration, vocal quality, pacing) and visual track (slides, animation, digital-human narration, on-screen text, visual signaling) are analyzed.
The title — used to check whether content delivers on what it promises (see A5 Title–Content Mismatch).
A learner persona — a short description of the target student: prior knowledge, preferred speed, available time, and learning preference. The persona is parsed into three structured fields (Grade, focus time, background knowledge) before scoring; these fields drive every Adaptability metric.

Before scoring, AI Student extracts an internal content_map: an ordered list of the teaching units it detected in the video (each unit: topic, time span, claimed concept). The content_map is used both as the reference for coverage checks (S3 Content Completeness) and to define the "slide / beat" counting unit used by the B-scale metrics.

How AI Student works internally

AI Student is a three-agent pipeline. All three agents are LLM instances (currently gemini-3-flash), differing only in their prompts and the inputs they receive.

Agent 1 — Perceiving LLM. Inputs: the raw video, the title, and the persona. Output: a Content Map (the ordered list of teaching units referenced throughout this rubric — see S3, and D6's "slide/beat" unit) plus a Multi-modal Audit of presentation, visual alignment, and accessibility. Role: this agent does not score. It extracts what is in the video so the scoring agents can work from a shared, structured observation. → Produces the agent1_content_analyst block in the output JSON.

Agent 2 — Grading LLM. Inputs: Agent 1's Content Map & Audit only (no access to raw video). Output: Accuracy and Logic scores, with metric-level ratings and evidence. Role: because Agent 2 operates on Agent 1's structured observations, its scoring is reproducible and auditable against a fixed artifact. If a contestant contests a score, the Content Map is the shared ground truth. → Produces the agent2_gap_analysis_judge block.

Agent 3 — Persona Judging LLM. Inputs: the raw video, the title, the persona, and Agent 1's Content Map & Audit. Output: Adaptability and Engagement scores. Role: these two dimensions are irreducibly subjective — "does the pacing feel right for this persona?", "is the voice energizing?" — so Agent 3 re-watches the video itself while cross-referencing Agent 1's structured extraction. → Produces the subjective_evaluation block.

Design implication for contestants.

Accuracy and Logic flaws that Agent 1 does not detect will not surface in Accuracy/Logic. Invest in making your content legible to Agent 1 (clear on-screen text, unambiguous narration) — otherwise your scoring agent is working from an incomplete observation.
Adaptability and Engagement can penalize you on things Agent 1 missed, because Agent 3 watches the raw video directly.

The four dimensions, at a glance

Dimension	The question it answers
Accuracy	Is the content correct, complete, and faithful to the title?
Logic	Does it build coherently, without unjustified jumps or overload?
Adaptability	Is it matched to the given learner persona?
Engagement	Does it hold attention without being hollow spectacle?

Each dimension is scored independently in [0.0, 5.0] and reported as a separate number. This rubric never collapses the four into a single overall grade — any aggregated leaderboard score is defined by the competition rules, not here.