To motivate our work, consider the following scenarios:
Scenario 1 - When watching sports such as diving or gymnastics in the Olympics, one might wonder: how do judges score the performances? To a casual observer, many of the actions may appear quite similar, making it difficult to distinguish the best from the good.
Scenario 2 - In medical training, precision is critical. For instance, medical students practicing surgery are evaluated by expert professors who assess their technique step-by-step. This raises a key question — how does an expert determine these grades in a consistent and informative manner?

A representative example of diving.

A representative example of gymnastics.

A representative example of suturing — a common step done by medical professionals.
Across these examples, a few common elements emerge. Each scenario involves a performer executing a precise sequence of actions (diver, gymnast, medical student), an expert who evaluates these actions (judge, professor), and often followed by some feedback that enables improvement.
Experts typically rely on a set of evaluation rubrics — structured guidelines that define what constitutes correct or high-quality performance. For instance, in diving, rubrics may specify the expected body position, number of twists, or entry angle. Similarly, in medical skill assessment, rubrics may define criteria for hand positioning or needle angle. Since a single expert’s score can carry uncertainty, it is common practice to involve multiple evaluators while judging an action.
Our central research question is therefore:
#Abrar: Here you can first say AQA is already common but ignore rubrics
Can we model and predict performance scores using deep learning methods that explicitly leverage these predefined rubrics?

Motivation for our work RICA2 - We incorporate rubrics while assessing the quality of an action and incorporate uncertainty calibration i.e “our model knows when it doesn’t know”; thus low confidence while rating actions.
Just as a single human experts exhibit uncertainty, machine learning models can also make overconfident predictions — assigning high confidence to incorrect assessments. Hence, we aim to calibrate our model’s uncertainty, ensuring that it can recognize when its predictions are unreliable i.e knowing when it doesn’t know.
To illustrate our approach, consider a simplified example of a five-step diving sequence. The information looks like the below:
Step number - [1, 2, 3, 4, 5]
Step action information - [Arm forward, 1 Twist, 1 Twist, 2 Somersault pike, Entry]
i.e each step corresponds to a specific action segment within the video. (In cases where step annotations are unavailable, we demonstrate an augmentation approach in Supplement C.4.)
Now, the key question becomes: How can we integrate these step-wise actions with the corresponding scoring rubrics?
To achieve this, we employ a Graph Neural Network (GNN) structured as a Directed Acyclic Graph (DAG). Conceptually, the graph is directed, since information flows in a known order from one step to the next using rubrics, and acyclic, since this flow progresses only forward without loops (see Sec.3.1 for details).