Seyeon An Mar 19, 2021

<aside> 🔗 We have reposted this blog on our Medium publication. Read this on Medium.

</aside>

The DramaQA, which has been proposed for development of video story understanding artificial intelligence, consists of cognitive-based difficulty levels for QA as a hierarchical evaluation metric. Also, it provides coreference resolved script and rich visual metadata for character-centered video.

“Stories are the communal currency of humanity,” according to Tahir Shah. Stories have existed since humans have existed, and they will not cease to exist until humans do. They are indispensable tools for conveying what we see, hear, feel, and know. Stories can be transmitted via word of mouth— but there's definitely many more ways— as novels, cartoons, plays, and films. Humans not only listen to stories, they create them themselves using these creative mediums. This is why the story understanding ability is a crucial part of human intelligence that sets humans apart from others. Such differentiation indicates that the capacity to understand stories like humans could be a proper medium when developing human-level AI. Especially, drama, typically in the form of video, is considered as a proper medium, since it conveys a story via human senses as sight, hearing, and actions.

In this post, we introduce DramaQA, which has brought great progress in enabling the computer's understanding of a complicated story, as commonly observed in a drama. This data set contributes in solving the problem of what computer vision and natural language processing have not been able to handle until now. The ability to understand stories, that was originally conceived as the communal currency of humanity, can be expanded to the currency of computers as well, if DramaQA is properly used for further research. We have open-sourced the full dataset and moreover, we have held challenges that encourage further artificial intelligence development via our data set.

A Quick Overview of the DramaQA Dataset

For such video story understanding research, the DramaQA dataset was collected on a popular Korean drama Another Miss Oh, which has 18 episodes—20.5 hours in total.

The figure below shows an overview of the DramaQA dataset.

An example of DramaQA dataset which contains video clips, scripts, and QA pairs with levels of difficulty

An example of DramaQA dataset which contains video clips, scripts, and QA pairs with levels of difficulty

QA sets classified by levels of difficulty

The ability to understand stories and to answer questions on them differed accordingly from the stages of cognitive development. To collect question-answer pairs with levels of difficulty, we propose two criteria: Memory Capacity and Logical Complexity.

Memory Capacity : In the perspective of machine learning, the longer the video is and the more data it contains consequently, the harder it is to reason the answer from the video. The VideoQA problem considers two levels of memory capacity: shot and scene.

Logical Complexity : Complicated questions often require higher logical reasoning steps than simple questions; thus, the VideoQA set considers logical complexity of the problem as one other standard of measuring difficulty. The DramaQA set define four levels of logical complexity from simple recall to high-level reasoning, similar to hierarchical stages of human development.