Logo TemporalBench

Benchmarking Fine-grained Temporal Understanding for
Multimodal Video Models


1 Universityof Wisconsin-Madison, 2Microsoft Research, Redmond, 3Ohio State University,
4University of California, San Diego, 5Northeastern University, 6University of California, Santa Cruz,
7Chinese University of Hong Kong, 8Illinois Institute of Technology, 9Georgia Institute of Technology

*Work done during the internship at Microsoft Research        Equal Advising

What is TemporalBench?

data-composition

The tasks of TemporalBench. TemporalBench starts from fine-grained video descriptions and supports diverse video understanding tasks including video QA, video captioning, long video understanding, etc. It differs from existing benchmarks by the average number of words per video (middle top), word density (center) and the coverage of various temporal aspects (middle bottom).




Introduction

Understanding fine-grained temporal dynamics is crucial for multimodal video comprehension and generation. Due to the lack of fine-grained temporal annotations, existing video benchmarks mostly resemble static image benchmarks and are incompetent at evaluating models for temporal understanding. In this paper, we introduce TemporalBench, a new benchmark dedicated to evaluating fine-grained temporal understanding in videos. TemporalBench consists of ~10K video question-answer pairs, derived from ~2K high-quality human annotations detailing the temporal dynamics in video clips. As a result, our benchmark provides a unique testbed for evaluating various temporal understanding and reasoning abilities such as action frequency, motion magnitude, event order, etc. Moreover, it enables evaluations on various tasks like both video question answering and captioning, both short and long video understanding, as well as different models such as multimodal video embedding models and text generation models. Results show that state-of-the-art models like GPT-4o achieve only 38.5% question answering accuracy on TemporalBench, demonstrating a significant gap (~30%) between humans and AI in temporal understanding. Furthermore, we notice a critical pitfall for multi-choice QA where LLMs can detect the subtle changes in negative captions and find a centralized description as a cue for its prediction, where we propose Multiple Binary Accuracy (MBA) to correct such bias. We hope that TemporalBench can foster research on improving models' temporal reasoning capabilities. Both dataset and evaluation code will be made available.

Leaderboard

Accuracy scores on TemporalBench are presented for short video and long video question answering accuracy, reflected by both binary accuracy and multiple binary accuracy. Furthermore, we provide results on short video captioning. Overall QA denotes the average performance between

Short Video: [0,20] seconds          Long Video: [0,20] minutes

The leaderboard is sorted by multiple binary accuracy on short videos by default. To view other sorted results, please click on the corresponding cell.

# Model Frames Date Overall QA (%) Short Video QA (%) Long Video QA (%) Detailed Captioning
Multiple Binary Accuracy Binary Accuracy Multiple Binary Accuracy Binary Accuracy Multiple Binary Accuracy Binary Accuracy Similarity
Human Performance

- - - - 67.9 89.7 - - -
Random Chance

- - 09.5 50.0 09.5 50.0 09.5 50.0 -
GPT-4o

64 2024-06-15 35.3 73.2 38.0 76.0 32.7 70.5 63.5
GPT-4o

32 2024-06-15 32.9 71.5 38.3 75.9 27.4 67.0 63.2
GPT-4o

16 2024-06-15 34.3 72.8 38.5 75.7 30.1 69.8 61.3
Gemini-1.5-Pro

1FPS 2024-08-01 25.6 66.4 26.6 67.5 24.7 65.2 56.5
Claude-3.5-Sonnet

16 2024-07-30 23.2 64.1 23.5 65.9 22.9 62.4 54.1
Claude-3.5-Sonnet

8 2024-07-30 24.1 65.0 23.6 65.5 24.5 64.6 53.1
LLaVA-Video-72B

32 2024-09-30 33.7 72.4 37.7 75.9 29.6 68.8 54.8
LLaVA-Video-7B

32 2024-09-30 22.9 63.6 22.9 63.3 22.9 63.9 52.1
Aria

32 2024-10-10 25.0 65.9 26.6 68.4 23.5 63.5 51.5
LongVU

1FPS 2024-10-22 18.9 58.5 20.9 61.7 16.9 55.3 40.5
Qwen2-VL-72B

32 2024-06-15 31.7 70.2 38.3 75.8 25.0 64.5 56.1
Qwen2-VL-72B

8 2024-06-15 30.1 68.9 34.0 73.1 26.2 64.7 51.4
Qwen2-VL-7B

32 2024-06-15 21.7 62.0 24.7 64.4 18.8 59.7 51.9
LLaVA-OneVision-72B

32 2024-08-08 26.6 66.6 30.7 70.5 22.4 62.7 53.9
LLaVA-OneVision-72B

8 2024-08-08 28.1 67.8 33.0 72.1 23.1 63.6 55.0
LLaVA-OneVision-7B

32 2024-08-08 18.7 59.4 21.2 61.9 16.2 56.9 50.1
LLaVA-NeXT-Video-34B

32 2024-04-30 19.9 61.1 22.0 64.0 17.7 58.2 53.1
LLaVA-NeXT-Video-7B

8 2024-04-30 20.5 61.2 23.6 65.1 17.3 57.2 50.1
InternLM-XC2.5

1FPS 2024-04-30 16.7 57.3 17.9 58.8 15.6 55.8 52.4
VideoLLaVA

8 2023-11-16 20.3 61.5 25.5 67.1 15.1 56.0 46.0
MiniCPM-V2.6

1FPS 2024-08-12 20.4 61.3 21.4 62.3 19.3 60.3 47.2
Phi-3.5-Vision

2 2024-08-16 15.5 56.2 16.9 58.0 14.1 54.4 42.9
MA-LMM

4 2024-04-08 09.1 47.4 09.2 48.0 09.0 46.9 38.7
M3

6 2024-05-27 13.3 54.7 14.8 56.4 11.8 53.1 47.8
GPT-4o

1 2024-06-15 26.4 67.3 28.4 70.0 24.5 64.7 52.3
LLaVA-1.5-13B

1 2023-10-05 13.7 55.1 13.1 55.7 14.2 54.5 47.9
LLaVA-1.5-7B

1 2023-10-05 15.3 56.8 18.3 60.5 12.3 53.2 45.7
LLaVA-NeXT-34B

1 2024-01-30 19.0 60.5 18.0 60.5 19.9 60.5 49.1
Phi-3-Vision

1 2024-05-19 15.4 55.2 15.1 54.4 15.6 56.0 42.0

Benchmark

Data Examples

All data are newly collected and annotated by humans, not from any existing video dataset.

Benchmark Curation

Benchmark is from Human Annotation

data-composition

Overview of the annotation pipeline for TemporalBench. In step 1, we fist collect high-quality captions for the videos using qualified AMT annotators followed by refining them. In step 2, we leverage existing LLMs to generate negative captions by replacing select words and reordering the sequence of actions before filtering them ourselves. centric.




TemporalBench Owns High-quality Negatives

data-composition

Comparsion of negative captions generated from the original captions and our detailed captions in TemporalBench. With fine-grained details, the negatives are more difficult and temporal centric.

Experiment Results

Different Question Types

grade-lv

All current LMMs have a large gap compared to human performance. Visualization of binary accuracy for short video QA per (a) subset and (b) negative type. Human performance is much better than GPT-4o, Qwen2-VL-72B, LLaVA-OneVision-72B, and Gemini-1.5-Pro.

More Frames Help, but Not Much

Model performance on TemporalBench with varying frames on short video understanding. With more frames, LMMs mostly perform better, but the improvement is limited.

A Pitfall in Multi-choice Question Answering

While developing our benchmark, we noticed another previously ignored but critical pitfall for multi-choice QA. Specifically, if every negative answer choice is generated by changing a small part of the correct answer, the LLM can detect those changes to find a centralized description and use that cue for its prediction. To study this, given a positive caption C and its associated negative caption N(C), we intentionally derive a few negatives from N_1(C) (instead of for C), resulting in N_1(N_1(C)) and N_2(N_1(C)), resulting in [C, N_1(C), N_1(N_1(C)), N_2(N_1(C))] as options, so that N_1(C) becomes the centralized description (see Fig.~ref{fig:negative_captions_generation}). Surprisingly, we find that 66.4% of text-only GPT-4o's predictions correspond to N(C), while only 6.4% of its predictions correspond to C. Our findings also align with human behavior analysis from psychology (Furman et al., 2008), where humans can achieve better than random chance performance on multi-choice QAs using similar cues.

Motivated by these findings, we propose to decompose a single multi-choice QA into multiple binary QAs. In this case, we eliminate the centralized option due to the fact that there are only two options to choose from. As a result, given M negatives, the multiple binary QAs will query a model M times, where the random chance performance changes from 1 / (M+1) to (1/2)^M. Given that (1/2)^M > 1 / (M+1) for every M > 2, multiple binary QA is a more difficult task than multi-choice QA.

GPT-4o Fails in Distinguishing Basic Temporal Dynamics

Citation


      @article{cai2024temporalbench,
        title={TemporalBench: Towards Fine-grained Temporal Understanding for Multimodal Video Models},
        author={Cai, Mu and Tan, Reuben and Zhang, Jianrui and Zou, Bocheng and Zhang, Kai and Yao, Feng and Zhu, Fangrui and Gu, Jing and Zhong, Yiwu and Shang, Yuzhang and Dou, Yao and Park, Jaden and Gao, Jianfeng and Lee, Yong Jae and Yang, Jianwei},
        journal={arXiv preprint arXiv:2410.10818},
        year={2024}
      }