TemporalBench

Benchmarking Fine-grained Temporal Understanding for
Multimodal Video Models

Mu Cai^1,*, Reuben Tan², Jianrui Zhang¹, Bocheng Zou¹, Kai Zhang³, Feng Yao⁴,
Fangrui Zhu⁵, Jing Gu⁶, Yiwu Zhong⁷, Yuzhang Shang⁸, Yao Dou⁹, Jaden Park¹,
Jianfeng Gao^2,†, Yong Jae Lee^1,†, Jianwei Yang^2,†

¹ Universityof Wisconsin-Madison, ²Microsoft Research, Redmond, ³Ohio State University,
⁴University of California, San Diego, ⁵Northeastern University, ⁶University of California, Santa Cruz,
⁷Chinese University of Hong Kong, ⁸Illinois Institute of Technology, ⁹Georgia Institute of Technology

^*Work done during the internship at Microsoft Research ^†Equal Advising

arXiv Code

📊

Dataset

🏆

Leaderboard

What is TemporalBench?

The tasks of TemporalBench. TemporalBench starts from fine-grained video descriptions and supports diverse video understanding tasks including video QA, video captioning, long video understanding, etc. It differs from existing benchmarks by the average number of words per video (middle top), word density (center) and the coverage of various temporal aspects (middle bottom).

Introduction

Understanding fine-grained temporal dynamics is crucial for multimodal video comprehension and generation. Due to the lack of fine-grained temporal annotations, existing video benchmarks mostly resemble static image benchmarks and are incompetent at evaluating models for temporal understanding. In this paper, we introduce TemporalBench, a new benchmark dedicated to evaluating fine-grained temporal understanding in videos. TemporalBench consists of ~10K video question-answer pairs, derived from ~2K high-quality human annotations detailing the temporal dynamics in video clips. As a result, our benchmark provides a unique testbed for evaluating various temporal understanding and reasoning abilities such as action frequency, motion magnitude, event order, etc. Moreover, it enables evaluations on various tasks like both video question answering and captioning, both short and long video understanding, as well as different models such as multimodal video embedding models and text generation models. Results show that state-of-the-art models like GPT-4o achieve only 38.5% question answering accuracy on TemporalBench, demonstrating a significant gap (~30%) between humans and AI in temporal understanding. Furthermore, we notice a critical pitfall for multi-choice QA where LLMs can detect the subtle changes in negative captions and find a centralized description as a cue for its prediction, where we propose Multiple Binary Accuracy (MBA) to correct such bias. We hope that TemporalBench can foster research on improving models' temporal reasoning capabilities. Both dataset and evaluation code will be made available.

Leaderboard

Accuracy scores on TemporalBench are presented for short video and long video question answering accuracy, reflected by both binary accuracy and multiple binary accuracy. Furthermore, we provide results on short video captioning. Overall QA denotes the average performance between

Short Video: [0,20] seconds Long Video: [0,20] minutes

The leaderboard is sorted by multiple binary accuracy on short videos by default. To view other sorted results, please click on the corresponding cell.

#	Model	Frames	Date	Overall QA (%)		Short Video QA (%)		Long Video QA (%)		Detailed Captioning
#	Model	Frames	Date	Multiple Binary Accuracy	Binary Accuracy	Multiple Binary Accuracy	Binary Accuracy	Multiple Binary Accuracy	Binary Accuracy	Similarity
	Human Performance	-	-	-	-	67.9	89.7	-	-	-
	Random Chance	-	-	09.5	50.0	09.5	50.0	09.5	50.0	-
	GPT-4o	64	2024-06-15	35.3	73.2	38.0	76.0	32.7	70.5	63.5
	GPT-4o	32	2024-06-15	32.9	71.5	38.3	75.9	27.4	67.0	63.2
	GPT-4o	16	2024-06-15	34.3	72.8	38.5	75.7	30.1	69.8	61.3
	Gemini-1.5-Pro	1FPS	2024-08-01	25.6	66.4	26.6	67.5	24.7	65.2	56.5
	Claude-3.5-Sonnet	16	2024-07-30	23.2	64.1	23.5	65.9	22.9	62.4	54.1
	Claude-3.5-Sonnet	8	2024-07-30	24.1	65.0	23.6	65.5	24.5	64.6	53.1
	LLaVA-Video-72B	32	2024-09-30	33.7	72.4	37.7	75.9	29.6	68.8	54.8
	LLaVA-Video-7B	32	2024-09-30	22.9	63.6	22.9	63.3	22.9	63.9	52.1
	Aria	32	2024-10-10	25.0	65.9	26.6	68.4	23.5	63.5	51.5
	LongVU	1FPS	2024-10-22	18.9	58.5	20.9	61.7	16.9	55.3	40.5
	Qwen2-VL-72B	32	2024-06-15	31.7	70.2	38.3	75.8	25.0	64.5	56.1
	Qwen2-VL-72B	8	2024-06-15	30.1	68.9	34.0	73.1	26.2	64.7	51.4
	Qwen2-VL-7B	32	2024-06-15	21.7	62.0	24.7	64.4	18.8	59.7	51.9
	LLaVA-OneVision-72B	32	2024-08-08	26.6	66.6	30.7	70.5	22.4	62.7	53.9
	LLaVA-OneVision-72B	8	2024-08-08	28.1	67.8	33.0	72.1	23.1	63.6	55.0
	LLaVA-OneVision-7B	32	2024-08-08	18.7	59.4	21.2	61.9	16.2	56.9	50.1
	LLaVA-NeXT-Video-34B	32	2024-04-30	19.9	61.1	22.0	64.0	17.7	58.2	53.1
	LLaVA-NeXT-Video-7B	8	2024-04-30	20.5	61.2	23.6	65.1	17.3	57.2	50.1
	InternLM-XC2.5	1FPS	2024-04-30	16.7	57.3	17.9	58.8	15.6	55.8	52.4
	VideoLLaVA	8	2023-11-16	20.3	61.5	25.5	67.1	15.1	56.0	46.0
	MiniCPM-V2.6	1FPS	2024-08-12	20.4	61.3	21.4	62.3	19.3	60.3	47.2
	Phi-3.5-Vision	2	2024-08-16	15.5	56.2	16.9	58.0	14.1	54.4	42.9
	MA-LMM	4	2024-04-08	09.1	47.4	09.2	48.0	09.0	46.9	38.7
	M3	6	2024-05-27	13.3	54.7	14.8	56.4	11.8	53.1	47.8
	GPT-4o	1	2024-06-15	26.4	67.3	28.4	70.0	24.5	64.7	52.3
	LLaVA-1.5-13B	1	2023-10-05	13.7	55.1	13.1	55.7	14.2	54.5	47.9
	LLaVA-1.5-7B	1	2023-10-05	15.3	56.8	18.3	60.5	12.3	53.2	45.7
	LLaVA-NeXT-34B	1	2024-01-30	19.0	60.5	18.0	60.5	19.9	60.5	49.1
	Phi-3-Vision	1	2024-05-19	15.4	55.2	15.1	54.4	15.6	56.0	42.0

Data Examples

All data are newly collected and annotated by humans, not from any existing video dataset.

Benchmark is from Human Annotation

Overview of the annotation pipeline for TemporalBench. In step 1, we fist collect high-quality captions for the videos using qualified AMT annotators followed by refining them. In step 2, we leverage existing LLMs to generate negative captions by replacing select words and reordering the sequence of actions before filtering them ourselves. centric.

TemporalBench Owns High-quality Negatives

Comparsion of negative captions generated from the original captions and our detailed captions in TemporalBench. With fine-grained details, the negatives are more difficult and temporal centric.

Different Question Types

All current LMMs have a large gap compared to human performance. Visualization of binary accuracy for short video QA per (a) subset and (b) negative type. Human performance is much better than GPT-4o, Qwen2-VL-72B, LLaVA-OneVision-72B, and Gemini-1.5-Pro.

More Frames Help, but Not Much

Model performance on TemporalBench with varying frames on short video understanding. With more frames, LMMs mostly perform better, but the improvement is limited.

A Pitfall in Multi-choice Question Answering

While developing our benchmark, we noticed another previously ignored but critical pitfall for multi-choice QA. Specifically, if every negative answer choice is generated by changing a small part of the correct answer, the LLM can detect those changes to find a centralized description and use that cue for its prediction. To study this, given a positive caption C and its associated negative caption N(C), we intentionally derive a few negatives from N_1(C) (instead of for C), resulting in N_1(N_1(C)) and N_2(N_1(C)), resulting in [C, N_1(C), N_1(N_1(C)), N_2(N_1(C))] as options, so that N_1(C) becomes the centralized description (see Fig.~ref{fig:negative_captions_generation}). Surprisingly, we find that 66.4% of text-only GPT-4o's predictions correspond to N(C), while only 6.4% of its predictions correspond to C. Our findings also align with human behavior analysis from psychology (Furman et al., 2008), where humans can achieve better than random chance performance on multi-choice QAs using similar cues.

Motivated by these findings, we propose to decompose a single multi-choice QA into multiple binary QAs. In this case, we eliminate the centralized option due to the fact that there are only two options to choose from. As a result, given M negatives, the multiple binary QAs will query a model M times, where the random chance performance changes from 1 / (M+1) to (1/2)^M. Given that (1/2)^M > 1 / (M+1) for every M > 2, multiple binary QA is a more difficult task than multi-choice QA.

GPT-4o Fails in Distinguishing Basic Temporal Dynamics


      @article{cai2024temporalbench,
        title={TemporalBench: Towards Fine-grained Temporal Understanding for Multimodal Video Models},
        author={Cai, Mu and Tan, Reuben and Zhang, Jianrui and Zou, Bocheng and Zhang, Kai and Yao, Feng and Zhu, Fangrui and Gu, Jing and Zhong, Yiwu and Shang, Yuzhang and Dou, Yao and Park, Jaden and Gao, Jianfeng and Lee, Yong Jae and Yang, Jianwei},
        journal={arXiv preprint arXiv:2410.10818},
        year={2024}
      }

TemporalBench

Benchmarking Fine-grained Temporal Understanding for
Multimodal Video Models

What is TemporalBench?

Introduction

Leaderboard

Benchmark

Data Examples

Benchmark Curation

Benchmark is from Human Annotation

TemporalBench Owns High-quality Negatives

Experiment Results

Different Question Types

More Frames Help, but Not Much

A Pitfall in Multi-choice Question Answering

GPT-4o Fails in Distinguishing Basic Temporal Dynamics

Citation

TemporalBench

Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models

What is TemporalBench?

Introduction

Leaderboard

Benchmark

Data Examples

Benchmark Curation

Benchmark is from Human Annotation

TemporalBench Owns High-quality Negatives

Experiment Results

Different Question Types

More Frames Help, but Not Much

A Pitfall in Multi-choice Question Answering

GPT-4o Fails in Distinguishing Basic Temporal Dynamics

Citation

Benchmarking Fine-grained Temporal Understanding for
Multimodal Video Models