Debatrix: Multi-dimensional Debate Judge with Iterative Chronological Analysis Based on LLM

Jingcong Liang¹, Rong Ye^1,3, Meng Han², Ruofei Lai², Xinyu Zhang², Xuanjing Huang¹, Zhongyu Wei¹

¹Fudan University, ²Huawei Poisson Lab, ³ByteDance

Paper arXiv Code Demo Dataset (PanelBench)

An LLM debate judge judging a debate between Alice and Bob. The LLM needs to understand the arguments and how they counter each other (purple bubble); the LLM also needs to evaluate the speeches in multiple dimensions (orange bubble). However, multi-round debates are often long, detracting attention or exceeding the context window (light gray bubble).

Abstract

How can we construct an automated debate judge to evaluate an extensive, vibrant, multi-turn debate? This task is challenging, as judging a debate involves grappling with lengthy texts, intricate argument relationships, and multi-dimensional assessments. At the same time, current research mainly focuses on short dialogues, rarely touching upon the evaluation of an entire debate.

By leveraging Large Language Models (LLMs), we propose Debatrix, which makes the analysis and assessment of multi-turn debates more aligned with majority preferences. Specifically, Debatrix features a vertical, iterative chronological analysis and a horizontal, multi-dimensional evaluation collaboration.

To align with real-world debate scenarios, we introduced the PanelBench benchmark, comparing our system's performance to actual debate outcomes. The findings indicate a notable enhancement over directly using LLMs for debate evaluation.

Demo Videos

Debatrix

We propose Debatrix, a fine-grained automatic debate judging framework based on LLM that breaks down the task along both chronological and dimensional axes.

An illustration of the General Structure of Debatrix. ①: Add speech to context memory; ②: Retrieve relevant pieces of context and analysis; ③: Add analysis and reflections to analysis memory; ④: Fetch analysis for final judgment. The framework can generate speech judgments, debater judgments, and the winner verdict based on analysis from single or multiple dimensions.

Chronological

Iterative Speech Analysis

We instruct the LLM to analyze the debate speech by speech, maintaining speech and analysis streams with a memory system and providing content analysis of previous speeches when analyzing new speeches. After reviewing all speeches, the LLM makes decisions based on all analyses. This iterative approach lets the LLM concentrate on one speech at a time and understand the context more effectively. It also produces feedback or decisions for each speech, each debater, and the final winner.

Dimensional

Multi-Dimensional Collaboration

Debatrix also allows LLMs to focus on a specific judging dimension, such as argument, language, or clash, during the speech analyzing process. Each LLM agent can make comments on these specific aspects. For the overall judgment, all these individual analyses are combined into one summary, providing a systematic judgment across multiple dimensions.

PanelBench

Furthermore, we introduce PanelBench, a novel benchmark for evaluating automatic debate judging systems. PanelBench consists of two collections of debates with judgments: DebateArt for 1v1 debates with dimensional judgments and BP-Competition for high-quality debates with multiple debaters on each side.

DebateArt

1v1 Debates with Dimensional Judgments

DebateArt debates sources from DebateArt, an online debate platform that provides 1v1 debate arenas following competitive debate formats. Debates vary in speech count and length and have dimensional voting results. PanelBench includes 100 valid debates with valid votes from DebateArt.

	# Speech	# Speech Token	# Debate Token
Min	4.0	53.0	468.0
Mean	6.7	650.5	4,342.6
Max	10.0	2,368.0	12,337.0

On DebateArt, voters must consider and vote on four metrics for comparative performance insights: argument, source, legibility, and conduct. To align with oral debates that are not formatted, we merged two dimensions — legibility and conduct — into a single language dimension, representing the language style.

Dimension	Pro	Tie	Con	D2O RMSE
Argument	33	11	56	23.85
Source	14	67	19	41.02
Language	9	66	25	47.20
General	37	7	56	-

BP-Competition

High-Quality Debates with More Debaters

BP-Competition includes 22 debates transcribed from world-class competitive debate competitions. These debates follow the British Parliamentary (BP) format involving four teams (two on each side), enriching PanelBench with long, complex, high-quality samples.

	# Speech Token	# Debate Token
Min	1,478.0	13,571.0
Mean	1,892.5	15,139.9
Max	2,411.0	17,089.0

In BP debates, four teams (OG, OO, CG, and CO) are divided into two sides of a motion but compete against all three other teams in the debate. PanelBench requires judging which of the four teams is the best. Some BP debates have more than one winners; PanelBench treats predicting any winning teams as correct.

Debater	# Wins
Opening Government (OG)	8
Opening Opposition (OO)	16
Closing Government (CG)	8
Closing Opposition (CO)	6

Experiments

We conducted experiments on PanelBench to evaluate the debate judging performance of LLMs. We also compare our Debatrix framework (using ChatGPT as the backbone LLM) with judging directly with LLMs. For DebateArt debates, We predict the winner in two ways, calculating the RMSE when matching them with true winners:

Score Comparison: Compare all debaters' score; for source and language, score differences within ±3 become ties.
Direct Prediction: Predict the winner directly.

For BP-Competition debates, we always predict the winner directly, measuring the completion rate and predict accuracy.

Model	DebateArt (RMSE)		BP-Competition
Model	Score Comparison↓	Direct Prediction↓	Completion Rate↑	Accuracy↑
`ChatGPT`	49.99	51.16	13.64	0.00
`GPT-4`	44.84	46.55	100.00	34.85
`Chronological`	48.61	48.71	100.00	30.30
`Dimensional`	44.91	45.01	36.36	12.12
`NonIterative`	44.18	44.03	66.67	36.36
`Debatrix`	42.21	41.75	100.00	51.52

Iterative chronological analysis is crucial for ChatGPT to handle very long debates. Meanwhile, dimensional collaboration is also beneficial in shorter debates. Nevertheless, combining both of them yields a better performance. Finally, using previous content analysis iteratively, Debatrix consistently outperforms all baseline models on both debate collections, including GPT-4 with a more powerful LLM.

Analysis

DebateArt debates vary in speech count and length: among all 100 debates, 34 have no fewer than 8 speeches, and 51 contain at least 4,000 tokens. On one hand, some baseline models show partial advantage but fail to cover all scenarios. On the other hand, Debatrix maintains a relatively low RMSE regardless of the number of speeches or tokens. This shows that Debatrix effectively assists the LLM in evaluating long, multi-turn debates while extending its advantage to short debates.

Model	Score Comparison↓	Direct Prediction↓
`Dimensional`	52.06	52.23
`Dimensional` (GPT-4)	51.13	51.06
`NonIterative`	50.87	50.37
`Debatrix`	47.50	47.67

Among all dimensions, the argument dimension is the major one affecting debaters' persuasiveness. Using the more powerful GPT-4 does not improve Dimensional much on this dimension; instead, chronological analysis brings a significant improvement. Iterative analysis when analyzing speeches is also beneficial, allowing Debatrix to understand arguments better without resorting to larger models.

Improvements in specific dimensions eventually result in better overall judgments under dimensional collaboration: models that split dimensions (Dimensional and Debatrix) consistently outperform their duets without splitting dimensions (ChatGPT and Chronological).

Model	Opening		Closing		N/A
Model	OG	OO	CG	CO	N/A
`GPT-4`	0	0	10	56	0
`NonIterative`	17	12	9	6	22
`Debatrix`	13	21	15	17	0
Gold	12	30	13.5	10.5	0

Further investigation on BP-Competition has revealed that GPT-4 always predicts the closing half (CG and CO) which speaks after the opening half (OG and OO); in most cases, it selects CO who speaks the last. Meanwhile, ChatGPT-based Debatrix gives relatively balanced predictions that roughly match the expectation.

We conjecture that position bias could be a significant factor that causes GPT-4 to fail in judging BP debates: LLM may prefer the last speaker who can refute others while not being refuted, thus seemingly more convincing.

Conclusion

We proposed a fine-grained debate judging framework based on LLM, Debatrix. We decompose the debate judging task into a iterative chronological analysis to tackle multi-turn, long debates and elaborate multiple dimensions to generate systematic judgments.
We introduce a novel debate judging benchmark, PanelBench, to assess our framework and other automatic debate judging approaches, covering multi-dimensional and multi-debater scenarios.
Under both settings, Debatrix significantly improves ChatGPT, aiding it in judging long debates that exceed the context window and outperforming bare GPT-4.

Citation

@article{liang2024debatrix,
    title={Debatrix: Multi-dimensional Debate Judge with Iterative Chronological Analysis Based on LLM},
    author={Liang, Jingcong and Ye, Rong and Han, Meng and Lai, Ruofei and Zhang, Xinyu and Huang, Xuanjing and Wei, Zhongyu},
    journal={arXiv preprint arXiv:2403.08010},
    year={2024}
}