An LLM debate judge judging a debate between Alice and Bob. The LLM needs to understand the arguments and how they counter each other (purple bubble); the LLM also needs to evaluate the speeches in multiple dimensions (orange bubble). However, multi-round debates are often long, detracting attention or exceeding the context window (light gray bubble).
How can we construct an automated debate judge to evaluate an extensive, vibrant, multi-turn debate? This task is challenging, as judging a debate involves grappling with lengthy texts, intricate argument relationships, and multi-dimensional assessments. At the same time, current research mainly focuses on short dialogues, rarely touching upon the evaluation of an entire debate.
By leveraging Large Language Models (LLMs), we propose Debatrix, which makes the analysis and assessment of multi-turn debates more aligned with majority preferences. Specifically, Debatrix features a vertical, iterative chronological analysis and a horizontal, multi-dimensional evaluation collaboration.
To align with real-world debate scenarios, we introduced the PanelBench benchmark, comparing our system's performance to actual debate outcomes. The findings indicate a notable enhancement over directly using LLMs for debate evaluation.
We propose Debatrix, a fine-grained automatic debate judging framework based on LLM that breaks down the task along both chronological and dimensional axes.
An illustration of the General Structure of Debatrix. ①: Add speech to context memory; ②: Retrieve relevant pieces of context and analysis; ③: Add analysis and reflections to analysis memory; ④: Fetch analysis for final judgment. The framework can generate speech judgments, debater judgments, and the winner verdict based on analysis from single or multiple dimensions.
Iterative Speech Analysis
We instruct the LLM to analyze the debate speech by speech, maintaining speech and analysis streams with a memory system and providing content analysis of previous speeches when analyzing new speeches. After reviewing all speeches, the LLM makes decisions based on all analyses. This iterative approach lets the LLM concentrate on one speech at a time and understand the context more effectively. It also produces feedback or decisions for each speech, each debater, and the final winner.
Multi-Dimensional Collaboration
Debatrix also allows LLMs to focus on a specific judging dimension, such as argument, language, or clash, during the speech analyzing process. Each LLM agent can make comments on these specific aspects. For the overall judgment, all these individual analyses are combined into one summary, providing a systematic judgment across multiple dimensions.
Furthermore, we introduce PanelBench, a novel benchmark for evaluating automatic debate judging systems. PanelBench consists of two collections of debates with judgments: DebateArt for 1v1 debates with dimensional judgments and BP-Competition for high-quality debates with multiple debaters on each side.
1v1 Debates with Dimensional Judgments
DebateArt debates sources from DebateArt, an online debate platform that provides 1v1 debate arenas following competitive debate formats. Debates vary in speech count and length and have dimensional voting results. PanelBench includes 100 valid debates with valid votes from DebateArt.
# Speech | # Speech Token | # Debate Token | |
---|---|---|---|
Min | 4.0 | 53.0 | 468.0 |
Mean | 6.7 | 650.5 | 4,342.6 |
Max | 10.0 | 2,368.0 | 12,337.0 |
On DebateArt, voters must consider and vote on four metrics for comparative performance insights: argument, source, legibility, and conduct. To align with oral debates that are not formatted, we merged two dimensions — legibility and conduct — into a single language dimension, representing the language style.
Dimension | Pro | Tie | Con | D2O RMSE |
---|---|---|---|---|
Argument | 33 | 11 | 56 | 23.85 |
Source | 14 | 67 | 19 | 41.02 |
Language | 9 | 66 | 25 | 47.20 |
General | 37 | 7 | 56 | - |
High-Quality Debates with More Debaters
BP-Competition includes 22 debates transcribed from world-class competitive debate competitions. These debates follow the British Parliamentary (BP) format involving four teams (two on each side), enriching PanelBench with long, complex, high-quality samples.
# Speech Token | # Debate Token | |
---|---|---|
Min | 1,478.0 | 13,571.0 |
Mean | 1,892.5 | 15,139.9 |
Max | 2,411.0 | 17,089.0 |
In BP debates, four teams (OG, OO, CG, and CO) are divided into two sides of a motion but compete against all three other teams in the debate. PanelBench requires judging which of the four teams is the best. Some BP debates have more than one winners; PanelBench treats predicting any winning teams as correct.
Debater | # Wins |
---|---|
Opening Government (OG) | 8 |
Opening Opposition (OO) | 16 |
Closing Government (CG) | 8 |
Closing Opposition (CO) | 6 |
We conducted experiments on PanelBench to evaluate the debate judging performance of LLMs. We also compare our Debatrix framework (using ChatGPT as the backbone LLM) with judging directly with LLMs. For DebateArt debates, We predict the winner in two ways, calculating the RMSE when matching them with true winners:
For BP-Competition debates, we always predict the winner directly, measuring the completion rate and predict accuracy.
Model | DebateArt (RMSE) | BP-Competition | ||
---|---|---|---|---|
Score Comparison↓ | Direct Prediction↓ | Completion Rate↑ | Accuracy↑ | |
ChatGPT | 49.99 | 51.16 | 13.64 | 0.00 |
GPT-4 | 44.84 | 46.55 | 100.00 | 34.85 |
Chronological | 48.61 | 48.71 | 100.00 | 30.30 |
Dimensional | 44.91 | 45.01 | 36.36 | 12.12 |
NonIterative | 44.18 | 44.03 | 66.67 | 36.36 |
Debatrix | 42.21 | 41.75 | 100.00 | 51.52 |
Iterative chronological analysis is crucial for ChatGPT to handle very long debates. Meanwhile, dimensional collaboration is also beneficial in shorter debates. Nevertheless, combining both of them yields a better performance. Finally, using previous content analysis iteratively, Debatrix consistently outperforms all baseline models on both debate collections, including GPT-4 with a more powerful LLM.
DebateArt debates vary in speech count and length: among all 100 debates, 34 have no fewer than 8 speeches, and 51 contain at least 4,000 tokens. On one hand, some baseline models show partial advantage but fail to cover all scenarios. On the other hand, Debatrix maintains a relatively low RMSE regardless of the number of speeches or tokens. This shows that Debatrix effectively assists the LLM in evaluating long, multi-turn debates while extending its advantage to short debates.
Model | Score Comparison↓ | Direct Prediction↓ |
---|---|---|
Dimensional | 52.06 | 52.23 |
Dimensional (GPT-4) | 51.13 | 51.06 |
NonIterative | 50.87 | 50.37 |
Debatrix | 47.50 | 47.67 |
Among all dimensions, the argument dimension is the major one affecting debaters' persuasiveness. Using the more powerful GPT-4 does not improve Dimensional much on this dimension; instead, chronological analysis brings a significant improvement. Iterative analysis when analyzing speeches is also beneficial, allowing Debatrix to understand arguments better without resorting to larger models.
Improvements in specific dimensions eventually result in better overall judgments under dimensional collaboration: models that split dimensions (Dimensional and Debatrix) consistently outperform their duets without splitting dimensions (ChatGPT and Chronological).
Model | Opening | Closing | N/A | ||
---|---|---|---|---|---|
OG | OO | CG | CO | ||
GPT-4 | 0 | 0 | 10 | 56 | 0 |
NonIterative | 17 | 12 | 9 | 6 | 22 |
Debatrix | 13 | 21 | 15 | 17 | 0 |
Gold | 12 | 30 | 13.5 | 10.5 | 0 |
Further investigation on BP-Competition has revealed that GPT-4 always predicts the closing half (CG and CO) which speaks after the opening half (OG and OO); in most cases, it selects CO who speaks the last. Meanwhile, ChatGPT-based Debatrix gives relatively balanced predictions that roughly match the expectation.
We conjecture that position bias could be a significant factor that causes GPT-4 to fail in judging BP debates: LLM may prefer the last speaker who can refute others while not being refuted, thus seemingly more convincing.
@article{liang2024debatrix,
title={Debatrix: Multi-dimensional Debate Judge with Iterative Chronological Analysis Based on LLM},
author={Liang, Jingcong and Ye, Rong and Han, Meng and Lai, Ruofei and Zhang, Xinyu and Huang, Xuanjing and Wei, Zhongyu},
journal={arXiv preprint arXiv:2403.08010},
year={2024}
}