A multi-agent system and benchmark for end-to-end peer review modeling, enabling real-world evaluation and predictive insights for future conference decisions.
Enter your OpenReview submission ID or paper title to check predictions from our multi-agent system.
Want the complete dataset? Download all ~13,000 predictions with detailed reasoning.
📥 Download Full Accept List & ReasoningPeer review is fundamental to academic research but remains challenging to model due to its subjectivity, dynamics, and multi-stage complexity. Previous efforts leveraging large language models (LLMs) have primarily explored isolated sub-tasks, such as review generation or score prediction, failing to capture the entire evaluation workflow.
To this end, we introduce PaperDecision to model the peer review process end-to-end. Central to our approach is PaperDecision-Bench, a large-scale multimodal benchmark that links OpenReview papers, reviews, rebuttals, and final decisions across multiple conference cycles. By continuously incorporating newly released conference rounds, the benchmark remains forward-looking and helps avoid risks of data leakage in evaluation.
Building on this benchmark, we develop PaperDecision-Agent, a multi-agent system that simulates the roles and interactions of authors, reviewers, and area chairs. Empirically, frontier multimodal LLMs achieve up to ~82% accuracy in accept-reject prediction. We further provide in-depth analysis of the decision-making process, identifying several key factors associated with acceptance outcomes, such as reviewer expertise and score change. Overall, PaperDecision establishes the first dynamic and extensible benchmark for automated peer review, laying the groundwork for more accurate, transparent, and scalable AI-assisted paper decision systems.
A structured work for comprehensive peer review modeling.
Large-scale, multimodal benchmark integrating research papers and reviews across multiple stages with three-tier design for real-world prediction.
Multi-agent system simulating interactions among authors, reviewers, and area chairs with specialized roles for each stage.
82.44% accuracy on ICLR 2025 for accept/reject prediction. Stable performance across years (2023-2025) demonstrates robust generalization.
Our system models key roles in the real-world peer review process through specialized agents.
Initial evaluation of novelty, methodology, and quality
Aggregates feedback & assesses reliability
Evaluates author responses & attitude shifts
Final accept/reject decision with reasoning
Target ICLR 2026 decision prediction where models observe papers and reviews while final outcomes remain hidden—gold-standard test of cross-temporal generalization.
Complete ICLR 2023–2025 data for solid retrospective evaluation, enabling robust model comparison and systematic error analysis.
Cost-efficient benchmark focusing on MLLM, 3D, and RL papers with ambiguous decision boundaries for rapid iteration.
Comprehensive evaluation of frontier multimodal LLMs on ICLR paper acceptance prediction.
We report performance on the ICLR 2025 benchmark, using ICLR 2024/2023 as historical reference. We evaluate both accept/reject decisions and the predicted tier (Oral/Spotlight/Poster).
Tables sorted by Accept Precision (descending).
Each row represents a complete multi-agent system powered by the specified backbone model. The backbone handles all agent roles (Reviewer, Summarizer, Rebuttal Analyzer, Decision Maker).
Our experiments show that single-agent prediction (direct paper → decision) achieves much lower accuracy than our multi-agent pipeline. The structured workflow with specialized roles significantly improves prediction quality.
| Backbone Model | Accept Prec | Reject Prec | Acc | Oral F1 | Spotlight F1 | Poster F1 |
|---|---|---|---|---|---|---|
| GPT-5.2 | 85.55% | 79.95% | 81.92% | 0.00% | 9.33% | 66.38% |
| MiniMax M2.1 | 80.30% | 73.01% | 75.20% | 13.60% | 29.81% | 54.13% |
| GPT-4.1 | 79.94% | 81.08% | 80.63% | 38.64% | 32.52% | 47.09% |
| Qwen3-Max | 72.50% | 75.56% | 74.38% | 22.88% | 15.15% | 36.39% |
| Gemini 3 Pro | 66.04% | 93.77% | 76.72% | 55.50% | 40.11% | 62.65% |
| Grok-4 | 64.69% | 84.19% | 73.33% | 26.29% | 51.84% | 63.39% |
| GLM-4.6v | 63.00% | 81.39% | 71.28% | 0.00% | 18.37% | 62.26% |
| Gemini 2.5 Pro | 61.44% | 85.26% | 70.83% | 35.42% | 15.03% | 31.03% |
| MiMo-VL-7B (RL) | 59.62% | 81.35% | 68.49% | 16.25% | 17.85% | 57.40% |
| Doubao 1.6 | 58.05% | 89.77% | 67.85% | 24.74% | 22.71% | 41.83% |
| Kimi-VL | 44.29% | 92.49% | 45.93% | 0.92% | 20.06% | 50.57% |
| Backbone Model | Accept Prec | Reject Prec | Acc | Oral F1 | Spotlight F1 | Poster F1 |
|---|---|---|---|---|---|---|
| GPT-5.2 | 85.50% | 82.84% | 83.70% | 0.00% | 14.78% | 63.96% |
| Gemini 3 Pro | 63.90% | 94.54% | 76.91% | 33.52% | 39.41% | 58.96% |
| Doubao 1.6 | 54.67% | 90.36% | 66.73% | 18.05% | 29.07% | 39.36% |
Results on ICLR 2024 show consistent patterns with 2025, GPT-5.2 achieving highest accuracy and accept precision.
| Backbone Model | Accept Prec | Reject Prec | Acc | Oral F1 | Spotlight F1 | Poster F1 |
|---|---|---|---|---|---|---|
| GPT-5.2 | 98.31% | 72.01% | 77.03% | 0.00% | 8.33% | 44.78% |
| Gemini 3 Pro | 91.20% | 86.16% | 87.95% | 36.46% | 39.23% | 62.62% |
| Doubao 1.6 | 77.03% | 69.37% | 71.07% | 10.81% | 30.71% | 35.01% |
ICLR 2023 results demonstrate strong cross-year generalization, with Gemini 3 Pro achieving 87.95% accuracy.
Understanding model behaviors and limitations on ICLR 2025.
The key factors that influence paper decisions.
| Feature | Correlation | Quick Take |
|---|---|---|
| Average Reviewer Score † | +0.705 | The single biggest factor. |
| Minimum Reviewer Score † | +0.617 | A low floor hurts you bad. |
| Maximum Reviewer Score † | +0.542 | High scores help, but less than the average. |
| Rebuttal Success Rate * | +0.530 | Rebuttals matter nearly as much as scores. |
| Paper Novelty * | +0.270 | Being unique gives a moderate boost. |
| Converted Reviewers (Low→High) * | +0.213 | Changing minds is a good sign. |
| Paper Visual Quality * | +0.200 | Good formatting helps a little. |
| Number of Shallow Reviewers * | +0.089 | Unprofessional reviewers can actually help papers get accepted. |
| Score Variance † | -0.106 | Disagreement usually hurts. |
| Score Range † | -0.119 | Wide gaps in scores are bad. |
| Number of Expert Reviewers * | -0.132 | Experts are harder to please. |
| Explicitly Stubborn Reviewers * | -0.323 | Stubborn reviewers are deal-breakers. |
Score-based features: Calculated directly from official ICLR reviewer scores (e.g., average, min, max, variance, range).
Agent-analyzed features: Evaluated by our multi-agent system. For example, Rebuttal Success Rate is assessed by the Rebuttal Analyzer agent, which analyzes rebuttal content and reviewer responses, then maps the outcome to a quantitative score for correlation analysis.
Papers assigned to expert reviewers are accepted less often (41.2% vs 58.8%). Experts are stricter: even if other reviewers like your paper, a single low score (≤4) from an expert can kill the paper. Their opinion often outweighs the non-experts.
A "Strong" rebuttal boosts acceptance to 66.1%, while a bad one drops it to 5.5%. How you reply is key: don't admit unfixable bugs or get angry. Authors who frame errors as "limitations for future work" succeed much more often.
"Groundbreaking" papers have a high (~70%) chance of acceptance. On the other hand, "Incremental" work—just combining existing methods with small changes—is rejected nearly 90% of the time. The community strongly dislikes trivial updates.
Surprisingly, having "shallow" (less detailed) reviewers actually helps your paper. They tend to give higher scores and increase your average score. On the other side, even if they give a low score, Area Chairs often ignore it because their review lacks detailed evidence.
57.1% of papers have unresolved baseline errors, and they still get 37.6% acceptance. But papers with no serious flaws achieve 71.5% acceptance and take 61.8% of oral slots. Fixing your weaknesses matters as much as showing your strengths.
Converting at least one reviewer from low to high scores increases acceptance from 32.6% to 52.5%. Successfully "converting" a skeptic into a supporter is one of the strong signals for acceptance.
§ Agent-Identified Categories: Error types and rebuttal quality (e.g., "Strong") are classified by our agent system based on the analysis of review and rebuttal texts.
¶ "High Risk" Flag: Indicates papers containing credible technical criticisms identified by the agent, such as definition errors or data leakage. Critically, 43.1% of these high-score rejected cases were correctly flagged as High Risk by our system despite their positive numerical scores.
All data, code, and predictions are available.
Complete accept IDs for ICLR 2026.
Complete accept list with decisions (Oral/Spotlight/Poster/Reject) and detailed reasoning for ICLR 2026.
Full benchmark with ICLR 2023-2025 data, reviews, rebuttals, and decisions.
PaperDecision-Agent implementation and evaluation scripts.