ICLR 2026 Acceptance Prediction: Benchmarking Decision Process with A Multi-Agent System

Yi-Fan Zhang*, Yuhao Dong*, Saining Zhang*, Kai Wu, Liang Wang,
Caifeng Shan, Ziwei Liu, Ran He, Hao Zhao, Chaoyou Fu#

NJU  •  CASIA  •  NTU  •  THU

* Equal Contribution    # Project Leader

Contact: yifanzhang.cs@gmail.com , bradyfu24@gmail.com

A multi-agent system and benchmark for end-to-end peer review modeling, enabling real-world evaluation and predictive insights for future conference decisions.

🔮 ICLR 2026 Prediction Lookup

Enter your OpenReview submission ID or paper title to check predictions from our multi-agent system.

Connecting to database... If the process is taking longer than expected, you can download the file for offline viewing.
Initializing prediction database...
Understanding Model Biases — Important for Interpreting Results

Want the complete dataset? Download all ~13,000 predictions with detailed reasoning.

📥 Download Full Accept List & Reasoning

Why and How We Build PaperDecision

Peer review is fundamental to academic research but remains challenging to model due to its subjectivity, dynamics, and multi-stage complexity. Previous efforts leveraging large language models (LLMs) have primarily explored isolated sub-tasks, such as review generation or score prediction, failing to capture the entire evaluation workflow.

To this end, we introduce PaperDecision to model the peer review process end-to-end. Central to our approach is PaperDecision-Bench, a large-scale multimodal benchmark that links OpenReview papers, reviews, rebuttals, and final decisions across multiple conference cycles. By continuously incorporating newly released conference rounds, the benchmark remains forward-looking and helps avoid risks of data leakage in evaluation.

Building on this benchmark, we develop PaperDecision-Agent, a multi-agent system that simulates the roles and interactions of authors, reviewers, and area chairs. Empirically, frontier multimodal LLMs achieve up to ~82% accuracy in accept-reject prediction. We further provide in-depth analysis of the decision-making process, identifying several key factors associated with acceptance outcomes, such as reviewer expertise and score change. Overall, PaperDecision establishes the first dynamic and extensible benchmark for automated peer review, laying the groundwork for more accurate, transparent, and scalable AI-assisted paper decision systems.

Visualization showing factors influencing paper acceptance as a race toward an acceptance gate, with arrows indicating correlation strength and direction
Factors influencing paper acceptance, visualized as a race toward the 'ACCEPTANCE' gate. Arrow direction and metaphorical elements reflect the sign and relative strength of correlations between review process features and final acceptance decisions.
Explore detailed experimental findings

About PaperDecision

A structured work for comprehensive peer review modeling.

📊

PaperDecision Benchmark

Large-scale, multimodal benchmark integrating research papers and reviews across multiple stages with three-tier design for real-world prediction.

🤖

PaperDecisionAgent

Multi-agent system simulating interactions among authors, reviewers, and area chairs with specialized roles for each stage.

🎯

Reliable Prediction

82.44% accuracy on ICLR 2025 for accept/reject prediction. Stable performance across years (2023-2025) demonstrates robust generalization.

Multi-Agent System

Our system models key roles in the real-world peer review process through specialized agents.

Multi-Agent Framework Workflow

Overview of the PaperDecision-Agent workflow, simulating interactions among authors, reviewers, and ACs.

📝 Paper Reviewer

Initial evaluation of novelty, methodology, and quality

📋 Review Summarizer

Aggregates feedback & assesses reliability

💬 Rebuttal Analyzer

Evaluates author responses & attitude shifts

⚖️ Decision Agent

Final accept/reject decision with reasoning

PaperDecision Benchmark Structure

The PaperDecision-Bench structure comprising Future Prediction (B1), Retrospective (B2), and Mini-Benchmark (B3).

B1: Future Prediction

Target ICLR 2026 decision prediction where models observe papers and reviews while final outcomes remain hidden—gold-standard test of cross-temporal generalization.

B2: Retrospective

Complete ICLR 2023–2025 data for solid retrospective evaluation, enabling robust model comparison and systematic error analysis.

B3: Mini-Benchmark

Cost-efficient benchmark focusing on MLLM, 3D, and RL papers with ambiguous decision boundaries for rapid iteration.

Experimental Results

Comprehensive evaluation of frontier multimodal LLMs on ICLR paper acceptance prediction.

📐 Evaluation Metrics

We report performance on the ICLR 2025 benchmark, using ICLR 2024/2023 as historical reference. We evaluate both accept/reject decisions and the predicted tier (Oral/Spotlight/Poster).

Precision Metrics
  • Accept Prec — Of all predicted accepts, how many are truly accepted
  • Reject Prec — Of all predicted rejects, how many are truly rejected
Overall & Tier Metrics
  • Acc — Binary accuracy (Accept vs Reject): the total number of correct predictions (true positives and true negatives) divided by the total number of predictions.
  • Oral/Spotlight/Poster F1 — F1 score (balances precision and recall) for each tier

Tables sorted by Accept Precision (descending).

🤖

Multi-Agent System Performance

Each row represents a complete multi-agent system powered by the specified backbone model. The backbone handles all agent roles (Reviewer, Summarizer, Rebuttal Analyzer, Decision Maker).

💡 Why Multi-Agent?

Our experiments show that single-agent prediction (direct paper → decision) achieves much lower accuracy than our multi-agent pipeline. The structured workflow with specialized roles significantly improves prediction quality.

Backbone Model Accept Prec Reject Prec Acc Oral F1 Spotlight F1 Poster F1
GPT-5.2 85.55% 79.95% 81.92% 0.00% 9.33% 66.38%
MiniMax M2.1 80.30% 73.01% 75.20% 13.60% 29.81% 54.13%
GPT-4.1 79.94% 81.08% 80.63% 38.64% 32.52% 47.09%
Qwen3-Max 72.50% 75.56% 74.38% 22.88% 15.15% 36.39%
Gemini 3 Pro 66.04% 93.77% 76.72% 55.50% 40.11% 62.65%
Grok-4 64.69% 84.19% 73.33% 26.29% 51.84% 63.39%
GLM-4.6v 63.00% 81.39% 71.28% 0.00% 18.37% 62.26%
Gemini 2.5 Pro 61.44% 85.26% 70.83% 35.42% 15.03% 31.03%
MiMo-VL-7B (RL) 59.62% 81.35% 68.49% 16.25% 17.85% 57.40%
Doubao 1.6 58.05% 89.77% 67.85% 24.74% 22.71% 41.83%
Kimi-VL 44.29% 92.49% 45.93% 0.92% 20.06% 50.57%
🤖 Results from our multi-agent pipeline using each backbone model.
Backbone Model Accept Prec Reject Prec Acc Oral F1 Spotlight F1 Poster F1
GPT-5.2 85.50% 82.84% 83.70% 0.00% 14.78% 63.96%
Gemini 3 Pro 63.90% 94.54% 76.91% 33.52% 39.41% 58.96%
Doubao 1.6 54.67% 90.36% 66.73% 18.05% 29.07% 39.36%

Results on ICLR 2024 show consistent patterns with 2025, GPT-5.2 achieving highest accuracy and accept precision.

🤖 Results from our multi-agent pipeline using each backbone model.
Backbone Model Accept Prec Reject Prec Acc Oral F1 Spotlight F1 Poster F1
GPT-5.2 98.31% 72.01% 77.03% 0.00% 8.33% 44.78%
Gemini 3 Pro 91.20% 86.16% 87.95% 36.46% 39.23% 62.62%
Doubao 1.6 77.03% 69.37% 71.07% 10.81% 30.71% 35.01%

ICLR 2023 results demonstrate strong cross-year generalization, with Gemini 3 Pro achieving 87.95% accuracy.

Error Analysis

Understanding model behaviors and limitations on ICLR 2025.

GPT Series vs. Gemini Series

  • GPT Series (5.2 & 4.1): The "High Precision" choice. GPT-5.2 gets the most accurate acc (~82% Accuracy) with very high Accept Precision (~85%). If GPT says "Accept", the paper is likely solid. GPT-4.1 is slightly less accurate.
  • Gemini Series (3 Pro & 2.5 Pro): The "High Recall" choice. Gemini 3 Pro has an impressive 93.77% Reject Precision—it rarely wrongly rejects a good paper. However, both Gemini models are "noisier" and tend to accept more borderline papers (lower Accept Precision around 60-66%) compared to GPT.

Challenge: Predicting Top-Tier Papers

  • Hard to Distinguish "Great" from "Good": Predicting exact tiers (Oral/Spotlight) is much harder than simple Accept/Reject. Most models have low F1 scores here.
  • The "Too Safe" Problem: Surprisingly, the most accurate model, GPT-5.2, has a 0% F1 score on Orals. It plays it too safe, categorizing almost all accepted papers as "Posters" and failing to recognize excellent work.
  • Current Limit: Only Gemini 3 Pro shows decent ability to spot Orals (55% F1), making it the only reliable option for finding top 1% papers.

Understanding How ICLR Makes Decisions

The key factors that influence paper decisions.

What Correlates with Acceptance?

Feature Correlation Quick Take
Average Reviewer Score +0.705 The single biggest factor.
Minimum Reviewer Score +0.617 A low floor hurts you bad.
Maximum Reviewer Score +0.542 High scores help, but less than the average.
Rebuttal Success Rate * +0.530 Rebuttals matter nearly as much as scores.
Paper Novelty * +0.270 Being unique gives a moderate boost.
Converted Reviewers (Low→High) * +0.213 Changing minds is a good sign.
Paper Visual Quality * +0.200 Good formatting helps a little.
Number of Shallow Reviewers * +0.089 Unprofessional reviewers can actually help papers get accepted.
Score Variance -0.106 Disagreement usually hurts.
Score Range -0.119 Wide gaps in scores are bad.
Number of Expert Reviewers * -0.132 Experts are harder to please.
Explicitly Stubborn Reviewers * -0.323 Stubborn reviewers are deal-breakers.

Score-based features: Calculated directly from official ICLR reviewer scores (e.g., average, min, max, variance, range).

*

Agent-analyzed features: Evaluated by our multi-agent system. For example, Rebuttal Success Rate is assessed by the Rebuttal Analyzer agent, which analyzes rebuttal content and reviewer responses, then maps the outcome to a quantitative score for correlation analysis.

Key Insights

01

Experts are Harder to Please

Papers assigned to expert reviewers are accepted less often (41.2% vs 58.8%). Experts are stricter: even if other reviewers like your paper, a single low score (≤4) from an expert can kill the paper. Their opinion often outweighs the non-experts.

02

Rebuttal Matters: Don't Be Defensive

A "Strong" rebuttal boosts acceptance to 66.1%, while a bad one drops it to 5.5%. How you reply is key: don't admit unfixable bugs or get angry. Authors who frame errors as "limitations for future work" succeed much more often.

03

Novelty as Primary Discriminator

"Groundbreaking" papers have a high (~70%) chance of acceptance. On the other hand, "Incremental" work—just combining existing methods with small changes—is rejected nearly 90% of the time. The community strongly dislikes trivial updates.

04

The "Shallow Reviewer" Advantage

Surprisingly, having "shallow" (less detailed) reviewers actually helps your paper. They tend to give higher scores and increase your average score. On the other side, even if they give a low score, Area Chairs often ignore it because their review lacks detailed evidence.

05

Avoid Big Mistakes

57.1% of papers have unresolved baseline errors, and they still get 37.6% acceptance. But papers with no serious flaws achieve 71.5% acceptance and take 61.8% of oral slots. Fixing your weaknesses matters as much as showing your strengths.

06

Reviewer Conversion Efficacy

Converting at least one reviewer from low to high scores increases acceptance from 32.6% to 52.5%. Successfully "converting" a skeptic into a supporter is one of the strong signals for acceptance.

High-Score Rejections (Avg ≥ 6.5)

  • 72 papers rejected despite high scores
  • 51.4%: Unresolved baseline errors §
  • 11.1%: Unresolved math errors §
  • 43.1%: Flagged as "High Risk"

Low-Score Acceptances (Avg < 5.5)

  • 165 papers accepted despite low scores
  • 57.0%: Had "Strong" rebuttals §
  • Successfully converted skeptical reviewers
  • Credibly addressed technical concerns

§ Agent-Identified Categories: Error types and rebuttal quality (e.g., "Strong") are classified by our agent system based on the analysis of review and rebuttal texts.

"High Risk" Flag: Indicates papers containing credible technical criticisms identified by the agent, such as definition errors or data leakage. Critically, 43.1% of these high-score rejected cases were correctly flagged as High Risk by our system despite their positive numerical scores.

Open Source Resources

All data, code, and predictions are available.

🎯

ICLR 2026 Accept Ids

Complete accept IDs for ICLR 2026.

🎯

ICLR 2026 Decision/Reasoning

Complete accept list with decisions (Oral/Spotlight/Poster/Reject) and detailed reasoning for ICLR 2026.

📊

PaperDecision Benchmark

Full benchmark with ICLR 2023-2025 data, reviews, rebuttals, and decisions.

💻

Source Code

PaperDecision-Agent implementation and evaluation scripts.