Title: CHUNKED BUDGET ALLOCATION PREVENTS NON-MONOTONIC REGRESSIONS IN WORLD-MODEL VERI-FICATION FARS
PDF: f77159c8-1702-416f-af6c-f8c1368d5cc6.pdf
Score: 4.5
Verdict: Reject
Confidence: 0.8
Elapsed: 206.3s

Strengths:
1. The paper identifies a real and counterintuitive failure mode — trajectory drift — in sequential verify-and-retry for irreversible actions, and provides concrete evidence: Table 2 shows 60.5% (49/81) of differential failures between chunked and sequential conditions are attributable to trajectory drift, where rejected checkouts lead agents to re-browse and select worse products. This is a legitimate mechanistic insight.
2. The experimental design includes a clean 3-condition comparison (A: sequential K=1, B: chunked with consensus, C: chunked without aggregation) that isolates the structural effect (fewer cycles) from the aggregation effect. The finding that Condition C (no-aggregation, 21.50%) outperforms Condition B (consensus, 14.83%) is genuinely surprising and informative — it demonstrates that the benefit comes from cycle reduction, not from aggregation sophistication (Table 1, Section 4.2).
3. The temperature ablation (Table 3) is well-designed: showing that the chunked advantage persists and even slightly increases under deterministic decoding (temp=0: +9.67 pp vs temp=0.7: +7.67 pp) cleanly eliminates the hypothesis that stochastic diversity across parallel rollouts drives the benefit. This is a proper confound-removal experiment (Section 4.5).

Weaknesses:
1. The 25× improvement claim (21.50% vs 0.86%) is inflated by a near-constant-reject world model acting as a strawman baseline. Section 4.6 reveals the world model predicts p̂=0.0 for 98.4% of cycles (306/311), with actual success rate of 27.5% when p̂=0.0 — meaning the world model has an extreme false-negative bias. In Condition A (sequential, K=1), this single near-certain rejection blocks nearly every checkout on the first cycle, guaranteeing near-zero success. The comparison is thus between 'a system that almost never allows checkout' vs 'a system that always allows checkout after one cycle,' which is an artifact of the world model's miscalibration rather than a fundamental insight about budget allocation. With a moderately calibrated world model, the sequential vs. chunked gap would shrink dramatically.
2. The core finding — that fewer sequential retries reduce compounding errors — is intuitive and thin for a top-venue contribution. The prescription 'set M=1' is equivalent to 'do not retry after verification failure,' which reduces to a binary accept/reject decision with no retry. This is a trivially obvious structural point: if retrying hurts, don't retry. The paper's contribution is the empirical documentation of *why* retrying hurts (trajectory drift), but the solution itself requires no algorithmic innovation. The non-monotonic intermediate results (M=2 at 8.5% < M=5 at 14.0%) in Figure 2 are explained away as a threshold interaction artifact (Section 4.3) rather than yielding deeper insight.
3. Evaluation is narrow: single environment (WebShop), single world model (fine-tuned Qwen2.5-7B from Li et al. 2025), single acting agent (Gemini-2.5-Flash), single budget (B=10), and modest absolute success rate (21.50% best). The paper itself acknowledges this limitation (Section 5), but the narrow scope means we cannot assess whether the findings generalize beyond this specific world-model–agent pairing. The 98.4% false-negative rate may be a pathology of this particular fine-tuned model rather than a general property of world models for verification. Additionally, only 200 tasks × 3 seeds = 600 episodes per condition is modest, and no significance tests beyond bootstrap CIs are reported — no paired t-tests, no multiple-comparison corrections for the M-sweep in Figure 2.

Must Fix Items:
1. Replace or supplement the 25× claim with comparisons against better-calibrated world models or against a sequential baseline that uses a corrected threshold (e.g., τ adjusted to match the false-negative rate). The current comparison conflates 'chunked allocation is better' with 'this particular world model is catastrophically miscalibrated.'
2. Report formal statistical significance tests (e.g., McNemar's test for paired binary outcomes) rather than relying solely on non-overlapping CIs as evidence. The bootstrap CI for C−A of [+17.50, +24.00] pp is convincing for the main comparison, but the intermediate M conditions (M=2 vs M=5) in Figure 2 lack formal testing.
3. Evaluate on at least one additional environment or world model to assess generalizability. If the 98.4% false-negative rate is the key driver, the paper should demonstrate that the findings hold with a less pathological world model (or explicitly frame the contribution as a negative result about poorly-calibrated world models).

Runs:
- run=1 score=4.5 verdict=Reject confidence=0.8 error=None