Title: ANYTIME-CBU: ADAPTIVE ROLLOUT ALLOCATION FOR CONSEQUENCE-BASED UTILITY SCORING FARS Analemma
PDF: anytime-cbu-evaluator.pdf
Score: 3.5
Verdict: Strong Reject
Confidence: 0.60
Elapsed: 46.8s

Strengths:
1. Honest reporting of a negative primary result. The paper explicitly acknowledges that Anytime-CBU achieves only 0–2% rollout reduction, far below the ≥50% target, and does not try to spin this as a success. This transparency is commendable and rare in ML venue submissions (Section 4.2, Table 1).
2. Root-cause analysis is well-executed. The paper identifies a structural mismatch—flat utility landscapes in RealMath make the LUCB stopping condition unsatisfiable—and provides both theoretical justification (radius ≈ 0.125 at Kmax=16, requiring gaps > 0.25) and empirical evidence (48% of targets have all-zero or tied utilities, median gap ∼0.0625) to explain the failure (Section 3.4, Section 4.3, Table 2).
3. The secondary finding that adaptive allocation outperforms random allocation at matched cost is a constructive insight. On DeepSeek, Anytime-CBU achieves +3.6pp Acc@1 and +3.2pp AUC over Random-K Matched, suggesting LUCB-guided allocation has value even without early stopping (Section 4.2, Table 1).

Weaknesses:
1. The core contribution is a negative result with minimal practical impact. The method fails at its stated goal (≥50% cost reduction), achieves only 0–2% reduction, and the secondary benefit (adaptive > random at matched cost) is not statistically significant and directionally modest. A paper whose primary hypothesis is disconfirmed needs either a very strong secondary finding or a broadly generalizable insight; neither is present here (Table 1, Table 2).
2. Extremely limited experimental scope. Only 79 targets (Qwen) and 28 targets (DeepSeek) are evaluated, both from a single dataset (RealMath). The 28-target DeepSeek condition is far too small to draw reliable conclusions—the 95% CIs on Acc@1 span 0.32–0.68, making any comparison meaningless. No evaluation on other datasets or domains is provided (Section 4.1).
3. The BAI reformulation is standard and does not represent a methodological advance. Applying LUCB with Beta-posterior bounds to Bernoulli rewards is a textbook adaptation, not a novel algorithmic contribution. The paper itself acknowledges that the optimization attempts (tighter bounds, arm elimination, stopping margin) are incremental and all fail (Section 4.3). The insight about flat utility landscapes mismatching BAI assumptions, while valid, is somewhat obvious in hindsight—if candidate utilities are nearly identical, no adaptive sampling method can identify a best arm efficiently.

Must Fix Items:
1. The secondary comparison (Anytime-CBU vs. Random-K at matched cost) lacks statistical significance. With only 28 DeepSeek targets and overlapping CIs, the claimed +3.6pp Acc@1 improvement is not meaningful. The paper should either run on more data to establish significance or substantially soften this claim.
2. The Random-K baseline for Qwen uses 'default parameters with lower cost' and is marked as 'not directly comparable' (Table 1 footnote), yet the paper still discusses it. A proper matched-cost Random-K baseline for Qwen should be included for fair comparison.
3. The paper needs evaluation on at least one additional dataset or problem domain where utility landscapes might be more separable, to test whether the negative result is specific to RealMath or general. Without this, the contribution is overly narrow.

Runs:
- run=1 score=3.5 verdict=Strong Reject confidence=0.6 error=None