Title: COMPUTE-MATCHED REPETITION ADVANTAGE LONG-COT SUPERVISED FINE-TUNING
PDF: compute-matched-repetition-advantage.pdf
Score: 4.2
Verdict: Reject
Confidence: 0.60
Elapsed: 52.0s

Strengths:
1. Token-budget matching is a clean and well-motivated methodological contribution. The paper correctly identifies that step-matched comparisons in long-CoT SFT conflate data repetition effects with compute differences due to variable response lengths. The proposed early-stopping mechanism achieves tight compute control (0.50% token deviation, 0.56% FLOPs deviation), which is a meaningful improvement over the 2.85% token surplus in step-matched conditions. (Section 3.2, Table 3)
2. The mechanism analysis decomposing accuracy into termination rate and conditional accuracy given termination is insightful and well-executed. The finding that Condition C's advantage operates entirely through improved termination (87-91% vs. 29-48%) while conditional accuracy is actually *lower* is non-obvious and provides genuine mechanistic understanding rather than just reporting performance numbers. (Table 4, Section 4.4)
3. The decision rule with a pre-specified threshold (Δtok/Δstep ≥ 0.8) provides a clear, falsifiable criterion for evaluating the compute confound hypothesis. This is good scientific practice compared to post-hoc interpretation. The aggregate ratio of 6.33 far exceeds the threshold, making the conclusion robust to reasonable threshold variations. (Section 3.4, Table 2)

Weaknesses:
1. Extremely narrow experimental scope: only one model (Qwen2.5-1.5B-Instruct), one dataset (NuminaMath-CoT), and three benchmarks (AIME'24, AIME'25, GPQA Diamond). The 1.5B model is very small by current standards, and it is unclear whether these findings generalize to larger models (7B, 14B, 72B) or to non-mathematical reasoning domains. The paper acknowledges this in the conclusion but does nothing to address it. This significantly limits the contribution's impact and generality. (Section 3.1, Section 5)
2. The '6.33× amplification' claim is misleading due to ratio-of-small-numbers instability. The step-matched advantage (Δstep) on aggregate Pass@k is only 4.78 percentage points (from 36.05% to 40.83%). A small denominator inflates the ratio dramatically. The GPQA Pass@4 ratio of 12.00 comes from Δstep = 2.86 and Δtok = 34.34—but this means Conditions A and B are nearly indistinguishable on GPQA, making the ratio meaningless as evidence of 'amplification.' The raw effect sizes (Δtok values) are more informative than the ratios. (Table 2)
3. Near-complete memorization (>99.7% token accuracy for Conditions B and C, Figure 3) raises serious concerns about overfitting and the practical utility of the repetition approach. The training dynamics show the model essentially memorizes the 1,600 training samples. While termination rate improves, the conditional accuracy *decreases* (Table 4), suggesting the model learns to produce complete but worse answers. This tradeoff is acknowledged but not adequately explored—how does this model perform on distribution shift or out-of-domain tasks? (Figure 3, Table 4, Appendix A)
4. The constant learning rate schedule with early stopping creates a confound: Condition C never reaches the final portion of training where a cosine decay schedule would have lower LR, potentially making Condition C's LR effectively higher at 'equivalent' training progress. While the authors justify constant LR to avoid schedule-length confounds, the choice itself may advantage Condition C by preventing the late-training LR decay that could help Condition A generalize better from its more diverse data. (Section 3.2)

Must Fix Items:
1. Report raw effect sizes (Δtok, Δstep in percentage points) alongside ratios, and explicitly acknowledge the ratio-of-small-numbers instability, especially for GPQA where Δstep is only 2.86 percentage points.
2. Add at least one experiment with a different model scale (e.g., Qwen2.5-7B) or a different domain to demonstrate generality of the token-budget matching methodology and the repetition advantage finding.
3. Discuss the implications of near-complete memorization (>99.7% token accuracy) for the practical applicability of the repetition approach, and test on at least one out-of-distribution benchmark to assess whether memorization harms generalization.

Runs:
- run=1 score=4.2 verdict=Reject confidence=0.6 error=None