Title: THE REPETITION ADVANTAGE IN LONG-COT SFT IS A TERMINATION EFFECT FARS Analemma
PDF: termination-aware-sft-repetition-advantage.pdf
Score: 4.0
Verdict: Reject
Confidence: 0.60
Elapsed: 46.3s

Strengths:
1. Clean and insightful diagnostic framework: The ParseRate × Acc|Parse decomposition is a simple yet powerful lens for disentangling termination artifacts from genuine reasoning improvements. The finding that Acc|Parse reverses (A beats B by 6.7pp conditional on parsing) while unconditional accuracy favors B by 17.6pp is striking and well-demonstrated (Table 1).
2. Mediation analysis provides formal quantification: Going beyond qualitative observation, the mediation fraction analysis (M > 1.0 on AIME benchmarks, Table 2) rigorously shows that parseability over-explains the accuracy gap, meaning data scaling actually produces better reasoning when both models terminate. This is a meaningful analytical contribution.
3. Well-designed ablation study (Table 3): The ablation of TA-SFT components (EOS-only, think-only, no-cap) reveals that both token types are synergistically needed and that uncapped reweighting hurts performance. The finding that C-think-only achieves the highest Acc|Parse (58.2%) even above baseline (57.7%) is an interesting secondary observation.

Weaknesses:
1. TA-SFT is a weak intervention that recovers only 14% of the repetition advantage: The proposed method (Condition C) improves aggregate accuracy by only 2.0pp (25.2→27.2) versus B's 42.8%. The paper itself acknowledges this, but it raises the question of whether the 'insight' from the diagnostic framework translates into any actionable improvement. A 2.0pp gain on a single model/benchmark setup is marginal and may not replicate (Table 1).
2. Limited evaluation scope and statistical concerns: Only one base model (OLMo-3-7B) is used. AIME has only 30 problems per year, and with n=16 samples, variance is high. No error bars, confidence intervals, or significance tests are reported anywhere in the paper. For a paper whose central claim is about decomposing and reversing accuracy gaps of a few percentage points, the absence of any statistical quantification is a serious concern (Tables 1-3).
3. Benchmark-dependent findings undermine generality: On GPQA, the mediation fraction is only 0.69 (Table 2), meaning parseability does NOT fully explain the gap, and the conditional gap remains positive (+6.3pp for B over A). The paper acknowledges this but does not investigate why the 'termination effect' story holds for AIME but not GPQA. This inconsistency weakens the claim that the repetition advantage is 'primarily a termination effect' as a general statement (Section 4.3).
4. The paper is generated by an automated research system (stated in abstract): This raises concerns about depth of analysis, novelty of framing, and whether the experimental design was optimized for producing publishable-looking results rather than genuine understanding. The narrow scope (1 model, 3 benchmarks, no significance tests) is consistent with automated experimentation patterns.

Must Fix Items:
1. Add statistical significance tests or confidence intervals for all accuracy comparisons, especially the conditional gap reversals on AIME that are central to the paper's claim.
2. Test on at least one additional base model to demonstrate that the termination effect finding is not specific to OLMo-3-7B.
3. Explain the GPQA anomaly (M=0.69, conditional gap still positive for B) — this directly contradicts the paper's general claim and needs analysis.

Runs:
- run=1 score=4 verdict=Reject confidence=0.6 error=None