Title: HARD EXAMPLES BEAT EASY
PDF: low-nll-coreset-repetition.pdf
Score: 4.5
Verdict: Reject
Confidence: 0.60
Elapsed: 50.5s

Strengths:
1. Clear, focused research question with practical relevance: the paper addresses a concrete and underexplored question—whether easy or hard examples should be selected for repetition-heavy CoT SFT—and delivers an actionable recommendation (high-NLL selection). This is directly motivated by the surprising finding of Kopiczko et al. (2026) and fills a gap left by prior data-selection work that focused on single-epoch regimes (Section 1, Section 2.1).
2. Length-matched selection controls a key confound: The per-decile NLL ranking procedure (Section 3.3) ensures Low-NLL and High-NLL subsets have identical length distributions (2817 vs 2815 tokens, Section 5), isolating the effect of difficulty from response length. This is a methodologically sound design choice that strengthens causal attribution.
3. Optimization attempts test the robustness of the finding: Table 2 reports two attempts to recover Low-NLL performance (hyperparameter tuning and trigram filtering), both of which failed. This strengthens the claim that the limitation is fundamental to the selection strategy rather than an artifact of a specific training configuration. The trigram filtering experiment also helps disentangle the textual-repetition confound from the NLL effect.

Weaknesses:
1. Extremely narrow experimental scope—one model, one dataset, one regime: All results use OLMo3-7B on Dolci-Think with 800 examples for 32 epochs (Section 3.4). The paper itself acknowledges this limitation (Section 5), but the contribution's significance depends critically on generalizability. Whether the finding holds for larger models (e.g., 70B+), different CoT datasets, or different epoch/size configurations is entirely unknown. A single-model, single-dataset study making a general recommendation is insufficient for a top venue.
2. No statistical significance testing beyond bootstrap CI on one aggregate comparison: The paper reports a bootstrap 95% CI [3.27, 20.36] for the High-NLL vs Low-NLL aggregate difference (Section 4.1), but this is computed on only 3 seeds for Low-NLL and High-NLL conditions. With n=3 seeds, the CI is unreliable. No per-benchmark significance tests are reported, and the high variance in Low-NLL (±12.6 on AIME 2024) makes the comparison fragile. This is a serious statistical weakness for a paper whose core contribution is an empirical comparison.
3. The trigram-repetition confound is characterized only correlationally, and the filtering experiment is underspecified: The paper identifies that low-NLL examples have higher trigram rates (0.457 vs 0.206, Section 5) and attempts filtering with a threshold of <0.3 (Table 2), but does not report how many examples survived filtering, what the resulting subset's NLL distribution looked like, or whether the filtered set remained length-matched. The claim that 'the limitation is fundamental to the selection strategy' (Abstract) is overstated given this single, underspecified intervention.

Must Fix Items:
1. Run additional seeds (at least 5 per condition) and report proper statistical tests (e.g., paired permutation tests on per-problem accuracy) for each benchmark individually, not just aggregate.
2. Characterize the trigram-filtered subset fully: report the number of surviving examples, the NLL distribution, and whether the filtered set is still length-matched. Consider a factorial design that crosses NLL level with trigram rate to causally separate the two factors.
3. Test on at least one additional model or dataset to provide evidence of generalizability beyond the single OLMo3-7B/Dolci-Think configuration.

Runs:
- run=1 score=4.5 verdict=Reject confidence=0.6 error=None