{
  "pdf": "57445040-9899-4828-aa1b-052808501d56.pdf",
  "title": "LENGTH-WEIGHTED LOSS DOES NOT EXPLAIN THE REPETITION ADVANTAGE IN LONG-COT SUPERVISED FINE-TUNING FARS",
  "elapsed": 508.8,
  "runs_mode": 1,
  "valid_runs": 1,
  "avg_score": 5.5,
  "scores": [
    5.5
  ],
  "score_std": 0.0,
  "final_verdict": "Revise",
  "final_confidence": 0.7,
  "conference_scores": null,
  "strengths": [
    "Pre-registered success criteria with explicit thresholds (recovery ≥60% supports, 20–60% partial, <20% refutes; Section 3.4) is a methodological strength rarely seen in ML papers — it prevents post-hoc redefinition of what counts as 'recovery'.",
    "Systematic exploration of three distinct reweighting approaches (linear Llen, quadratic α=2, token-level tail weighting β=4.0; Table 2) strengthens the negative claim by showing it holds across different gradient redistribution strategies, not just the primary hypothesis.",
    "The termination rate analysis (90.1% for Condition B vs 46.1% for A and 46.7% for C; Table 1) provides a clear behavioral signal that the mechanism must involve something beyond gradient magnitude, narrowing the space of viable explanations.",
    "Clean experimental design: identical optimizer, step budget (51,200), learning rate, and seeds across conditions eliminates many confounds (Section 3.3, Section 4.1). Three random seeds per condition is better than single-run."
  ],
  "weaknesses": [
    "The hypothesis tested is a narrow strawman. The repetition advantage could operate through many mechanisms unrelated to per-sequence gradient weighting: (a) curriculum effects from seeing the same examples in different order across epochs, (b) implicit regularization from reduced data diversity, (c) optimization landscape benefits from correlated gradient directions across epochs, (d) better calibration of generation length/termination from repeated exposure. Refuting one specific loss-weighting hypothesis does not meaningfully 'eliminate gradient signal distribution' — it only eliminates one operationalization. The conclusion's claim that results 'eliminate gradient signal distribution as an explanation' (Abstract, Section 5) is overreaching.",
    "No formal statistical significance tests are reported. With 3 seeds × 30 problems (AIME) or 3 seeds × 198 problems (GPQA), the standard errors are non-trivial. The claim that Condition C (25.5% ± 0.4%) is 'statistically identical' to Condition A (25.6% ± 0.8%) requires a proper test (e.g., bootstrap, paired t-test across problems). The ± values appear to be across 3 seeds only, giving very limited power to detect moderate effects. A 1–2 point recovery could be real but undetectable with this design.",
    "C-Opt0 (quadratic weighting) simultaneously changes two variables: the loss weighting exponent (α=1→2) AND the learning rate (2e-5→5e-5). The dramatic degradation to 16.4% Acc@k could be caused by the learning rate change alone, making it impossible to attribute failure to 'stronger weighting.' This confound undermines the claim that 'aggressive upweighting that assigns 12× more gradient to the longest sequences not only failed to improve performance but actually degraded it' (Section 5).",
    "The proposed alternative explanation — 'memorization convergence through repeated exposure' (Abstract, Conclusion) — is asserted without direct evidence. The paper cites Kopiczko et al. observing that training token accuracy correlates with downstream performance, but correlation is not mechanism. No experiment in this paper tests memorization directly (e.g., measuring training loss convergence, probe tasks, or gradient similarity across epochs). The discussion section (Section 5) merely gestures at this without supporting experiments.",
    "Single model (OLMo3-7B), single dataset (Dolci), three benchmarks. The paper acknowledges this limitation (Section 5), but for a negative-result paper that aims to 'eliminate' a hypothesis, generalizability is critical. The repetition advantage may operate differently in different model architectures (e.g., models with different attention mechanisms or position encodings) or data distributions (e.g., shorter CoT traces where length variance is smaller)."
  ],
  "must_fix_items": [
    "De-confound C-Opt0: run quadratic weighting (α=2) with the original learning rate (2e-5), and separately run original weighting (α=1) with the higher learning rate (5e-5), to isolate which variable causes the degradation.",
    "Add statistical significance tests (bootstrap CI or paired permutation test on per-problem accuracies) for the key comparison: Condition C vs Condition A on each benchmark. Report p-values or confidence intervals for the recovery fraction.",
    "Tone down the conclusion: change 'eliminates gradient signal distribution as an explanation' to 'does not support one specific gradient signal redistribution mechanism.' The current language claims more than the evidence warrants."
  ],
  "runs": [
    {
      "run": 1,
      "score": 5.5,
      "verdict": "Revise",
      "confidence": 0.7,
      "strengths": [
        "Pre-registered success criteria with explicit thresholds (recovery ≥60% supports, 20–60% partial, <20% refutes; Section 3.4) is a methodological strength rarely seen in ML papers — it prevents post-hoc redefinition of what counts as 'recovery'.",
        "Systematic exploration of three distinct reweighting approaches (linear Llen, quadratic α=2, token-level tail weighting β=4.0; Table 2) strengthens the negative claim by showing it holds across different gradient redistribution strategies, not just the primary hypothesis.",
        "The termination rate analysis (90.1% for Condition B vs 46.1% for A and 46.7% for C; Table 1) provides a clear behavioral signal that the mechanism must involve something beyond gradient magnitude, narrowing the space of viable explanations.",
        "Clean experimental design: identical optimizer, step budget (51,200), learning rate, and seeds across conditions eliminates many confounds (Section 3.3, Section 4.1). Three random seeds per condition is better than single-run."
      ],
      "weaknesses": [
        "The hypothesis tested is a narrow strawman. The repetition advantage could operate through many mechanisms unrelated to per-sequence gradient weighting: (a) curriculum effects from seeing the same examples in different order across epochs, (b) implicit regularization from reduced data diversity, (c) optimization landscape benefits from correlated gradient directions across epochs, (d) better calibration of generation length/termination from repeated exposure. Refuting one specific loss-weighting hypothesis does not meaningfully 'eliminate gradient signal distribution' — it only eliminates one operationalization. The conclusion's claim that results 'eliminate gradient signal distribution as an explanation' (Abstract, Section 5) is overreaching.",
        "No formal statistical significance tests are reported. With 3 seeds × 30 problems (AIME) or 3 seeds × 198 problems (GPQA), the standard errors are non-trivial. The claim that Condition C (25.5% ± 0.4%) is 'statistically identical' to Condition A (25.6% ± 0.8%) requires a proper test (e.g., bootstrap, paired t-test across problems). The ± values appear to be across 3 seeds only, giving very limited power to detect moderate effects. A 1–2 point recovery could be real but undetectable with this design.",
        "C-Opt0 (quadratic weighting) simultaneously changes two variables: the loss weighting exponent (α=1→2) AND the learning rate (2e-5→5e-5). The dramatic degradation to 16.4% Acc@k could be caused by the learning rate change alone, making it impossible to attribute failure to 'stronger weighting.' This confound undermines the claim that 'aggressive upweighting that assigns 12× more gradient to the longest sequences not only failed to improve performance but actually degraded it' (Section 5).",
        "The proposed alternative explanation — 'memorization convergence through repeated exposure' (Abstract, Conclusion) — is asserted without direct evidence. The paper cites Kopiczko et al. observing that training token accuracy correlates with downstream performance, but correlation is not mechanism. No experiment in this paper tests memorization directly (e.g., measuring training loss convergence, probe tasks, or gradient similarity across epochs). The discussion section (Section 5) merely gestures at this without supporting experiments.",
        "Single model (OLMo3-7B), single dataset (Dolci), three benchmarks. The paper acknowledges this limitation (Section 5), but for a negative-result paper that aims to 'eliminate' a hypothesis, generalizability is critical. The repetition advantage may operate differently in different model architectures (e.g., models with different attention mechanisms or position encodings) or data distributions (e.g., shorter CoT traces where length variance is smaller)."
      ],
      "must_fix_items": [
        "De-confound C-Opt0: run quadratic weighting (α=2) with the original learning rate (2e-5), and separately run original weighting (α=1) with the higher learning rate (5e-5), to isolate which variable causes the degradation.",
        "Add statistical significance tests (bootstrap CI or paired permutation test on per-problem accuracies) for the key comparison: Condition C vs Condition A on each benchmark. Report p-values or confidence intervals for the recovery fraction.",
        "Tone down the conclusion: change 'eliminates gradient signal distribution as an explanation' to 'does not support one specific gradient signal redistribution mechanism.' The current language claims more than the evidence warrants."
      ],
      "conference_scores": null
    }
  ]
}