{
  "pdf": "e9598167-ca9a-417a-bca0-488f48d3c645.pdf",
  "title": "DOES IGRPO NEED A GOOD DRAFT? BEST-VS-WORST SELF-CONDITIONING ABLATION FOR RLVR MATH",
  "elapsed": 514.1,
  "runs_mode": 1,
  "valid_runs": 1,
  "avg_score": 5.2,
  "scores": [
    5.2
  ],
  "score_std": 0.0,
  "final_verdict": "Reject",
  "final_confidence": 0.72,
  "conference_scores": null,
  "strengths": [
    "Clean 3-condition ablation design isolating draft quality from draft presence: GRPO (no draft) vs iGRPO-best (best draft) vs iGRPO-worst-of-formatted (worst draft). Matched compute budget (8 rollouts/prompt), same model (DeepSeek-R1-Distill-Qwen-7B), same hyperparameters, same training data — this is a well-controlled experiment (Section 2.2-2.3).",
    "Honest reporting of a counterintuitive negative result: worst-of-formatted drafts outperforming best-of-N selection challenges the core motivation of iGRPO's reward-based draft selection. The recovery ratio r_worst = 1.34 with 95% CI [1.21, 1.47] entirely above 1.0 is a meaningful statistical claim (Table 2, Eq. 3).",
    "Mechanistic analysis attempting to explain the surprising result: the 50% increase in gradient-active training groups (18.5% vs 12.3%, z = −22.36) is a concrete, verifiable hypothesis for why bad drafts outperform good ones — more diverse Stage-2 outcomes lead to more non-zero advantage groups (Table 3, Section 3.3).",
    "Draft length confound explicitly ruled out: KS test p = 0.988 confirming both selection strategies produce drafts of similar token length (~4095), which is important since worst drafts could plausibly have been shorter/degenerate (Table 3).",
    "Six-benchmark evaluation spanning different difficulty levels (MATH500, GSM8K, AMC23, Minerva, AIME24, AIME25), with harder benchmarks showing larger iGRPO gains, providing face validity that draft conditioning helps complex reasoning (Table 1)."
  ],
  "weaknesses": [
    "Only 2 random seeds per condition — critically insufficient for drawing strong conclusions. With n=2, no meaningful variance estimation or significance testing on the main accuracy comparisons is possible. The 95% CI on the recovery ratio [1.21, 1.47] likely comes from bootstrap over evaluation runs (64 for AIME, 8 for others), not from training seeds, making it a statement about evaluation noise only, not training stability. This is a severe limitation acknowledged only briefly in Section 5.",
    "Single model (DeepSeek-R1-Distill-Qwen-7B), single training configuration (131 steps, lr=1e-6), single domain (math). The core claim — 'draft quality is not necessary for iGRPO's benefit' — is a generalization from one model at one scale in one domain. Whether this holds for larger models, different training durations, or non-math tasks is entirely unknown. The conclusion overstates the scope (Section 5: 'iGRPO can be simplified by removing reward-based draft selection without sacrificing performance').",
    "Worst-of-formatted is not truly 'worst' — it selects the worst among well-formatted drafts (Eq. 2). This conflates two variables: draft quality and draft format validity. A true worst-condition (selecting the worst draft regardless of format) is not tested. The fallback to 'lowest-reward draft' when no formatted drafts exist further muddies the comparison. The paper never reports what fraction of selection episodes fall back, making it impossible to assess how often the 'worst-of-formatted' rule is actually applied vs the fallback.",
    "No significance tests on the main benchmark accuracy comparisons (Table 1). The 2.43 pp macro-average gap between iGRPO-worst (64.37%) and iGRPO-best (61.94%) could easily be within noise given 2 seeds. The paper reports z-tests for gradient-active groups (p < 10^{-100}) but not for the actual accuracy differences that underpin the central claim. The recovery ratio CI is computed on MATH500 only, not on the macro-average where the headline claim is made.",
    "131 training steps is an extremely short run. It is unclear whether the observed pattern (worst > best) is a transient early-training phenomenon or a stable result. The paper provides no learning curves or checkpoints to assess training dynamics. If gradient-active groups are higher early on but converge later, the conclusion would change entirely. This is a critical missing analysis for a paper whose core finding is about training dynamics (gradient signal).",
    "The gradient-active groups explanation, while plausible, is correlational — not causal. The paper shows that worst-of-formatted produces more gradient-active groups AND better accuracy, but does not demonstrate that the former causes the latter. A causal test would require controlling gradient-active groups independently of draft quality, which is not attempted."
  ],
  "must_fix_items": [
    "Report results with more than 2 seeds or at minimum provide per-seed breakdowns and standard deviations for all benchmarks — current n=2 makes the main accuracy claims unvalidated.",
    "Provide learning curves across training steps to show whether worst>best is a stable result or a transient phenomenon at 131 steps.",
    "Report the fraction of episodes where worst-of-formatted falls back to lowest-reward draft (i.e., no formatted draft available), to clarify what the condition actually does in practice.",
    "Add significance tests (or at minimum per-seed standard deviations) on the macro-average accuracy comparison between Conditions B and C — the headline 2.43 pp gap currently lacks any statistical validation.",
    "Tone down the generalization: change 'draft quality is not necessary for iGRPO's benefit' to 'draft quality is not necessary for iGRPO's benefit in this specific setting (DeepSeek-R1-Distill-Qwen-7B, 131 steps, math domain)'."
  ],
  "runs": [
    {
      "run": 1,
      "score": 5.2,
      "verdict": "Reject",
      "confidence": 0.72,
      "strengths": [
        "Clean 3-condition ablation design isolating draft quality from draft presence: GRPO (no draft) vs iGRPO-best (best draft) vs iGRPO-worst-of-formatted (worst draft). Matched compute budget (8 rollouts/prompt), same model (DeepSeek-R1-Distill-Qwen-7B), same hyperparameters, same training data — this is a well-controlled experiment (Section 2.2-2.3).",
        "Honest reporting of a counterintuitive negative result: worst-of-formatted drafts outperforming best-of-N selection challenges the core motivation of iGRPO's reward-based draft selection. The recovery ratio r_worst = 1.34 with 95% CI [1.21, 1.47] entirely above 1.0 is a meaningful statistical claim (Table 2, Eq. 3).",
        "Mechanistic analysis attempting to explain the surprising result: the 50% increase in gradient-active training groups (18.5% vs 12.3%, z = −22.36) is a concrete, verifiable hypothesis for why bad drafts outperform good ones — more diverse Stage-2 outcomes lead to more non-zero advantage groups (Table 3, Section 3.3).",
        "Draft length confound explicitly ruled out: KS test p = 0.988 confirming both selection strategies produce drafts of similar token length (~4095), which is important since worst drafts could plausibly have been shorter/degenerate (Table 3).",
        "Six-benchmark evaluation spanning different difficulty levels (MATH500, GSM8K, AMC23, Minerva, AIME24, AIME25), with harder benchmarks showing larger iGRPO gains, providing face validity that draft conditioning helps complex reasoning (Table 1)."
      ],
      "weaknesses": [
        "Only 2 random seeds per condition — critically insufficient for drawing strong conclusions. With n=2, no meaningful variance estimation or significance testing on the main accuracy comparisons is possible. The 95% CI on the recovery ratio [1.21, 1.47] likely comes from bootstrap over evaluation runs (64 for AIME, 8 for others), not from training seeds, making it a statement about evaluation noise only, not training stability. This is a severe limitation acknowledged only briefly in Section 5.",
        "Single model (DeepSeek-R1-Distill-Qwen-7B), single training configuration (131 steps, lr=1e-6), single domain (math). The core claim — 'draft quality is not necessary for iGRPO's benefit' — is a generalization from one model at one scale in one domain. Whether this holds for larger models, different training durations, or non-math tasks is entirely unknown. The conclusion overstates the scope (Section 5: 'iGRPO can be simplified by removing reward-based draft selection without sacrificing performance').",
        "Worst-of-formatted is not truly 'worst' — it selects the worst among well-formatted drafts (Eq. 2). This conflates two variables: draft quality and draft format validity. A true worst-condition (selecting the worst draft regardless of format) is not tested. The fallback to 'lowest-reward draft' when no formatted drafts exist further muddies the comparison. The paper never reports what fraction of selection episodes fall back, making it impossible to assess how often the 'worst-of-formatted' rule is actually applied vs the fallback.",
        "No significance tests on the main benchmark accuracy comparisons (Table 1). The 2.43 pp macro-average gap between iGRPO-worst (64.37%) and iGRPO-best (61.94%) could easily be within noise given 2 seeds. The paper reports z-tests for gradient-active groups (p < 10^{-100}) but not for the actual accuracy differences that underpin the central claim. The recovery ratio CI is computed on MATH500 only, not on the macro-average where the headline claim is made.",
        "131 training steps is an extremely short run. It is unclear whether the observed pattern (worst > best) is a transient early-training phenomenon or a stable result. The paper provides no learning curves or checkpoints to assess training dynamics. If gradient-active groups are higher early on but converge later, the conclusion would change entirely. This is a critical missing analysis for a paper whose core finding is about training dynamics (gradient signal).",
        "The gradient-active groups explanation, while plausible, is correlational — not causal. The paper shows that worst-of-formatted produces more gradient-active groups AND better accuracy, but does not demonstrate that the former causes the latter. A causal test would require controlling gradient-active groups independently of draft quality, which is not attempted."
      ],
      "must_fix_items": [
        "Report results with more than 2 seeds or at minimum provide per-seed breakdowns and standard deviations for all benchmarks — current n=2 makes the main accuracy claims unvalidated.",
        "Provide learning curves across training steps to show whether worst>best is a stable result or a transient phenomenon at 131 steps.",
        "Report the fraction of episodes where worst-of-formatted falls back to lowest-reward draft (i.e., no formatted draft available), to clarify what the condition actually does in practice.",
        "Add significance tests (or at minimum per-seed standard deviations) on the macro-average accuracy comparison between Conditions B and C — the headline 2.43 pp gap currently lacks any statistical validation.",
        "Tone down the generalization: change 'draft quality is not necessary for iGRPO's benefit' to 'draft quality is not necessary for iGRPO's benefit in this specific setting (DeepSeek-R1-Distill-Qwen-7B, 131 steps, math domain)'."
      ],
      "conference_scores": null
    }
  ]
}
