{
  "pdf": "robust-innovation-kpo.pdf",
  "title": "INNOVATION SATURATION DOES NOT ROBUSTIFY KALMAN-FILTERED IMPORTANCE RATIOS IN LLM REINFORCEMENT LEARNING",
  "elapsed": 55,
  "runs_mode": 1,
  "valid_runs": 1,
  "avg_score": 3.5,
  "scores": [
    3.5
  ],
  "score_std": 0,
  "final_verdict": "Strong Reject",
  "final_confidence": 0.6,
  "conference_scores": {
    "soundness": 2.5,
    "presentation": 3,
    "contribution": 2,
    "overall_rating": 3.5,
    "confidence": 3
  },
  "strengths": [
    "Clear negative-result framing with transparent hypothesis testing: The paper pre-registers a 50% recovery threshold and reports that IS-KPO achieves only 6.6%, with a z-test yielding p≈0.16. This honest reporting of failure is scientifically valuable and avoids spin (Section 4.2, Table 1).",
    "Thorough diagnostic analysis explaining the mechanism failure: Table 2 and Figure 2 provide compelling evidence that the saturation mechanism never activates (clip fraction < 10⁻⁶) because κ√V ≈ 3.06 far exceeds max|δ| ≈ 0.5, despite heavy-tailed innovations (kurtosis 17–59). The diagnosis is specific and quantitative, not hand-wavy.",
    "Identification of a fundamental design tension with supporting evidence: The paper goes beyond 'it didn't work' to identify a structural incompatibility—activating clipping requires low V, but low V increases Kalman gain and destroys smoothing. Table 3's two optimization attempts (Optimization 0: 43.94%, Optimization 1: 43.58%) both worsen performance, supporting the claim this is not merely a tuning problem (Section 4.4, Table 3)."
  ],
  "weaknesses": [
    "Extremely narrow experimental scope—single base model, single dataset, single task domain: All experiments use Qwen3-4B-Base on DAPO-Math-17k with only 16 training steps. The paper claims a 'fundamental design tension' that 'cannot be resolved through parameter tuning,' but this strong claim rests on one model, one dataset, and two failed optimization attempts (Table 3). Whether the tension generalizes across models, datasets, or longer training horizons is unknown (Section 4.1).",
    "Very small absolute performance differences and limited statistical power: The core comparison is KPO-weak 47.81% vs IS-KPO-weak 48.59% on MATH-500—a 0.78 pp difference. With only 500 problems and the reported z≈1.4, the experiment is underpowered. The paper's own p≈0.16 means it cannot confidently claim IS-KPO fails either—it is indistinguishable from baseline. On AIME (30 problems each), IS-KPO actually underperforms KPO-weak, but with N=30 the confidence intervals are enormous (Table 1).",
    "The 'negative result' contribution is incremental relative to what is already understood about Kalman filtering: The observation that high V makes clipping thresholds loose and low V makes Kalman gain high is a straightforward consequence of Kalman filter equations (Eq. 1-2). That V controls both the clipping threshold σ_t and the Kalman gain K_t is algebraically transparent. The paper demonstrates this empirically, but the 'fundamental tension' is more of an algebraic identity than a discovery—K_t = P/(P+V) and σ_t = √(P+V) both depend on V, so of course adjusting V trades off between them (Sections 3.1-3.2, Eq. 2-3).",
    "No comparison against other robustification strategies: The paper tests only innovation saturation and two parameter variants. Alternative approaches—e.g., adaptive V, heavy-tailed measurement models, M-estimation, or simply using a different filtering architecture as the conclusion suggests—are not tested. This makes it hard to assess whether the failure is specific to innovation saturation or generic to all robustification attempts on KPO (Section 5)."
  ],
  "must_fix_items": [
    "Run experiments on at least one additional model (e.g., Qwen2.5-7B or a different architecture) to support the claim that the design tension is 'fundamental' rather than model-specific.",
    "Provide confidence intervals or standard deviations for all reported metrics, not just a single z-test for the MATH-500 comparison. AIME results with N=30 are essentially uninformative without error bars.",
    "Test at least one alternative robustification approach (e.g., adaptive V, Huber loss on innovations) to clarify whether the failure is specific to innovation saturation or inherent to the KPO framework."
  ],
  "runs": [
    {
      "run": 1,
      "score": 3.5,
      "verdict": "Strong Reject",
      "confidence": 0.6,
      "strengths": [
        "Clear negative-result framing with transparent hypothesis testing: The paper pre-registers a 50% recovery threshold and reports that IS-KPO achieves only 6.6%, with a z-test yielding p≈0.16. This honest reporting of failure is scientifically valuable and avoids spin (Section 4.2, Table 1).",
        "Thorough diagnostic analysis explaining the mechanism failure: Table 2 and Figure 2 provide compelling evidence that the saturation mechanism never activates (clip fraction < 10⁻⁶) because κ√V ≈ 3.06 far exceeds max|δ| ≈ 0.5, despite heavy-tailed innovations (kurtosis 17–59). The diagnosis is specific and quantitative, not hand-wavy.",
        "Identification of a fundamental design tension with supporting evidence: The paper goes beyond 'it didn't work' to identify a structural incompatibility—activating clipping requires low V, but low V increases Kalman gain and destroys smoothing. Table 3's two optimization attempts (Optimization 0: 43.94%, Optimization 1: 43.58%) both worsen performance, supporting the claim this is not merely a tuning problem (Section 4.4, Table 3)."
      ],
      "weaknesses": [
        "Extremely narrow experimental scope—single base model, single dataset, single task domain: All experiments use Qwen3-4B-Base on DAPO-Math-17k with only 16 training steps. The paper claims a 'fundamental design tension' that 'cannot be resolved through parameter tuning,' but this strong claim rests on one model, one dataset, and two failed optimization attempts (Table 3). Whether the tension generalizes across models, datasets, or longer training horizons is unknown (Section 4.1).",
        "Very small absolute performance differences and limited statistical power: The core comparison is KPO-weak 47.81% vs IS-KPO-weak 48.59% on MATH-500—a 0.78 pp difference. With only 500 problems and the reported z≈1.4, the experiment is underpowered. The paper's own p≈0.16 means it cannot confidently claim IS-KPO fails either—it is indistinguishable from baseline. On AIME (30 problems each), IS-KPO actually underperforms KPO-weak, but with N=30 the confidence intervals are enormous (Table 1).",
        "The 'negative result' contribution is incremental relative to what is already understood about Kalman filtering: The observation that high V makes clipping thresholds loose and low V makes Kalman gain high is a straightforward consequence of Kalman filter equations (Eq. 1-2). That V controls both the clipping threshold σ_t and the Kalman gain K_t is algebraically transparent. The paper demonstrates this empirically, but the 'fundamental tension' is more of an algebraic identity than a discovery—K_t = P/(P+V) and σ_t = √(P+V) both depend on V, so of course adjusting V trades off between them (Sections 3.1-3.2, Eq. 2-3).",
        "No comparison against other robustification strategies: The paper tests only innovation saturation and two parameter variants. Alternative approaches—e.g., adaptive V, heavy-tailed measurement models, M-estimation, or simply using a different filtering architecture as the conclusion suggests—are not tested. This makes it hard to assess whether the failure is specific to innovation saturation or generic to all robustification attempts on KPO (Section 5)."
      ],
      "must_fix_items": [
        "Run experiments on at least one additional model (e.g., Qwen2.5-7B or a different architecture) to support the claim that the design tension is 'fundamental' rather than model-specific.",
        "Provide confidence intervals or standard deviations for all reported metrics, not just a single z-test for the MATH-500 comparison. AIME results with N=30 are essentially uninformative without error bars.",
        "Test at least one alternative robustification approach (e.g., adaptive V, Huber loss on innovations) to clarify whether the failure is specific to innovation saturation or inherent to the KPO framework."
      ],
      "conference_scores": {
        "soundness": 2.5,
        "presentation": 3,
        "contribution": 2,
        "overall_rating": 3.5,
        "confidence": 3
      }
    }
  ]
}