{
  "pdf": "cusum-calibrated-rollback-controller.pdf",
  "title": "CUSUM-ϵ: FALSE-ALARM-CALIBRATED ROLLBACK THRESHOLDS FOR RUNTIME TRAINING STABILITY CONTROLLERS",
  "elapsed": 46.8,
  "runs_mode": 1,
  "valid_runs": 1,
  "avg_score": 3.5,
  "scores": [
    3.5
  ],
  "score_std": 0,
  "final_verdict": "Strong Reject",
  "final_confidence": 0.6,
  "conference_scores": {
    "soundness": 2.5,
    "presentation": 3,
    "contribution": 2,
    "overall_rating": 3.5,
    "confidence": 3
  },
  "strengths": [
    "Clear negative-result framing: The paper honestly reports that CUSUM-ϵ underperforms Or-ϵ despite theoretical expectations, providing a useful negative finding for the community (Section 1, Abstract). This is a refreshing departure from papers that only report positive results.",
    "Fair calibration protocol: Both controllers are calibrated to match nominal rollback rates (p0=0.2%), with actual rates within 20% of target (Table 1). This ensures the comparison isolates the effect of the decision rule rather than differing sensitivity levels, which is methodologically sound (Section 3.5).",
    "FIR partial reset ablation: The paper identifies and addresses the 'reset blind spot' in standard CUSUM, showing a 35% improvement (Table 3). This ablation helps isolate the source of CUSUM's disadvantage—initial detection delay vs. reset delay—providing clearer mechanistic understanding (Section 3.4, 4.3)."
  ],
  "weaknesses": [
    "Extremely narrow experimental scope: Only ResNet-18/CIFAR-10 with 250 training steps and synthetic gradient perturbations (ζ=300 amplification, 10-step window). No real-world training instability scenarios, no other architectures, no other datasets. The paper's own Section 4.4 acknowledges results are 'specific to large, immediately-detectable perturbations that cause innovation spikes >5σ.' This severely limits generalizability and practical impact.",
    "Modest contribution as a negative result: The finding that 'one-step thresholds beat sequential tests when perturbations are large and immediately detectable' is somewhat intuitive—of course immediate detection is better than delayed detection when the signal is obvious. The paper does not test the complementary regime (subtle, gradual drift) where CUSUM would theoretically shine, leaving the practical utility of CUSUM-ϵ entirely unvalidated (Section 4.4 explicitly defers this).",
    "Questionable statistical rigor: Only 20 random seeds are used. The paper claims 'non-overlapping standard deviations' as evidence of statistical significance (Section 4.2), which is not a proper significance test. No p-values, confidence intervals, or effect size analyses are reported. The standard deviations in Table 2 are large relative to the means (e.g., Or-ϵ peak excess: 1.85±0.92; CUSUM-ϵ: 3.47±0.81)—these do overlap on the low end of Or-ϵ's range (1.85-0.92=0.93) vs. high end of CUSUM-ϵ range (3.47-0.81=2.66), so the 'non-overlapping' claim appears incorrect.",
    "Short training horizon: 250 steps is extremely short for a training stability study. Real training runs span tens of thousands to millions of steps. The calibration is done over just 250 steps with Ncal=20 seeds (Section 3.5), raising concerns about whether the estimated statistics (μ0, σ0) are stable or representative of longer training dynamics."
  ],
  "must_fix_items": [
    "Correct the incorrect claim that standard deviations do not overlap in Table 2—Or-ϵ step perturbation peak excess 1.85±0.92 has range [0.93, 2.77] which overlaps with CUSUM-ϵ's [2.66, 4.28]. Proper statistical tests (e.g., Welch's t-test, bootstrap CI on the difference) must be reported.",
    "Add experiments in the gradual-drift regime where CUSUM is theoretically advantageous, or substantially soften claims about CUSUM's 'fundamental limitation' given that the regime tested is precisely the one most favorable to one-step thresholds."
  ],
  "runs": [
    {
      "run": 1,
      "score": 3.5,
      "verdict": "Strong Reject",
      "confidence": 0.6,
      "strengths": [
        "Clear negative-result framing: The paper honestly reports that CUSUM-ϵ underperforms Or-ϵ despite theoretical expectations, providing a useful negative finding for the community (Section 1, Abstract). This is a refreshing departure from papers that only report positive results.",
        "Fair calibration protocol: Both controllers are calibrated to match nominal rollback rates (p0=0.2%), with actual rates within 20% of target (Table 1). This ensures the comparison isolates the effect of the decision rule rather than differing sensitivity levels, which is methodologically sound (Section 3.5).",
        "FIR partial reset ablation: The paper identifies and addresses the 'reset blind spot' in standard CUSUM, showing a 35% improvement (Table 3). This ablation helps isolate the source of CUSUM's disadvantage—initial detection delay vs. reset delay—providing clearer mechanistic understanding (Section 3.4, 4.3)."
      ],
      "weaknesses": [
        "Extremely narrow experimental scope: Only ResNet-18/CIFAR-10 with 250 training steps and synthetic gradient perturbations (ζ=300 amplification, 10-step window). No real-world training instability scenarios, no other architectures, no other datasets. The paper's own Section 4.4 acknowledges results are 'specific to large, immediately-detectable perturbations that cause innovation spikes >5σ.' This severely limits generalizability and practical impact.",
        "Modest contribution as a negative result: The finding that 'one-step thresholds beat sequential tests when perturbations are large and immediately detectable' is somewhat intuitive—of course immediate detection is better than delayed detection when the signal is obvious. The paper does not test the complementary regime (subtle, gradual drift) where CUSUM would theoretically shine, leaving the practical utility of CUSUM-ϵ entirely unvalidated (Section 4.4 explicitly defers this).",
        "Questionable statistical rigor: Only 20 random seeds are used. The paper claims 'non-overlapping standard deviations' as evidence of statistical significance (Section 4.2), which is not a proper significance test. No p-values, confidence intervals, or effect size analyses are reported. The standard deviations in Table 2 are large relative to the means (e.g., Or-ϵ peak excess: 1.85±0.92; CUSUM-ϵ: 3.47±0.81)—these do overlap on the low end of Or-ϵ's range (1.85-0.92=0.93) vs. high end of CUSUM-ϵ range (3.47-0.81=2.66), so the 'non-overlapping' claim appears incorrect.",
        "Short training horizon: 250 steps is extremely short for a training stability study. Real training runs span tens of thousands to millions of steps. The calibration is done over just 250 steps with Ncal=20 seeds (Section 3.5), raising concerns about whether the estimated statistics (μ0, σ0) are stable or representative of longer training dynamics."
      ],
      "must_fix_items": [
        "Correct the incorrect claim that standard deviations do not overlap in Table 2—Or-ϵ step perturbation peak excess 1.85±0.92 has range [0.93, 2.77] which overlaps with CUSUM-ϵ's [2.66, 4.28]. Proper statistical tests (e.g., Welch's t-test, bootstrap CI on the difference) must be reported.",
        "Add experiments in the gradual-drift regime where CUSUM is theoretically advantageous, or substantially soften claims about CUSUM's 'fundamental limitation' given that the regime tested is precisely the one most favorable to one-step thresholds."
      ],
      "conference_scores": {
        "soundness": 2.5,
        "presentation": 3,
        "contribution": 2,
        "overall_rating": 3.5,
        "confidence": 3
      }
    }
  ]
}