{
  "pdf": "risk-controlled-dllm-early-exit.pdf",
  "title": "RISK-CONTROLLED EARLY EXIT DIFFUSION",
  "elapsed": 53.3,
  "runs_mode": 1,
  "valid_runs": 1,
  "avg_score": 3.5,
  "scores": [
    3.5
  ],
  "score_std": 0,
  "final_verdict": "Strong Reject",
  "final_confidence": 0.6,
  "conference_scores": {
    "soundness": 2.3,
    "presentation": 2.8,
    "contribution": 2,
    "overall_rating": 3.5,
    "confidence": 3
  },
  "strengths": [
    "RC-Jot is the first application of conformal risk control to early exit in diffusion language models, filling a genuine gap: existing DLLM early exit methods (Jot, KLASS) use heuristic thresholds with no formal quality guarantees. The contribution is cleanly scoped and well-motivated (Section 1, bullet 1).",
    "The empirical comparison of calibration methods (Naive, CRC, UCB-HB) is informative and reveals practical differences. On HumanEval with small calibration sets, Naive shows 52% violation at ε=0.15 while UCB-HB shows only 1% violation — a striking demonstration that uncorrected threshold selection is unreliable (Table 2, ε=0.15 column).",
    "The monotonicity validation (Section 4.3, Figure 2) is a commendable empirical check of a core assumption. Reporting Spearman ρ = −1.0 on GSM8K and ρ = −0.408 on HumanEval with explicit discussion of violations provides transparency about when the assumption holds and when it is approximate."
  ],
  "weaknesses": [
    "The speedups are modest: 1.36× on GSM8K and 1.32× on HumanEval, while the original Jot achieves 2.01× and 1.19× respectively (Table 1, Table 2). At tighter risk budgets (ε ≤ 0.05 on GSM8K), all calibration methods fall back to full decoding (1.0× speedup), meaning the framework provides no acceleration when strict guarantees are needed. The practical value of a 1.36× speedup with 8.9% risk on GSM8K is questionable compared to simply running fewer diffusion steps without the calibration overhead.",
    "On GSM8K, Naive, CRC, and UCB-HB all produce identical threshold selections (τ=200 at ε=0.10, τ=256 at ε≤0.05), meaning the sophisticated UCB-HB calibration provides no advantage over naive selection when calibration data is abundant (Table 1). The claimed superiority of UCB-HB rests entirely on HumanEval with n=82 calibration samples — a very small and arguably insufficient dataset for drawing strong conclusions about calibration method comparison.",
    "The monotonicity assumption is violated on HumanEval (Section 4.3: ρ = −0.408, 3 minor violations). The paper dismisses these as 'small' and 'within confidence intervals,' but violations of monotonicity undermine the theoretical guarantee in Equation 3. The paper does not provide any theoretical analysis of how monotonicity violations affect the guarantee, nor does it propose a correction. This is a significant gap between the claimed distribution-free guarantee and actual empirical behavior.",
    "The cross-distribution calibration on GSM8K uses an ad-hoc margin of 0.05 (Section 4.1) with no theoretical justification for why this particular margin suffices. This margin is critical — without it, the distribution-free guarantee would not hold across train/test split — yet its selection is heuristic, undermining the 'distribution-free' claim. The paper does not analyze sensitivity to this margin choice."
  ],
  "must_fix_items": [
    "Provide theoretical or empirical analysis of how monotonicity violations (documented on HumanEval) affect the validity of the UCB-HB guarantee. Simply noting they are 'small' is insufficient for a paper claiming distribution-free guarantees.",
    "Justify the 0.05 cross-distribution margin on GSM8K theoretically or via sensitivity analysis; currently it is an unprincipled heuristic that the entire guarantee rests upon.",
    "Report statistical significance or confidence intervals on the speedup and violation rate metrics, especially given the very small HumanEval test set (n_test=82). A single violation on HumanEval corresponds to ~1.2%, making the '1% violation' claim sensitive to individual samples."
  ],
  "runs": [
    {
      "run": 1,
      "score": 3.5,
      "verdict": "Strong Reject",
      "confidence": 0.6,
      "strengths": [
        "RC-Jot is the first application of conformal risk control to early exit in diffusion language models, filling a genuine gap: existing DLLM early exit methods (Jot, KLASS) use heuristic thresholds with no formal quality guarantees. The contribution is cleanly scoped and well-motivated (Section 1, bullet 1).",
        "The empirical comparison of calibration methods (Naive, CRC, UCB-HB) is informative and reveals practical differences. On HumanEval with small calibration sets, Naive shows 52% violation at ε=0.15 while UCB-HB shows only 1% violation — a striking demonstration that uncorrected threshold selection is unreliable (Table 2, ε=0.15 column).",
        "The monotonicity validation (Section 4.3, Figure 2) is a commendable empirical check of a core assumption. Reporting Spearman ρ = −1.0 on GSM8K and ρ = −0.408 on HumanEval with explicit discussion of violations provides transparency about when the assumption holds and when it is approximate."
      ],
      "weaknesses": [
        "The speedups are modest: 1.36× on GSM8K and 1.32× on HumanEval, while the original Jot achieves 2.01× and 1.19× respectively (Table 1, Table 2). At tighter risk budgets (ε ≤ 0.05 on GSM8K), all calibration methods fall back to full decoding (1.0× speedup), meaning the framework provides no acceleration when strict guarantees are needed. The practical value of a 1.36× speedup with 8.9% risk on GSM8K is questionable compared to simply running fewer diffusion steps without the calibration overhead.",
        "On GSM8K, Naive, CRC, and UCB-HB all produce identical threshold selections (τ=200 at ε=0.10, τ=256 at ε≤0.05), meaning the sophisticated UCB-HB calibration provides no advantage over naive selection when calibration data is abundant (Table 1). The claimed superiority of UCB-HB rests entirely on HumanEval with n=82 calibration samples — a very small and arguably insufficient dataset for drawing strong conclusions about calibration method comparison.",
        "The monotonicity assumption is violated on HumanEval (Section 4.3: ρ = −0.408, 3 minor violations). The paper dismisses these as 'small' and 'within confidence intervals,' but violations of monotonicity undermine the theoretical guarantee in Equation 3. The paper does not provide any theoretical analysis of how monotonicity violations affect the guarantee, nor does it propose a correction. This is a significant gap between the claimed distribution-free guarantee and actual empirical behavior.",
        "The cross-distribution calibration on GSM8K uses an ad-hoc margin of 0.05 (Section 4.1) with no theoretical justification for why this particular margin suffices. This margin is critical — without it, the distribution-free guarantee would not hold across train/test split — yet its selection is heuristic, undermining the 'distribution-free' claim. The paper does not analyze sensitivity to this margin choice."
      ],
      "must_fix_items": [
        "Provide theoretical or empirical analysis of how monotonicity violations (documented on HumanEval) affect the validity of the UCB-HB guarantee. Simply noting they are 'small' is insufficient for a paper claiming distribution-free guarantees.",
        "Justify the 0.05 cross-distribution margin on GSM8K theoretically or via sensitivity analysis; currently it is an unprincipled heuristic that the entire guarantee rests upon.",
        "Report statistical significance or confidence intervals on the speedup and violation rate metrics, especially given the very small HumanEval test set (n_test=82). A single violation on HumanEval corresponds to ~1.2%, making the '1% violation' claim sensitive to individual samples."
      ],
      "conference_scores": {
        "soundness": 2.3,
        "presentation": 2.8,
        "contribution": 2,
        "overall_rating": 3.5,
        "confidence": 3
      }
    }
  ]
}