{
  "pdf": "interval-matched-noise-quantization.pdf",
  "title": "INTERVAL-CALIBRATED NOISY QUANTIZA-TION:",
  "elapsed": 51.3,
  "runs_mode": 1,
  "valid_runs": 1,
  "avg_score": 4.8,
  "scores": [
    4.8
  ],
  "score_std": 0,
  "final_verdict": "Reject",
  "final_confidence": 0.6,
  "conference_scores": {
    "soundness": 2.8,
    "presentation": 3.2,
    "contribution": 2.5,
    "overall_rating": 4.8,
    "confidence": 3
  },
  "strengths": [
    "Practical parameter-free defense: The core contribution—deriving noise scale σ from quantization interval half-widths without any evaluation data—is a genuine practical advance over grid-search tuning. The double-median aggregation (Eq. 3-4) is simple, well-motivated, and requires only a single O(n) pass through model weights (Section 3.4), making it deployable in quantization libraries where per-model evaluation is infeasible.",
    "Strong defense effectiveness: The per-layer method achieves 98.77% code security vs. 33.4% without defense (Table 1), recovering nearly all security loss. The gap to the grid-search oracle (98.17%) is only 0.6 percentage points, demonstrating that the interval-calibrated approach is near-optimal without any tuning (Section 4.2).",
    "Diagnostic analysis provides principled justification: Figure 2 shows ˆσ = 3.89×10⁻⁴ falls within 22% of grid-search optimal σ* = 5.0×10⁻⁴, and Figure 3 shows ˆσ sits near the inflection point of the weight-change S-curve (~40% weights changed). This geometrically grounds the method beyond mere empirical fitting (Section 4.3)."
  ],
  "weaknesses": [
    "Extremely narrow experimental scope—single model, single attack, single quantization method: All results are on Phi-2 (2.78B) with ELQ attack + LLM.int8() only. The paper acknowledges this limitation (Section 4.5) but does not validate on any other model (e.g., Llama, Qwen), any other quantization scheme (GPTQ, AWQ, GGUF mentioned but not tested), or any other attack variant. The 22% proximity claim between ˆσ and σ* is a single data point on one model; it is unknown whether this holds generally. This severely limits confidence in generalization.",
    "Limited baseline fairness and statistical rigor: Only 3 random seeds are used for stochastic methods, and the grid-search baseline searches over just 5 candidate σ values {10⁻⁵, 5×10⁻⁵, 10⁻⁴, 5×10⁻⁴, 10⁻³}—a very coarse grid that may miss the true optimum. The paper's own ˆσ = 3.89×10⁻⁴ is not even in this grid, suggesting a finer grid could yield a stronger baseline. No statistical significance tests (e.g., t-test, bootstrap CI) are reported despite small-n experiments (n=3), making the 0.6pp gap claim between per-layer and grid-search unsupported by significance testing.",
    "Per-layer improvement is marginal and potentially not meaningful: Table 2(a) shows per-layer calibration improves security by only 0.37pp over global ˆσ (98.77% vs. 98.40%) with overlapping standard deviations (0.25 vs. 0.57). The '56% variance reduction' claim is based on n=3 standard deviations. This improvement, while directionally consistent, may not be statistically significant and does not strongly justify the added complexity of per-layer noise."
  ],
  "must_fix_items": [
    "Add experiments on at least one additional model (e.g., Llama3.1-8B as referenced in Section 2) and at least one additional quantization scheme (GPTQ or AWQ) to demonstrate generalization of the interval-calibrated noise scale principle.",
    "Report statistical significance tests or bootstrap confidence intervals for the main comparisons, especially the 0.6pp gap between per-layer and grid-search (Table 1) and the 0.37pp per-layer improvement (Table 2a).",
    "Use a finer grid-search (e.g., logarithmically spaced with 15-20 points) to ensure the grid-search baseline is a fair competitor rather than an artificially weak one."
  ],
  "runs": [
    {
      "run": 1,
      "score": 4.8,
      "verdict": "Reject",
      "confidence": 0.6,
      "strengths": [
        "Practical parameter-free defense: The core contribution—deriving noise scale σ from quantization interval half-widths without any evaluation data—is a genuine practical advance over grid-search tuning. The double-median aggregation (Eq. 3-4) is simple, well-motivated, and requires only a single O(n) pass through model weights (Section 3.4), making it deployable in quantization libraries where per-model evaluation is infeasible.",
        "Strong defense effectiveness: The per-layer method achieves 98.77% code security vs. 33.4% without defense (Table 1), recovering nearly all security loss. The gap to the grid-search oracle (98.17%) is only 0.6 percentage points, demonstrating that the interval-calibrated approach is near-optimal without any tuning (Section 4.2).",
        "Diagnostic analysis provides principled justification: Figure 2 shows ˆσ = 3.89×10⁻⁴ falls within 22% of grid-search optimal σ* = 5.0×10⁻⁴, and Figure 3 shows ˆσ sits near the inflection point of the weight-change S-curve (~40% weights changed). This geometrically grounds the method beyond mere empirical fitting (Section 4.3)."
      ],
      "weaknesses": [
        "Extremely narrow experimental scope—single model, single attack, single quantization method: All results are on Phi-2 (2.78B) with ELQ attack + LLM.int8() only. The paper acknowledges this limitation (Section 4.5) but does not validate on any other model (e.g., Llama, Qwen), any other quantization scheme (GPTQ, AWQ, GGUF mentioned but not tested), or any other attack variant. The 22% proximity claim between ˆσ and σ* is a single data point on one model; it is unknown whether this holds generally. This severely limits confidence in generalization.",
        "Limited baseline fairness and statistical rigor: Only 3 random seeds are used for stochastic methods, and the grid-search baseline searches over just 5 candidate σ values {10⁻⁵, 5×10⁻⁵, 10⁻⁴, 5×10⁻⁴, 10⁻³}—a very coarse grid that may miss the true optimum. The paper's own ˆσ = 3.89×10⁻⁴ is not even in this grid, suggesting a finer grid could yield a stronger baseline. No statistical significance tests (e.g., t-test, bootstrap CI) are reported despite small-n experiments (n=3), making the 0.6pp gap claim between per-layer and grid-search unsupported by significance testing.",
        "Per-layer improvement is marginal and potentially not meaningful: Table 2(a) shows per-layer calibration improves security by only 0.37pp over global ˆσ (98.77% vs. 98.40%) with overlapping standard deviations (0.25 vs. 0.57). The '56% variance reduction' claim is based on n=3 standard deviations. This improvement, while directionally consistent, may not be statistically significant and does not strongly justify the added complexity of per-layer noise."
      ],
      "must_fix_items": [
        "Add experiments on at least one additional model (e.g., Llama3.1-8B as referenced in Section 2) and at least one additional quantization scheme (GPTQ or AWQ) to demonstrate generalization of the interval-calibrated noise scale principle.",
        "Report statistical significance tests or bootstrap confidence intervals for the main comparisons, especially the 0.6pp gap between per-layer and grid-search (Table 1) and the 0.37pp per-layer improvement (Table 2a).",
        "Use a finer grid-search (e.g., logarithmically spaced with 15-20 points) to ensure the grid-search baseline is a fair competitor rather than an artificially weak one."
      ],
      "conference_scores": {
        "soundness": 2.8,
        "presentation": 3.2,
        "contribution": 2.5,
        "overall_rating": 4.8,
        "confidence": 3
      }
    }
  ]
}