Title: PHASEGUARD-KL: OUTPUT-DISSIMILARITY-TRIGGERED KL REGULARIZATION FOR EMERGENT MISALIGNMENT DEFENSE
PDF: 94192214-3b6a-43d9-ad83-0580db630411.pdf
Score: 4.0
Verdict: Reject
Confidence: 0.7
Elapsed: 394.0s

Strengths:
1. Honest self-refutation with pre-registered criteria (Section 3.3): The paper pre-registers two success criteria (Security EM misaligned rate ≤24.3% and OpSwap exact match ≥43.7%) before running experiments, and honestly reports that PhaseGuard-KL fails the OpSwap criterion (22.78% vs ≥43.7% threshold), refuting its own selectivity hypothesis. This is methodologically commendable and rare in ML publications.
2. Lambda sweep reveals fundamental tradeoff (Table 2, Figure 2): The systematic sweep across 5 KL coefficient values (0.01–0.1) demonstrates that no single λ satisfies both safety and utility criteria, with a sharp cliff between λ=0.02 (OpSwap 54.44%, Security EM 30.56%) and λ=0.03 (OpSwap 22.78%, Security EM 20.83%). This is a genuine empirical finding about the non-existence of a sweet spot.
3. Root cause analysis pinpoints mechanism failure (Section 4.4): The paper correctly identifies that JS divergence on canary prompts reaches 0.28–0.33 within the first 5 steps for both Security EM and OpSwap, explaining why the trigger fires identically at step 20 for both tasks. The diagnosis 'detects distribution shift magnitude but not distribution shift intent' is precise and actionable for future work.

Weaknesses:
1. Trivial core idea beneath packaging (Section 3.2): Stripped of its branding ('PhaseGuard-KL', 'canary prompts', 'dissimilarity monitor', 'level-shift detector'), the method is simply: compute JS divergence on a fixed prompt set during fine-tuning; if it exceeds a threshold, turn on KL penalty. The hypothesis that distribution shift magnitude distinguishes malicious from benign fine-tuning is superficially implausible — any fine-tuning that changes behavior will shift output distributions. The negative result is therefore unsurprising, not insightful.
2. Insufficient statistical rigor: Security EM reports 3 seeds with high variance (PhaseGuard-KL: 20.83% ± 7.22%, meaning the 95% CI is approximately [6.4%, 35.3%] — the ≤24.3% criterion passage is not statistically robust). OpSwap uses only 1 seed (Section 4.1: 'OpSwap uses 1 seed due to large expected effect sizes'), providing zero variance information. No significance tests (t-tests, bootstrap, or otherwise) are reported for any comparison. The lambda sweep (Table 2) draws conclusions from 5 points with no error bars on OpSwap. The 'phase transition' claim between λ=0.02 and 0.03 is based on 2 adjacent data points from a single-seed experiment.
3. Narrow experimental scope — single model, two benchmarks: Only Qwen2.5-7B-Instruct is tested (Section 4.1). The generalization of the negative result to other model sizes (e.g., 1.5B, 72B), architectures (e.g., Llama, Mistral), or fine-tuning methods (e.g., full fine-tuning instead of LoRA) is entirely unknown. Only 2 benchmarks are used, both from the same source (Kaczér et al., 2025). The canary prompts are fixed at 24 with no analysis of how prompt selection affects discriminability. The negative result may be an artifact of this particular model-benchmark combination rather than a fundamental finding.
4. Missing critical analysis of canary prompt design (Section 3.2): The 24 canary prompts are described as 'safety-relevant' and 'benign user questions designed to probe out-of-domain safety regressions,' but no analysis is provided of what makes them safety-relevant vs. general distribution probes. The paper does not test whether safety-specific canary prompts (e.g., prompts explicitly about harmful requests) would be more discriminative than the current set. Without this analysis, the negative result applies only to this particular canary design, not to the class of output-dissimilarity approaches.

Must Fix Items:
1. Add variance estimates and significance tests for all comparisons, especially the lambda sweep OpSwap results — currently single-seed with no error bars, making the 'phase transition' claim between λ=0.02 and 0.03 unsupported.
2. Report at least 3 seeds for OpSwap experiments; the 'large expected effect sizes' justification is circular when the paper's conclusion depends on exact numerical thresholds.
3. Test at least one additional model (e.g., Llama-3-8B) to establish whether the negative result generalizes beyond Qwen2.5-7B-Instruct, or explicitly scope the claim to this specific model.

Runs:
- run=1 score=4.0 verdict=Reject confidence=0.7 error=None