Title: DIFFERENTIALLY PRIVATE SPECTRAL MONITOR LOGS FOR HALLUCINATION DETECTION: A COMPARATIVE STUDY OF WISHART AND GAUSSIAN MECHANISMS FARS
PDF: dp-spectral-activation-logging.pdf
Score: 3.2
Verdict: Strong Reject
Confidence: 0.60
Elapsed: 234.5s

Strengths:
1. Clear and focused research question: the paper identifies a concrete, previously unstudied problem (DP mechanisms for spectral monitor logs) and provides a principled comparative study of Wishart vs Gaussian mechanisms for covariance matrix privatization (Section 1, contributions list).
2. Honest reporting of negative results: the paper transparently reports that (a) privacy leakage is minimal even without DP (0.74% vs 0.50% chance, Table 1), and (b) DP noise causes unacceptable utility degradation at reasonable ε (Section 4.4, Figure 3). This is scientifically valuable even though the findings are largely negative.
3. Well-designed SNR analysis (Section 4.4, Figure 3): the signal-to-noise ratio decomposition clearly explains WHY DP fails here — K=10 covariance eigenvalues (~0.6) are overwhelmed by Wishart noise at ε≤5 (SNR=0.12 at ε=1). This provides genuine mechanistic insight rather than just empirical failure reporting.

Weaknesses:
1. Extremely narrow experimental scope: only one model (OPT-6.7B), one dataset (SQuAD v2.0), one K value (K=10), and only ε=1 for the Gaussian mechanism. The paper claims Wishart 'strictly dominates' Gaussian DP (Abstract, Section 4.5), but this claim rests on a single ε point. Figure 2a shows Wishart at ε=10 achieves 59.4% AUROC, which is still far from the 63.7% baseline — the gap between mechanisms may narrow at higher ε, undermining the 'strict dominance' claim.
2. Weak threat model undermines the paper's premise: the canary-ID attack yields near-chance accuracy even without DP (0.74% vs 0.50%, Table 1), meaning the privacy problem the paper sets out to solve barely exists under this threat model. This raises the question of whether the entire DP application is motivated, since the attack surface is essentially nil. The paper acknowledges this but does not explore stronger threat models (e.g., membership inference with known training data, embedding inversion attacks) that might show real leakage.
3. Over-packaging of a limited contribution: the title includes 'FARS' (an acronym for the generating system, not a method), and the paper bills itself as 'the first comparative study' when the comparison is straightforward — Wishart preserves PSD structure, Gaussian does not. The core insight (PSD projection destroys ~5/10 eigenvalues at ε=1, Table 1 discussion) is well-known from the DP covariance literature (Jiang et al., 2015; Dong et al., 2022). The paper's main finding is essentially: 'applying standard DP mechanisms to low-SNR statistics doesn't work,' which is unsurprising.
4. No statistical significance tests: Table 1 reports ±standard deviations but no confidence intervals or significance tests for the key AUROC comparison (55.1±4.7 vs 50.7±2.4). The overlapping error bars suggest the 4.4pp difference may not be statistically significant. This is a critical gap for a paper whose main claim is Wishart superiority.
5. Missing Gaussian mechanism across ε values: Figure 2 only shows Wishart across ε∈{0.5,1,2,5,10}. Gaussian DP is reported only at ε=1 (Table 1). Without Gaussian results at multiple ε values, the reader cannot assess whether the Wishart advantage persists or diminishes at different privacy budgets, making the 'strict dominance' claim unsupported.

Must Fix Items:
1. Add statistical significance tests (e.g., paired t-test or bootstrap CI) for the key AUROC comparison between Wishart and Gaussian at ε=1, given overlapping error bars (55.1±4.7 vs 50.7±2.4).
2. Report Gaussian mechanism results across the same ε range as Wishart (ε∈{0.5,1,2,5,10}) to substantiate or weaken the 'strict dominance' claim.
3. Test at least one additional model (e.g., LLaMA-7B) or dataset to assess generalizability of findings, given that all results come from a single model-dataset combination.

Runs:
- run=1 score=3.2 verdict=Strong Reject confidence=0.6 error=None