Title: SILENCE-CONDITIONAL OUTPUT SUPPRESSION FOR TRAINING-FREE WHISPER HALLUCINATION MITIGA-TION FARS
PDF: whisper-calm-nospeech-probe.pdf
Score: 3.5
Verdict: Strong Reject
Confidence: 0.80
Elapsed: 44.8s

Strengths:
1. The method is genuinely training-free and inference-time only, requiring no fine-tuning, external VAD, or model modification — just a threshold check on p_no_speech already computed during standard Whisper inference. This is a practical and deployable contribution (Section 3.2, steps 1–4).
2. Clean and honest experimental reporting: the paper transparently reports that 60.1% hallucination rate remains (not over-claiming), provides per-class breakdown revealing failure modes (Table 2), and includes false positive rates on LibriSpeech (Table 1). The 0.19% FP rate on test-clean and 0% on test-other is a concrete and meaningful result.
3. The ablation study (Section 4.4) cleanly disentangles the suppression policy from decoder head masking, showing Skip-Only produces identical results to the full method at τ=0.6. This eliminates a plausible confound and strengthens the causal claim that p_no_speech is the effective mechanism, not head masking.

Weaknesses:
1. The core idea is extremely simple — a single if-then threshold on an already-existing model signal — and the paper does not provide any analysis of why p_no_speech fails on speech-like sounds, nor any theoretical or empirical characterization of the p_no_speech distribution. The contribution is essentially: read a pre-computed logit and threshold it. This is closer to an engineering note than a research contribution (Section 3.2).
2. The hallucination reduction is modest: from 100% to 60.1% at τ=0.3, meaning nearly 2/3 of non-speech clips still hallucinate. The method completely fails on speech-like environmental sounds (1.5% trigger on street music, 3.9% on children playing — Table 2), which constitute a large and important subset of real-world non-speech audio. The paper acknowledges this limitation but does not explore solutions or even whether per-class thresholds could help.
3. Limited evaluation scope: only one model (Whisper-large-v3), one non-speech dataset (UrbanSound8K), and one speech dataset (LibriSpeech). No evaluation on actual silence recordings, music datasets, or diverse real-world scenarios. The UrbanSound8K classes are environmental sounds — the method's behavior on pure silence (which is arguably the most important case for hallucination mitigation, per Koenecke et al. 2024's concern about aphasia patients with longer non-vocal segments) is not tested at all (Section 4.1).
4. The Always-Mask baseline (Condition B) is poorly implemented or explained: the paper states it 'does not reduce hallucination rate in our pipeline, as the HuggingFace Transformers implementation always produces non-empty output regardless of head masking.' This means the baseline comparison against Wang et al. (2025) is uninformative — the paper is not actually reproducing Wang et al.'s result but reporting a broken implementation. This undermines the comparative claims (Table 1, Section 4.2).
5. No statistical significance testing: results are reported as single numbers without confidence intervals or significance tests. With 8,732 UrbanSound8K clips and ~2,600 LibriSpeech utterances, variance estimates are feasible and expected (Table 1).

Must Fix Items:
1. Evaluate on pure silence / no-audio inputs — the most important use case given the motivation around aphasia patients with non-vocal segments.
2. Report confidence intervals or statistical significance for all metrics in Table 1.
3. Properly implement or explain the Always-Mask baseline so the comparison to Wang et al. (2025) is meaningful — either reproduce their exact setup or clearly explain why results differ.

Runs:
- run=1 score=3.5 verdict=Strong Reject confidence=0.8 error=None