{
  "pdf": "whisper-calm-nospeech-probe.pdf",
  "title": "SILENCE-CONDITIONAL OUTPUT SUPPRESSION FOR TRAINING-FREE WHISPER HALLUCINATION MITIGA-TION FARS",
  "elapsed": 44.8,
  "runs_mode": 1,
  "valid_runs": 1,
  "avg_score": 3.5,
  "scores": [
    3.5
  ],
  "score_std": 0,
  "final_verdict": "Strong Reject",
  "final_confidence": 0.8,
  "conference_scores": {
    "soundness": 2.5,
    "presentation": 2.8,
    "contribution": 1.8,
    "overall_rating": 3.5,
    "confidence": 4
  },
  "strengths": [
    "The method is genuinely training-free and inference-time only, requiring no fine-tuning, external VAD, or model modification — just a threshold check on p_no_speech already computed during standard Whisper inference. This is a practical and deployable contribution (Section 3.2, steps 1–4).",
    "Clean and honest experimental reporting: the paper transparently reports that 60.1% hallucination rate remains (not over-claiming), provides per-class breakdown revealing failure modes (Table 2), and includes false positive rates on LibriSpeech (Table 1). The 0.19% FP rate on test-clean and 0% on test-other is a concrete and meaningful result.",
    "The ablation study (Section 4.4) cleanly disentangles the suppression policy from decoder head masking, showing Skip-Only produces identical results to the full method at τ=0.6. This eliminates a plausible confound and strengthens the causal claim that p_no_speech is the effective mechanism, not head masking."
  ],
  "weaknesses": [
    "The core idea is extremely simple — a single if-then threshold on an already-existing model signal — and the paper does not provide any analysis of why p_no_speech fails on speech-like sounds, nor any theoretical or empirical characterization of the p_no_speech distribution. The contribution is essentially: read a pre-computed logit and threshold it. This is closer to an engineering note than a research contribution (Section 3.2).",
    "The hallucination reduction is modest: from 100% to 60.1% at τ=0.3, meaning nearly 2/3 of non-speech clips still hallucinate. The method completely fails on speech-like environmental sounds (1.5% trigger on street music, 3.9% on children playing — Table 2), which constitute a large and important subset of real-world non-speech audio. The paper acknowledges this limitation but does not explore solutions or even whether per-class thresholds could help.",
    "Limited evaluation scope: only one model (Whisper-large-v3), one non-speech dataset (UrbanSound8K), and one speech dataset (LibriSpeech). No evaluation on actual silence recordings, music datasets, or diverse real-world scenarios. The UrbanSound8K classes are environmental sounds — the method's behavior on pure silence (which is arguably the most important case for hallucination mitigation, per Koenecke et al. 2024's concern about aphasia patients with longer non-vocal segments) is not tested at all (Section 4.1).",
    "The Always-Mask baseline (Condition B) is poorly implemented or explained: the paper states it 'does not reduce hallucination rate in our pipeline, as the HuggingFace Transformers implementation always produces non-empty output regardless of head masking.' This means the baseline comparison against Wang et al. (2025) is uninformative — the paper is not actually reproducing Wang et al.'s result but reporting a broken implementation. This undermines the comparative claims (Table 1, Section 4.2).",
    "No statistical significance testing: results are reported as single numbers without confidence intervals or significance tests. With 8,732 UrbanSound8K clips and ~2,600 LibriSpeech utterances, variance estimates are feasible and expected (Table 1)."
  ],
  "must_fix_items": [
    "Evaluate on pure silence / no-audio inputs — the most important use case given the motivation around aphasia patients with non-vocal segments.",
    "Report confidence intervals or statistical significance for all metrics in Table 1.",
    "Properly implement or explain the Always-Mask baseline so the comparison to Wang et al. (2025) is meaningful — either reproduce their exact setup or clearly explain why results differ."
  ],
  "runs": [
    {
      "run": 1,
      "score": 3.5,
      "verdict": "Strong Reject",
      "confidence": 0.8,
      "strengths": [
        "The method is genuinely training-free and inference-time only, requiring no fine-tuning, external VAD, or model modification — just a threshold check on p_no_speech already computed during standard Whisper inference. This is a practical and deployable contribution (Section 3.2, steps 1–4).",
        "Clean and honest experimental reporting: the paper transparently reports that 60.1% hallucination rate remains (not over-claiming), provides per-class breakdown revealing failure modes (Table 2), and includes false positive rates on LibriSpeech (Table 1). The 0.19% FP rate on test-clean and 0% on test-other is a concrete and meaningful result.",
        "The ablation study (Section 4.4) cleanly disentangles the suppression policy from decoder head masking, showing Skip-Only produces identical results to the full method at τ=0.6. This eliminates a plausible confound and strengthens the causal claim that p_no_speech is the effective mechanism, not head masking."
      ],
      "weaknesses": [
        "The core idea is extremely simple — a single if-then threshold on an already-existing model signal — and the paper does not provide any analysis of why p_no_speech fails on speech-like sounds, nor any theoretical or empirical characterization of the p_no_speech distribution. The contribution is essentially: read a pre-computed logit and threshold it. This is closer to an engineering note than a research contribution (Section 3.2).",
        "The hallucination reduction is modest: from 100% to 60.1% at τ=0.3, meaning nearly 2/3 of non-speech clips still hallucinate. The method completely fails on speech-like environmental sounds (1.5% trigger on street music, 3.9% on children playing — Table 2), which constitute a large and important subset of real-world non-speech audio. The paper acknowledges this limitation but does not explore solutions or even whether per-class thresholds could help.",
        "Limited evaluation scope: only one model (Whisper-large-v3), one non-speech dataset (UrbanSound8K), and one speech dataset (LibriSpeech). No evaluation on actual silence recordings, music datasets, or diverse real-world scenarios. The UrbanSound8K classes are environmental sounds — the method's behavior on pure silence (which is arguably the most important case for hallucination mitigation, per Koenecke et al. 2024's concern about aphasia patients with longer non-vocal segments) is not tested at all (Section 4.1).",
        "The Always-Mask baseline (Condition B) is poorly implemented or explained: the paper states it 'does not reduce hallucination rate in our pipeline, as the HuggingFace Transformers implementation always produces non-empty output regardless of head masking.' This means the baseline comparison against Wang et al. (2025) is uninformative — the paper is not actually reproducing Wang et al.'s result but reporting a broken implementation. This undermines the comparative claims (Table 1, Section 4.2).",
        "No statistical significance testing: results are reported as single numbers without confidence intervals or significance tests. With 8,732 UrbanSound8K clips and ~2,600 LibriSpeech utterances, variance estimates are feasible and expected (Table 1)."
      ],
      "must_fix_items": [
        "Evaluate on pure silence / no-audio inputs — the most important use case given the motivation around aphasia patients with non-vocal segments.",
        "Report confidence intervals or statistical significance for all metrics in Table 1.",
        "Properly implement or explain the Always-Mask baseline so the comparison to Wang et al. (2025) is meaningful — either reproduce their exact setup or clearly explain why results differ."
      ],
      "conference_scores": {
        "soundness": 2.5,
        "presentation": 2.8,
        "contribution": 1.8,
        "overall_rating": 3.5,
        "confidence": 4
      }
    }
  ]
}