Title: PREMATURE SPEECH EOS IS NOT A DOMINANT FAILURE MODE IN QWEN2.5-OMNI: AN EMPIRICAL
PDF: exomni-longform-speech-consistency.pdf
Score: 3.2
Verdict: Strong Reject
Confidence: 0.60
Elapsed: 47.0s

Strengths:
1. Honest reporting of a negative result: The paper transparently reports that TLC-AS degrades rather than improves performance (WER increases from 5.86% to 9.05%), which is a valuable contribution to the community by preventing others from pursuing this ineffective intervention (Table 1).
2. Well-designed ablation structure with three conditions (C0, C1, C2) that cleanly isolates the effect of cap raising versus EOS suppression, allowing clear attribution of whether premature EOS or hard-cap truncation is the issue (Section 3.3).
3. Bucketed analysis by text length (Table 2) provides nuance, showing that even in the Long bucket (120-160 words, n=23) where TLC-AS should theoretically help most, it still degrades WER from 7.36% to 9.95%, strengthening the negative conclusion.

Weaknesses:
1. Extremely narrow scope: Only one model (Qwen2.5-Omni-7B) and one benchmark (VoiceBench CommonEval, 200 samples) are tested. The paper's title makes a strong general claim ('Premature Speech EOS Is NOT A Dominant Failure Mode') but the evidence only covers a single model-dataset pair. Other architectures (Moshi, GLM-4-Voice, LLaMA-Omni) where the Thinker-Talker coupling does not exist are not evaluated at all (Section 4.1).
2. Insufficient sample size for the most relevant bucket: The Very Long bucket has only n=3 samples, and the Long bucket has only n=23 samples. The paper acknowledges this ('limited sample size makes conclusions unreliable') for Very Long but does not address the statistical power issue for the Long bucket either, where only 1 early-stop instance is observed (Table 2). With such small n, the 0.5% early-stop rate and 4.35% rate in the Long bucket are unreliable estimates.
3. No statistical significance testing: WER differences between conditions are reported without confidence intervals or significance tests. The mean WER increase from 5.86% to 9.05% is presented as a key finding, but without knowing the variance or conducting a proper test (e.g., paired bootstrap on per-sample WER), it is unclear whether this difference is statistically meaningful. This is a hard methodological gap for an empirical study whose core contribution is an empirical finding.
4. Over-packaged contribution: TLC-AS is a straightforward decode-time intervention (mask EOS until a token floor is reached), yet it is presented with a named acronym, a dedicated method section, and a formula (Equation 1) for what is essentially `ceil(word_count / wps * token_rate)`. The actual contribution is the negative finding, not the method, but the paper structure foregrounds the method as if it were novel. The calibration procedure (20-sample median WPS) is also underspecified—no analysis of calibration stability or sensitivity to this parameter.

Must Fix Items:
1. Add statistical significance tests or confidence intervals for all reported WER differences, especially the key C1 vs C2 comparison.
2. Broaden evaluation to at least one additional omni-modal model without a Thinker-Talker architecture (e.g., Moshi or GLM-4-Voice) to justify the general claim in the title, or significantly narrow the title and claims.
3. Report per-sample WER distributions or variance measures, not just mean/median, to allow readers to assess the reliability of the 5.86%→9.05% increase.

Runs:
- run=1 score=3.2 verdict=Strong Reject confidence=0.6 error=None