{
  "pdf": "exomni-longform-speech-consistency.pdf",
  "title": "PREMATURE SPEECH EOS IS NOT A DOMINANT FAILURE MODE IN QWEN2.5-OMNI: AN EMPIRICAL",
  "elapsed": 47,
  "runs_mode": 1,
  "valid_runs": 1,
  "avg_score": 3.2,
  "scores": [
    3.2
  ],
  "score_std": 0,
  "final_verdict": "Strong Reject",
  "final_confidence": 0.6,
  "conference_scores": {
    "soundness": 2.3,
    "presentation": 2.8,
    "contribution": 1.8,
    "overall_rating": 3.2,
    "confidence": 3
  },
  "strengths": [
    "Honest reporting of a negative result: The paper transparently reports that TLC-AS degrades rather than improves performance (WER increases from 5.86% to 9.05%), which is a valuable contribution to the community by preventing others from pursuing this ineffective intervention (Table 1).",
    "Well-designed ablation structure with three conditions (C0, C1, C2) that cleanly isolates the effect of cap raising versus EOS suppression, allowing clear attribution of whether premature EOS or hard-cap truncation is the issue (Section 3.3).",
    "Bucketed analysis by text length (Table 2) provides nuance, showing that even in the Long bucket (120-160 words, n=23) where TLC-AS should theoretically help most, it still degrades WER from 7.36% to 9.95%, strengthening the negative conclusion."
  ],
  "weaknesses": [
    "Extremely narrow scope: Only one model (Qwen2.5-Omni-7B) and one benchmark (VoiceBench CommonEval, 200 samples) are tested. The paper's title makes a strong general claim ('Premature Speech EOS Is NOT A Dominant Failure Mode') but the evidence only covers a single model-dataset pair. Other architectures (Moshi, GLM-4-Voice, LLaMA-Omni) where the Thinker-Talker coupling does not exist are not evaluated at all (Section 4.1).",
    "Insufficient sample size for the most relevant bucket: The Very Long bucket has only n=3 samples, and the Long bucket has only n=23 samples. The paper acknowledges this ('limited sample size makes conclusions unreliable') for Very Long but does not address the statistical power issue for the Long bucket either, where only 1 early-stop instance is observed (Table 2). With such small n, the 0.5% early-stop rate and 4.35% rate in the Long bucket are unreliable estimates.",
    "No statistical significance testing: WER differences between conditions are reported without confidence intervals or significance tests. The mean WER increase from 5.86% to 9.05% is presented as a key finding, but without knowing the variance or conducting a proper test (e.g., paired bootstrap on per-sample WER), it is unclear whether this difference is statistically meaningful. This is a hard methodological gap for an empirical study whose core contribution is an empirical finding.",
    "Over-packaged contribution: TLC-AS is a straightforward decode-time intervention (mask EOS until a token floor is reached), yet it is presented with a named acronym, a dedicated method section, and a formula (Equation 1) for what is essentially `ceil(word_count / wps * token_rate)`. The actual contribution is the negative finding, not the method, but the paper structure foregrounds the method as if it were novel. The calibration procedure (20-sample median WPS) is also underspecified—no analysis of calibration stability or sensitivity to this parameter."
  ],
  "must_fix_items": [
    "Add statistical significance tests or confidence intervals for all reported WER differences, especially the key C1 vs C2 comparison.",
    "Broaden evaluation to at least one additional omni-modal model without a Thinker-Talker architecture (e.g., Moshi or GLM-4-Voice) to justify the general claim in the title, or significantly narrow the title and claims.",
    "Report per-sample WER distributions or variance measures, not just mean/median, to allow readers to assess the reliability of the 5.86%→9.05% increase."
  ],
  "runs": [
    {
      "run": 1,
      "score": 3.2,
      "verdict": "Strong Reject",
      "confidence": 0.6,
      "strengths": [
        "Honest reporting of a negative result: The paper transparently reports that TLC-AS degrades rather than improves performance (WER increases from 5.86% to 9.05%), which is a valuable contribution to the community by preventing others from pursuing this ineffective intervention (Table 1).",
        "Well-designed ablation structure with three conditions (C0, C1, C2) that cleanly isolates the effect of cap raising versus EOS suppression, allowing clear attribution of whether premature EOS or hard-cap truncation is the issue (Section 3.3).",
        "Bucketed analysis by text length (Table 2) provides nuance, showing that even in the Long bucket (120-160 words, n=23) where TLC-AS should theoretically help most, it still degrades WER from 7.36% to 9.95%, strengthening the negative conclusion."
      ],
      "weaknesses": [
        "Extremely narrow scope: Only one model (Qwen2.5-Omni-7B) and one benchmark (VoiceBench CommonEval, 200 samples) are tested. The paper's title makes a strong general claim ('Premature Speech EOS Is NOT A Dominant Failure Mode') but the evidence only covers a single model-dataset pair. Other architectures (Moshi, GLM-4-Voice, LLaMA-Omni) where the Thinker-Talker coupling does not exist are not evaluated at all (Section 4.1).",
        "Insufficient sample size for the most relevant bucket: The Very Long bucket has only n=3 samples, and the Long bucket has only n=23 samples. The paper acknowledges this ('limited sample size makes conclusions unreliable') for Very Long but does not address the statistical power issue for the Long bucket either, where only 1 early-stop instance is observed (Table 2). With such small n, the 0.5% early-stop rate and 4.35% rate in the Long bucket are unreliable estimates.",
        "No statistical significance testing: WER differences between conditions are reported without confidence intervals or significance tests. The mean WER increase from 5.86% to 9.05% is presented as a key finding, but without knowing the variance or conducting a proper test (e.g., paired bootstrap on per-sample WER), it is unclear whether this difference is statistically meaningful. This is a hard methodological gap for an empirical study whose core contribution is an empirical finding.",
        "Over-packaged contribution: TLC-AS is a straightforward decode-time intervention (mask EOS until a token floor is reached), yet it is presented with a named acronym, a dedicated method section, and a formula (Equation 1) for what is essentially `ceil(word_count / wps * token_rate)`. The actual contribution is the negative finding, not the method, but the paper structure foregrounds the method as if it were novel. The calibration procedure (20-sample median WPS) is also underspecified—no analysis of calibration stability or sensitivity to this parameter."
      ],
      "must_fix_items": [
        "Add statistical significance tests or confidence intervals for all reported WER differences, especially the key C1 vs C2 comparison.",
        "Broaden evaluation to at least one additional omni-modal model without a Thinker-Talker architecture (e.g., Moshi or GLM-4-Voice) to justify the general claim in the title, or significantly narrow the title and claims.",
        "Report per-sample WER distributions or variance measures, not just mean/median, to allow readers to assess the reliability of the 5.86%→9.05% increase."
      ],
      "conference_scores": {
        "soundness": 2.3,
        "presentation": 2.8,
        "contribution": 1.8,
        "overall_rating": 3.2,
        "confidence": 3
      }
    }
  ]
}