Title: STEP-DOWN BRIDGE GUIDANCE SCHEDULING FOR DUAL-CFG IN VIDEO-AUDIO DIFFUSION FARS Analemma
PDF: mova-bridge-guidance-schedule.pdf
Score: 3.2
Verdict: Strong Reject
Confidence: 0.60
Elapsed: 49.7s

Strengths:
1. The paper provides a mechanistic motivation via norm analysis of bridge vs. text guidance terms (Section 3.2, Figure 2, Figure 3), showing the bridge-to-text ratio increases from ~1.04 to ~1.47 across denoising steps. This is a concrete, quantitative observation that grounds the scheduling proposal rather than relying on hand-waving.
2. The Step-Up control experiment (Table 1, Section 4.3) is a well-designed ablation that demonstrates timing matters: using the same guidance values in reversed order yields 1.9% worse WER. This rules out the trivial explanation that the improvement comes from merely averaging different guidance values.
3. The method is training-free and adds negligible computational overhead (a single cosine evaluation per step, Section 3.4), making it immediately applicable to existing dual-CFG models without retraining. This is a practical contribution with low adoption barrier.

Weaknesses:
1. The absolute WER improvements are extremely small: the best reported WER improvement over constant baseline is 1.5% relative (1.123 vs 1.140 in Table 1). In absolute terms, this is a 0.017 WER reduction on an already very low WER (~1.1), which is within the noise margin of ASR evaluation systems. No statistical significance tests (e.g., confidence intervals, paired t-tests, bootstrap) are reported to confirm these differences are real and not due to sampling variance. This triggers HF_NO_SIGNIFICANCE concerns.
2. The evaluation is narrow and potentially overfitted to a single model and benchmark. Only MOVA-360p is tested, on a single benchmark (Verse-Bench set3), with a single seed (seed 42, Section 4.1). The hyperparameters (k*=12, s_high=3.5, s_low=1.5) appear hand-tuned to this specific setup. No generalization to other dual-CFG models, other benchmarks, or non-speech prompts is demonstrated. The paper title claims general applicability to 'dual-CFG in video-audio diffusion' but evidence covers only one model on one benchmark.
3. The norm analysis in Section 3.2 reports ratios that are inconsistent across different parts of the paper. The abstract says ratio goes from 1.04 to 1.47; Figure 3 caption says ratio increases from ~0.35 to ~1.57 with mean 1.47 for k≥12; the main text says ratio starts at ~0.5. These discrepancies (1.04 vs 0.35 vs 0.5 for early-step ratio) undermine confidence in the reported mechanistic analysis and suggest the numbers may be cherry-picked or poorly measured. Additionally, n=10 samples is a very small sample for norm analysis.

Must Fix Items:
1. Add statistical significance tests for all reported WER, AV-A, and LSE-C comparisons. With only 100 prompts and absolute WER differences of 0.017, it is essential to show these are not due to chance. Report confidence intervals or p-values.
2. Resolve the inconsistent norm ratio values reported across the abstract (1.04), Figure 3 caption (~0.35), and main text (~0.5). These cannot all be correct for the same measurement.
3. Evaluate on at least one additional model or benchmark to support the general claim in the title. Currently the evidence supports only 'Step-Down scheduling for MOVA-360p on Verse-Bench speech prompts,' not the broader claim of applicability to dual-CFG video-audio diffusion in general.

Runs:
- run=1 score=3.2 verdict=Strong Reject confidence=0.6 error=None