Title: TIME-VARYING MUTUAL INFORMATION DECODING FOR MITIGATING VISUAL FORGETTING IN VISION-LANGUAGE MODELS FARS
PDF: mi-grounded-decoding-visual-forgetting.pdf
Score: 3.5
Verdict: Strong Reject
Confidence: 0.60
Elapsed: 60.7s

Strengths:
1. The paper identifies and directly addresses a real phenomenon—visual forgetting during long CoT generation in VLMs. The PDM-H trajectory analysis (Figure 2) provides empirical confirmation that visual information reliance decays over generation steps, validating the problem motivation (Section 4.4).
2. The proposed method is training-free and requires no input modification, making it practical to deploy. The dual-forward-pass design with confidence-based gating (Equation 4) is a clean and interpretable mechanism that avoids interfering with high-confidence predictions (Section 3.4).
3. The method generalizes across architectures. Improvements are shown on both a reasoning-specialized model (VLAA-Thinker-7B, +1.51pp on HallusionBench) and an instruction-tuned model (Qwen2.5-VL-7B-Instruct, +3.34pp on MMStar subset), suggesting the approach is not model-specific (Tables 1, 3).

Weaknesses:
1. The absolute improvements are very small: +1.51pp on HallusionBench (66.25→67.76) and +0.62pp over visual replay. The MMStar score actually decreases slightly (62.13→62.07). These margins are within noise range for benchmarks of this size, and no statistical significance tests (e.g., bootstrap confidence intervals, paired t-tests) are reported to confirm these differences are not due to random variation (Tables 1, 2).
2. The computational overhead of dual forward passes (~2× inference cost) is acknowledged but inadequately analyzed. No wall-clock timing measurements are provided, and the trade-off between 2× cost and marginal accuracy gains is not quantified, making it difficult to assess practical value (Section 5).
3. The ablation study is weak and inconsistent. The ablation in Table 4 uses a 200-item MMStar subset (reporting 64.0% and 65.0%), while the main results in Table 1 use the full 1500-item MMStar (reporting 62.13% and 62.07%). The subset results are not directly comparable, and the adaptive MI actually underperforms fixed-gamma on HallusionBench (67.85 vs 68.11), undermining the claim that the time-varying schedule is superior (Table 4).
4. Key hyperparameters (λ=0.005, α=0.8, t0=0, γmax=5.0) are selected via grid search on a held-out validation set, but no sensitivity analysis is provided. The method's dependence on these four hyperparameters raises concerns about fragility and reproducibility—small changes could flip the already marginal gains (Section 4.1).

Must Fix Items:
1. Add statistical significance tests for all reported improvements (bootstrap CI or McNemar's test on per-question accuracy) to confirm that the marginal gains are not noise.
2. Provide a proper ablation with consistent evaluation sets: Table 4 ablation uses different subset sizes than Table 1, making comparison impossible. Re-run ablations on the same full benchmark or clearly explain why subsets differ.
3. Add hyperparameter sensitivity analysis for λ, α, and γmax to demonstrate the method is not fragile to these choices.

Runs:
- run=1 score=3.5 verdict=Strong Reject confidence=0.6 error=None