Title: CONVERGESTOP: INFERENCE-TIME CONVERGENCE-BASED HALTING FOR GENERATIVE TEXT EMBED-DINGS FARS
PDF: convergence-stopping-generative-embeddings.pdf
Score: 3.4
Verdict: Strong Reject
Confidence: 0.60
Elapsed: 85.7s

Strengths:
1. Training-free approach: ConvergeStop requires no additional training or model modification, making it immediately applicable to any generative embedding model. This is a practical advantage over learned halting mechanisms like ACT (Graves, 2016) or PonderNet (Banino et al., 2021). (Section 3.3, Equation 2)
2. Above Pareto frontier: ConvergeStop outperforms compute-matched fixed-K baselines on both datasets (+0.20 nDCG@10 on SciFact at K=9 equivalent, +0.53 on FiQA2018 at K=19 equivalent), demonstrating that adaptive halting provides genuine value beyond uniformly reducing iterations. (Table 1, Section 4.3)
3. Premise validation is methodologically sound: The paper validates both premises (embedding convergence with Spearman ρ=1.0 median, ranking stability with Jaccard@10) before evaluating the method, establishing that the approach is grounded in observable phenomena. (Section 4.2)

Weaknesses:
1. Extremely limited evaluation scope: Only 2 datasets are evaluated, both with tiny query sets (50 queries each, split 10 dev / 40 test). With 40 test queries, a single query's nDCG@10 can swing results by ~2.5 points, making the reported differences (e.g., 78.33 vs 77.97 = +0.36 on SciFact) within noise margin. No error bars, confidence intervals, or statistical significance tests are reported. (Table 1, Section 4.1)
2. Minimal adaptive behavior: The halting depth distributions (Figure 3) are narrow and unimodal—70% of SciFact queries halt at k*=9, 40% of FiQA2018 at k*=18. This means ConvergeStop is effectively selecting a near-fixed K per dataset rather than meaningfully adapting per-query. The Spearman correlation between k* and query difficulty is non-significant (ρ=0.13, p=0.42), confirming the method does not differentiate by input complexity. (Section 4.6, Figure 3)
3. Dataset-dependent savings undermine generality: The method achieves 55% savings on SciFact but only 7% on FiQA2018. The paper acknowledges this but does not investigate what dataset properties predict early convergence, leaving the practical applicability unclear. For datasets where embeddings converge late, ConvergeStop provides negligible savings. (Section 4.3, Section 5)
4. Threshold selection relies on a dev set with only 10 queries: Equation 3 describes grid search over r and p on the development split, but with only 10 queries, the selected threshold is almost certainly overfit to those specific queries. The paper does not analyze threshold sensitivity or stability across different dev splits. (Section 3.4, Section 4.1)

Must Fix Items:
1. Add statistical significance tests or confidence intervals for all reported nDCG@10 comparisons, given the extremely small test sets (40 queries). Without this, claims of 'quality parity' or 'outperforming baselines' are not justified.
2. Evaluate on additional datasets (at least 3-5 more from BEIR/MTEB) to establish whether the 55% savings on SciFact is representative or cherry-picked. The current 2-dataset evaluation is insufficient for a systems/efficiency paper.
3. Report threshold sensitivity analysis: how do results change with different τ values, and how stable is the dev-set-derived threshold across different random splits?

Runs:
- run=1 score=3.4 verdict=Strong Reject confidence=0.6 error=None