Title: QUOTEVERIFY: INFERENCE-TIME QUOTE-BACKED
PDF: quote-backed-citation-verification.pdf
Score: 3.5
Verdict: Strong Reject
Confidence: 0.60
Elapsed: 49.4s

Strengths:
1. The paper identifies a real and practically important problem: citation quality in deep research agent outputs, where hallucinated references and fabricated quotes are common. The problem is well-motivated with concrete prior evidence (Li et al., 2025 reporting 27-79% match rates) — Section 1.
2. The pipeline is modular and requires no training, making it immediately applicable as a wrapper around any base report generator. This is a practical engineering contribution with clear deployment value — Section 3.
3. The paper is unusually honest about its own findings: it openly reports that the structured citation format (Prompt-Only) drives most gains, not the verification pipeline itself (+22.5pp from Standard→Prompt-Only vs -3.8pp/+0.9pp from Prompt-Only→QuoteVerify on GPT-4o/Gemini) — Table 1 and Section 4.2. This transparency is commendable and rare.
4. The error analysis identifying quote validity as the primary bottleneck (18-28% valid quotes even with successful fetches) is a meaningful empirical finding that informs future work — Section 4.4, Figure 2.

Weaknesses:
1. The central empirical claim is misleading: the abstract and introduction tout '+18.7pp and +12.5pp improvements' for QuoteVerify, but Table 1 shows QuoteVerify actually performs *worse* than Prompt-Only on GPT-4o (53.1% vs 56.9%) and only marginally better on Gemini-2.5-Pro (39.2% vs 38.3%). The statistically significant improvements are only vs. the Standard baseline, not vs. the more fair Prompt-Only comparison. This is a serious framing issue — the paper's title contribution (the verification pipeline) adds almost no value over simply asking the model to provide evidence quotes — Table 1.
2. Evaluation is conducted on only 20 prompts from ReportBench — this is an extremely small sample. With 20 prompts, even paired bootstrap with 10K resamples provides limited statistical power, and the results may not generalize. The authors acknowledge this is a 'pilot subset' but still present the findings as conclusive — Section 4.1.
3. The quality-coverage tradeoff is severe: QuoteVerify reduces reference recall by 62-66% and cited statements by 38-53% compared to Standard baselines. A citation verification system that removes the majority of citations is of questionable practical utility — the improved precision comes at an extreme coverage cost that is not adequately discussed as a limitation — Table 1, Section 4.2.
4. The ablation study (Table 2) reveals a paradox: removing the NLI entailment gate *improves* match rate by +5.7pp (58.8% vs 53.1%). The paper frames this as the gate 'preventing irrelevant quotes,' but the primary metric (Match Rate) worsens with the gate. This suggests the entailment gate is over-filtering and removing valid citations, undermining the pipeline's value — Table 2.
5. No statistical significance is reported for the comparison between QuoteVerify and Prompt-Only. The paper only tests significance vs. Standard. Given that Prompt-Only outperforms or matches QuoteVerify, this omission is conspicuous and suggests the authors may have cherry-picked the comparison that yields significance — Section 4.2.

Must Fix Items:
1. Reframe the main claim honestly: the primary finding is that structured citation prompts (requiring evidence quotes) improve match rates, while the verification pipeline itself provides marginal or negative additional benefit. The abstract and title overclaim the contribution of the verification stages.
2. Report statistical significance for QuoteVerify vs. Prompt-Only, not just vs. Standard. This is the critical comparison for assessing the pipeline's contribution beyond prompt engineering.
3. Address the severe recall/coverage tradeoff quantitatively — discuss whether a system that drops 53-66% of citations while improving match rate by ~13-19pp is practically deployable, and propose concrete mitigations.

Runs:
- run=1 score=3.5 verdict=Strong Reject confidence=0.6 error=None