Title: ENTAILMENT-CHECKLIST SCORING: AN API-FREE ALTERNATIVE TO LLM-BASED DENSE VIDEO CAP-TION EVALUATION PDF: ff1e3cca-8067-42bb-92cb-a490034103e7.pdf Score: 5.2 Verdict: Reject Confidence: 0.78 Elapsed: 162.7s Strengths: 1. Clear problem motivation with empirical evidence of failure modes: the paper demonstrates quantitatively that BERTScore achieves negative Spearman correlation (−0.052) and embedding-only methods invert system rankings (Kendall −1.0) despite moderate correlation (0.409 Spearman). This is a concrete, evidence-backed gap that justifies the work (Table 1, Section 4.2). 2. Honest and informative error analysis: the manual inspection of 150 disagreements (Section 4.4) identifies that 77.3% stem from partial coverage and breaks down remaining errors into interpretable categories (world knowledge 6.7%, discourse reasoning 8.0%, stylistic paraphrases 8.0%). This is more useful than typical papers that skip error analysis entirely. 3. Retrieve-then-verify pipeline yields genuine efficiency gains: Table 2 shows 4.8× wall-clock speedup (1,861s vs 8,874s) and 6.6× NLI pass reduction (357K vs 2.36M) with only 4% Spearman correlation drop (0.330 vs 0.290). The retrieval ablation is well-designed and the cost-accuracy tradeoff is clearly quantified. 4. Correct system ranking achieved where alternatives fail: ECS is the only API-free method with Kendall +1.0 system-level ranking matching Gemini (Table 1). This is a practically important result — a metric that inverts system ordering is worse than useless. Weaknesses: 1. Trivial core idea — NLI for entailment verification is a direct application of existing tools: the 'insight' that keypoint coverage = entailment is straightforward. The pipeline (bge-m3 retrieval + DeBERTa NLI) assembles two off-the-shelf models with no architectural innovation. The closest prior, SummaC (Laban et al., 2021), already demonstrated sentence-level NLI aggregation for consistency detection. ECS is essentially SummaC applied to a different task with a retrieval pre-filter added (Section 3.2–3.3). 2. Evaluation on a single benchmark with only two systems — severe generalization risk: OmniDCBench contains just two captioning systems (Section 4.1). System-level Kendall on N=2 systems is a binary coin flip — +1.0 just means ECS happened to rank them correctly, not that it is robust. With only two systems, any metric either gets Kendall +1.0 or −1.0; there is no middle ground. This makes the headline 'correct system ranking' result statistically meaningless. No other benchmarks, no additional systems, no cross-domain validation. 3. Threshold calibration uses Gemini labels — circular dependency undermines the API-free claim: Equation 3 and Section 3.4 explicitly calibrate per-dimension thresholds using Gemini supervision on 215 held-out clips. The paper argues 'calibration requires LLM labels only during development' but this means: (a) you need API access to set up ECS, violating the 'API-free' promise for new domains/benchmarks; (b) thresholds are tuned to match Gemini, so good agreement is partially tautological; (c) if Gemini updates or is replaced, thresholds become stale. The 'API-free' framing is misleading — it is API-free at inference time only. 4. Low absolute agreement with Gemini undermines practical utility: ECS achieves only 0.511 keypoint F1 and 67.6% pairwise agreement at the generous ≥0.10 gap threshold (Table 1, Section 4.2). Nearly half of keypoint decisions disagree with Gemini, and a third of comparative judgments are wrong. The paper celebrates correct system ranking, but at the instance level the metric is unreliable — a researcher using ECS would frequently get wrong answers about which caption is better for a given clip. 5. No statistical significance testing anywhere: all results are reported as point estimates without confidence intervals, standard errors, or significance tests. With N=2 systems for Kendall, this is especially critical. For keypoint F1 on 46,577 items, bootstrap CI would be trivial to compute but is absent. The 150-sample error analysis has no CI on the 77.3% partial-coverage claim either. Must Fix Items: 1. Add at least one additional benchmark or captioning system to make system-level ranking claims non-trivial; N=2 Kendall is uninformative and potentially misleading. 2. Report confidence intervals or bootstrap significance tests for all metrics, especially the system-level Kendall and keypoint F1. 3. Clearly qualify the 'API-free' claim: the method requires Gemini-labeled data for threshold calibration and is only API-free at inference time. Discuss how thresholds would be set for new benchmarks without LLM supervision. Runs: - run=1 score=5.2 verdict=Reject confidence=0.78 error=None