Title: OCR-ANCHOR RERANKING: WHEN BEST-OF-N SE-LECTION FAILS DUE TO CANDIDATE HOMOGENEITY FARS Analemma
PDF: dd1d3d52-acc2-4dc0-9125-f4c8a039e03d.pdf
Score: 3.5
Verdict: Reject
Confidence: 0.7
Elapsed: 222.1s

Strengths:
1. Honest negative-result reporting: The paper transparently reports that its proposed OCR-Anchor Reranking method fails to improve over baselines, including presenting Table 2 diagnostic statistics (90.6% identity rate, 89.1% avg coverage) that clearly explain the failure mechanism. This is more informative than selectively reporting only positive outcomes.
2. Well-structured diagnostic analysis (Section 4.3): The paper traces the failure cascade from low temperature → candidate homogeneity (90.6% identity) → coverage saturation (median 96%) → default to candidate 0 (96% of pages) → method equivalent to N=1 baseline. This causal chain is clearly articulated with supporting statistics from Table 2.
3. Clean experimental protocol with multiple baselines: Table 1 compares five selection strategies (N=1, Self-Score, Random, Anchor Coverage, Consensus) across 7 benchmark categories with 3 random seeds (42,43,44) and reports mean±std. The 0.3-point band result (82.0–82.3%) makes the null finding credible rather than a marginal effect buried in noise.

Weaknesses:
1. Method never demonstrated working under any condition: The entire evaluation uses a single temperature (0.1) that produces 90.6% identical candidates (Table 2). This is an experimental design that guarantees failure before the method is even tested. No experiments at higher temperatures (e.g., 0.5, 0.8, 1.0) where candidate diversity might exist and the selection signal could be informative. The paper proposes a reranking method but never shows it functioning in conditions where reranking is possible—this is a fundamental gap, not a finding.
2. Trivial core contribution after packaging stripping: OCR-Anchor Reranking reduces to: (1) extract high-confidence tokens from PaddleOCR, (2) check which VLM candidate contains the most of them. This is a one-paragraph heuristic, not a method. The 'broader implication' that best-of-N requires candidate diversity is tautological—of course selection requires something to select among—and has been widely understood in the sampling literature. The paper packages this as a contribution but the actual informational content is minimal.
3. Single model, single benchmark, no significance tests: Only olmOCR-2-7B-1025 is tested on only olmOCR-Bench at only temperature 0.1 with N=8. No other VLMs (e.g., Qwen2-VL, InternVL, GOT-OCR), no other OCR benchmarks, no temperature sweep, no varying N. The 3-seed standard deviations (0.1–0.9 pp in Table 1) are reported but no paired significance tests are conducted to confirm the 0.3-pp band is statistically indistinguishable. With only 3 seeds and no formal test, the null result claim rests on informal inspection.

Must Fix Items:
1. Add at least one temperature condition above 0.1 (e.g., 0.5, 0.8) to demonstrate whether OCR-Anchor Reranking works when candidate diversity exists. Without this, the paper has proposed a method and shown it fails under conditions that guarantee failure, which is not a meaningful evaluation.
2. Conduct formal significance tests (e.g., paired bootstrap or Wilcoxon signed-rank on per-page scores) to confirm the 0.3-pp band in Table 1 is statistically indistinguishable. Three seeds with informal mean±std comparison is insufficient for a null-result claim.
3. Evaluate on at least one additional VLM model and/or one additional document OCR benchmark to establish whether the homogeneity finding generalizes beyond olmOCR-2-7B-1025 on olmOCR-Bench.

Runs:
- run=1 score=3.5 verdict=Reject confidence=0.7 error=None