Title: REINK: A TRAINING-FREE INFERENCE WRAPPER FOR ROBUST CHART QUESTION ANSWERING UNDER VI-SUAL DEGRADATIONS FARS
PDF: reinked-ocr-view-chart-robustness.pdf
Score: 3.2
Verdict: Strong Reject
Confidence: 0.60
Elapsed: 108.8s

Strengths:
1. Clear and focused problem formulation: the paper identifies a specific, real-world failure mode (text illegibility under visual corruption) for VLMs on chart QA, and proposes a targeted solution. The motivation is well-grounded in prior work showing 25-30pp accuracy drops (Section 1, citing Shin et al. 2025; Mukhopadhyay et al. 2024).
2. Well-designed control experiment: the scrambled-text baseline (Condition B) is a strong experimental control that isolates whether gains come from correct text semantics versus spatial layout cues. The +8.50pp gap between ReInk (C) and scrambled (B) provides convincing evidence that text content drives improvement, not just visual contrast or layout (Table 1).
3. Honest and informative ablation study: the OCR-as-text ablation reveals that spatial rendering provides only +0.64pp over plain text injection, and bounding box outlines add only +0.10pp. The clean-chart sanity check (+0.13pp) confirms no prompt confound. These ablations are transparent about the limited marginal value of the spatial rendering component, which strengthens credibility (Table 2).

Weaknesses:
1. Extremely limited experimental scope: only one VLM (Qwen2.5-VL-7B-Instruct), one benchmark (ChartQAPro), and two corruption types (defocus blur, pixelate) at a single severity level. The paper's own conclusion acknowledges this, but it fundamentally undermines the generalizability claims. A 'training-free inference wrapper' should demonstrate model-agnostic utility across at least 2-3 VLMs (Section 4.1, 5).
2. The core contribution is thin and arguably over-packaged: ReInk is essentially 'run OCR, put text back on an image, give both images to VLM.' The ablation in Table 2 shows that the spatial rendering (the paper's namesake and primary technical novelty) contributes only +0.64pp over simply providing OCR text as a string in the prompt. This means the simplest possible baseline (OCR text in prompt) captures 95% of the benefit, raising questions about whether the spatial rendering pipeline is the right contribution versus a simpler text-injection approach (Table 2, OCR-as-text: 27.31% vs ReInk: 27.95%).
3. Low absolute performance ceiling: even with ReInk, accuracy is only 27.95% on ChartQAPro-Corrupted, meaning the model still fails on ~72% of questions. The paper does not sufficiently discuss what the remaining failure modes are or whether OCR errors vs. VLM reasoning failures account for the gap. The 2.8:1 rescue-to-hurt ratio is presented positively but 4.72% of questions are actively hurt by ReInk, which is non-trivial (Figure 3, Section 4.4).
4. No statistical significance testing: the paper reports point estimates (e.g., 27.95% vs 19.45%) without confidence intervals or significance tests. For A-SC4, standard deviations are reported (±1.61 to ±9.43), but no such uncertainty quantification is provided for the main conditions A, B, C. This makes it impossible to assess whether the +0.64pp difference between ReInk and OCR-as-text is meaningful or within noise (Table 1, Table 2).

Must Fix Items:
1. Add statistical significance testing or confidence intervals for all main results, especially the small differences in ablation (ReInk vs OCR-as-text: +0.64pp).
2. Evaluate on at least one additional VLM to substantiate the 'model-agnostic' claim, or remove/rephrase that claim.
3. Analyze failure cases on the ~72% of questions ReInk still gets wrong: what fraction are due to OCR errors vs. VLM reasoning failures vs. questions that don't require text reading? This would inform practitioners about the ceiling of OCR-based approaches.

Runs:
- run=1 score=3.2 verdict=Strong Reject confidence=0.6 error=None