{
  "pdf": "reinked-ocr-view-chart-robustness.pdf",
  "title": "REINK: A TRAINING-FREE INFERENCE WRAPPER FOR ROBUST CHART QUESTION ANSWERING UNDER VI-SUAL DEGRADATIONS FARS",
  "elapsed": 108.8,
  "runs_mode": 1,
  "valid_runs": 1,
  "avg_score": 3.2,
  "scores": [
    3.2
  ],
  "score_std": 0,
  "final_verdict": "Strong Reject",
  "final_confidence": 0.6,
  "conference_scores": {
    "soundness": 2.3,
    "presentation": 2.8,
    "contribution": 1.8,
    "overall_rating": 3.2,
    "confidence": 3
  },
  "strengths": [
    "Clear and focused problem formulation: the paper identifies a specific, real-world failure mode (text illegibility under visual corruption) for VLMs on chart QA, and proposes a targeted solution. The motivation is well-grounded in prior work showing 25-30pp accuracy drops (Section 1, citing Shin et al. 2025; Mukhopadhyay et al. 2024).",
    "Well-designed control experiment: the scrambled-text baseline (Condition B) is a strong experimental control that isolates whether gains come from correct text semantics versus spatial layout cues. The +8.50pp gap between ReInk (C) and scrambled (B) provides convincing evidence that text content drives improvement, not just visual contrast or layout (Table 1).",
    "Honest and informative ablation study: the OCR-as-text ablation reveals that spatial rendering provides only +0.64pp over plain text injection, and bounding box outlines add only +0.10pp. The clean-chart sanity check (+0.13pp) confirms no prompt confound. These ablations are transparent about the limited marginal value of the spatial rendering component, which strengthens credibility (Table 2)."
  ],
  "weaknesses": [
    "Extremely limited experimental scope: only one VLM (Qwen2.5-VL-7B-Instruct), one benchmark (ChartQAPro), and two corruption types (defocus blur, pixelate) at a single severity level. The paper's own conclusion acknowledges this, but it fundamentally undermines the generalizability claims. A 'training-free inference wrapper' should demonstrate model-agnostic utility across at least 2-3 VLMs (Section 4.1, 5).",
    "The core contribution is thin and arguably over-packaged: ReInk is essentially 'run OCR, put text back on an image, give both images to VLM.' The ablation in Table 2 shows that the spatial rendering (the paper's namesake and primary technical novelty) contributes only +0.64pp over simply providing OCR text as a string in the prompt. This means the simplest possible baseline (OCR text in prompt) captures 95% of the benefit, raising questions about whether the spatial rendering pipeline is the right contribution versus a simpler text-injection approach (Table 2, OCR-as-text: 27.31% vs ReInk: 27.95%).",
    "Low absolute performance ceiling: even with ReInk, accuracy is only 27.95% on ChartQAPro-Corrupted, meaning the model still fails on ~72% of questions. The paper does not sufficiently discuss what the remaining failure modes are or whether OCR errors vs. VLM reasoning failures account for the gap. The 2.8:1 rescue-to-hurt ratio is presented positively but 4.72% of questions are actively hurt by ReInk, which is non-trivial (Figure 3, Section 4.4).",
    "No statistical significance testing: the paper reports point estimates (e.g., 27.95% vs 19.45%) without confidence intervals or significance tests. For A-SC4, standard deviations are reported (±1.61 to ±9.43), but no such uncertainty quantification is provided for the main conditions A, B, C. This makes it impossible to assess whether the +0.64pp difference between ReInk and OCR-as-text is meaningful or within noise (Table 1, Table 2)."
  ],
  "must_fix_items": [
    "Add statistical significance testing or confidence intervals for all main results, especially the small differences in ablation (ReInk vs OCR-as-text: +0.64pp).",
    "Evaluate on at least one additional VLM to substantiate the 'model-agnostic' claim, or remove/rephrase that claim.",
    "Analyze failure cases on the ~72% of questions ReInk still gets wrong: what fraction are due to OCR errors vs. VLM reasoning failures vs. questions that don't require text reading? This would inform practitioners about the ceiling of OCR-based approaches."
  ],
  "runs": [
    {
      "run": 1,
      "score": 3.2,
      "verdict": "Strong Reject",
      "confidence": 0.6,
      "strengths": [
        "Clear and focused problem formulation: the paper identifies a specific, real-world failure mode (text illegibility under visual corruption) for VLMs on chart QA, and proposes a targeted solution. The motivation is well-grounded in prior work showing 25-30pp accuracy drops (Section 1, citing Shin et al. 2025; Mukhopadhyay et al. 2024).",
        "Well-designed control experiment: the scrambled-text baseline (Condition B) is a strong experimental control that isolates whether gains come from correct text semantics versus spatial layout cues. The +8.50pp gap between ReInk (C) and scrambled (B) provides convincing evidence that text content drives improvement, not just visual contrast or layout (Table 1).",
        "Honest and informative ablation study: the OCR-as-text ablation reveals that spatial rendering provides only +0.64pp over plain text injection, and bounding box outlines add only +0.10pp. The clean-chart sanity check (+0.13pp) confirms no prompt confound. These ablations are transparent about the limited marginal value of the spatial rendering component, which strengthens credibility (Table 2)."
      ],
      "weaknesses": [
        "Extremely limited experimental scope: only one VLM (Qwen2.5-VL-7B-Instruct), one benchmark (ChartQAPro), and two corruption types (defocus blur, pixelate) at a single severity level. The paper's own conclusion acknowledges this, but it fundamentally undermines the generalizability claims. A 'training-free inference wrapper' should demonstrate model-agnostic utility across at least 2-3 VLMs (Section 4.1, 5).",
        "The core contribution is thin and arguably over-packaged: ReInk is essentially 'run OCR, put text back on an image, give both images to VLM.' The ablation in Table 2 shows that the spatial rendering (the paper's namesake and primary technical novelty) contributes only +0.64pp over simply providing OCR text as a string in the prompt. This means the simplest possible baseline (OCR text in prompt) captures 95% of the benefit, raising questions about whether the spatial rendering pipeline is the right contribution versus a simpler text-injection approach (Table 2, OCR-as-text: 27.31% vs ReInk: 27.95%).",
        "Low absolute performance ceiling: even with ReInk, accuracy is only 27.95% on ChartQAPro-Corrupted, meaning the model still fails on ~72% of questions. The paper does not sufficiently discuss what the remaining failure modes are or whether OCR errors vs. VLM reasoning failures account for the gap. The 2.8:1 rescue-to-hurt ratio is presented positively but 4.72% of questions are actively hurt by ReInk, which is non-trivial (Figure 3, Section 4.4).",
        "No statistical significance testing: the paper reports point estimates (e.g., 27.95% vs 19.45%) without confidence intervals or significance tests. For A-SC4, standard deviations are reported (±1.61 to ±9.43), but no such uncertainty quantification is provided for the main conditions A, B, C. This makes it impossible to assess whether the +0.64pp difference between ReInk and OCR-as-text is meaningful or within noise (Table 1, Table 2)."
      ],
      "must_fix_items": [
        "Add statistical significance testing or confidence intervals for all main results, especially the small differences in ablation (ReInk vs OCR-as-text: +0.64pp).",
        "Evaluate on at least one additional VLM to substantiate the 'model-agnostic' claim, or remove/rephrase that claim.",
        "Analyze failure cases on the ~72% of questions ReInk still gets wrong: what fraction are due to OCR errors vs. VLM reasoning failures vs. questions that don't require text reading? This would inform practitioners about the ceiling of OCR-based approaches."
      ],
      "conference_scores": {
        "soundness": 2.3,
        "presentation": 2.8,
        "contribution": 1.8,
        "overall_rating": 3.2,
        "confidence": 3
      }
    }
  ]
}