{
  "pdf": "424d9225-e893-4a73-b97e-5c9c85c010e2.pdf",
  "title": "DEEP-LAYER ATTENTION PRUNING VISION-",
  "elapsed": 246.2,
  "runs_mode": 1,
  "valid_runs": 1,
  "avg_score": 5.0,
  "scores": [
    5.0
  ],
  "score_std": 0.0,
  "final_verdict": "Reject",
  "final_confidence": 0.78,
  "conference_scores": null,
  "strengths": [
    "The core finding—that attention from layer 12 of InternVL2.5-8B is far more semantically informative than shallow-layer attention for token pruning—is clearly demonstrated with a large margin: Raw MIS at L12 achieves 66.32% vs L4's 38.70% (Table 2), an absolute +27.62 point gap that is unlikely to be a statistical fluke even without error bars.",
    "The diagnostic analysis in Table 3 (Spearman ρ ≈0.14–0.18 across all layers) provides a concrete explanation for why ratio-based debiasing fails: the positional bias that D2Pruner assumes simply does not exist in this model. This is a useful empirical refutation of a widely-held assumption, even if the diagnostic sample size (30 images) is small.",
    "Elimination of offline calibration is a practical engineering benefit. D2Pruner requires 1000 COCO images to compute bias priors; this method requires none, which genuinely simplifies deployment and avoids domain-transfer concerns (Section 3.3, Figure 1).",
    "The ablation in Table 2 is systematic: it tests ratio debiasing at both L4 and L12, with multiple Ks values, plus a weighted combination variant, and all uniformly underperform raw attention. The consistency of the negative result strengthens the 'no positional bias' conclusion."
  ],
  "weaknesses": [
    "The core contribution is extremely thin: the method is 'use layer 12 instead of layer 2, skip debiasing.' This is essentially a hyperparameter choice (which layer to extract attention from) combined with an existing algorithm (MIS from D2Pruner). The layer-12 selection was presumably found via search, but no layer-sweep experiment is reported—readers cannot see the full sensitivity curve or know whether L10 or L14 would perform equally well or better.",
    "Single-model, single-task evaluation is a critical limitation acknowledged in the paper (Section 5) but not mitigated. All results are on InternVL2.5-8B and RefCOCO-family grounding. Whether deep-layer attention is superior on other VLMs (LLaVA, Qwen-VL), other architectures (SigLIP encoders, different LLM backbones), or other tasks (VQA, captioning, document understanding) is entirely unknown. This severely limits generalizability claims.",
    "No statistical significance testing: Table 1 reports point estimates on ~57K samples but with no error bars, confidence intervals, or significance tests. The +11.29 gap over D2Pruner is likely significant at this sample size, but the smaller ablation comparisons in Table 2 (e.g., Weighted Combo at -2.08 vs Raw) are not validated. This is a standard expectation for empirical NLP/CV papers.",
    "The 'ratio debiasing fails' contribution is a strawman: Equation 1 (Amid/Ashallow) is the authors' own proposal, not an existing method. Showing one's own failed idea does not constitute a contribution—it is a negative result on a method nobody previously advocated. D2Pruner's actual debiasing uses an offline prior, not a per-instance shallow-layer ratio, so the refutation does not directly address D2Pruner's mechanism.",
    "Only one retention ratio (10%) is evaluated. The 92% retention of no-pruning performance at 10% tokens is impressive, but the paper provides no scaling curve: how does the L12 vs L2 advantage change at 20%, 30%, or 50% retention? If the gap collapses at moderate retention, the practical significance diminishes."
  ],
  "must_fix_items": [
    "Add a layer-sweep experiment (at minimum L2 through L16 in steps of 2) to show the full sensitivity curve and justify the L12 choice rather than presenting it as an oracle selection.",
    "Add evaluation on at least one additional model (e.g., LLaVA-1.5-7B or Qwen2-VL) and one additional task type (e.g., VQAv2 or TextVQA) to substantiate generalizability beyond a single model-task combination.",
    "Report error bars or confidence intervals on all main-table results, or at minimum on the key pairwise comparisons (Ours vs D2Pruner)."
  ],
  "runs": [
    {
      "run": 1,
      "score": 5.0,
      "verdict": "Reject",
      "confidence": 0.78,
      "strengths": [
        "The core finding—that attention from layer 12 of InternVL2.5-8B is far more semantically informative than shallow-layer attention for token pruning—is clearly demonstrated with a large margin: Raw MIS at L12 achieves 66.32% vs L4's 38.70% (Table 2), an absolute +27.62 point gap that is unlikely to be a statistical fluke even without error bars.",
        "The diagnostic analysis in Table 3 (Spearman ρ ≈0.14–0.18 across all layers) provides a concrete explanation for why ratio-based debiasing fails: the positional bias that D2Pruner assumes simply does not exist in this model. This is a useful empirical refutation of a widely-held assumption, even if the diagnostic sample size (30 images) is small.",
        "Elimination of offline calibration is a practical engineering benefit. D2Pruner requires 1000 COCO images to compute bias priors; this method requires none, which genuinely simplifies deployment and avoids domain-transfer concerns (Section 3.3, Figure 1).",
        "The ablation in Table 2 is systematic: it tests ratio debiasing at both L4 and L12, with multiple Ks values, plus a weighted combination variant, and all uniformly underperform raw attention. The consistency of the negative result strengthens the 'no positional bias' conclusion."
      ],
      "weaknesses": [
        "The core contribution is extremely thin: the method is 'use layer 12 instead of layer 2, skip debiasing.' This is essentially a hyperparameter choice (which layer to extract attention from) combined with an existing algorithm (MIS from D2Pruner). The layer-12 selection was presumably found via search, but no layer-sweep experiment is reported—readers cannot see the full sensitivity curve or know whether L10 or L14 would perform equally well or better.",
        "Single-model, single-task evaluation is a critical limitation acknowledged in the paper (Section 5) but not mitigated. All results are on InternVL2.5-8B and RefCOCO-family grounding. Whether deep-layer attention is superior on other VLMs (LLaVA, Qwen-VL), other architectures (SigLIP encoders, different LLM backbones), or other tasks (VQA, captioning, document understanding) is entirely unknown. This severely limits generalizability claims.",
        "No statistical significance testing: Table 1 reports point estimates on ~57K samples but with no error bars, confidence intervals, or significance tests. The +11.29 gap over D2Pruner is likely significant at this sample size, but the smaller ablation comparisons in Table 2 (e.g., Weighted Combo at -2.08 vs Raw) are not validated. This is a standard expectation for empirical NLP/CV papers.",
        "The 'ratio debiasing fails' contribution is a strawman: Equation 1 (Amid/Ashallow) is the authors' own proposal, not an existing method. Showing one's own failed idea does not constitute a contribution—it is a negative result on a method nobody previously advocated. D2Pruner's actual debiasing uses an offline prior, not a per-instance shallow-layer ratio, so the refutation does not directly address D2Pruner's mechanism.",
        "Only one retention ratio (10%) is evaluated. The 92% retention of no-pruning performance at 10% tokens is impressive, but the paper provides no scaling curve: how does the L12 vs L2 advantage change at 20%, 30%, or 50% retention? If the gap collapses at moderate retention, the practical significance diminishes."
      ],
      "must_fix_items": [
        "Add a layer-sweep experiment (at minimum L2 through L16 in steps of 2) to show the full sensitivity curve and justify the L12 choice rather than presenting it as an oracle selection.",
        "Add evaluation on at least one additional model (e.g., LLaVA-1.5-7B or Qwen2-VL) and one additional task type (e.g., VQAv2 or TextVQA) to substantiate generalizability beyond a single model-task combination.",
        "Report error bars or confidence intervals on all main-table results, or at minimum on the key pairwise comparisons (Ours vs D2Pruner)."
      ],
      "conference_scores": null
    }
  ]
}
