{
  "pdf": "escaped-markup-judges-format-spoofing.pdf",
  "title": "ESCAPED MARKUP: PREVENTING VERDICT SPOOF-ING IN STRUCTURED MULTIMODAL LLM JUDGES FARS Analemma",
  "elapsed": 53.6,
  "runs_mode": 1,
  "valid_runs": 1,
  "avg_score": 4.2,
  "scores": [
    4.2
  ],
  "score_std": 0,
  "final_verdict": "Reject",
  "final_confidence": 0.6,
  "conference_scores": {
    "soundness": 2.5,
    "presentation": 2.8,
    "contribution": 2.2,
    "overall_rating": 4.2,
    "confidence": 3
  },
  "strengths": [
    "The paper identifies a concrete and practically relevant vulnerability—format-spoofing attacks on structured multimodal LLM judges—that is distinct from prior semantic prompt injection work. The 66.59% conditional ASR on VL-RewardBench (Table 1) demonstrates the severity of the problem with clear evidence, and the distinction between structural marker collision and imperative prompt injection (Section 3.2) is well-articulated.",
    "The NL-only control experiment (Section 4.4) is a strong methodological choice. By showing that the markup spoof achieves only 4.12pp higher ASRcond than the natural-language variant (66.59% vs 62.47%), the paper honestly reveals that much of the attack's power comes from semantic persuasion rather than structural exploitation. This self-limiting finding adds credibility and prevents over-packaging of the structural attack novelty.",
    "The ablation study (Table 2) is informative and leads to a genuinely surprising finding: tags-only and boxed-only partial variants achieve 41.19pp and 41.65pp ASRcond reduction respectively, comparable to or slightly better than the full escaping defense (39.36pp). This challenges the intuitive assumption that the full pipeline would dominate and correctly attributes the primary defense mechanism to content redaction rather than structural escaping."
  ],
  "weaknesses": [
    "The evaluation is extremely narrow: single model (Qwen2.5-VL-7B-Instruct), single benchmark (VL-RewardBench, N=1,247), single attack family (multi-block structural), and a single baseline (Spotlighting). The paper acknowledges this in Section 5 but does not mitigate it. The defense (regex-based redaction of verdict/quality phrases) is template-specific and may not generalize to judges with different output formats, different languages, or adversarially obfuscated verdict phrases. Without multi-model or multi-benchmark validation, it is unclear whether the 39.36pp reduction is robust or an artifact of this particular judge–benchmark combination.",
    "The baseline comparison is unfair and weak. Spotlighting (base64 encoding) was designed for text-only GPT-family models (Hines et al., 2024), not 7B multimodal judges. Showing it fails with 18.8% ParseFail on a model it was never intended for is predictable and does not constitute a meaningful comparison. The paper lacks comparisons with more relevant defenses: StruQ and SecAlign (both discussed in Related Work but not evaluated), simple input truncation, or even a naive 'strip all angle brackets' heuristic. The absence of these baselines makes it impossible to assess whether the proposed four-step pipeline offers advantages over simpler alternatives.",
    "The ablation reveals a fundamental tension that the paper does not adequately address. Since tags-only (which includes verdict redaction per the pipeline description) achieves 41.19pp reduction while the full escaping achieves only 39.36pp, the additional steps (boxed removal, quality-assertion redaction) appear to add no benefit or may even slightly hurt. The paper's explanation of 'interaction effects' (Section 4.3) is hand-wavy. If the full defense is worse than partial variants, this raises questions about whether the four-step design is well-motivated or whether a simpler tags+verdict-redaction approach would be preferable. The paper should either explain the interaction mechanism or simplify the proposed defense."
  ],
  "must_fix_items": [
    "Add at least one more baseline defense (e.g., StruQ, SecAlign, or a simple heuristic like 'strip all content after \\boxed' or 'truncate candidate response at reserved markers') to make the comparison meaningful rather than just beating a predictably-failed Spotlighting.",
    "Explain or resolve the ablation anomaly where full escaping (39.36pp reduction) performs worse than both partial variants (~41pp reduction). If adding steps hurts, the four-step pipeline design needs justification or simplification.",
    "Evaluate on at least one additional judge model or benchmark to demonstrate generalizability beyond a single model–benchmark pair."
  ],
  "runs": [
    {
      "run": 1,
      "score": 4.2,
      "verdict": "Reject",
      "confidence": 0.6,
      "strengths": [
        "The paper identifies a concrete and practically relevant vulnerability—format-spoofing attacks on structured multimodal LLM judges—that is distinct from prior semantic prompt injection work. The 66.59% conditional ASR on VL-RewardBench (Table 1) demonstrates the severity of the problem with clear evidence, and the distinction between structural marker collision and imperative prompt injection (Section 3.2) is well-articulated.",
        "The NL-only control experiment (Section 4.4) is a strong methodological choice. By showing that the markup spoof achieves only 4.12pp higher ASRcond than the natural-language variant (66.59% vs 62.47%), the paper honestly reveals that much of the attack's power comes from semantic persuasion rather than structural exploitation. This self-limiting finding adds credibility and prevents over-packaging of the structural attack novelty.",
        "The ablation study (Table 2) is informative and leads to a genuinely surprising finding: tags-only and boxed-only partial variants achieve 41.19pp and 41.65pp ASRcond reduction respectively, comparable to or slightly better than the full escaping defense (39.36pp). This challenges the intuitive assumption that the full pipeline would dominate and correctly attributes the primary defense mechanism to content redaction rather than structural escaping."
      ],
      "weaknesses": [
        "The evaluation is extremely narrow: single model (Qwen2.5-VL-7B-Instruct), single benchmark (VL-RewardBench, N=1,247), single attack family (multi-block structural), and a single baseline (Spotlighting). The paper acknowledges this in Section 5 but does not mitigate it. The defense (regex-based redaction of verdict/quality phrases) is template-specific and may not generalize to judges with different output formats, different languages, or adversarially obfuscated verdict phrases. Without multi-model or multi-benchmark validation, it is unclear whether the 39.36pp reduction is robust or an artifact of this particular judge–benchmark combination.",
        "The baseline comparison is unfair and weak. Spotlighting (base64 encoding) was designed for text-only GPT-family models (Hines et al., 2024), not 7B multimodal judges. Showing it fails with 18.8% ParseFail on a model it was never intended for is predictable and does not constitute a meaningful comparison. The paper lacks comparisons with more relevant defenses: StruQ and SecAlign (both discussed in Related Work but not evaluated), simple input truncation, or even a naive 'strip all angle brackets' heuristic. The absence of these baselines makes it impossible to assess whether the proposed four-step pipeline offers advantages over simpler alternatives.",
        "The ablation reveals a fundamental tension that the paper does not adequately address. Since tags-only (which includes verdict redaction per the pipeline description) achieves 41.19pp reduction while the full escaping achieves only 39.36pp, the additional steps (boxed removal, quality-assertion redaction) appear to add no benefit or may even slightly hurt. The paper's explanation of 'interaction effects' (Section 4.3) is hand-wavy. If the full defense is worse than partial variants, this raises questions about whether the four-step design is well-motivated or whether a simpler tags+verdict-redaction approach would be preferable. The paper should either explain the interaction mechanism or simplify the proposed defense."
      ],
      "must_fix_items": [
        "Add at least one more baseline defense (e.g., StruQ, SecAlign, or a simple heuristic like 'strip all content after \\boxed' or 'truncate candidate response at reserved markers') to make the comparison meaningful rather than just beating a predictably-failed Spotlighting.",
        "Explain or resolve the ablation anomaly where full escaping (39.36pp reduction) performs worse than both partial variants (~41pp reduction). If adding steps hurts, the four-step pipeline design needs justification or simplification.",
        "Evaluate on at least one additional judge model or benchmark to demonstrate generalizability beyond a single model–benchmark pair."
      ],
      "conference_scores": {
        "soundness": 2.5,
        "presentation": 2.8,
        "contribution": 2.2,
        "overall_rating": 4.2,
        "confidence": 3
      }
    }
  ]
}