Title: ISOLATED SOLVE-THEN-JUDGE: A SIMPLE DEFENSE AGAINST CANDIDATE-RESPONSE PROMPT INJECTION FOR MULTIMODAL LLM JUDGES FARS
PDF: isolated-solve-then-judge-injection.pdf
Score: 4.2
Verdict: Reject
Confidence: 0.60
Elapsed: 47.4s

Strengths:
1. Clean experimental design with three well-controlled conditions (A/B/C) that isolate the effect of information isolation from prompt engineering and additional compute. The comparison of B vs C directly tests whether isolation per se provides defense benefit, which is a methodologically sound approach (Section 3.4, Figure 1).
2. Transparent reporting of trade-offs and limitations: the paper honestly reports the 10.8pp clean accuracy degradation (Table 1), category-dependent effectiveness with reasoning tasks showing anomalous behavior where non-isolated control outperforms isolated defense (Table 2), and authority impersonation attacks remaining at 63.6% ASR_cond even against the defended system (Table 3). This level of candor is commendable.
3. Corruption analysis and failure analysis provide mechanistic insight into how the defense works and why it fails. The finding that 98.7% of failures occur when the attacked response is in position 2 reveals a position bias vulnerability, and the observation that weak self-answer anchors (<30 chars) correlate with 37% of failures offers concrete directions for improvement (Section 4.5, 4.6).

Weaknesses:
1. The core idea — generating a model's own answer before judging candidates — is conceptually straightforward and has direct precedent in Lin et al. (2025), who proposed self-reference-guided evaluation. The paper acknowledges this but positions its contribution as applying the idea to security. However, the 'isolated solve-then-judge' is essentially: (1) answer the question first, then (2) judge with that answer as reference. The incremental novelty over prior self-reference work and standard multi-pass prompting is limited (Section 2.4, 3.3).
2. The isolated contribution of information isolation is only 4pp (B: 29.33% vs C: 33.37% ASR_cond), meaning the vast majority of the 62pp defense comes from prompt engineering (delimiters, warnings) and the two-pass structure rather than isolation itself. This undermines the paper's central claim that 'information isolation' is the key defense mechanism. The paper's title and framing overstate the role of isolation relative to the data (Table 1, Section 4.2).
3. The defense is evaluated on only one model (Qwen2.5-VL-7B-Instruct) and one benchmark (VL-RewardBench). No scaling analysis across model sizes, no evaluation on other VLMs (e.g., InternVL, LLaVA), and no evaluation on text-only LLM judge scenarios. This severely limits the generalizability of claims (Section 4.1). Additionally, only three hand-crafted attack variants are tested — no optimization-based attacks (e.g., JudgeDeceiver from Shi et al. 2025a, cited in related work) are evaluated, which is a significant gap given these are more threatening in practice.

Must Fix Items:
1. Evaluate on at least one additional model or benchmark to demonstrate generalizability of the defense.
2. Test against optimization-based attacks (e.g., JudgeDeceiver) rather than only hand-crafted prompt injection variants, as these represent a more realistic and challenging threat model.
3. Reframe the narrative to accurately reflect that information isolation contributes only 4pp of the 62pp defense; the primary benefit comes from the two-pass structure and prompt engineering. The current framing is misleading.

Runs:
- run=1 score=4.2 verdict=Reject confidence=0.6 error=None