Title: AUDITING HARDENING LIVEMEDBENCH’S RUBRIC GRADER AGAINST PROMPT INJECTION:
PDF: a060820b-1cdc-4cca-b93d-80e0a5ca8689.pdf
Score: 5.2
Verdict: Reject
Confidence: 0.72
Elapsed: 319.8s

Strengths:
1. Negative-result framing is honest and valuable: the paper explicitly reports that hardening causes net harm (−6.42% benign drift, CI excludes zero) without security benefit (Table 1), and does not oversell the hardening strategy. This is a commendable contribution to the literature on when security interventions are counterproductive (Section 4.2, Table 1).
2. Statistical rigor with bootstrap CIs and explicit significance testing: all inflation deltas report 95% bootstrap CIs with 1000 resamples, and significance is determined by CI exclusion of zero rather than point-estimate comparisons (Section 3.5, Table 1). This is a higher standard than many papers in this space.
3. Clear threat model and attack surface identification (Section 3.2–3.3): the paper correctly identifies two concrete vulnerabilities—untrusted data interpolation and permissive fallback parsing—and designs payloads targeting each surface. The 99.78% fallback rate (Section 3.1) is a striking empirical finding that motivates Payload 3.

Weaknesses:
1. Extremely narrow attack evaluation—only 3 hand-crafted payloads tested on a single judge model (Qwen2.5-72B-Instruct). The related work itself cites optimization-based attacks (JudgeDeceiver, Shi et al. 2025) achieving >30% success rates on other judges, and Zhao et al. (2025) showing 'master key' tokens eliciting false positives. The paper's own Limitations section admits this (Section 5.1), but the core claim—'natural robustness'—cannot be supported by three hand-crafted payloads on one model. The negative result may simply reflect inadequate attack sophistication, not inherent rubric-grader robustness (Sections 3.3, 4.2, 5.1).
2. Benign drift root-cause analysis is speculative and incomplete. Section 4.3 attributes the −6.42% drift to the evidence verification gate (Layer 4), but the only supporting evidence is that 0.22% abstention occurs on clean inputs. A 0.22% abstention rate cannot explain a 6.42% score drop. The other three hardening layers (untrusted data framing, schema-constrained output, strict parsing) likely also contribute—especially the prompt changes in Layer 1, which alter the judge's interpretation context—but no ablation isolates each layer's contribution. Without per-layer ablation, the drift attribution is unverified (Section 4.3, Table 1).
3. The paper claims to audit LiveMedBench's rubric grader specifically, but the 150-case subset (921 criteria) is small and the answer generator is a single small model (Qwen2.5-7B-Instruct). The 7B model's responses may be short and low-quality enough that injection payloads are naturally diluted or ignored by the 72B judge—not because rubric grading is robust, but because the evaluated content provides little room for injection to operate. No analysis of response length or payload-to-response ratio is provided (Section 4.1).

Must Fix Items:
1. Add per-layer ablation of the hardening strategy to identify which layer(s) cause the −6.42% benign drift. Without this, the attribution to Layer 4 is speculative and the practical recommendation (whether to deploy partial hardening) is unsupported.
2. Test at least one optimization-based or adaptive attack (e.g., JudgeDeceiver-style) rather than only hand-crafted payloads. The current attack evaluation is too weak to support the 'natural robustness' claim with confidence.
3. Report response-length statistics and payload-to-response ratios to rule out the confound that short/low-quality responses from Qwen2.5-7B-Instruct simply dilute injection effectiveness.

Runs:
- run=1 score=5.2 verdict=Reject confidence=0.72 error=None