Title: SELECTIVE DELEXICALIZATION DEFEND STRUCTURED-OUTPUT LLM APIS FROM CONTROL-
PDF: 86d12644-9d51-45c7-870f-a7c51b181d70.pdf
Score: 4.5
Verdict: Reject
Confidence: 0.82
Elapsed: 195.7s

Strengths:
1. The paper correctly identifies the attack vector (forced enum/const literals in JSON schemas) and demonstrates via ablation (Table 3, Strip-Only vs No Defense both at 22.0% ASR) that metadata removal alone is insufficient, while literal delexicalization achieves 0% ASR — this is a clean ablation isolating the actual mechanism.
2. The defense is genuinely training-free and deployable as a preprocessing step, which is a practical engineering contribution. The 1.1% benign modification rate (Table 2) is low, and the validity/compliance deltas (-0.38pp, -0.70pp) are within acceptable production thresholds, demonstrating real utility preservation.
3. The paper honestly reports the chunked-payload attack limitation (§4.6, Figure 3), showing 0% chunk detection and 8–14% residual ASR even with defense. This transparency about failure modes is commendable and prevents overclaiming universal protection.

Weaknesses:
1. The core defense is a trivially simple 3-criteria string heuristic (length > 20, whitespace, imperative verb pattern — any 2 of 3). Packaging this as 'Selective DeLex-JSON' with a 'conjunction-based suspicion function' inflates what is essentially a keyword filter. The heuristic is specifically reverse-engineered for the EnumAttack pattern and would not generalize to any attack that does not use natural-language sentences as forced literals — as the chunked attack results themselves prove (§4.6).
2. No statistical significance tests are reported anywhere in the paper. The main results (Table 1) are single-run point estimates. With only 159 HarmBench behaviors, the difference between e.g. 0% and 2.6% ASR on StrongREJECT (which has 313 prompts) could be within noise. The benign modification rates are computed over 8,825 schemas but without confidence intervals. All claims of 'complete neutralization' and 'Pareto-dominance' rest on untested point estimates.
3. The evaluation scope is extremely narrow: only 2 models (Llama-3.1-8B, Qwen2.5-7B), only 1 attack method (EnumAttack, with chunked variant as admitted failure), and only 2 safety benchmarks. No evaluation against AttackPrefixTree (Li et al., 2025) which is explicitly cited as another control-plane attack. The StrongREJECT results (Table 1) show 2.6% residual ASR with DeLex-JSON, contradicting the '0% ASR' headline claim — the paper selectively emphasizes HarmBench (0%) while burying the StrongREJECT result.
4. The Escape-Hatch baseline is a strawman: it wraps the schema with a oneOf refusal option, but constrained decoding forces the model to follow the schema path, making the 'escape' choice semantically inaccessible. This baseline was designed to fail rather than to be a serious competitor. The Input Guard baseline also appears weak — it operates on the combined prompt+schema as flat text, which fundamentally misunderstands the structured attack surface, and its 88.7% rejection rate on attack prompts (Table 1 footnote) actually suggests it catches most attacks but lets 3.8% through, making it a partially effective baseline that is dismissed too quickly.

Must Fix Items:
1. Add statistical significance tests (e.g., bootstrap confidence intervals on ASR over HarmBench/StrongREJECT) to support claims of 'complete neutralization' and 'Pareto-dominance.' The 2.6% StrongREJECT ASR with defense must be explicitly reconciled with the '0% ASR' framing — either explain why StrongREJECT is excluded from the headline claim or revise the claim.
2. Evaluate against at least one additional attack method beyond EnumAttack (e.g., AttackPrefixTree from Li et al., 2025, which is cited but not tested against), and include at least one larger-scale model (70B+) to substantiate the 'model-agnostic' claim.
3. Provide a fairer baseline comparison: the Escape-Hatch baseline should be reconsidered (e.g., with modified constrained decoding that prioritizes refusal paths), or its obvious failure should be acknowledged as expected rather than presented as a meaningful comparison point.

Runs:
- run=1 score=4.5 verdict=Reject confidence=0.82 error=None