Title: ALIGNDEFTOK: TRAINING-FREE TRANSFER OF DE-FENSIVETOKENS EMBEDDING-SPACE ALIGN-
PDF: transferable-defensive-tokens.pdf
Score: 3.5
Verdict: Strong Reject
Confidence: 0.60
Elapsed: 52.9s

Strengths:
1. Addresses a practical deployment problem: DefensiveTokens require ~16 GPU-hours per model, and AlignDefTok offers 133-285× compute savings, which is a clear operational benefit (Section 3.3, Table 1).
2. The Orthogonal Procrustes alignment is norm-preserving by construction (Equation 1-3), and the paper empirically confirms norm preservation with <1e-5 error (Section 3.3, Section 4.4), which is a principled choice given that high-norm embeddings are critical for defense effectiveness.
3. The tiny-adapt ablation (Table 2b, Figure 2) shows Procrustes initialization converges 4× faster than random initialization (25 vs 100 steps to <5% ASR), demonstrating the value of the alignment as initialization even when direct transfer is insufficient.

Weaknesses:
1. Extremely narrow experimental scope: only two models tested (Llama-3-8B-Instruct and Llama-3.1-8B-Instruct), which are closely related siblings sharing the same tokenizer and embedding dimension. The paper acknowledges this limitation but it severely undermines generalizability claims. No experiments on models from different families, different sizes, or even different fine-tunes of the same base (Section 4.1).
2. Procrustes alignment provides negligible improvement over Direct Copy: Table 2(a) shows Direct Copy achieves 0.0% ASR for 3.1→3 (identical to Procrustes) and 34.6% vs 33.7% for 3→3.1 (only 0.9% difference). This means the core methodological contribution—Orthogonal Procrustes alignment—adds almost nothing beyond naive token copying for these nearly-identical models. The paper's title and framing overstate this contribution.
3. No statistical significance testing: Results are reported as single runs without confidence intervals or multiple seeds. ASR on only 208 test pairs (AlpacaFarm) with string-match detection is noisy; the difference between 33.7% and 34.6% (Table 2a) or between 1.9% and 2.9% (Table 2b) may not be statistically meaningful. No error bars, no p-values, no repeated runs reported.
4. Only one benchmark (AlpacaFarm) is used for evaluation, with only one attack detection method (string-match). The paper does not evaluate on any other prompt injection benchmark (e.g., DeepInception, HARM-Bench, BIPIA), nor does it assess utility preservation (task performance on clean inputs without attacks), which is critical for any defense method.
5. The 'asymmetric transfer difficulty' analysis (Section 4.4) is speculative and lacks mechanistic explanation. The claim that Llama-3.1's embedding space is 'more receptive' is not supported by any diagnostic experiments (e.g., measuring the rotation matrix norm, analyzing singular value spectrum, or probing embedding space geometry).

Must Fix Items:
1. Add multiple benchmark evaluations beyond AlpacaFarm and report utility preservation on clean inputs.
2. Report results with multiple random seeds and statistical significance tests; single-run results on 208 samples are unreliable for small ASR differences.
3. Test on at least one additional model pair (e.g., different fine-tunes, different sizes within Llama family, or Mistral variants) to substantiate the generalizability claim.
4. Provide deeper analysis of why Procrustes adds nearly nothing over Direct Copy, and clarify whether the method's value is primarily in the tiny-adapt initialization rather than the alignment itself.

Runs:
- run=1 score=3.5 verdict=Strong Reject confidence=0.6 error=None