Title: HAZARD-SIGNATURE TOMBSTONES: COMMIT-TIME FORGET LOCKOUT FOR LLM AGENT MEMORY FARS
PDF: hazard-signature-forget-lockout.pdf
Score: 3.8
Verdict: Strong Reject
Confidence: 0.60
Elapsed: 55.6s

Strengths:
1. Clear and well-defined threat model: The paraphrase re-injection attack is a realistic and previously unaddressed vulnerability in LLM agent memory systems. The paper convincingly demonstrates that naive ID-delete is counterproductive (PRP@3 rises from 0.67 to 0.94), providing strong motivation for the proposed solution (Table 1, Section 4.4).
2. Elegant architectural insight—commit-time vs retrieval-time: The distinction between blocking at write time (preventing index pollution) vs filtering at retrieval time (slot wasting) is a meaningful systems-level contribution. The backfill ablation (Figure 2) cleanly quantifies this: retrieval-time fuzzy filtering returns only 1.0 results/query even with δ=6 backfill, while HST returns 3.0, a 3× improvement.
3. Strong empirical results on the primary benchmark: HST achieves PRP@3=0.0, Benign Recall=1.0, WriteBlockRate=1.0, and only 3% false positives on benign records (Table 1). The fuzzy set-containment matching raising HS-stability from 30% to 92% is a clean ablation that explains the method's effectiveness (Section 4.6).

Weaknesses:
1. Extremely narrow and artificial evaluation: The entire experimental evaluation uses a single scenario—100 benign + 10 poisoned records from the MemoryGraft corpus, with only 12 evaluation queries and 5 paraphrases per seed (50 total). There is no evaluation across different agent types, different attack strategies (e.g., adversarial paraphrases designed to evade the specific hazard taxonomy), or different hazard categories. The hazard taxonomy (Equation 1) contains only 5 labels for data-analysis agents, making generalization claims unsupported (Section 3.2).
2. No statistical significance or robustness analysis: All results appear to be from a single experimental run. No confidence intervals, no multiple seeds, no variance reporting. The 3% false positive rate is reported without any analysis of how it varies across different benign record distributions. The HS-stability of 92% is reported without variance across paraphrases. This is a serious concern for a security-focused paper where robustness claims are central (Table 1, Section 4.4).
3. Hazard signature extraction is brittle and unanalyzed: The method relies on an LLM (DeepSeek-V3.2 at temperature=0) to classify records into a fixed 5-label taxonomy. No analysis of: (a) classification accuracy/consistency of the LLM labeler on ground-truth labeled data; (b) sensitivity to prompt wording for extraction; (c) adversarial robustness—an attacker who knows the taxonomy can deliberately omit hazard labels. The 'deterministic decoding' claim (temperature=0) does not guarantee determinism across different prompt phrasings or model versions. This is the single most critical component yet it receives almost no empirical validation (Section 3.2).
4. Fuzzy set-containment matching is overly broad and poorly justified: The matching criterion (Equation 3) blocks writes where Hw ⊆ Ht OR Ht ⊆ Hw. If a deleted record has signature {skip validation, remote exec}, then ANY write with either label alone would be blocked (subset relation), and any write with those two plus additional labels would also be blocked (superset relation). This explains the 3% false positive rate but also means a single deleted record with a common hazard label (e.g., 'skip validation') could block large classes of legitimate writes. The paper does not analyze how the false positive rate scales with the number of tombstones or the granularity of the taxonomy.

Must Fix Items:
1. Add statistical significance: report results across multiple random seeds and/or multiple paraphrase generation attempts; include confidence intervals for PRP@3, Benign Recall, and false positive rate.
2. Evaluate adversarial paraphrases: test against paraphrases specifically crafted to evade the 5-label hazard taxonomy (e.g., paraphrases that omit explicit hazard keywords). This is critical for a security paper.
3. Analyze the LLM-based hazard signature extraction component: report classification consistency, prompt sensitivity, and at minimum discuss adversarial manipulation of the labeler.

Runs:
- run=1 score=3.8 verdict=Strong Reject confidence=0.6 error=None