Title: PERSISTENT DEMO-POOL POISONING ATTACKS ON ONLINE LLM LOG PARSERS FARS Analemma
PDF: adaparser-online-demo-poisoning.pdf
Score: 3.8
Verdict: Strong Reject
Confidence: 0.60
Elapsed: 52.1s

Strengths:
1. Novel and timely attack surface identification: The paper identifies a real security vulnerability in online LLM log parsers that use self-generated ICL (SG-ICL), where the monotonically growing demo pool creates a persistent attack surface. This is a meaningful contribution as it highlights a previously unexamined risk in systems like AdaParser. (Section 1, Section 2.2)
2. Well-designed control experiment: The inclusion of a random noise control condition (C2) that shows negligible impact (1.50pp FTA drop on BGL, +2.83pp on Thunderbird) provides strong evidence that the degradation is specific to the targeted over-generalization strategy rather than any generic log injection. (Table 1, Section 3.2)
3. Insightful ablation on damage mechanisms: The SG-ICL ablation (Table 2a) showing only 0.67pp difference between with-SG-ICL and without-SG-ICL conditions, combined with Thunderbird's 0% ICL contamination yet 13.33pp FTA drop, provides compelling evidence that trie poisoning—not demo-pool contamination—is the dominant persistence mechanism. This is a useful finding for future defense design. (Table 2, Section 3.4)

Weaknesses:
1. Extremely narrow evaluation scope—only 2 datasets and 1 target system: The entire experimental evaluation is limited to BGL and Thunderbird from LogHub-2.0, and only targets AdaParser with GPT-4o-mini. The paper claims to identify a 'fundamental security vulnerability in online LLM log parsers' (plural), but never tests any other parser (e.g., LILAC, LLMParser, LUNAR, MicLog—each mentioned in related work) or any other LLM backbone. This severely limits the generality of the claimed contribution. (Section 3.1, Table 1)
2. No defense evaluation or discussion of mitigation feasibility: The conclusion briefly mentions 'potential defenses such as demo pool pruning, trie validation, and anomaly detection' (Section 5) but provides zero experimental or analytical evidence that any of these would work. For a security paper, the absence of even a preliminary defense analysis makes the contribution one-sided and limits its practical impact. A simple experiment showing that e.g., periodic trie validation or demo-pool pruning reduces attack effectiveness would substantially strengthen the paper.
3. Statistical significance concerns: Results are averaged over only 3 random seeds with no standard deviations, confidence intervals, or significance tests reported anywhere in the paper. Given the high variance typical of LLM-based parsing and the small number of seeds, it is unclear whether the observed differences (especially the 0.67pp SG-ICL ablation difference and the +2.83pp Thunderbird C2 improvement) are statistically meaningful. (Section 3.1, Tables 1 and 2)
4. Over-packaging of a straightforward attack as a fundamental vulnerability: The attack itself—injecting logs with variable-looking tokens to induce over-generalized templates—is conceptually simple. The 'over-generalization strategy' described in Section 2.3 amounts to replacing static tokens with variable-looking values, which is essentially exploiting a known weakness of template-based parsing (high wildcard coverage). The paper wraps this in language like 'fundamental security vulnerability' and 'novel attack surface' but the core mechanism is an expected failure mode of any system that trusts its own outputs without validation.

Must Fix Items:
1. Report standard deviations or confidence intervals over random seeds and conduct statistical significance tests (e.g., paired t-test or bootstrap) for all claimed differences in Tables 1 and 2.
2. Evaluate on at least 1-2 additional datasets (e.g., HDFS, Spark from LogHub-2.0) and ideally test on at least one additional online LLM log parser to support the generalization claim.
3. Provide at least a preliminary evaluation of one proposed defense (e.g., periodic trie validation checking for templates with excessive wildcards) to demonstrate the attack is mitigable and to give practical guidance.

Runs:
- run=1 score=3.8 verdict=Strong Reject confidence=0.6 error=None