{
  "pdf": "logrules-induction-poisoning.pdf",
  "title": "POISONING LLM-INDUCED RULE REPOSITORIES VIA INDIRECT PROMPT INJECTION FARS Analemma",
  "elapsed": 56.8,
  "runs_mode": 1,
  "valid_runs": 1,
  "avg_score": 3.5,
  "scores": [
    3.5
  ],
  "score_std": 0,
  "final_verdict": "Strong Reject",
  "final_confidence": 0.6,
  "conference_scores": {
    "soundness": 2.3,
    "presentation": 2.8,
    "contribution": 1.8,
    "overall_rating": 3.5,
    "confidence": 3
  },
  "strengths": [
    "Identifies a novel attack surface: induction-stage poisoning in LLM-based log parsing systems is a genuinely new threat vector that has not been studied before. The insight that LogRules' two-stage architecture creates a persistent vulnerability—where corrupted rules propagate to all downstream parsing—is well-motivated (Section 3.1, paragraph on rule repository persistence).",
    "Systematic evaluation design: 108 configurations (3 payloads × 4 budgets × 3 datasets × 3 seeds) provide reasonable coverage. The finding that instruction-style payload D scales monotonically with poisoning budget while E and F do not is a meaningful empirical result that distinguishes payload effectiveness (Table 1, Figure 2).",
    "Honest reporting of defense failures: The canary-based defense is proposed and its failure modes are thoroughly documented—2.8% detection on BGL due to false acceptance, and 0% recovery on HDFS because Rsafe itself has 0% PA. This transparency about limitations is commendable (Table 3, Section 4.4)."
  ],
  "weaknesses": [
    "Extremely low baseline parsing accuracy undermines attack significance: Clean baselines are only 34.8% (BGL), 17.3% (Linux), and 15.1% (HDFS). A 15.1pp drop from 15.1% clean PA on HDFS means going from ~15% to ~0%, which is already near floor. The attack's absolute impact is small when baseline performance is negligible. The paper does not discuss whether these low baselines make the attack practically meaningful (Section 4.1, Metrics paragraph).",
    "Single model and small induction set limit generalizability: Only Qwen2.5-7B-Instruct is used for both induction and deduction, and the induction set has only K=10 examples (with k up to 7 poisoned—70% poisoning rate). Real deployments would likely use larger induction sets and different models. The Appendix A cross-model analysis shows LLaMA-3-8B-Instruct has baseline PA of only 1.7%, making attack evaluation meaningless. The paper does not test with GPT-4o-mini as originally described in LogRules (Section 4.1, Models paragraph; Appendix A).",
    "Weak defense contribution with no alternative explored: The canary-based defense achieves only 42.6% overall detection and fails completely on 2 out of 3 datasets. The paper proposes no improved defense and does not compare against even simple alternatives like input sanitization, instruction hierarchy prompting, or example-level anomaly detection. The defense section reads more like a negative result than a constructive contribution (Table 3, Section 4.4).",
    "Inconsistent attack results for 2 of 3 payloads: Payloads E and F actually improve parsing accuracy on average (mean PA drop of −1.61pp and −3.57pp respectively). Only payload D works, and even it fails at k=1 on BGL (−8.40pp). The paper's claim of 'broad applicability' is overstated given that 2/3 payloads are ineffective (Table 1, Table 2).",
    "Threat model realism is questionable: The attacker controls the raw log content but not the template label, and must insert payloads into existing variable fields. The paper does not demonstrate that such payloads would survive real log ingestion pipelines (which often sanitize or truncate fields), nor does it discuss how an attacker would ensure their poisoned logs are selected for the induction set specifically (Section 3.2)."
  ],
  "must_fix_items": [
    "Address the low baseline PA issue: Discuss whether attacking a system with 15-35% accuracy has practical significance, and test on datasets or configurations where LogRules achieves higher baseline performance.",
    "Test with larger induction sets (K>10) and more realistic poisoning ratios (e.g., k=5 out of K=100): The current 70% maximum poisoning ratio is unrealistic for deployment scenarios.",
    "Provide at least one viable defense or mitigation strategy: The current defense fails on 2/3 datasets; proposing input sanitization, prompt hardening, or ensemble-based induction as alternatives would strengthen the contribution."
  ],
  "runs": [
    {
      "run": 1,
      "score": 3.5,
      "verdict": "Strong Reject",
      "confidence": 0.6,
      "strengths": [
        "Identifies a novel attack surface: induction-stage poisoning in LLM-based log parsing systems is a genuinely new threat vector that has not been studied before. The insight that LogRules' two-stage architecture creates a persistent vulnerability—where corrupted rules propagate to all downstream parsing—is well-motivated (Section 3.1, paragraph on rule repository persistence).",
        "Systematic evaluation design: 108 configurations (3 payloads × 4 budgets × 3 datasets × 3 seeds) provide reasonable coverage. The finding that instruction-style payload D scales monotonically with poisoning budget while E and F do not is a meaningful empirical result that distinguishes payload effectiveness (Table 1, Figure 2).",
        "Honest reporting of defense failures: The canary-based defense is proposed and its failure modes are thoroughly documented—2.8% detection on BGL due to false acceptance, and 0% recovery on HDFS because Rsafe itself has 0% PA. This transparency about limitations is commendable (Table 3, Section 4.4)."
      ],
      "weaknesses": [
        "Extremely low baseline parsing accuracy undermines attack significance: Clean baselines are only 34.8% (BGL), 17.3% (Linux), and 15.1% (HDFS). A 15.1pp drop from 15.1% clean PA on HDFS means going from ~15% to ~0%, which is already near floor. The attack's absolute impact is small when baseline performance is negligible. The paper does not discuss whether these low baselines make the attack practically meaningful (Section 4.1, Metrics paragraph).",
        "Single model and small induction set limit generalizability: Only Qwen2.5-7B-Instruct is used for both induction and deduction, and the induction set has only K=10 examples (with k up to 7 poisoned—70% poisoning rate). Real deployments would likely use larger induction sets and different models. The Appendix A cross-model analysis shows LLaMA-3-8B-Instruct has baseline PA of only 1.7%, making attack evaluation meaningless. The paper does not test with GPT-4o-mini as originally described in LogRules (Section 4.1, Models paragraph; Appendix A).",
        "Weak defense contribution with no alternative explored: The canary-based defense achieves only 42.6% overall detection and fails completely on 2 out of 3 datasets. The paper proposes no improved defense and does not compare against even simple alternatives like input sanitization, instruction hierarchy prompting, or example-level anomaly detection. The defense section reads more like a negative result than a constructive contribution (Table 3, Section 4.4).",
        "Inconsistent attack results for 2 of 3 payloads: Payloads E and F actually improve parsing accuracy on average (mean PA drop of −1.61pp and −3.57pp respectively). Only payload D works, and even it fails at k=1 on BGL (−8.40pp). The paper's claim of 'broad applicability' is overstated given that 2/3 payloads are ineffective (Table 1, Table 2).",
        "Threat model realism is questionable: The attacker controls the raw log content but not the template label, and must insert payloads into existing variable fields. The paper does not demonstrate that such payloads would survive real log ingestion pipelines (which often sanitize or truncate fields), nor does it discuss how an attacker would ensure their poisoned logs are selected for the induction set specifically (Section 3.2)."
      ],
      "must_fix_items": [
        "Address the low baseline PA issue: Discuss whether attacking a system with 15-35% accuracy has practical significance, and test on datasets or configurations where LogRules achieves higher baseline performance.",
        "Test with larger induction sets (K>10) and more realistic poisoning ratios (e.g., k=5 out of K=100): The current 70% maximum poisoning ratio is unrealistic for deployment scenarios.",
        "Provide at least one viable defense or mitigation strategy: The current defense fails on 2/3 datasets; proposing input sanitization, prompt hardening, or ensemble-based induction as alternatives would strengthen the contribution."
      ],
      "conference_scores": {
        "soundness": 2.3,
        "presentation": 2.8,
        "contribution": 1.8,
        "overall_rating": 3.5,
        "confidence": 3
      }
    }
  ]
}