{
  "pdf": "key-search-bypasses-encrypted-activation-monitors.pdf",
  "title": "KEY-SEARCH ATTACKS BYPASS ENCRYPTED ACTIVA-TION MONITORS FARS Analemma",
  "elapsed": 60.3,
  "runs_mode": 1,
  "valid_runs": 1,
  "avg_score": 3.8,
  "scores": [
    3.8
  ],
  "score_std": 0,
  "final_verdict": "Strong Reject",
  "final_confidence": 0.6,
  "conference_scores": {
    "soundness": 2.4,
    "presentation": 2.7,
    "contribution": 2.2,
    "overall_rating": 3.8,
    "confidence": 3
  },
  "strengths": [
    "The paper identifies a genuinely important and under-explored vulnerability at the intersection of privacy-preserving inference and safety monitoring. The tension between key diversity (for privacy) and key-search attack resistance is a real architectural concern that deployers of systems like OSNIP need to understand. Section 1 clearly articulates the problem and the threat model is well-defined in Section 3.2.",
    "The experimental results are systematic and well-structured. Table 1 provides a clean sweep over K∈{1,...,512} showing monotonic TPR degradation from 84.9% to 16.2%, which is compelling quantitative evidence for the attack's effectiveness. The log-linear scaling observed in Figure 2 is a useful characterization of attack behavior.",
    "The ablation on key diversity (Section 4.3, Figure 3) is the paper's most insightful contribution. Showing that removing diversity regularization (λ2=0) reduces attack effectiveness by 3.5–5.0× (from 25.0pp to 5.0pp TPR drop at K=64) directly establishes the causal mechanism. The correlation analysis in Figure 4 (r=0.458, p<1e-13) further corroborates that per-prompt score variance drives bypass probability."
  ],
  "weaknesses": [
    "The paper's central claim is significantly undercut by its own results. Table 2 shows that the 'effective' key-search attack (25.0pp TPR drop) only works when the encryptor already violates both utility (KL=0.031 vs target 0.02) and privacy (ASR@10=0.526 vs target 0.20) constraints. The standard encryptor that satisfies these constraints is largely resistant (only 5.0pp drop). This means the attack primarily exploits a misconfigured system, not a fundamental vulnerability in well-designed schemes. The paper's title and framing ('Bypass Encrypted Activation Monitors') significantly overstate the practical impact relative to what the evidence supports.",
    "The threat model in Section 3.2 assumes the attacker can observe monitor scores for each key attempt, but the paper does not adequately justify or analyze this assumption. If the server does not expose monitor scores (a common deployment choice), the attack degrades to blind sampling with no selection advantage. The paper mentions this in the Limitations section but does not quantify how much the attack degrades under more realistic threat models. This is a critical gap because the entire attack mechanism depends on score observability.",
    "The paper evaluates on only one model (Qwen2.5-7B-Instruct), one dataset (HarmBench), and one encryptor architecture (OSNIP-style MLP). The generalizability is unknown. Different model sizes, architectures, or monitor designs (e.g., ensemble monitors, multi-layer monitors) could produce very different score variance profiles and thus very different attack effectiveness. The 'Standard' encryptor (λ2=0) already violates the utility constraint (KL=0.039 > 0.02 in Table 2), meaning no configuration actually satisfies all three desiderata simultaneously—raising the question of whether the tradeoff is even realizable in practice.",
    "The monitor architecture is extremely simple (Linear(3584,32)→ReLU→Linear(32,1)→Sigmoid, trained on only 240+240 samples for 5000 epochs with heavy weight decay of 1.0). This is likely an under-capacity monitor that may have unusually high score variance across keys precisely because it is poorly calibrated. A more robust monitor design could naturally reduce score variance and resist key-search attacks without any changes to the encryptor, making the vulnerability partly an artifact of the monitor design choice rather than a fundamental property of the architecture."
  ],
  "must_fix_items": [
    "The title and abstract framing should be revised to reflect that key-search attacks primarily bypass monitors only when the encryptor is configured with high key diversity that already violates utility and privacy constraints. The current framing implies a more general vulnerability than the evidence supports.",
    "Provide analysis or experiments for the case where monitor scores are not observable to the attacker (score-free threat model), as this is a more restrictive but also more realistic deployment assumption.",
    "Address the inconsistency in Table 2 where the 'Standard' encryptor (λ2=0) already violates the utility constraint (KL=0.039 > 0.02). If no configuration satisfies all three constraints simultaneously, this should be explicitly discussed as it changes the nature of the claimed tradeoff."
  ],
  "runs": [
    {
      "run": 1,
      "score": 3.8,
      "verdict": "Strong Reject",
      "confidence": 0.6,
      "strengths": [
        "The paper identifies a genuinely important and under-explored vulnerability at the intersection of privacy-preserving inference and safety monitoring. The tension between key diversity (for privacy) and key-search attack resistance is a real architectural concern that deployers of systems like OSNIP need to understand. Section 1 clearly articulates the problem and the threat model is well-defined in Section 3.2.",
        "The experimental results are systematic and well-structured. Table 1 provides a clean sweep over K∈{1,...,512} showing monotonic TPR degradation from 84.9% to 16.2%, which is compelling quantitative evidence for the attack's effectiveness. The log-linear scaling observed in Figure 2 is a useful characterization of attack behavior.",
        "The ablation on key diversity (Section 4.3, Figure 3) is the paper's most insightful contribution. Showing that removing diversity regularization (λ2=0) reduces attack effectiveness by 3.5–5.0× (from 25.0pp to 5.0pp TPR drop at K=64) directly establishes the causal mechanism. The correlation analysis in Figure 4 (r=0.458, p<1e-13) further corroborates that per-prompt score variance drives bypass probability."
      ],
      "weaknesses": [
        "The paper's central claim is significantly undercut by its own results. Table 2 shows that the 'effective' key-search attack (25.0pp TPR drop) only works when the encryptor already violates both utility (KL=0.031 vs target 0.02) and privacy (ASR@10=0.526 vs target 0.20) constraints. The standard encryptor that satisfies these constraints is largely resistant (only 5.0pp drop). This means the attack primarily exploits a misconfigured system, not a fundamental vulnerability in well-designed schemes. The paper's title and framing ('Bypass Encrypted Activation Monitors') significantly overstate the practical impact relative to what the evidence supports.",
        "The threat model in Section 3.2 assumes the attacker can observe monitor scores for each key attempt, but the paper does not adequately justify or analyze this assumption. If the server does not expose monitor scores (a common deployment choice), the attack degrades to blind sampling with no selection advantage. The paper mentions this in the Limitations section but does not quantify how much the attack degrades under more realistic threat models. This is a critical gap because the entire attack mechanism depends on score observability.",
        "The paper evaluates on only one model (Qwen2.5-7B-Instruct), one dataset (HarmBench), and one encryptor architecture (OSNIP-style MLP). The generalizability is unknown. Different model sizes, architectures, or monitor designs (e.g., ensemble monitors, multi-layer monitors) could produce very different score variance profiles and thus very different attack effectiveness. The 'Standard' encryptor (λ2=0) already violates the utility constraint (KL=0.039 > 0.02 in Table 2), meaning no configuration actually satisfies all three desiderata simultaneously—raising the question of whether the tradeoff is even realizable in practice.",
        "The monitor architecture is extremely simple (Linear(3584,32)→ReLU→Linear(32,1)→Sigmoid, trained on only 240+240 samples for 5000 epochs with heavy weight decay of 1.0). This is likely an under-capacity monitor that may have unusually high score variance across keys precisely because it is poorly calibrated. A more robust monitor design could naturally reduce score variance and resist key-search attacks without any changes to the encryptor, making the vulnerability partly an artifact of the monitor design choice rather than a fundamental property of the architecture."
      ],
      "must_fix_items": [
        "The title and abstract framing should be revised to reflect that key-search attacks primarily bypass monitors only when the encryptor is configured with high key diversity that already violates utility and privacy constraints. The current framing implies a more general vulnerability than the evidence supports.",
        "Provide analysis or experiments for the case where monitor scores are not observable to the attacker (score-free threat model), as this is a more restrictive but also more realistic deployment assumption.",
        "Address the inconsistency in Table 2 where the 'Standard' encryptor (λ2=0) already violates the utility constraint (KL=0.039 > 0.02). If no configuration satisfies all three constraints simultaneously, this should be explicitly discussed as it changes the nature of the claimed tradeoff."
      ],
      "conference_scores": {
        "soundness": 2.4,
        "presentation": 2.7,
        "contribution": 2.2,
        "overall_rating": 3.8,
        "confidence": 3
      }
    }
  ]
}