{
  "pdf": "canary-controlled-safe-interleaving.pdf",
  "title": "CANARY-CONTROLLED SAFE-DATA INTERLEAVING FOR REDUCING EMERGENT MISALIGNMENT",
  "elapsed": 56.5,
  "runs_mode": 1,
  "valid_runs": 1,
  "avg_score": 4,
  "scores": [
    4
  ],
  "score_std": 0,
  "final_verdict": "Reject",
  "final_confidence": 0.6,
  "conference_scores": {
    "soundness": 2.4,
    "presentation": 2.8,
    "contribution": 2.2,
    "overall_rating": 4,
    "confidence": 3
  },
  "strengths": [
    "The closed-loop adaptive interleaving concept is a natural and well-motivated improvement over fixed-ratio interleaving, grounded in the empirically observed phase-transition dynamics of emergent misalignment (Turner et al., 2025). The idea of using canary prompts as a real-time proxy for EM risk and closing the loop with a threshold controller is a principled engineering contribution to safety-focused training. (Section 3, Algorithm 1)",
    "The ablation study is well-designed and provides meaningful insight into which components matter. The Fixed p = p̄ variant (Table 2) cleanly isolates adaptive timing from total safe-data volume, showing 36% worse EM suppression despite identical average ratio. The Delayed (D=5) variant demonstrates feedback timeliness matters (59% degradation). These ablations go beyond mere endpoint comparisons and provide mechanistic understanding. (Table 2, Section 4.3)",
    "Cross-seed variance reduction is a practically important finding. The canary-controlled method achieves std=0.16 vs std=0.63 for Fixed 5% (Table 1), a 4× improvement in consistency. For safety-critical deployments, predictable behavior across random seeds is valuable. (Table 1, Section 4.2)"
  ],
  "weaknesses": [
    "The core proxy signal—canary prompts—has only weak correlation with the target metric (Pearson r=0.31, Section 4.4). This raises fundamental questions about whether the controller is actually responding to EM risk or merely applying more safe data on average (13% vs 5%). The ablation showing Fixed p = p̄ at 7.18% vs Full at 5.26% confirms some adaptive-timing benefit, but the dominant factor may simply be the higher average interleaving ratio. A fairer baseline would be Fixed 10% or Fixed 15% interleaving to disentangle ratio from adaptivity. The absence of this comparison is a significant gap. (Section 4.4, Table 1, Table 2)",
    "Evaluation is limited to a single benchmark (Security EM) on a single model (Qwen2.5-7B-Instruct). The paper acknowledges this limitation (Section 4.5) but does not provide any evidence of generalization. Given that the method relies on canary prompts whose correlation with EM risk is already weak (r=0.31), it is unclear whether the approach would transfer to other EM-inducing datasets, model families, or scales. This severely limits confidence in the contribution's generality. (Section 4.1, Section 4.5)",
    "The paper was generated by an automated research system (explicitly stated in the abstract). While the work appears technically coherent, this origin raises concerns about depth of insight. For instance, the discussion of the weak canary-general correlation (r=0.31) is acknowledged but not deeply analyzed—what are the failure modes? When does the canary signal lead the controller astray? How sensitive is the method to the choice of canary prompts? These questions, critical for a safety-focused method, are not explored. (Abstract, Section 4.4, Section 4.5)"
  ],
  "must_fix_items": [
    "Add baselines with higher fixed interleaving ratios (e.g., 10%, 15%, 20%) to fairly assess whether the improvement comes from adaptivity or simply from more safe data on average. Without this, the 25% relative improvement claim over Fixed 5% is misleading if Fixed 15% would achieve comparable or better results.",
    "Report statistical significance tests (e.g., bootstrap confidence intervals or t-tests) for the main comparisons. With only 3 seeds, the difference between 5.39±0.16 and 7.15±0.63 may or may not be statistically significant. The paper cites no p-values or confidence intervals beyond standard deviations.",
    "Provide analysis of canary prompt selection sensitivity—how results change with different canary sets, canary set sizes, or canary prompt designs. Given r=0.31 correlation, this is critical for assessing robustness of the risk signal."
  ],
  "runs": [
    {
      "run": 1,
      "score": 4,
      "verdict": "Reject",
      "confidence": 0.6,
      "strengths": [
        "The closed-loop adaptive interleaving concept is a natural and well-motivated improvement over fixed-ratio interleaving, grounded in the empirically observed phase-transition dynamics of emergent misalignment (Turner et al., 2025). The idea of using canary prompts as a real-time proxy for EM risk and closing the loop with a threshold controller is a principled engineering contribution to safety-focused training. (Section 3, Algorithm 1)",
        "The ablation study is well-designed and provides meaningful insight into which components matter. The Fixed p = p̄ variant (Table 2) cleanly isolates adaptive timing from total safe-data volume, showing 36% worse EM suppression despite identical average ratio. The Delayed (D=5) variant demonstrates feedback timeliness matters (59% degradation). These ablations go beyond mere endpoint comparisons and provide mechanistic understanding. (Table 2, Section 4.3)",
        "Cross-seed variance reduction is a practically important finding. The canary-controlled method achieves std=0.16 vs std=0.63 for Fixed 5% (Table 1), a 4× improvement in consistency. For safety-critical deployments, predictable behavior across random seeds is valuable. (Table 1, Section 4.2)"
      ],
      "weaknesses": [
        "The core proxy signal—canary prompts—has only weak correlation with the target metric (Pearson r=0.31, Section 4.4). This raises fundamental questions about whether the controller is actually responding to EM risk or merely applying more safe data on average (13% vs 5%). The ablation showing Fixed p = p̄ at 7.18% vs Full at 5.26% confirms some adaptive-timing benefit, but the dominant factor may simply be the higher average interleaving ratio. A fairer baseline would be Fixed 10% or Fixed 15% interleaving to disentangle ratio from adaptivity. The absence of this comparison is a significant gap. (Section 4.4, Table 1, Table 2)",
        "Evaluation is limited to a single benchmark (Security EM) on a single model (Qwen2.5-7B-Instruct). The paper acknowledges this limitation (Section 4.5) but does not provide any evidence of generalization. Given that the method relies on canary prompts whose correlation with EM risk is already weak (r=0.31), it is unclear whether the approach would transfer to other EM-inducing datasets, model families, or scales. This severely limits confidence in the contribution's generality. (Section 4.1, Section 4.5)",
        "The paper was generated by an automated research system (explicitly stated in the abstract). While the work appears technically coherent, this origin raises concerns about depth of insight. For instance, the discussion of the weak canary-general correlation (r=0.31) is acknowledged but not deeply analyzed—what are the failure modes? When does the canary signal lead the controller astray? How sensitive is the method to the choice of canary prompts? These questions, critical for a safety-focused method, are not explored. (Abstract, Section 4.4, Section 4.5)"
      ],
      "must_fix_items": [
        "Add baselines with higher fixed interleaving ratios (e.g., 10%, 15%, 20%) to fairly assess whether the improvement comes from adaptivity or simply from more safe data on average. Without this, the 25% relative improvement claim over Fixed 5% is misleading if Fixed 15% would achieve comparable or better results.",
        "Report statistical significance tests (e.g., bootstrap confidence intervals or t-tests) for the main comparisons. With only 3 seeds, the difference between 5.39±0.16 and 7.15±0.63 may or may not be statistically significant. The paper cites no p-values or confidence intervals beyond standard deviations.",
        "Provide analysis of canary prompt selection sensitivity—how results change with different canary sets, canary set sizes, or canary prompt designs. Given r=0.31 correlation, this is critical for assessing robustness of the risk signal."
      ],
      "conference_scores": {
        "soundness": 2.4,
        "presentation": 2.8,
        "contribution": 2.2,
        "overall_rating": 4,
        "confidence": 3
      }
    }
  ]
}