{
  "pdf": "syntax-diversified-unlearning-leakk.pdf",
  "title": "SYNTAX-DIVERSIFIED UNLEARNING: EVALUATING DATA-SIDE INTERVENTIONS FOR REDUCING WORST-CASE LEAKAGE",
  "elapsed": 50.1,
  "runs_mode": 1,
  "valid_runs": 1,
  "avg_score": 3.2,
  "scores": [
    3.2
  ],
  "score_std": 0,
  "final_verdict": "Strong Reject",
  "final_confidence": 0.6,
  "conference_scores": {
    "soundness": 2.3,
    "presentation": 2.8,
    "contribution": 1.8,
    "overall_rating": 3.2,
    "confidence": 3
  },
  "strengths": [
    "Pre-specified success criteria with clear thresholds (≥20% leak@32 reduction, ≥0.10 Relearn SR reduction, ≤3% utility drop) prevent moving the goalposts after seeing results, which is commendable scientific rigor for a negative-result paper (Section 3.4, Table 3).",
    "Transparent reporting of per-seed variance (Table 2) reveals that the aggregate improvement is within noise: seed 456 shows worse leak@32 at T=1.0 for diversified (0.35 vs 0.15), and the best baseline seed (123) achieves leak@32=0.10 comparable to the best diversified seed. This honest disaggregation strengthens the negative conclusion.",
    "Statistical significance testing is conducted (p=0.62 for leak@32, p=0.69 for Relearn SR with n=3 seeds, Table 3), and the authors correctly conclude the improvements are not significant rather than claiming a positive finding from noise.",
    "The paper clearly contextualizes its negative result against prior work (Yoon et al., 2026) that motivated the hypothesis, and points toward representation-level interventions as a needed future direction (Section 5), which is constructive for the field."
  ],
  "weaknesses": [
    "The augmentation is extremely small: only 53 paraphrases added to 400 original entries (13% expansion, Section 3.3), with 25% fallback rate using deterministic templates rather than truly diverse paraphrases. This makes it difficult to conclude that data-side interventions are fundamentally insufficient—the intervention tested is too weak to support such a strong claim. The paper itself acknowledges this limitation only briefly in Section 5.",
    "Only 20 target queries are used for augmentation and evaluation, and with n=3 seeds the statistical power is very low. The p-values (0.62, 0.69) indicate the experiment is severely underpowered. The paper cannot reliably distinguish between 'the intervention has no effect' and 'the study lacks power to detect a real effect' — a critical distinction for a negative-result paper (Table 3).",
    "The contribution is thin: the core idea (paraphrase the forget set) is straightforward, the augmentation pipeline produces minimal diversity (avg 2.6 paraphrases per query), and the negative result is unsurprising given the weak intervention. The paper does not explore alternative diversification strategies, different augmentation scales, or ablations (e.g., what if 100% of queries had diverse paraphrases rather than 25% falling back to templates?) (Section 3.3).",
    "The paper is generated by an automated research system (explicitly stated in the abstract), which raises concerns about the depth of analysis, novelty of hypothesis generation, and whether the experimental design was optimized for rigorous testing or for rapid execution. The methodology feels like a first-pass exploration rather than a thoroughly investigated research question.",
    "Single benchmark (TOFU forget10) and single model (Llama-3.2-1B-Instruct) severely limit generalizability. Whether the negative result holds for larger models, different unlearning methods (e.g., representation-based approaches), or other benchmarks (MUSE, WMDP) is entirely unknown (Sections 3.1, 4.1)."
  ],
  "must_fix_items": [
    "The claim 'data-side interventions alone are insufficient' (Abstract, Section 5) is overstrong given the weak augmentation tested (13% expansion, 25% template fallback). This should be softened to acknowledge that the specific intervention tested was insufficient, not the entire class of data-side approaches.",
    "Statistical power must be addressed: with n=3 seeds and high variance, the study cannot distinguish 'no effect' from 'underpowered study.' Either increase the number of seeds, acknowledge this as a fundamental limitation of the negative claim, or compute minimum detectable effect sizes.",
    "The 25% fallback rate (5/20 queries using deterministic templates, Section 3.3) means a quarter of the 'diversified' condition isn't actually diversified. This contamination weakens the intervention and should be analyzed separately or excluded."
  ],
  "runs": [
    {
      "run": 1,
      "score": 3.2,
      "verdict": "Strong Reject",
      "confidence": 0.6,
      "strengths": [
        "Pre-specified success criteria with clear thresholds (≥20% leak@32 reduction, ≥0.10 Relearn SR reduction, ≤3% utility drop) prevent moving the goalposts after seeing results, which is commendable scientific rigor for a negative-result paper (Section 3.4, Table 3).",
        "Transparent reporting of per-seed variance (Table 2) reveals that the aggregate improvement is within noise: seed 456 shows worse leak@32 at T=1.0 for diversified (0.35 vs 0.15), and the best baseline seed (123) achieves leak@32=0.10 comparable to the best diversified seed. This honest disaggregation strengthens the negative conclusion.",
        "Statistical significance testing is conducted (p=0.62 for leak@32, p=0.69 for Relearn SR with n=3 seeds, Table 3), and the authors correctly conclude the improvements are not significant rather than claiming a positive finding from noise.",
        "The paper clearly contextualizes its negative result against prior work (Yoon et al., 2026) that motivated the hypothesis, and points toward representation-level interventions as a needed future direction (Section 5), which is constructive for the field."
      ],
      "weaknesses": [
        "The augmentation is extremely small: only 53 paraphrases added to 400 original entries (13% expansion, Section 3.3), with 25% fallback rate using deterministic templates rather than truly diverse paraphrases. This makes it difficult to conclude that data-side interventions are fundamentally insufficient—the intervention tested is too weak to support such a strong claim. The paper itself acknowledges this limitation only briefly in Section 5.",
        "Only 20 target queries are used for augmentation and evaluation, and with n=3 seeds the statistical power is very low. The p-values (0.62, 0.69) indicate the experiment is severely underpowered. The paper cannot reliably distinguish between 'the intervention has no effect' and 'the study lacks power to detect a real effect' — a critical distinction for a negative-result paper (Table 3).",
        "The contribution is thin: the core idea (paraphrase the forget set) is straightforward, the augmentation pipeline produces minimal diversity (avg 2.6 paraphrases per query), and the negative result is unsurprising given the weak intervention. The paper does not explore alternative diversification strategies, different augmentation scales, or ablations (e.g., what if 100% of queries had diverse paraphrases rather than 25% falling back to templates?) (Section 3.3).",
        "The paper is generated by an automated research system (explicitly stated in the abstract), which raises concerns about the depth of analysis, novelty of hypothesis generation, and whether the experimental design was optimized for rigorous testing or for rapid execution. The methodology feels like a first-pass exploration rather than a thoroughly investigated research question.",
        "Single benchmark (TOFU forget10) and single model (Llama-3.2-1B-Instruct) severely limit generalizability. Whether the negative result holds for larger models, different unlearning methods (e.g., representation-based approaches), or other benchmarks (MUSE, WMDP) is entirely unknown (Sections 3.1, 4.1)."
      ],
      "must_fix_items": [
        "The claim 'data-side interventions alone are insufficient' (Abstract, Section 5) is overstrong given the weak augmentation tested (13% expansion, 25% template fallback). This should be softened to acknowledge that the specific intervention tested was insufficient, not the entire class of data-side approaches.",
        "Statistical power must be addressed: with n=3 seeds and high variance, the study cannot distinguish 'no effect' from 'underpowered study.' Either increase the number of seeds, acknowledge this as a fundamental limitation of the negative claim, or compute minimum detectable effect sizes.",
        "The 25% fallback rate (5/20 queries using deterministic templates, Section 3.3) means a quarter of the 'diversified' condition isn't actually diversified. This contamination weakens the intervention and should be analyzed separately or excluded."
      ],
      "conference_scores": {
        "soundness": 2.3,
        "presentation": 2.8,
        "contribution": 1.8,
        "overall_rating": 3.2,
        "confidence": 3
      }
    }
  ]
}