Title: HEADROLLBACK: POST-TASK ATTENTION HEAD PDF: f427337f-6766-4d81-bdde-092bd5a0d260.pdf Score: 4.2 Verdict: Reject Confidence: 0.78 Elapsed: 260.6s Strengths: 1. The paper is honest about statistical weakness: Section 4.2 explicitly states 'improvements are not statistically significant at p < 0.05 with N = 9 runs due to limited statistical power,' and the Limitations section reiterates this. This transparency is commendable and rare. 2. The ablation comparing HeadRollback vs HighDisruptionOnly (Table 1, Figure 2) is informative: Jaccard similarity below 0.10 shows the importance weighting changes head selection qualitatively, and the 3–7× higher I_old for HeadRollback-selected heads (Figure 2b) provides mechanistic evidence that the formula targets historically important heads rather than just the most-changed ones. 3. The method is genuinely replay-free and operates at task boundaries without modifying training dynamics (Algorithm 1). The minimal state requirement (one scalar per head + one LoRA checkpoint) is a practical advantage over replay-based methods. Table 2 shows current-task accuracy is preserved (+0.12pp mean change) while earlier tasks recover +1.9 to +4.9pp, supporting the 'unintentionally disrupted' hypothesis. Weaknesses: 1. No statistical significance despite the paper's own admission (Section 4.2). N=9 runs (3 orders × 3 seeds) yields extremely low power. Cohen's d of 0.40–0.67 is small-to-medium; with N=9, even a Cohen's d of 0.67 requires ~26 samples per group for 80% power. The 7/9 win rate is consistent with chance under the null (binomial test p ≈ 0.09 for a one-sided test against 0.5, borderline at best). The core claim of improvement is unsupported by the data's statistical evidence. 2. Missing all substantive continual learning baselines. The paper cites EWC, LwF, GEM, DER++, O-LoRA, MIGU, MoFO, Merge-before-Forget, SSR, and FOREVER in Related Work (Section 2), yet compares only against Vanilla LoRA and a single ablation (HighDisruptionOnly). Zero-shot and self-consistency prompting are not comparable (no fine-tuning). The absence of EWC, LwF, O-LoRA, or any regularization-based alternative makes it impossible to assess whether HeadRollback offers any advantage over even the simplest forgetting mitigations. 3. Single benchmark, single tiny model, single task type. The evaluation uses only 5 text classification tasks on Qwen3-0.6B-Base (a 0.6B model). No larger models (1.5B, 7B, 14B), no generation tasks, no different domains, no longer task sequences (K>5). The B-row-only rollback is acknowledged to be approximate when A drifts (Section 3.4: 'approximately restores the effective LoRA update ∆W for selected heads when A is relatively stable across tasks'), and this drift accumulates with more tasks—making the 5-task setting particularly favorable. Generalizability is entirely untested. 4. Packaging stripping: the core contribution is a heuristic scoring formula s(h) = Δ(h) · I_old(h) / (I_new(h) + ε) (Equation 5) and a top-p% selection with row rollback (Equation 6). The three-signal framing ('disruption magnitude, historical importance, new-task importance') dresses up a straightforward weighted product. The rollback operation itself is just overwriting B rows from a saved checkpoint—essentially selective parameter reset. The novelty is incremental: gradient-based importance tracking and selective rollback are well-established ideas; combining them in a product formula with a median-based ε is a design choice, not a conceptual advance. Must Fix Items: 1. Add at least 2–3 continual learning baselines (EWC, O-LoRA, or LwF minimum) to Table 1. Without these, the comparison is incomplete and the contribution cannot be calibrated against the field. 2. Run a properly powered experiment or acknowledge that the results are preliminary. N=9 with Cohen's d ≈ 0.5 needs ~64 runs for adequate power. At minimum, report paired t-test or Wilcoxon signed-rank p-values and confidence intervals for the OP/BWT differences. 3. Evaluate on at least one additional model scale (e.g., 1.5B or 7B) and one additional benchmark or task type (generation, NLU, or longer task sequences K≥8) to demonstrate generalizability beyond the single favorable setting. Runs: - run=1 score=4.2 verdict=Reject confidence=0.78 error=None