Title: KL-TIME REPLAY: FUNCTION-SPACE DRIFT MONI-TORING FOR CONTINUAL LEARNING IN LLMS FARS Analemma
PDF: kl-drift-replay-scheduling.pdf
Score: 3.5
Verdict: Strong Reject
Confidence: 0.60
Elapsed: 52.2s

Strengths:
1. Clear and well-motivated research question: the paper correctly identifies that parameter-space drift (as used in FOREVER) may not directly reflect behavioral forgetting, and proposes a function-space alternative. The intuition that 'large parameter motion may occur in directions irrelevant to past-task behavior, while small parameter changes in sensitive directions can cause substantial output changes' (Section 3.2) is sound and well-articulated.
2. The correlation analysis (Figure 2) is informative: showing within-task Pearson >0.97 but full-trajectory Pearson = 0.46, Spearman = 0.58 provides concrete evidence that KL drift and parameter norms diverge at task boundaries. This is the most analytically interesting finding in the paper and genuinely adds understanding of when and why the two signals differ.
3. The paper is transparent about its automated origin (abstract WARNING) and provides public code. The compute parity constraint (±10% replay budget matching) is a reasonable design choice for fair comparison against FOREVER (Section 3.5).

Weaknesses:
1. Marginal contribution over FOREVER: KL-Time achieves OP 0.704 vs FOREVER's 0.708 (Table 1)—effectively equivalent performance within standard deviation (±0.015 vs ±0.014). The '14.24% trigger divergence' is presented as evidence of meaningful difference, but if different scheduling decisions yield equivalent outcomes, this undermines rather than supports the claim of practical advantage. The paper shows the signals differ but does not demonstrate that the function-space signal leads to better or more efficient scheduling in any measurable way.
2. Severely limited experimental scope: only one benchmark (5-task Standard CL), one base model (Qwen3-0.6B), one LoRA configuration (r=8). There is no evaluation on generative tasks, longer task sequences, larger models, or other CL benchmarks (e.g., Long Sequence benchmarks, SuperNI, OASST). The ablation (Table 2) actually shows that current-task anchors outperform the proposed past-task anchors (OP 0.713 vs 0.704, 37% fewer replay events), which paradoxically suggests the core proposal—monitoring past-task drift—is suboptimal compared to a simpler alternative.
3. Statistical significance is not reported: the paper reports means and standard deviations but never conducts statistical tests (t-test, Wilcoxon, etc.) to confirm whether the differences between methods are significant. With n=9 runs and overlapping error bars across all methods, it is likely that no method significantly outperforms another. This triggers HF_NO_SIGNIFICANCE concerns. The 14.24% divergence metric lacks a principled threshold (why 10%?) and 2 of 9 runs fail it (Figure 3).
4. The Ebbinghaus schedule adaptation is mechanical and lacks justification: mapping drift units to 'virtual days' via a 24-step warm-up window (Section 3.5) is an arbitrary engineering choice. The ρ scaling factor is calibrated post-hoc to match FOREVER's compute budget, which circularly ensures parity rather than allowing KL-Time to find its own optimal schedule. This makes the comparison more about matching compute than about the intrinsic quality of the drift signal.

Must Fix Items:
1. Add statistical significance tests (paired t-test or bootstrap) for all pairwise comparisons in Table 1 and Table 2 to justify claims of 'comparable performance' vs actual equivalence.
2. Explain why past-task anchors (the proposed method) are outperformed by current-task anchors in the ablation (Table 2), and either revise the default or provide a principled argument for past-task anchors despite worse performance.
3. Evaluate on at least one additional benchmark (e.g., a longer task sequence or a generative/sequence-level task) to demonstrate generalizability beyond a single 5-task classification setup.

Runs:
- run=1 score=3.5 verdict=Strong Reject confidence=0.6 error=None