Title: ORDER-ROBUSTNESS AUDIT OF GRADIENT MASKING METHODS FOR CONTINUAL LEARNING IN LLMS FARS Analemma
PDF: 172d7903-0e05-4048-b8be-ca21dcdabf49.pdf
Score: 3.5
Verdict: Reject
Confidence: 0.85
Elapsed: 64.2s

Strengths:
1. Important research question: whether continual learning method rankings generalize across task orderings is a genuine and under-explored concern, well-motivated by Section 1 and Related Work (Section 2, para 3 on task order effects).
2. Transparent reporting of MIGU sanity check failure (Table 1: +3.35 deviation from published 44.08), with explicit acknowledgment in Section 4.2 rather than hiding the discrepancy.
3. Seed-level paired comparison (Table 3) with per-seed breakdowns provides some transparency into variance, and FGGM's higher variance (std=1.06 vs MIGU's 0.13) is a real and informative finding about instability.

Weaknesses:
1. FATAL: The claimed 'ranking reversal' is illusory. On the authors' own default-order data (Table 1), MIGU (47.43) already outperforms FGGM (45.84) — MIGU wins by 1.59 points on the default order. The paper claims FGGM outperforms MIGU on the default order based on published values (46.00 vs 44.08), but the authors' own reproduced implementation already reverses this ranking. There is no 'reversal' between orders; MIGU simply wins on both orders in the authors' hands. The entire narrative of the paper collapses.
2. MIGU sanity check fails by +3.35 points (Table 1: 47.43 vs published 44.08), exceeding the authors' own ±2.0 tolerance. This means the MIGU implementation being compared is not the same MIGU from the original paper. The authors' MIGU is boosted by an implementation artifact (DeepSpeed ZeRO-2 hook vs Accelerate), making all MIGU-vs-FGGM comparisons untrustworthy — the 'ranking reversal' could simply be an implementation bug inflating MIGU.
3. Only a single alternative ordering (Order 2) is tested. The paper title promises an 'audit' and the discussion (Section 5, Limitations) acknowledges this, but a single reordering cannot support general claims about 'order-robustness.' With 8 tasks, there are 8! = 40,320 possible orderings; testing 2 is insufficient to characterize robustness.
4. No formal statistical significance tests are reported. The authors claim 'statistical separation' in Section 4.4 based on non-overlapping 1-σ intervals, but n=3 seeds is far too small for reliable variance estimation, and no t-test, permutation test, or bootstrap is conducted. This is a hard fail under HF_NO_SIGNIFICANCE.
5. The mechanistic analysis (Section 4.5, Figure 2) is purely correlational — the authors acknowledge this (Section 5, Limitations, para 3) but still present it as identifying a 'mechanism.' Low Jaccard similarity between consecutive task masks is a description of what happens, not an explanation of why it causes worse performance. No intervention (e.g., artificially increasing mask overlap) is attempted to test causality.

Must Fix Items:
1. Acknowledge that MIGU already outperforms FGGM on the default order in the reproduced implementation (47.43 vs 45.84 in Table 1), which eliminates the claimed 'ranking reversal' narrative. The paper must either (a) fix the MIGU implementation to match published values and re-run all experiments, or (b) reframe the contribution entirely away from 'ranking reversal.'
2. Conduct formal significance tests (paired t-test or bootstrap) on the 3-seed Order 2 results rather than relying on non-overlapping 1-σ intervals with n=3.
3. Test at least 3–5 additional orderings to support any claim about 'order-robustness'; a single alternative ordering is insufficient.

Runs:
- run=1 score=3.5 verdict=Reject confidence=0.85 error=None