Title: COUNTERFACTUAL GATE SUPERVISION DOES NOT FIX GATING CREDIT ASSIGNMENT IN ENGRAM-STYLE CONDITIONAL MEMORY
PDF: d708e179-f280-4ada-9ba2-6cc4b6521772.pdf
Score: 5.0
Verdict: Reject
Confidence: 0.72
Elapsed: 83.0s

Strengths:
1. Honest negative-result reporting: the paper clearly states CGS fails to fix gating credit assignment and does not try to spin the negative outcome. The oracle-AUC degradation (0.549→0.528, Table 1) and the persistence of the hot/cold flip pathology are reported transparently (Section 3.3, Section 3.5).
2. Well-designed iso-compute control (Section 3.1, Table 1): the iso-compute baseline at 5134 steps cleanly disentangles extra computation from improved supervision. This is the strongest experimental contribution — it shows the iso-compute baseline achieves better val loss (4.4521) than CGS (4.4665) without oracle-AUC degradation (0.5475 vs 0.5282), definitively attributing CGS's apparent gain to computation rather than supervision.
3. Per-layer analysis (Table 2, Figure 2): the breakdown showing CGS degrades oracle-AUC at all three Engram layers (Layer 2: 0.556→0.536, Layer 4: 0.550→0.530, Layer 6: 0.541→0.519) provides useful diagnostic information. The consistent degradation across layers strengthens the negative-result claim.
4. Reproducibility: 3 random seeds (42, 123, 456), publicly available code (https://gitlab.com/fars-a/counterfactual-gate-supervision), and detailed hyperparameters (M=4, λ=0.5, EMA decay 0.99, 500-step warmup) in Section 3.1 facilitate replication.

Weaknesses:
1. Near-random oracle-AUC for ALL conditions trivializes the calibration claim: the baseline oracle-AUC is 0.549 (barely above random 0.5), and CGS degrades it to 0.528. The gate was never meaningfully calibrated to begin with — the 'degradation' is between two nearly-random regimes. This is not acknowledged in the paper, which discusses oracle-AUC changes as if they were meaningful discriminative signals (Section 3.3, Section 3.4). A gate at 0.549 AUC is essentially not functioning as intended, making the problem framing and the intervention's failure less informative than presented.
2. No statistical significance tests: all comparisons (Table 1, Table 2) rely on point estimates ± standard deviations without reporting p-values, confidence intervals, or effect-size tests. For example, is the val-loss difference between CGS (4.4665±0.0170) and baseline (4.4813±0.0112) statistically significant? The overlapping error bars suggest it may not be. The oracle-AUC 'degradation' from 0.5490±0.0039 to 0.5282±0.0009 has non-overlapping CIs, but the practical significance of moving between two near-random AUC values is unclear without a proper test (HF_NO_SIGNIFICANCE).
3. Trivial core method with limited novelty: CGS amounts to running two extra forward passes (gate forced on/off), computing loss differences, and applying BCE supervision. The 'counterfactual' framing is standard ablation-style analysis repackaged as supervision. This is an obvious intervention — the question of whether per-token counterfactual signals from an undertrained model would be too noisy could have been predicted analytically before running experiments (Section 2.3, Equations 3-4).
4. Single scale, single architecture, single benchmark: all experiments use GPT-2 (12L, d=768, 333M params), Engram at layers 2/4/6, FineWeb-Edu 100M tokens, 5000 steps. No scaling behavior is explored. The authors acknowledge this in Section 3.5 but do not provide any evidence that the negative result would hold at larger scales. Given that the noise hypothesis depends on undertrained embeddings, larger-scale experiments where memory tables are more thoroughly trained could yield different outcomes — yet this obvious direction is not even partially investigated.
5. Missing diagnostic analysis of the counterfactual signal: the paper hypothesizes that ∆ℓ is 'too noisy' (Section 3.5) but provides zero quantitative evidence. No analysis of the distribution of ∆ℓ values (mean, variance, fraction of positive/negative), no signal-to-noise ratio computation, no correlation between ∆ℓ and gate activations, no analysis of how the supervision target yt changes over training. Without this, the failure diagnosis remains speculative and the paper's contribution is limited to 'it didn't work' without explaining why in measurable terms.

Must Fix Items:
1. Add statistical significance tests for all claimed comparisons (val-loss differences, oracle-AUC degradation) — at minimum paired t-tests across the 3 seeds or bootstrap confidence intervals.
2. Provide diagnostic analysis of the counterfactual signal: distribution of ∆ℓ values, signal-to-noise ratio, correlation between ∆ℓ and gate activations, and how these evolve during training. Without this, the 'too noisy' hypothesis is unsupported speculation.
3. Acknowledge that the baseline oracle-AUC of 0.549 is near-random and discuss the implications: the gate is not meaningfully calibrated even without CGS, which reframes the problem and the intervention's failure.

Runs:
- run=1 score=5.0 verdict=Reject confidence=0.72 error=None