Title: SUBJECT-IDENTITY REMOVAL DOES NOT IMPROVE FROZEN EEG FOUNDATION MODEL TRANSFER: NEGATIVE RESULT FARS
PDF: inlp-subject-nullspace-eeg-linear-probe.pdf
Score: 3.8
Verdict: Strong Reject
Confidence: 0.60
Elapsed: 50.7s

Strengths:
1. Honest and transparent reporting of a negative result. The paper explicitly states its hypothesis was refuted, provides a pre-defined success criterion (+2 pp) that was not met, and avoids spin. This is valuable for the community as it prevents others from pursuing the same dead-end approach (Sections 1, 4.2).
2. Rigorous experimental design with appropriate controls. The inclusion of PCA-k controls (Table 1, Section 4.3) is a notable strength, allowing the authors to distinguish between the failure of subject-identity removal specifically versus dimension reduction generally. The nested cross-validation for hyperparameter selection (Section 3.5) prevents information leakage from the held-out subject.
3. Insightful analysis of why INLP self-nullifies. Figure 2 and Section 4.4 provide a compelling mechanistic explanation: inner CV selects minimal intervention (1-3 iterations in 77.8% of folds, C=0.01 in 92.6%), suggesting subject identity and task signal are entangled in overlapping subspaces. This goes beyond simply reporting 'it didn't work' to explaining why, which has scientific value.

Weaknesses:
1. Extremely narrow experimental scope: only one dataset (BNCI2014001, 9 subjects), one encoder (CBraMod), one task (4-class motor imagery), and one debiasing method (INLP). The conclusion that 'subject identity is entangled with task-discriminative signal in frozen EEG-FM embeddings' (Section 5) is a strong generalization from a single data point. Other EEG-FMs (BENDR, LaBraM, BIOT), other tasks (ERP, SSVEP), and other datasets could yield different entanglement patterns (Sections 4.1, 5).
2. The EA + Flatten baseline (56.27%) itself is a surprisingly strong and arguably the most novel finding, yet it is treated as a side result rather than rigorously characterized. The paper reports this exceeds fine-tuning (53.03%) by +3.24 pp, but the fine-tuning number comes from a different study (Liu et al., 2026) with potentially different preprocessing, hyperparameters, or evaluation protocol. No direct apples-to-apples comparison under identical conditions is provided, making this claim unreliable (Table 1, Section 4.2).
3. No statistical significance testing is reported anywhere. The +0.02 pp improvement of INLP-CV over EA baseline is described as 'within standard deviation,' but no formal test (paired t-test, Wilcoxon signed-rank, McNemar, etc.) is conducted. Given only 9 subjects × 3 seeds = 27 data points with high variance (std ~7-8%), it is impossible to assess whether even the EA baseline's superiority over fine-tuning is statistically meaningful (Table 1, Table 2).
4. The paper is auto-generated by an automated research system (abstract footnote), and the contribution feels correspondingly mechanistic: apply one existing method (INLP) to one existing encoder (CBraMod) on one existing dataset, observe it doesn't help. The per-subject analysis (Table 2) shows mixed effects with no clear pattern, but no deeper investigation (e.g., analyzing subject identity decodability per subject, embedding space geometry, or alternative projection strategies) is attempted.

Must Fix Items:
1. Add formal statistical significance tests for all comparisons in Table 1 and Table 2 (e.g., paired t-test or Wilcoxon signed-rank across subjects/seeds). Without this, no conclusion about whether INLP provides zero benefit is statistically justified.
2. Run fine-tuning under the exact same experimental conditions (same EA preprocessing, same LOSO protocol, same random seeds) rather than citing a reported number from a different study, to make the EA baseline comparison fair.
3. Extend experiments to at least one additional dataset/encoder combination to assess generalizability of the negative result, or substantially weaken the generalizing language in the conclusion.

Runs:
- run=1 score=3.8 verdict=Strong Reject confidence=0.6 error=None