Title: ORTHOGONAL JUNK: GRADIENT-ORTHOGONALITY DATA SELECTION FOR CONTINUAL PRE-TRAINING ON LOW-QUALITY DATA
PDF: orthogonal-junk-pretraining.pdf
Score: 3.2
Verdict: Strong Reject
Confidence: 0.60
Elapsed: 59.3s

Strengths:
1. Honest reporting of negative results: The paper transparently reports that its proposed method (Orthogonal Junk) degrades ARC-Challenge by −5.12pp vs random selection and is outperformed by a simple perplexity baseline on RULER (+8.05pp advantage). This level of honesty is uncommon and valuable for the community (Section 3.2, Table 1).
2. Identification of a critical confound: The data repetition analysis (Table 3, Figure 2) convincingly demonstrates that repetition rate (9.57× → 2.31×) dominates selection quality for ARC performance (15.36% → 23.21%), while RULER remains flat. This is a genuine methodological insight for any subset selection approach that faces token-budget mismatches (Section 3.4).
3. Clear problem formulation and theoretical grounding: The constrained optimization formulation (Eq. 2) and first-order gradient interference analysis (Eq. 3) provide a principled motivation for the orthogonality hypothesis, properly connecting to GEM/PCGrad literature. The three-stage pipeline is well-specified with concrete design choices (Section 2.1–2.3).

Weaknesses:
1. The core hypothesis is empirically falsified by the paper's own results, yet presented as a contribution: The gradient orthogonality hypothesis (Section 2.2) predicts that orthogonal samples should cause 'minimal interference with existing model capabilities,' but Orthogonal Junk degrades ARC by −5.12pp compared to random selection—worse than the unfiltered baseline. The method fails on its primary objective, and the modest RULER gain (+2.89pp) is dwarfed by the perplexity baseline (+10.94pp). Claiming 'Orthogonal Junk' as a contribution when it underperforms simple baselines is over-packaging (Table 1, Section 3.2).
2. Gradient computed on only 21.25% of parameters undermines the orthogonality hypothesis: The authors compute gradients only on the LM head and embedding layers (Section 2.3, Stage 1), yet acknowledge in the conclusion that 'gradient orthogonality computed at the LM head may not capture the full dynamics of capability preservation across all model layers.' This fundamental limitation is not analyzed—no ablation tests full-parameter gradients vs. partial gradients, making it impossible to determine whether the negative results stem from the hypothesis itself or from the computational approximation (Section 5, Section 2.3).
3. Extremely narrow experimental scope with only one model and two benchmarks: All experiments use a single 1B-parameter model (Llama-3.2-1B-Instruct) and only two evaluation benchmarks (ARC-Challenge and RULER). The original brain-rot study used Llama-3-8B, and there is no evidence the findings generalize. No MMLU, GSM8K, or other standard benchmarks are reported despite being used to construct the anchor gradient—curiously, the anchor is built from GSM8K and MMLU but performance on these tasks is never measured (Section 3.1, Table 1).

Must Fix Items:
1. Evaluate on MMLU and GSM8K—the anchor gradient is computed from these datasets (Section 2.3), yet performance on them is never reported, creating a major gap in understanding whether orthogonality preserves the very capabilities used to define the anchor.
2. Ablate gradient scope: test whether full-parameter gradient orthogonality changes the results, since the paper admits the LM-head-only approximation may be insufficient (Section 5). Without this, the negative result cannot be attributed to the hypothesis vs. the approximation.
3. Test on at least one additional model scale (e.g., 3B or 8B) to assess generalizability, given the original brain-rot study used 8B.

Runs:
- run=1 score=3.2 verdict=Strong Reject confidence=0.6 error=None