Title: POST-HOC TOP-p EXPERT ROUTING FOR DYNAMIC COMPUTE ALLOCATION IN MIXTURE-OF-EXPERTS LANGUAGE MODELS
PDF: 5d45f4f6-682a-49ca-99b3-dcb613f1d9a7.pdf
Score: 4.5
Verdict: Reject
Confidence: 0.75
Elapsed: 98.2s

Strengths:
1. The paper identifies a genuine and practical problem: fixed top-k routing in MoE models is wasteful for easy tokens and insufficient for hard ones, and proposes a training-free solution (top-p routing) that requires no retraining or architectural changes—just modifying the expert selection rule at inference time. This is a clean and potentially useful contribution for practitioners who want to adapt compute allocation without modifying pretrained models (Sections 3.2–3.3).
2. The domain-adaptive behavior is a genuine empirical finding: when calibrated for avg k=4 on WikiText-2, top-p routing automatically increases to k=6.04 on GSM8K (+54%), achieving 87.87% accuracy vs 81.88% for static top-4. This recovery of 80% of the gap to full top-8 (89.77%) at only 75% of the compute is a non-trivial and practically relevant result (Table 1, Section 4.2).
3. The sensitivity analysis (Figure 2) is honest and informative: it shows that top-p dominates static top-k at extreme sparsity (avg k≈2, 12.28 vs 30.14 perplexity) but crosses over around k=3–4. This crossover behavior is important for understanding when the method is actually beneficial, and the authors deserve credit for not hiding this limitation (Section 4.4).

Weaknesses:
1. The core idea is trivially derived from nucleus sampling (Holtzman et al., 2020), applied to expert routing instead of token generation. The paper itself acknowledges this analogy (Section 3.2, 'analogous to nucleus sampling in text generation'). Huang et al. (2024) already train MoE models with confidence-threshold (top-p) routing—the only difference is 'post-hoc' application. The intellectual novelty is minimal: repurpose an existing technique from decoding to routing without any new algorithmic insight (Sections 2, 3.2).
2. Single model, single architecture evaluation: all results are on Qwen3-30B-A3B only. No evaluation on other MoE architectures (Mixtral, DeepSeek, Switch Transformer) which have different expert counts, training objectives, and router designs. The paper's own limitations section acknowledges this (Section 4.6), but it severely undermines the generality claims. Whether top-p routing works on models with fewer experts (e.g., Mixtral's 8 experts, top-2) or different load-balancing strategies is completely unknown.
3. No statistical significance tests: all reported numbers appear to be single-run results. The GSM8K improvement from 81.88% to 87.87% (6 percentage points on 1319 test examples) could have meaningful variance across runs. No confidence intervals, no standard deviations, no multiple seeds. Similarly, the perplexity numbers on WikiText-2 are single values. For a paper that claims 'emergent domain-adaptive behavior,' statistical validation is essential (Table 1).
4. The perplexity penalty at matched compute (+0.25 vs static top-4) reveals that top-p routing is not Pareto-optimal for language modeling. More critically, the paper frames the GSM8K improvement as 'domain-adaptive,' but the mechanism is simply that router entropy is higher on GSM8K tokens, causing more experts to be selected. Whether this is genuinely 'adaptive' or merely a side effect of distribution shift in softmax outputs is not disentangled. The router was never trained to allocate experts dynamically—its confidence signal is a byproduct of fixed-top-k training, and the paper's own analysis shows it is 'weak' (86% of maximum entropy, Section 4.3). The 'emergent' framing overstates what is essentially: higher-entropy distributions → more experts selected under top-p, which is mechanically inevitable rather than emergent.
5. The paper claims 'recovering 80% of the performance gap to full top-8' (Section 4.2), but this framing is misleading. The gap from top-4 to top-8 on GSM8K is 7.89 points (89.77% - 81.88%). Top-p closes 6 of those points but uses 6.04 experts on GSM8K—that is 75.5% of top-8's compute, not a small overhead. A fairer comparison would be static top-6, which is not reported. Without this baseline, we cannot tell whether top-p's dynamic allocation actually outperforms simply using a fixed larger k, which would be the most natural alternative (Table 1, Section 4.2).

Must Fix Items:
1. Add static top-6 and static top-7 baselines on GSM8K to enable fair comparison: does top-p at avg k=6.04 on GSM8K outperform static top-6? Without this, the '80% gap recovery' claim is uninterpretable because a fixed top-6 baseline might achieve similar or better accuracy at the same compute cost.
2. Report statistical significance: run multiple seeds or bootstrap confidence intervals for all main results (Table 1), especially the GSM8K accuracy numbers. A 6-point gap on ~1300 examples needs variance quantification.
3. Evaluate on at least one additional MoE architecture (e.g., Mixtral-8x7B with 8 experts and top-2 routing) to assess whether the findings generalize beyond Qwen3-30B-A3B's specific 128-expert / top-8 configuration.

Runs:
- run=1 score=4.5 verdict=Reject confidence=0.75 error=None