Title: CAP-AND-SPILL: TWO-PASS CUDA-GRAPH MOE DISPATCH WITHOUT WORST-CASE PADDING FARS PDF: cap-and-spill-cudagraph-moe-dispatch.pdf Score: 4.5 Verdict: Reject Confidence: 0.60 Elapsed: 49.2s Strengths: 1. Clear and well-motivated problem formulation: the paper identifies a real and measurable inefficiency in CUDA-graph-captured MoE dispatch — 88% padding waste from worst-case buffer allocation — and provides concrete distributional evidence (Cmax/µ = 8.4×, Q99 = 16 vs Cmax = 43) from Mixtral-8x7B routing traces (Section 3.2, Figure 2). 2. Empirically validated latency reduction with rigorous measurement: 33.9% mean latency reduction (1077 µs → 712 µs) on 8×A100 NVLink, based on 200,000 measurements (5 restarts × 200 steps × 200 iterations), and bitwise equality verification across 50 dispatch steps (Table 1, Section 4.2). 3. Insightful unconditional execution finding: the counterintuitive result that always executing both passes outperforms conditional execution (712 µs vs 949 µs, 25% improvement) because CPU-GPU synchronization overhead (~163 µs) exceeds the cost of an empty second pass is a valuable engineering insight (Section 3.4). Weaknesses: 1. Extremely narrow experimental scope: evaluation is limited to a single model (Mixtral-8x7B), a single hardware configuration (8×A100 NVLink), and a single node. No evaluation on other MoE architectures (e.g., DeepSeek-V2, Switch Transformer), different GPU counts, multi-node Ethernet/InfiniBand setups, or different batch sizes. The paper itself acknowledges this limitation (Section 4.6) but does nothing to address it, making generalizability claims unsupported. 2. No end-to-end inference throughput measurement: the paper only measures dispatch latency in isolation. The 33.9% dispatch latency reduction does not directly translate to end-to-end inference speedup, since the dispatch phase is only one component of the full MoE forward pass. The overhead breakdown (Figure 3) shows Pack dominates at ~80%, but without showing the fraction of total inference time that dispatch occupies, the practical impact is unclear. 3. Optimal quantile (Q99) is model- and workload-specific with no principled selection method: Table 2 shows the optimum at q=0.99 for Mixtral-8x7B, but there is no theoretical or algorithmic framework for selecting the quantile for other models. An 'adaptive quantile selection based on runtime statistics' is listed as future work (Section 4.6), which means the key hyperparameter requires empirical tuning per deployment — undermining the generality of the approach. 4. Over-packaging concern: the core idea — use a quantile-based buffer size and handle overflow in a second pass — is a straightforward two-pass engineering technique. The paper wraps this in extensive formalization (Section 3.1 problem formulation, named 'Cap-and-Spill' algorithm) but the actual algorithmic novelty is thin: it is essentially 'cap at Q99, spill the rest,' which is a natural application of statistical quantile analysis to buffer sizing. Must Fix Items: 1. Provide end-to-end inference throughput numbers (tokens/sec or latency per token) to demonstrate that the dispatch-only latency reduction translates to meaningful practical speedup. 2. Evaluate on at least one additional MoE model or routing distribution to support generalizability claims beyond Mixtral-8x7B. 3. Provide a principled method or analysis for quantile selection (not just empirical sweep) that would allow practitioners to determine the right C without per-model tuning. Runs: - run=1 score=4.5 verdict=Reject confidence=0.6 error=None