Title: QUERY-CONDITIONED MARGINALS FOR OT-BASED CONTEXT COMPRESSION: AN EMPIRICAL INVESTI-GATION FARS
PDF: 57051a89-70e0-4a0d-a8a0-1778b14b4d40.pdf
Score: 3.0
Verdict: Reject
Confidence: 0.85
Elapsed: 229.0s

Strengths:
1. Honest reporting of negative result: The paper transparently reports that QCap-OT produces results statistically indistinguishable from the baseline and that their ComprExIT re-implementation fails catastrophically (2.47% vs 68.08% F1). This honesty is rare and valuable for the community (Abstract, Section 4.2, Section 5).
2. Rigorous statistical testing: The authors use paired bootstrap testing with 10,000 resamples and report confidence intervals for the mean delta, rather than relying on point estimates alone (Section 4.2). This exceeds the statistical rigor of many compression papers.
3. Reproducibility contribution: The authors document three specific architectural discrepancies found during re-implementation (per-layer projections, column normalization, L2 normalization in alignment MLP) and release their code and checkpoints, providing concrete value for future researchers (Section 4.4, footnote 1).

Weaknesses:
1. Fatal confound: The entire experimental evaluation is conducted in a floor-performance regime (2.47% F1 on SQuAD, near-zero EM). When compressed representations carry almost no extractable QA information, no method — however well-designed — can demonstrate a meaningful effect. The paper's own conclusion acknowledges this (Section 4.4: 'The query-conditioned marginal reweighting mechanism cannot be meaningfully validated when the base compression produces representations that carry minimal extractable information'). This makes the core experimental contribution vacuous.
2. Trivial core contribution after packaging stripping: QCap-OT reduces to a single equation (Eq. 6): ˜ρt ∝ ρt · exp(β · st), i.e., multiplying the learned marginal by a query-similarity exponential before re-normalization. This is a standard importance-reweighting trick with no theoretical justification for why exponential scaling (vs. linear, squared, etc.) is appropriate, no analysis of how β affects the transport plan, and no sensitivity study (β is fixed at 1.0 throughout — Section 3.3). The 'key insight' that modifying marginals changes the transport plan (Section 3.2) is a trivial consequence of constrained optimization, not a research contribution.
3. Unfair baseline comparison: QueryTopK uses different checkpoints (sft v6) than QCap-OT (sft v11), as the authors themselves note (Section 4.3). Despite this, they draw the conclusion that 'OT-based soft aggregation provides meaningful information preservation that hard selection cannot replicate' — a claim that cannot be supported when the compared methods use different trained checkpoints. The 6× F1 gap could be entirely attributable to checkpoint differences rather than OT coordination.
4. Extremely narrow experimental scope: Only 1 model (Llama-3.2-1B), 1 compression ratio (4×), 2 benchmarks (SQuAD, HotpotQA), 1 β value (1.0), 3 seeds. No hyperparameter sensitivity analysis for β, no exploration of different query embedding strategies beyond mean-pooling, no multi-query scenario evaluation (which is ComprExIT's stated advantage — Section 3.2 acknowledges caching is still possible but never tests it), and no analysis of the marginal distributions before/after reweighting to verify the mechanism is actually doing anything.
5. The paper's framing is self-contradictory: The title proposes 'Query-Conditioned Marginals' as a method, but the paper's own evidence shows the method has zero effect. The conclusion frames this as a 'negative result' about QCap-OT, but the more accurate framing is that the experiment was never capable of testing the hypothesis. A negative result requires a working testbed; a failed testbed yields an inconclusive result, not a negative one.

Must Fix Items:
1. Resolve the reproducibility gap with ComprExIT before evaluating QCap-OT. Without a working base system, no claim about QCap-OT's effectiveness (or lack thereof) is empirically valid. Contact the original authors for code or clarification on the three identified architectural discrepancies.
2. Conduct β sensitivity analysis: sweep β across multiple values (e.g., 0.1, 0.5, 1.0, 2.0, 5.0, 10.0) and report how the marginal distributions change. This is the one experiment that could be informative even in a low-performance regime — if β has no effect on the transport plan structure, that is a meaningful finding; if it does, it suggests the reweighting is mechanically sound but the downstream decoder cannot exploit it.
3. Fix the QueryTopK comparison to use the same checkpoint (sft v11) as QCap-OT, or remove the comparative claim about OT coordination superiority.
4. Analyze and visualize the actual marginal distributions (ρ vs. ˜ρ) to verify the reweighting mechanism is operating as intended. Report KL divergence or total variation distance between original and reweighted marginals as a function of β.

Runs:
- run=1 score=3.0 verdict=Reject confidence=0.85 error=None