Title: SKETCH-GATED TRACE CLUSTERING FOR ACCELER-ATING INTER-TRACE REDUNDANCY PRUNING FARS Analemma
PDF: sketch-gated-trace-clustering.pdf
Score: 2.5
Verdict: Strong Reject
Confidence: 0.80
Elapsed: 46.3s

Strengths:
1. The ablation study with Random-Gated control is methodologically honest and revealing. Table 2 shows that random cluster selection achieves nearly identical performance (896±1 prompts vs 897, identical accuracy), which provides a clear negative result about LSH candidate quality. This kind of self-critical ablation is uncommon and valuable for the community.
2. The paper explicitly acknowledges that Self-Consistency outperforms both DeepPrune and Sketch-Gated (76.5% vs 70.6% overall in Table 1), which shows intellectual honesty about the accuracy-efficiency trade-off of judge-based clustering versus simple voting.
3. The banding sensitivity analysis (Section 4.4, Figure 2) provides a useful parameter sweep showing the exponential relationship between band count B and fallback rate, offering practical guidance for deployment configurations.

Weaknesses:
1. The evaluation scale is critically small — only 17 problems total (5 AIME24 + 3 AIME25 + 9 GPQA). This is far too few data points to draw statistically reliable conclusions about a method's effectiveness. With only 17 binary outcomes, a single problem flipping changes accuracy by ~6 percentage points. No confidence intervals or statistical significance tests are reported (HF_NO_SIGNIFICANCE concern). Table 1's accuracy numbers (80.0%, 66.7%, 77.8%) are each based on 5, 3, and 9 problems respectively — meaningless for method comparison.
2. The paper's own ablation (Table 2) fatally undermines its core contribution: SimHash-LSH provides no benefit over random candidate selection. The paper is essentially proposing a trivial 'reduce K1 from 10 to 3 when candidates exist' heuristic, wrapped in unnecessary LSH machinery. The adaptive K1 mechanism alone could be described in 2 paragraphs; the entire SimHash + LSH + banding apparatus is decorative. This constitutes significant over-packaging.
3. DeepPrune itself already underperforms Self-Consistency (70.6% vs 76.5% in Table 1), meaning Sketch-Gated is optimizing a method that is already worse than the simplest baseline. The paper does not address why one should use DeepPrune/Sketch-Gated at all given this accuracy gap, nor does it analyze when/where DeepPrune's clustering actually helps versus hurts.
4. The Random-Gated ablation uses only 3 random seeds on 17 problems, yielding 0.0 standard deviation on accuracy and ±1 on prompts. This is too few seeds and too few problems to robustly conclude that random and sketch-based gating are equivalent; it may simply reflect the tiny evaluation scale where there is no room for variation.

Must Fix Items:
1. Scale evaluation to at least 100+ problems with proper statistical significance tests (bootstrap confidence intervals, paired tests) to support claims about accuracy preservation and prompt reduction.
2. Either (a) remove or substantially de-emphasize the SimHash-LSH component given the ablation showing it adds nothing over random gating, reframing the contribution honestly as 'adaptive K1 reduction', or (b) demonstrate scenarios at larger scale where semantic filtering genuinely outperforms random gating.
3. Address why DeepPrune/Sketch-Gated should be preferred over Self-Consistency given the 6pp accuracy gap, and provide wall-clock time comparisons rather than only prompt counts.

Runs:
- run=1 score=2.5 verdict=Strong Reject confidence=0.8 error=None