Title: TASK-AWARE EARLY TERMINATION FOR HNSW VIA LABEL-HISTOGRAM STABILIZATION FARS Analemma
PDF: vss-label-stability-early-exit.pdf
Score: 4.2
Verdict: Reject
Confidence: 0.60
Elapsed: 52.3s

Strengths:
1. Clear and well-motivated core insight: label distributions stabilize earlier than exact neighbor IDs during HNSW traversal because labels are a coarser signal. This is logically sound and empirically confirmed in Figure 2, which shows Label Recall@100 saturates at ef=100 (0.8479) while Synthetic Recall@100 continues improving to ef=1500 (0.9957), a 0.58pp gap vs 0.04pp gap respectively.
2. Training-free and lightweight method: Algorithm 1 uses only O(K) histogram computation via numpy.bincount per checkpoint, with <5µs overhead reported in Section 3.3. No offline training or model is required, making it immediately deployable and avoiding the generalization risks of learned approaches like LAET.
3. Strong latency improvements with negligible recall loss: Table 1 shows 58.6% p50 latency reduction and 55.9% p99 latency reduction vs fixed ef=1500, with Label Recall@100 drop of only 0.01pp (0.8483→0.8482). The 19.5% p99 latency improvement over ID-stability (B=50) demonstrates concrete benefit from the label-based criterion.
4. Robustness to query difficulty: Table 2 stratifies by label margin and shows the method does not disproportionately harm hard (low-margin) queries—Label Recall@100 is identical to the fixed baseline (0.7514) for hard queries, with p99 advantage concentrated in hard queries (5.532ms vs 6.993ms for ID-stability).

Weaknesses:
1. Single-dataset, single-task evaluation: The entire experimental evaluation is limited to ImageNet-EVA02 with classification (Label Recall@100). The conclusion itself acknowledges this limitation. No retrieval tasks (e.g., text retrieval with relevance labels), no face recognition (Face Recall as defined in Iceberg), no diverse dataset scales (e.g., billion-scale like DiskANN). The generalizability claim is unsupported by evidence beyond one benchmark.
2. Marginal improvement over ID-stability baseline: The headline 58.6% p50 latency reduction is primarily against the fixed ef=1500 baseline, not against the fair comparison (ID-stability B=50). Against ID-stability (B=50), p50 latency is virtually identical (1.405ms vs 1.397ms), and the only meaningful advantage is 19.5% lower p99 (4.514ms vs 5.605ms). Label Recall@100 is identical (0.8482), and Synthetic Recall@100 is slightly worse (0.9941 vs 0.9944). The actual contribution over the strongest fair baseline is narrow.
3. Nearly all queries exit at minimum ef: The mean exit ef of 201.4 (Table 1) means almost all queries terminate immediately after warmup (2 checkpoints × B=50 = ef=100, then patience=2 checkpoints at ef=200). For high-margin queries, mean exit ef is 200.0 (Table 2)—the absolute minimum. This raises the question of whether the sophisticated histogram monitoring is doing much beyond acting as a warmup+patience timer, especially since ID-stability also exits near the same point for easy queries.
4. Threshold derivation lacks empirical sensitivity analysis: The threshold τ=4/K is derived from first principles (2 label swaps), but no ablation study tests alternative thresholds (e.g., τ=2/K, τ=6/K, τ=8/K). Without this, it is unclear whether the method is robust to threshold choice or whether the specific value is critical. Similarly, checkpoint interval B=50 and patience=2 are set without justification or ablation.
5. No comparison against learned or adaptive ef methods: Despite discussing LAET, Ada-ef, and DARTH in Related Work, no experimental comparison is provided. While these optimize for Recall@K rather than Label Recall, a comparison would clarify whether the training-free advantage outweighs potential gains from learned approaches, especially since the task-aware metric changes the optimization landscape.

Must Fix Items:
1. Add at least one additional dataset/task beyond ImageNet classification (e.g., text retrieval, face recognition) to substantiate generalizability claims.
2. Provide ablation study for threshold τ, checkpoint interval B, and patience parameter to demonstrate robustness or sensitivity of the proposed method.
3. Clarify the practical significance of the contribution over ID-stability (B=50) since p50 latency is virtually identical and the main advantage is in p99 tail latency.

Runs:
- run=1 score=4.2 verdict=Reject confidence=0.6 error=None