{
  "pdf": "d789fe03-13a7-441d-b026-edbb6fc4b8c1.pdf",
  "title": "PAIRED MEDIAN-OF-MEANS REWARDS FOR ROBUST CONFIGURATION SELECTION IN VECTOR SEARCH BENCHMARKING",
  "elapsed": 56.9,
  "runs_mode": 1,
  "valid_runs": 1,
  "avg_score": 5.3,
  "scores": [
    5.3
  ],
  "score_std": 0.0,
  "final_verdict": "Reject",
  "final_confidence": 0.75,
  "conference_scores": null,
  "strengths": [
    "Well-defined problem with strong empirical motivation: the paper identifies a real and measurable issue—ANNS benchmarking noise with CV 4.6–4.9% exceeding the 2% reliability threshold (Section 4.2)—and demonstrates its practical consequence: 0% top-1 accuracy for all standard estimators on GIST-960 (Table 1). This is a compelling failure case that justifies the work.",
    "Clean ablation design that honestly reveals component contributions: Table 2 decomposes Paired-MoM into pairing vs. MoM vs. baseline, and the paper transparently reports that MoM actually hurts at the chosen budget (τ drops from 0.925→0.910 on SIFT-128, 0.840→0.817 on GIST-960). This honesty about a negative result for a named component is a strength.",
    "Budget sensitivity analysis (Section 4.5, Figure 2) shows scaling behavior across budgets 6–100, demonstrating that paired execution advantage compounds with more samples rather than vanishing. The tau≥0.90 threshold claim at budget 30 on SIFT-128 is a concrete, verifiable claim."
  ],
  "weaknesses": [
    "The paper's own ablation (Table 2, Section 4.4) shows that MoM—the titular 'MoM' component—hurts performance at the evaluated budget, and the log transform provides 'negligible' benefit (τ difference <0.001). The actual contribution is paired execution alone, which is a straightforward application of differential/paired measurement well-known in systems benchmarking (Duet, Bulej et al., 2020). The paper is effectively packaging a single technique (paired measurement) under a three-component name (Paired-MoM) where two of the three components are shown not to help. This is over-packaging.",
    "Critical ranking-vs-selection contradiction on SIFT-128 (Table 1): paired methods achieve higher Kendall τ (0.925) but worse regret (4.18 vs. 1.53) and worse top-1 accuracy (69.3% vs. 86.2%) compared to unpaired mean. In most practical configuration selection tasks, the goal is identifying the best configuration (top-1), not ranking all 32. The paper's own Discussion (Section 4.6) concedes this but does not reconcile why practitioners should adopt a method that makes the primary task harder on the moderate-noise regime that covers most real deployments.",
    "Only 2 datasets, both with HNSW only, and no statistical significance tests reported anywhere. The 500 bootstrap replicates provide point estimates but no confidence intervals or hypothesis tests on the key metrics (τ, top-1, regret). The paper claims '+83%' improvement in τ on GIST-960 (0.458→0.840) without reporting whether this difference is statistically significant, or what the bootstrap CI is. With only 2 datasets and 32 configurations each, the generalizability claim is unsupported."
  ],
  "must_fix_items": [
    "Report confidence intervals or standard errors on all metrics in Tables 1 and 2 from the 500 bootstrap replicates. Without this, the claimed improvements (e.g., +83% tau) cannot be assessed for statistical significance.",
    "Acknowledge explicitly in the title and abstract that MoM provides no benefit at the evaluated budgets and that the contribution is paired execution. The current framing ('Paired Median-of-Means') elevates a non-contributing component to titular status, which is misleading given the paper's own evidence.",
    "Add at least one more dataset or algorithm family (e.g., IVF-PQ from FAISS, or a third dataset like DEEP-1B) to support generalizability beyond 2 HNSW-only experiments."
  ],
  "runs": [
    {
      "run": 1,
      "score": 5.3,
      "verdict": "Reject",
      "confidence": 0.75,
      "strengths": [
        "Well-defined problem with strong empirical motivation: the paper identifies a real and measurable issue—ANNS benchmarking noise with CV 4.6–4.9% exceeding the 2% reliability threshold (Section 4.2)—and demonstrates its practical consequence: 0% top-1 accuracy for all standard estimators on GIST-960 (Table 1). This is a compelling failure case that justifies the work.",
        "Clean ablation design that honestly reveals component contributions: Table 2 decomposes Paired-MoM into pairing vs. MoM vs. baseline, and the paper transparently reports that MoM actually hurts at the chosen budget (τ drops from 0.925→0.910 on SIFT-128, 0.840→0.817 on GIST-960). This honesty about a negative result for a named component is a strength.",
        "Budget sensitivity analysis (Section 4.5, Figure 2) shows scaling behavior across budgets 6–100, demonstrating that paired execution advantage compounds with more samples rather than vanishing. The tau≥0.90 threshold claim at budget 30 on SIFT-128 is a concrete, verifiable claim."
      ],
      "weaknesses": [
        "The paper's own ablation (Table 2, Section 4.4) shows that MoM—the titular 'MoM' component—hurts performance at the evaluated budget, and the log transform provides 'negligible' benefit (τ difference <0.001). The actual contribution is paired execution alone, which is a straightforward application of differential/paired measurement well-known in systems benchmarking (Duet, Bulej et al., 2020). The paper is effectively packaging a single technique (paired measurement) under a three-component name (Paired-MoM) where two of the three components are shown not to help. This is over-packaging.",
        "Critical ranking-vs-selection contradiction on SIFT-128 (Table 1): paired methods achieve higher Kendall τ (0.925) but worse regret (4.18 vs. 1.53) and worse top-1 accuracy (69.3% vs. 86.2%) compared to unpaired mean. In most practical configuration selection tasks, the goal is identifying the best configuration (top-1), not ranking all 32. The paper's own Discussion (Section 4.6) concedes this but does not reconcile why practitioners should adopt a method that makes the primary task harder on the moderate-noise regime that covers most real deployments.",
        "Only 2 datasets, both with HNSW only, and no statistical significance tests reported anywhere. The 500 bootstrap replicates provide point estimates but no confidence intervals or hypothesis tests on the key metrics (τ, top-1, regret). The paper claims '+83%' improvement in τ on GIST-960 (0.458→0.840) without reporting whether this difference is statistically significant, or what the bootstrap CI is. With only 2 datasets and 32 configurations each, the generalizability claim is unsupported."
      ],
      "must_fix_items": [
        "Report confidence intervals or standard errors on all metrics in Tables 1 and 2 from the 500 bootstrap replicates. Without this, the claimed improvements (e.g., +83% tau) cannot be assessed for statistical significance.",
        "Acknowledge explicitly in the title and abstract that MoM provides no benefit at the evaluated budgets and that the contribution is paired execution. The current framing ('Paired Median-of-Means') elevates a non-contributing component to titular status, which is misleading given the paper's own evidence.",
        "Add at least one more dataset or algorithm family (e.g., IVF-PQ from FAISS, or a third dataset like DEEP-1B) to support generalizability beyond 2 HNSW-only experiments."
      ],
      "conference_scores": null
    }
  ]
}
