Title: INTERFACE-ROOTED REPO MAPS TOKEN-
PDF: featurebench-repomap-token-budget.pdf
Score: 2.8
Verdict: Strong Reject
Confidence: 0.60
Elapsed: 53.7s

Strengths:
1. Honest reporting of a negative result: The paper clearly refutes its own hypothesis (≥25% token reduction) and reports that IR-RepoMap actually increases token consumption by 20.8% on comparable tasks (Table 2). This transparency is commendable and avoids the common publication bias toward positive results. Evidence: Section 4.2, Table 2 mean ratio of 1.208.
2. Well-designed experimental control: The inclusion of a Random-Map null control that samples random files formatted identically isolates whether any benefit comes from structured import-closure content versus simply adding prefix text. This is a methodologically sound design choice. Evidence: Section 3.3, 'Random-Map control' description; Table 1 Random-Map row.
3. Per-task granular analysis with honest variance reporting: Table 2 and Figure 2 provide per-task breakdowns showing extreme variance (std=274.6%, range −89% to +889%), which prevents misleading aggregate claims and reveals the task-dependent nature of the intervention. Evidence: Table 2, Figure 2 scatter plot.

Weaknesses:
1. Floor effect renders quality comparison impossible: Both IR-RepoMap and Baseline achieve 0% resolved rate due to proot-based containers (vs 6.7% with Docker in the FeatureBench paper). This means the primary outcome metric (task success) is entirely uninformative, and the only measurable effect is on token consumption—a secondary metric. The experiment cannot assess whether the repo map helps or hurts task quality. Evidence: Table 1 (0% resolved for both conditions), Section 4.1 ('0% resolved rate versus 6.7% reported'), Section 5 Limitations.
2. Very small effective sample size (N=13 comparable tasks): Out of 26 total tasks, only 13 produced comparable API completions for both conditions. With N=13 and extreme variance (std=274.6%), no statistically meaningful conclusion can be drawn. The 20.8% mean increase could easily be driven by a few outlier tasks. No statistical significance test is reported. Evidence: Table 2 (13 rows), Section 4.2. This triggers HF_NO_SIGNIFICANCE: no confidence intervals, p-values, or effect size measures are provided for the central claim.
3. Minimal novelty in the method itself: IR-RepoMap is essentially: (1) regex-extract paths from problem statements, (2) BFS on import graph with depth=3, (3) extract signatures via AST, (4) prepend to prompt. Each component is straightforward and well-established. The method lacks any adaptive or learning component. While the paper claims this as a contribution, the pipeline is essentially an engineering recipe with fixed hyperparameters (D=3, B=1500) that are not ablated. Evidence: Section 3.2, four stages described with no novelty beyond the combination.
4. Incomplete Random-Map control analysis: The Random-Map control lacks token metrics ('–' in Table 1), undermining the ability to isolate whether token changes are due to structured content vs. prefix length. The paper acknowledges this in Section 5 but does not address it experimentally. Evidence: Table 1 (Mean Input Tokens '–' for Random-Map), Section 4.4 ('lacks token metrics for direct comparison').
5. No hyperparameter sensitivity analysis: The depth bound D=3 and token budget B=1500 are fixed without justification or ablation. The negative result could simply reflect poor hyperparameter choices rather than a fundamental limitation of static repo maps. Evidence: Section 3.2 Stage 3 ('maximum depth D = 3', 'token budget B = 1500'), no ablation study anywhere in the paper.

Must Fix Items:
1. Add statistical significance tests (e.g., paired t-test or Wilcoxon signed-rank on the 13 per-task token differences) with confidence intervals for the mean token change. Without this, the 20.8% increase claim is not statistically grounded (HF_NO_SIGNIFICANCE).
2. Report token metrics for the Random-Map control to properly isolate structure vs. length effects; currently this control is incomplete and uninformative for the primary outcome (token consumption).
3. Ablate at least the depth bound D and token budget B to demonstrate the negative result is not an artifact of specific hyperparameter choices.
4. Address the floor effect: either run experiments on proper Docker infrastructure so resolved rate is above zero, or explicitly reframe the contribution as only about token consumption (not task quality) and acknowledge this as a severe limitation.

Runs:
- run=1 score=2.8 verdict=Strong Reject confidence=0.6 error=None