Title: FARKAS DUAL RAYS DO NOT IMPROVE LLM-BASED OPTIMIZATION MODEL REPAIR FARS Analemma
PDF: 701b7578-4d6d-411d-af1c-bb094de99da9.pdf
Score: 4.2
Verdict: Reject
Confidence: 0.72
Elapsed: 321.0s

Strengths:
1. Honest and transparent reporting of a negative result. The paper explicitly states that DualRayRank produces identical results to baseline IIS-TopK (1/31 each, same instance), and that the truncation regime shows 0/16 repair for all methods. This level of candor is commendable and avoids the common pitfall of cherry-picking conditions to fabricate a positive signal (Sections 4.2, 5).
2. Well-controlled experimental design for the primary comparison. The controlled conditions (same 7B model, K=5, greedy decoding, identical prompt budget) properly isolate the effect of constraint ranking from confounding factors like model size or sampling strategy. The three-way comparison (IIS-TopK, DualRay-TopK, DualRay+Weights) systematically tests both ranking signal and explicit weight information (Table 1, top section).
3. The Best-of-2 vs. repair comparison is a valuable and unexpected finding. Showing that simple inference scaling (65.12% Pass@1 with 2 samples) outperforms all repair methods including 10× larger models (58.86%) provides actionable guidance for the field: regeneration dominates repair on this benchmark (Table 2, Figure 2).

Weaknesses:
1. Extremely small sample size (n=31 infeasible instances, n=16 in truncation regime) renders all comparisons statistically meaningless. With 1/31 vs 1/31 repair rate, the 95% binomial confidence interval for each is [0.08%, 16.7%]—massively overlapping. No significance tests (Fisher's exact, bootstrap, or otherwise) are reported. The claim that dual-ray ranking 'does not improve' repair is unsupported at this sample size; the study is merely underpowered to detect any effect (Table 1, Section 4.2). HF_NO_SIGNIFICANCE applies.
2. The core methodological contribution (DualRayRank) is trivially defined: sort constraints by |yi| and take top-K. This is a one-line sorting operation with no algorithmic novelty. The paper strips to: 'sort by magnitude, take top-K'—which is the most obvious possible use of multiplier magnitudes. The three conditions (baseline, ranking-only, ranking+weights) test the most straightforward variants with no exploration of alternative encodings, aggregation strategies, or prompt designs that might leverage the dual-ray structure more effectively (Section 3.3).
3. Extended comparison (Table 1 bottom) confounds model capacity, context budget (K), sampling strategy, and feedback method simultaneously. The 72B/K=10/16×2 configuration changes four variables at once relative to the controlled baseline, making it impossible to attribute the 7/31 repair rate to any specific factor. The paper acknowledges this confound qualitatively (Section 4.2) but does not provide the necessary ablations (e.g., 72B with K=5, or 7B with K=10 and 16×2 sampling) to disentangle the effects.
4. Single benchmark (MAMO-Optimization), single model family (Qwen2.5-Instruct), and single solver (HiGHS) severely limit generalizability. The paper's own Limitations section acknowledges this, but the negative-result framing ('Farkas dual rays do not improve…') implies a general claim that the evidence cannot support. Different benchmarks, model families (GPT-4, Claude, Llama), or solvers (Gurobi, CPLEX) might produce different IIS distributions and dual-ray characteristics that affect ranking quality (Section 5).

Must Fix Items:
1. Add statistical significance tests for all pairwise comparisons. With n=31, even Fisher's exact test or bootstrap confidence intervals on the difference would be informative. Reporting raw counts without any statistical framework is insufficient for a scientific claim, especially a negative one.
2. Deconfound the extended comparison by providing ablations that vary one factor at a time: (a) 7B with K=10, (b) 7B with 16×2 sampling and K=5, (c) 32B/72B with K=5 and single greedy decoding. Without these, the extended results are uninterpretable.
3. Soften the universal negative claim in the title and abstract. 'Do Not Improve' implies a general finding, but the evidence supports only 'do not improve on MAMO-Optimization with Qwen2.5-7B under n=31 conditions'. The current framing overstates the scope of the evidence.

Runs:
- run=1 score=4.2 verdict=Reject confidence=0.72 error=None