{
  "pdf": "quantaalpha-multiple-testing-controls.pdf",
  "title": "DEFLATED-RANKICIR: MULTIPLE-TESTING-AWARE FACTOR SELECTION FOR LLM-DRIVEN ALPHA MIN-FARS",
  "elapsed": 49.9,
  "runs_mode": 1,
  "valid_runs": 1,
  "avg_score": 3.5,
  "scores": [
    3.5
  ],
  "score_std": 0,
  "final_verdict": "Strong Reject",
  "final_confidence": 0.6,
  "conference_scores": {
    "soundness": 2.2,
    "presentation": 2.8,
    "contribution": 1.8,
    "overall_rating": 3.5,
    "confidence": 3
  },
  "strengths": [
    "The paper identifies a real and important problem: multiple testing bias in LLM-driven alpha mining factor selection, which existing systems (AlphaAgent, QuantaAlpha, R&D-Agent-Quant) do not address. This is a well-motivated gap in the literature (Section 1, Section 2).",
    "The bootstrap-based standard error estimation is a practical and well-justified technical choice. The evidence that analytical SE produces near-constant estimates (CV ≈0.5%) while bootstrap SE creates meaningful rank differentiation (Spearman correlation 0.83 vs 0.9994) is compelling and clearly presented (Section 3.4).",
    "The paper is transparent about statistical insignificance of its main results. Table 3 explicitly shows all 90% CIs include zero, and Section 4.4 honestly discusses limitations. This level of candor is commendable and rare in quantitative finance papers."
  ],
  "weaknesses": [
    "The core contribution is an incremental adaptation: replacing Sharpe ratio with RankICIR in the existing DSR framework (Bailey & de Prado, 2014). The adaptation is straightforward—substitute one ratio for another in Equation 1—and the paper does not introduce new statistical methodology or theoretical results beyond this substitution (Section 3.2–3.3).",
    "The empirical results are not statistically significant. All 90% confidence intervals for IR differences include zero (Table 3), and the improvement over RankICIR baseline is only 3.3% in IR (1.717 vs 1.662). With only 2 factor swaps between Pool B and Pool C (Jaccard 0.923), the practical impact is marginal. The paper itself acknowledges this fundamental limitation (Section 4.4).",
    "The experimental setup has limited generalizability: only one market (CSI300), one factor mining system (QuantaAlpha), only 70 candidate factors (a relatively small pool for multiple testing to be a serious concern), and a selection ratio K/M = 50/70 = 71.4% which leaves very little room for selection effects to manifest. With 70 candidates and 50 selected, even random selection retains most factors (Table 1, Section 4.1).",
    "The ablation study in Table 2 shows a paradoxical result: 'no correction' (C3, N̂=1) performs worse than the baseline methods A and B, which also do no correction. C3 achieves IR=1.572, worse than Pool A (1.615) and Pool B (1.662), yet C3 uses the same ranking as RankICIR with N̂=1 (which should reduce DSR to a monotonic transform of RankICIR). This inconsistency is unexplained and undermines the interpretation that bootstrap SE alone drives improvement (Table 2, Section 4.3)."
  ],
  "must_fix_items": [
    "Explain the ablation anomaly: C3 (N̂=1, no multiple-testing correction) produces IR=1.572, worse than Pool B (RankICIR, also no correction) at IR=1.662. If N̂=1 eliminates the DSR correction, C3 should be equivalent to ranking by bootstrap-SE-adjusted RankICIR, not worse than uncorrected RankICIR. This inconsistency must be resolved.",
    "Demonstrate results on at least one additional market/benchmark or with a larger candidate pool where M >> K to show the method's value scales with the severity of the multiple testing problem.",
    "Report statistical significance at conventional levels or acknowledge that the empirical contribution cannot be distinguished from noise, and adjust claims accordingly."
  ],
  "runs": [
    {
      "run": 1,
      "score": 3.5,
      "verdict": "Strong Reject",
      "confidence": 0.6,
      "strengths": [
        "The paper identifies a real and important problem: multiple testing bias in LLM-driven alpha mining factor selection, which existing systems (AlphaAgent, QuantaAlpha, R&D-Agent-Quant) do not address. This is a well-motivated gap in the literature (Section 1, Section 2).",
        "The bootstrap-based standard error estimation is a practical and well-justified technical choice. The evidence that analytical SE produces near-constant estimates (CV ≈0.5%) while bootstrap SE creates meaningful rank differentiation (Spearman correlation 0.83 vs 0.9994) is compelling and clearly presented (Section 3.4).",
        "The paper is transparent about statistical insignificance of its main results. Table 3 explicitly shows all 90% CIs include zero, and Section 4.4 honestly discusses limitations. This level of candor is commendable and rare in quantitative finance papers."
      ],
      "weaknesses": [
        "The core contribution is an incremental adaptation: replacing Sharpe ratio with RankICIR in the existing DSR framework (Bailey & de Prado, 2014). The adaptation is straightforward—substitute one ratio for another in Equation 1—and the paper does not introduce new statistical methodology or theoretical results beyond this substitution (Section 3.2–3.3).",
        "The empirical results are not statistically significant. All 90% confidence intervals for IR differences include zero (Table 3), and the improvement over RankICIR baseline is only 3.3% in IR (1.717 vs 1.662). With only 2 factor swaps between Pool B and Pool C (Jaccard 0.923), the practical impact is marginal. The paper itself acknowledges this fundamental limitation (Section 4.4).",
        "The experimental setup has limited generalizability: only one market (CSI300), one factor mining system (QuantaAlpha), only 70 candidate factors (a relatively small pool for multiple testing to be a serious concern), and a selection ratio K/M = 50/70 = 71.4% which leaves very little room for selection effects to manifest. With 70 candidates and 50 selected, even random selection retains most factors (Table 1, Section 4.1).",
        "The ablation study in Table 2 shows a paradoxical result: 'no correction' (C3, N̂=1) performs worse than the baseline methods A and B, which also do no correction. C3 achieves IR=1.572, worse than Pool A (1.615) and Pool B (1.662), yet C3 uses the same ranking as RankICIR with N̂=1 (which should reduce DSR to a monotonic transform of RankICIR). This inconsistency is unexplained and undermines the interpretation that bootstrap SE alone drives improvement (Table 2, Section 4.3)."
      ],
      "must_fix_items": [
        "Explain the ablation anomaly: C3 (N̂=1, no multiple-testing correction) produces IR=1.572, worse than Pool B (RankICIR, also no correction) at IR=1.662. If N̂=1 eliminates the DSR correction, C3 should be equivalent to ranking by bootstrap-SE-adjusted RankICIR, not worse than uncorrected RankICIR. This inconsistency must be resolved.",
        "Demonstrate results on at least one additional market/benchmark or with a larger candidate pool where M >> K to show the method's value scales with the severity of the multiple testing problem.",
        "Report statistical significance at conventional levels or acknowledge that the empirical contribution cannot be distinguished from noise, and adjust claims accordingly."
      ],
      "conference_scores": {
        "soundness": 2.2,
        "presentation": 2.8,
        "contribution": 1.8,
        "overall_rating": 3.5,
        "confidence": 3
      }
    }
  ]
}