{
  "pdf": "vidvec-adaptive-rerank-budget.pdf",
  "title": "ADAPTIVE RERANK BUDGETING FOR VIDEO-TEXT RETRIEVAL VIA LAYER-DISAGREEMENT ROUTING FARS Analemma",
  "elapsed": 58.7,
  "runs_mode": 1,
  "valid_runs": 1,
  "avg_score": 3.5,
  "scores": [
    3.5
  ],
  "score_std": 0,
  "final_verdict": "Strong Reject",
  "final_confidence": 0.6,
  "conference_scores": {
    "soundness": 2.3,
    "presentation": 2.8,
    "contribution": 1.8,
    "overall_rating": 3.5,
    "confidence": 3
  },
  "strengths": [
    "Training-free design: The entire method requires no additional training, relying solely on forward hooks and Jaccard distance computation. This is a practical advantage for deployment, as confirmed in Section 3.5 where the authors note 'VidVec-RouteK is entirely training-free' and that the only overhead is extracting embeddings from two extra layers via forward hooks.",
    "The cross-layer disagreement signal is a principled and novel confidence measure: Rather than relying on the margin between top-1 and top-2 embedding scores (which only captures local ranking information), the Jaccard distance between top-m sets across layers captures structural disagreement in the model's internal representations. Equation (1) provides a clean formalization. This connects to prior work on intermediate-layer representations (Skean et al., 2025; Bolya et al., 2025) but applies the insight to a new problem (adaptive retrieval budgeting).",
    "Counter-intuitive and notable finding: On MSR-VTT, the adaptive method with avg-K=30.9 achieves R@1=53.2, which exceeds the fixed K=100 baseline (R@1=52.5) by +0.7 (Table 1). This suggests that reranking irrelevant candidates deep in the list can introduce noise, which is an interesting empirical observation worth reporting, even if the authors do not deeply analyze why this occurs."
  ],
  "weaknesses": [
    "Extremely limited experimental scope — only 2 benchmarks and 1 model: All experiments are conducted on VideoLLaMA3-7B with MSR-VTT 1k-A and DiDeMo. The method's generality to other backbone models (e.g., InternVideo2, VideoPrism), other MLLM sizes, or other retrieval domains (image-text, document retrieval) is entirely untested. The authors themselves acknowledge this limitation in Section 5, but it severely undermines confidence in the generality of the claimed contribution. A method published at ICLR should demonstrate broader applicability.",
    "Tiny and potentially non-significant improvements: The primary claim is +0.9 R@1 over margin routing on MSR-VTT (53.2 vs 52.3) and +1.5 R@1 on DiDeMo (53.3 vs 51.8). These differences are very small on benchmarks with only ~1000 test queries. No statistical significance tests (e.g., bootstrap confidence intervals, paired t-tests) are reported anywhere in the paper. On MSR-VTT, the 'Disagreement Binary' variant already achieves 52.8 R@1, so the 3-tier improvement over binary is only +0.4 — within noise range. HF_NO_SIGNIFICANCE applies.",
    "Threshold calibration and hyperparameter fragility: The routing thresholds τ1 and τ2 are 'calibrated on a validation set to achieve a target average budget (avg-K ≈ 30)' (Section 3.4). The paper does not disclose the actual threshold values, the validation set used (a split of MSR-VTT training data? separate?), or how sensitive results are to threshold choices. The layer set L = {20, 24, 27} is hand-picked, and while the ablation in Table 2 tests {18, 24, 27}, this is still a narrow perturbation. The top-m parameter (m=20) and the budget levels {10, 60, 100} are also fixed without justification. The claimed 'training-free' advantage is somewhat undermined by this threshold tuning.",
    "The paper was generated by an automated research system (stated in the abstract): 'WARNING: This paper was generated by an automated research system.' This raises concerns about the depth of scientific insight, the care taken in experimental design, and whether the paper represents genuine intellectual contribution versus automated hypothesis-testing over a narrow design space. While this does not inherently invalidate the results, it contextualizes the incremental nature of the contribution and the lack of deeper analysis (e.g., why does selective reranking outperform full reranking? what types of queries are mis-routed?)."
  ],
  "must_fix_items": [
    "Report statistical significance tests (bootstrap CIs or paired tests) for all R@1 comparisons, especially the +0.9 and +1.5 claims over margin routing, and the +0.7 claim over fixed K=100 on MSR-VTT.",
    "Disclose the actual threshold values τ1 and τ2, the validation set used for calibration, and conduct sensitivity analysis showing how results vary with different threshold choices.",
    "Evaluate on at least one additional model backbone (e.g., a different MLLM or different size) to demonstrate that the cross-layer disagreement signal is not an artifact of VideoLLaMA3-7B's architecture."
  ],
  "runs": [
    {
      "run": 1,
      "score": 3.5,
      "verdict": "Strong Reject",
      "confidence": 0.6,
      "strengths": [
        "Training-free design: The entire method requires no additional training, relying solely on forward hooks and Jaccard distance computation. This is a practical advantage for deployment, as confirmed in Section 3.5 where the authors note 'VidVec-RouteK is entirely training-free' and that the only overhead is extracting embeddings from two extra layers via forward hooks.",
        "The cross-layer disagreement signal is a principled and novel confidence measure: Rather than relying on the margin between top-1 and top-2 embedding scores (which only captures local ranking information), the Jaccard distance between top-m sets across layers captures structural disagreement in the model's internal representations. Equation (1) provides a clean formalization. This connects to prior work on intermediate-layer representations (Skean et al., 2025; Bolya et al., 2025) but applies the insight to a new problem (adaptive retrieval budgeting).",
        "Counter-intuitive and notable finding: On MSR-VTT, the adaptive method with avg-K=30.9 achieves R@1=53.2, which exceeds the fixed K=100 baseline (R@1=52.5) by +0.7 (Table 1). This suggests that reranking irrelevant candidates deep in the list can introduce noise, which is an interesting empirical observation worth reporting, even if the authors do not deeply analyze why this occurs."
      ],
      "weaknesses": [
        "Extremely limited experimental scope — only 2 benchmarks and 1 model: All experiments are conducted on VideoLLaMA3-7B with MSR-VTT 1k-A and DiDeMo. The method's generality to other backbone models (e.g., InternVideo2, VideoPrism), other MLLM sizes, or other retrieval domains (image-text, document retrieval) is entirely untested. The authors themselves acknowledge this limitation in Section 5, but it severely undermines confidence in the generality of the claimed contribution. A method published at ICLR should demonstrate broader applicability.",
        "Tiny and potentially non-significant improvements: The primary claim is +0.9 R@1 over margin routing on MSR-VTT (53.2 vs 52.3) and +1.5 R@1 on DiDeMo (53.3 vs 51.8). These differences are very small on benchmarks with only ~1000 test queries. No statistical significance tests (e.g., bootstrap confidence intervals, paired t-tests) are reported anywhere in the paper. On MSR-VTT, the 'Disagreement Binary' variant already achieves 52.8 R@1, so the 3-tier improvement over binary is only +0.4 — within noise range. HF_NO_SIGNIFICANCE applies.",
        "Threshold calibration and hyperparameter fragility: The routing thresholds τ1 and τ2 are 'calibrated on a validation set to achieve a target average budget (avg-K ≈ 30)' (Section 3.4). The paper does not disclose the actual threshold values, the validation set used (a split of MSR-VTT training data? separate?), or how sensitive results are to threshold choices. The layer set L = {20, 24, 27} is hand-picked, and while the ablation in Table 2 tests {18, 24, 27}, this is still a narrow perturbation. The top-m parameter (m=20) and the budget levels {10, 60, 100} are also fixed without justification. The claimed 'training-free' advantage is somewhat undermined by this threshold tuning.",
        "The paper was generated by an automated research system (stated in the abstract): 'WARNING: This paper was generated by an automated research system.' This raises concerns about the depth of scientific insight, the care taken in experimental design, and whether the paper represents genuine intellectual contribution versus automated hypothesis-testing over a narrow design space. While this does not inherently invalidate the results, it contextualizes the incremental nature of the contribution and the lack of deeper analysis (e.g., why does selective reranking outperform full reranking? what types of queries are mis-routed?)."
      ],
      "must_fix_items": [
        "Report statistical significance tests (bootstrap CIs or paired tests) for all R@1 comparisons, especially the +0.9 and +1.5 claims over margin routing, and the +0.7 claim over fixed K=100 on MSR-VTT.",
        "Disclose the actual threshold values τ1 and τ2, the validation set used for calibration, and conduct sensitivity analysis showing how results vary with different threshold choices.",
        "Evaluate on at least one additional model backbone (e.g., a different MLLM or different size) to demonstrate that the cross-layer disagreement signal is not an artifact of VideoLLaMA3-7B's architecture."
      ],
      "conference_scores": {
        "soundness": 2.3,
        "presentation": 2.8,
        "contribution": 1.8,
        "overall_rating": 3.5,
        "confidence": 3
      }
    }
  ]
}