{
  "pdf": "sevir-quantile-remap-calibration.pdf",
  "title": "QUANTILE REMAP CALIBRATION FOR PRECIPITATION NOWCASTING FARS Analemma",
  "elapsed": 58.9,
  "runs_mode": 1,
  "valid_runs": 1,
  "avg_score": 3.5,
  "scores": [
    3.5
  ],
  "score_std": 0,
  "final_verdict": "Strong Reject",
  "final_confidence": 0.6,
  "conference_scores": {
    "soundness": 2.3,
    "presentation": 2.8,
    "contribution": 1.8,
    "overall_rating": 3.5,
    "confidence": 3
  },
  "strengths": [
    "The near-miss analysis (Table 1) provides concrete evidence that 38.6% of false negatives at threshold 219 are intensity near-misses (predictions in [181, 219)), directly motivating the quantile remap approach. This diagnostic contribution is valuable regardless of the method itself, as it quantifies a specific failure mode in deterministic nowcasting models.",
    "QRC is a genuinely training-free, simple post-hoc method requiring only empirical CDF computation on a validation set. The implementation is straightforward (K=1024 bins, linear interpolation) and can be applied to any deterministic model without retraining, giving it strong practical utility and reproducibility. Code is publicly available.",
    "The CSI improvements are substantial and statistically significant with bootstrap CIs reported: CSI-M-POOL16 improves from 0.4660 to 0.5249 (95% CI [+0.0583, +0.0599]) and CSI-219-POOL16 from 0.2083 to 0.2692 (95% CI [+0.0624, +0.0650]). The comparison against affine calibration and isotonic regression (Table 2, Table 3) demonstrates that nonlinear quantile mapping is necessary, not just shifting/scaling."
  ],
  "weaknesses": [
    "QRC is a direct application of quantile mapping (QM), a well-established bias-correction technique from statistical meteorology (Pulkkinen et al., 2019 is cited but the methodological novelty is not differentiated). The paper frames QRC as a new contribution, but applying QM to deep learning nowcasting outputs is a straightforward, incremental adaptation. Section 3.2's equation f_QRC(x) = F_Y^{-1}(F_hat_Y(x)) is the standard quantile mapping formula — no modification or innovation beyond applying it to this specific output type.",
    "The blending parameter α=0.75 is selected via grid search on the test set (Section 3.3: 'We select α via grid search on the test set to maximize CSI-M-POOL16 while constraining CRPS degradation'). This is a serious methodological concern — hyperparameter tuning on the test set constitutes data leakage and inflates reported performance. A proper hold-out validation strategy should be used. This raises HF_DATA_LEAK concerns.",
    "QRC corrects only marginal (pixel-wise) intensity distributions while ignoring spatial structure entirely. The paper acknowledges this limitation in Section 4.6 but does not quantify the spatial degradation. Table 2 shows FSS-219-16 drops from 0.5680 (uncalibrated) to 0.5269 (QRC original) — a 7.2% degradation in spatial skill score — yet this is not discussed as a serious concern. HSS-avg also drops from 0.5695 to 0.4904 (QRC original), a 13.9% degradation, indicating the method harms overall skill at non-extreme thresholds.",
    "The 'closes 104% of the gap to CasCast' claim (Abstract, Section 4.3) is misleading packaging. CasCast is a fundamentally different approach (cascaded diffusion) that produces probabilistic forecasts and optimizes different objectives. Comparing only on CSI-M-POOL16 while CasCast achieves far superior CRPS (0.0202 vs. 0.0288) and provides full probabilistic outputs makes the comparison selective and unfair. The 104% figure is cherry-picked from one metric.",
    "Only one base model (EarthFormer) and one dataset (SEVIR) are evaluated. There is no evidence that QRC generalizes to other architectures (ConvLSTM, PredRNN, DGMR), other datasets, or other domains. The method's generality claim ('can be applied to any deterministic nowcasting model') is unsupported by experiments."
  ],
  "must_fix_items": [
    "The blending hyperparameter α is tuned on the test set (Section 3.3), which constitutes data leakage. Must use a separate validation split or cross-validation for α selection and re-report results.",
    "The 'closes 104% of gap to CasCast' framing must be contextualized with the full metric picture — CasCast achieves substantially better CRPS (0.0202 vs 0.0288) and provides probabilistic forecasts. The current presentation is misleading.",
    "Evaluate on at least one additional base model or dataset to support the generality claim, or significantly soften the claim that QRC applies to 'any deterministic nowcasting model.'"
  ],
  "runs": [
    {
      "run": 1,
      "score": 3.5,
      "verdict": "Strong Reject",
      "confidence": 0.6,
      "strengths": [
        "The near-miss analysis (Table 1) provides concrete evidence that 38.6% of false negatives at threshold 219 are intensity near-misses (predictions in [181, 219)), directly motivating the quantile remap approach. This diagnostic contribution is valuable regardless of the method itself, as it quantifies a specific failure mode in deterministic nowcasting models.",
        "QRC is a genuinely training-free, simple post-hoc method requiring only empirical CDF computation on a validation set. The implementation is straightforward (K=1024 bins, linear interpolation) and can be applied to any deterministic model without retraining, giving it strong practical utility and reproducibility. Code is publicly available.",
        "The CSI improvements are substantial and statistically significant with bootstrap CIs reported: CSI-M-POOL16 improves from 0.4660 to 0.5249 (95% CI [+0.0583, +0.0599]) and CSI-219-POOL16 from 0.2083 to 0.2692 (95% CI [+0.0624, +0.0650]). The comparison against affine calibration and isotonic regression (Table 2, Table 3) demonstrates that nonlinear quantile mapping is necessary, not just shifting/scaling."
      ],
      "weaknesses": [
        "QRC is a direct application of quantile mapping (QM), a well-established bias-correction technique from statistical meteorology (Pulkkinen et al., 2019 is cited but the methodological novelty is not differentiated). The paper frames QRC as a new contribution, but applying QM to deep learning nowcasting outputs is a straightforward, incremental adaptation. Section 3.2's equation f_QRC(x) = F_Y^{-1}(F_hat_Y(x)) is the standard quantile mapping formula — no modification or innovation beyond applying it to this specific output type.",
        "The blending parameter α=0.75 is selected via grid search on the test set (Section 3.3: 'We select α via grid search on the test set to maximize CSI-M-POOL16 while constraining CRPS degradation'). This is a serious methodological concern — hyperparameter tuning on the test set constitutes data leakage and inflates reported performance. A proper hold-out validation strategy should be used. This raises HF_DATA_LEAK concerns.",
        "QRC corrects only marginal (pixel-wise) intensity distributions while ignoring spatial structure entirely. The paper acknowledges this limitation in Section 4.6 but does not quantify the spatial degradation. Table 2 shows FSS-219-16 drops from 0.5680 (uncalibrated) to 0.5269 (QRC original) — a 7.2% degradation in spatial skill score — yet this is not discussed as a serious concern. HSS-avg also drops from 0.5695 to 0.4904 (QRC original), a 13.9% degradation, indicating the method harms overall skill at non-extreme thresholds.",
        "The 'closes 104% of the gap to CasCast' claim (Abstract, Section 4.3) is misleading packaging. CasCast is a fundamentally different approach (cascaded diffusion) that produces probabilistic forecasts and optimizes different objectives. Comparing only on CSI-M-POOL16 while CasCast achieves far superior CRPS (0.0202 vs. 0.0288) and provides full probabilistic outputs makes the comparison selective and unfair. The 104% figure is cherry-picked from one metric.",
        "Only one base model (EarthFormer) and one dataset (SEVIR) are evaluated. There is no evidence that QRC generalizes to other architectures (ConvLSTM, PredRNN, DGMR), other datasets, or other domains. The method's generality claim ('can be applied to any deterministic nowcasting model') is unsupported by experiments."
      ],
      "must_fix_items": [
        "The blending hyperparameter α is tuned on the test set (Section 3.3), which constitutes data leakage. Must use a separate validation split or cross-validation for α selection and re-report results.",
        "The 'closes 104% of gap to CasCast' framing must be contextualized with the full metric picture — CasCast achieves substantially better CRPS (0.0202 vs 0.0288) and provides probabilistic forecasts. The current presentation is misleading.",
        "Evaluate on at least one additional base model or dataset to support the generality claim, or significantly soften the claim that QRC applies to 'any deterministic nowcasting model.'"
      ],
      "conference_scores": {
        "soundness": 2.3,
        "presentation": 2.8,
        "contribution": 1.8,
        "overall_rating": 3.5,
        "confidence": 3
      }
    }
  ]
}