{
  "pdf": "self-anchored-temporal-filtering.pdf",
  "title": "SELF-ANCHORED TEMPORAL FILTERING FOR LLM-FREE TEMPORAL-AWARE MEMORY RETRIEVAL FARS Analemma",
  "elapsed": 47.1,
  "runs_mode": 1,
  "valid_runs": 1,
  "avg_score": 4.2,
  "scores": [
    4.2
  ],
  "score_std": 0,
  "final_verdict": "Reject",
  "final_confidence": 0.6,
  "conference_scores": {
    "soundness": 2.4,
    "presentation": 2.6,
    "contribution": 2.2,
    "overall_rating": 4.2,
    "confidence": 3
  },
  "strengths": [
    "Practical zero-LLM-call solution: SATF eliminates the need for expensive GPT-4o API calls (127 calls in the evaluation) while achieving better NDCG@10 (0.683 vs 0.566), as shown in Table 1. This is a genuine efficiency gain with clear deployment value.",
    "Soft reranking avoids over-filtering: The design choice of soft score interpolation (Eq. 4) rather than hard filtering is well-motivated and empirically validated — GPT-4o's hard filtering degrades R@10 from 0.795 to 0.701 while SATF maintains 0.795 (Table 1), demonstrating that the soft-boost approach preserves recall.",
    "Do-no-harm analysis is commendable: Table 3 explicitly checks that SATF does not degrade non-temporal queries, with maximum R@10 degradation of only -0.014. This is a rigorous evaluation practice often omitted in similar papers."
  ],
  "weaknesses": [
    "Single benchmark evaluation is a major limitation: The entire experimental evaluation is conducted only on LongMemEval (Section 4.1), with no validation on other temporal retrieval benchmarks (e.g., TempReason, TimeQA, SituatedQA). The authors themselves acknowledge this in the conclusion but it severely limits generalizability claims.",
    "Hyperparameter sensitivity analysis is incomplete and internally inconsistent: Figure 2 references a 'confidence gating threshold γ' and 'window size fraction' that are never defined in the method section (Section 3). The method section defines N=30, σ=15, α=10 as hyperparameters, but the sensitivity analysis uses different parameters (γ and window fraction). This disconnect raises questions about whether the reported sensitivity corresponds to the actual method described.",
    "Baseline comparison is narrow and potentially unfair: The only temporal-aware baseline is GPT-4o time-range filtering from the LongMemEval benchmark itself. No comparison with other temporal reranking methods (e.g., learning-to-rank with temporal features, TILDE-style temporal query expansion, or simpler statistical baselines like recency-weighted scoring) is provided. A straightforward baseline like 'boost by recency (1/time_ago)' would help isolate how much of SATF's gain comes from multi-peak Gaussian kernels vs. simple temporal priors.",
    "The 'self-anchored' insight may be circular: The method uses top-N retrieval results to define temporal anchors, then re-ranks those same results. If the initial retrieval is already temporally correlated (which it should be for semantically similar items about the same topic), the Gaussian kernel will naturally boost items from the same cluster. The paper does not analyze how much of the improvement comes from genuine temporal reasoning vs. simply amplifying existing semantic clusters. An ablation comparing against a uniform recency prior would clarify this."
  ],
  "must_fix_items": [
    "Resolve inconsistency between hyperparameters defined in Section 3.5 (N, σ, α) and those analyzed in Section 4.5/Figure 2 (γ, window fraction). Either update the method section to include γ and window fraction, or revise the sensitivity analysis to cover the actual hyperparameters.",
    "Add at least one simpler temporal baseline (e.g., recency-weighted scoring or single-peak Gaussian at the mean timestamp of top-N results) to isolate the contribution of multi-peak kernels vs. basic temporal priors.",
    "Report statistical significance (e.g., bootstrap confidence intervals or paired t-tests) for the main results in Table 1. With n=127 temporal queries, it is feasible and necessary to confirm the improvements are not due to chance."
  ],
  "runs": [
    {
      "run": 1,
      "score": 4.2,
      "verdict": "Reject",
      "confidence": 0.6,
      "strengths": [
        "Practical zero-LLM-call solution: SATF eliminates the need for expensive GPT-4o API calls (127 calls in the evaluation) while achieving better NDCG@10 (0.683 vs 0.566), as shown in Table 1. This is a genuine efficiency gain with clear deployment value.",
        "Soft reranking avoids over-filtering: The design choice of soft score interpolation (Eq. 4) rather than hard filtering is well-motivated and empirically validated — GPT-4o's hard filtering degrades R@10 from 0.795 to 0.701 while SATF maintains 0.795 (Table 1), demonstrating that the soft-boost approach preserves recall.",
        "Do-no-harm analysis is commendable: Table 3 explicitly checks that SATF does not degrade non-temporal queries, with maximum R@10 degradation of only -0.014. This is a rigorous evaluation practice often omitted in similar papers."
      ],
      "weaknesses": [
        "Single benchmark evaluation is a major limitation: The entire experimental evaluation is conducted only on LongMemEval (Section 4.1), with no validation on other temporal retrieval benchmarks (e.g., TempReason, TimeQA, SituatedQA). The authors themselves acknowledge this in the conclusion but it severely limits generalizability claims.",
        "Hyperparameter sensitivity analysis is incomplete and internally inconsistent: Figure 2 references a 'confidence gating threshold γ' and 'window size fraction' that are never defined in the method section (Section 3). The method section defines N=30, σ=15, α=10 as hyperparameters, but the sensitivity analysis uses different parameters (γ and window fraction). This disconnect raises questions about whether the reported sensitivity corresponds to the actual method described.",
        "Baseline comparison is narrow and potentially unfair: The only temporal-aware baseline is GPT-4o time-range filtering from the LongMemEval benchmark itself. No comparison with other temporal reranking methods (e.g., learning-to-rank with temporal features, TILDE-style temporal query expansion, or simpler statistical baselines like recency-weighted scoring) is provided. A straightforward baseline like 'boost by recency (1/time_ago)' would help isolate how much of SATF's gain comes from multi-peak Gaussian kernels vs. simple temporal priors.",
        "The 'self-anchored' insight may be circular: The method uses top-N retrieval results to define temporal anchors, then re-ranks those same results. If the initial retrieval is already temporally correlated (which it should be for semantically similar items about the same topic), the Gaussian kernel will naturally boost items from the same cluster. The paper does not analyze how much of the improvement comes from genuine temporal reasoning vs. simply amplifying existing semantic clusters. An ablation comparing against a uniform recency prior would clarify this."
      ],
      "must_fix_items": [
        "Resolve inconsistency between hyperparameters defined in Section 3.5 (N, σ, α) and those analyzed in Section 4.5/Figure 2 (γ, window fraction). Either update the method section to include γ and window fraction, or revise the sensitivity analysis to cover the actual hyperparameters.",
        "Add at least one simpler temporal baseline (e.g., recency-weighted scoring or single-peak Gaussian at the mean timestamp of top-N results) to isolate the contribution of multi-peak kernels vs. basic temporal priors.",
        "Report statistical significance (e.g., bootstrap confidence intervals or paired t-tests) for the main results in Table 1. With n=127 temporal queries, it is feasible and necessary to confirm the improvements are not due to chance."
      ],
      "conference_scores": {
        "soundness": 2.4,
        "presentation": 2.6,
        "contribution": 2.2,
        "overall_rating": 4.2,
        "confidence": 3
      }
    }
  ]
}