{
  "pdf": "asa-delexgate-schema-churn.pdf",
  "title": "CANONICAL SCHEMA VIEWS ACTIVATION",
  "elapsed": 87.7,
  "runs_mode": 1,
  "valid_runs": 1,
  "avg_score": 3.2,
  "scores": [
    3.2
  ],
  "score_std": 0,
  "final_verdict": "Strong Reject",
  "final_confidence": 0.6,
  "conference_scores": {
    "soundness": 2.2,
    "presentation": 2.5,
    "contribution": 1.8,
    "overall_rating": 3.2,
    "confidence": 3
  },
  "strengths": [
    "Clear negative-result framing with honest reporting: The paper openly reports that DelexGate-ASA fails, providing specific quantitative evidence — 15.2 percentage point AST accuracy drop from 0.768 to 0.616 on clean schemas (Table 1). This kind of honest negative-result reporting is valuable for the community and avoids over-packaging.",
    "The finding that ASA (Recalibrate) ≈ ASA (Reuse) under churn is a practically useful observation: The near-identical performance (AST accuracy differs by <0.003 across all churned conditions, Table 1) demonstrates that ASA steering vectors are inherently robust to lexical churn, which has direct implications for practitioners — no need for recalibration when schemas change.",
    "Well-motivated research question: The problem of schema churn affecting activation steering is a legitimate and under-studied concern. The hypothesis that canonicalization could create schema-invariant representations is reasonable given prior work in dialogue state tracking (Rastogi et al., 2017; 2019), making the negative result informative."
  ],
  "weaknesses": [
    "The canonicalization approach is naively designed, making the negative result somewhat unsurprising: Replacing semantically meaningful parameter names like 'location' and 'unit' with generic placeholders like 'arg_0_0' and 'arg_0_1' obviously removes critical information that LLMs use for argument mapping. The paper itself acknowledges this in Section 4.2 ('canonicalization replaces meaningful parameter names with generic placeholders, removing semantic cues'). A negative result from such an aggressive delexicalization strategy has limited instructive value — a more nuanced approach (e.g., preserving semantic similarity via synonym replacement or embedding-based canonicalization) would have made the failure more insightful.",
    "Extremely limited experimental scope — single small model, single benchmark: All experiments use only Qwen2.5-1.5B-Instruct on BFCL v3 (620 examples). No larger models (e.g., 7B, 72B variants), no other benchmarks, no other model families. The observation that 'prompt-only baseline achieves the highest AST accuracy across all conditions' (Section 4.3) actually undermines the relevance of the entire study — if ASA itself provides no benefit for this model/benchmark, then studying ASA's robustness to churn is of questionable value. This critical limitation is acknowledged but not addressed.",
    "Inconsistency in implementation details between main text and appendix: Section 3.3 states 'layer L=18, threshold τ=0.7, and steering strength α=3.0,' but Appendix A states 'Steering vectors are computed at layer 12 with a steering strength of α=1.0. The linear probe threshold is set to 0.5.' These are substantial discrepancies that affect reproducibility and raise questions about which configuration was actually used. This constitutes a reproducibility concern (HF_NON_REPRODUCIBLE candidate).",
    "Limited novelty — the core insight (LLMs rely on parameter names for argument mapping) is well-established: Prior work like Hammer (Lin et al., 2024, cited in Section 2) already demonstrates that 'models being misled by function and parameter names' and explores function masking techniques. The finding that removing names hurts performance is largely confirmatory rather than revelatory. The paper's third contribution ('ASA assets are inherently robust to lexical churn') is the most novel but is a relatively narrow empirical observation from a single model."
  ],
  "must_fix_items": [
    "Resolve the contradictory hyperparameter values between Section 3.3 (layer 18, α=3.0, τ=0.7) and Appendix A (layer 12, α=1.0, τ=0.5). Report which was actually used and why the discrepancy exists.",
    "Address the elephant in the room: if prompt-only outperforms ASA on AST accuracy (the primary metric), the paper needs to either (a) demonstrate ASA's value on a setting where it actually helps, or (b) explicitly reframe the contribution as 'even when ASA helps in other settings, here is what happens under churn.' Without this, the churn-robustness question is being studied in a regime where ASA itself is not beneficial.",
    "Evaluate on at least one additional model scale (e.g., Qwen2.5-7B) to assess whether findings generalize, or explicitly limit claims to the 1.5B regime."
  ],
  "runs": [
    {
      "run": 1,
      "score": 3.2,
      "verdict": "Strong Reject",
      "confidence": 0.6,
      "strengths": [
        "Clear negative-result framing with honest reporting: The paper openly reports that DelexGate-ASA fails, providing specific quantitative evidence — 15.2 percentage point AST accuracy drop from 0.768 to 0.616 on clean schemas (Table 1). This kind of honest negative-result reporting is valuable for the community and avoids over-packaging.",
        "The finding that ASA (Recalibrate) ≈ ASA (Reuse) under churn is a practically useful observation: The near-identical performance (AST accuracy differs by <0.003 across all churned conditions, Table 1) demonstrates that ASA steering vectors are inherently robust to lexical churn, which has direct implications for practitioners — no need for recalibration when schemas change.",
        "Well-motivated research question: The problem of schema churn affecting activation steering is a legitimate and under-studied concern. The hypothesis that canonicalization could create schema-invariant representations is reasonable given prior work in dialogue state tracking (Rastogi et al., 2017; 2019), making the negative result informative."
      ],
      "weaknesses": [
        "The canonicalization approach is naively designed, making the negative result somewhat unsurprising: Replacing semantically meaningful parameter names like 'location' and 'unit' with generic placeholders like 'arg_0_0' and 'arg_0_1' obviously removes critical information that LLMs use for argument mapping. The paper itself acknowledges this in Section 4.2 ('canonicalization replaces meaningful parameter names with generic placeholders, removing semantic cues'). A negative result from such an aggressive delexicalization strategy has limited instructive value — a more nuanced approach (e.g., preserving semantic similarity via synonym replacement or embedding-based canonicalization) would have made the failure more insightful.",
        "Extremely limited experimental scope — single small model, single benchmark: All experiments use only Qwen2.5-1.5B-Instruct on BFCL v3 (620 examples). No larger models (e.g., 7B, 72B variants), no other benchmarks, no other model families. The observation that 'prompt-only baseline achieves the highest AST accuracy across all conditions' (Section 4.3) actually undermines the relevance of the entire study — if ASA itself provides no benefit for this model/benchmark, then studying ASA's robustness to churn is of questionable value. This critical limitation is acknowledged but not addressed.",
        "Inconsistency in implementation details between main text and appendix: Section 3.3 states 'layer L=18, threshold τ=0.7, and steering strength α=3.0,' but Appendix A states 'Steering vectors are computed at layer 12 with a steering strength of α=1.0. The linear probe threshold is set to 0.5.' These are substantial discrepancies that affect reproducibility and raise questions about which configuration was actually used. This constitutes a reproducibility concern (HF_NON_REPRODUCIBLE candidate).",
        "Limited novelty — the core insight (LLMs rely on parameter names for argument mapping) is well-established: Prior work like Hammer (Lin et al., 2024, cited in Section 2) already demonstrates that 'models being misled by function and parameter names' and explores function masking techniques. The finding that removing names hurts performance is largely confirmatory rather than revelatory. The paper's third contribution ('ASA assets are inherently robust to lexical churn') is the most novel but is a relatively narrow empirical observation from a single model."
      ],
      "must_fix_items": [
        "Resolve the contradictory hyperparameter values between Section 3.3 (layer 18, α=3.0, τ=0.7) and Appendix A (layer 12, α=1.0, τ=0.5). Report which was actually used and why the discrepancy exists.",
        "Address the elephant in the room: if prompt-only outperforms ASA on AST accuracy (the primary metric), the paper needs to either (a) demonstrate ASA's value on a setting where it actually helps, or (b) explicitly reframe the contribution as 'even when ASA helps in other settings, here is what happens under churn.' Without this, the churn-robustness question is being studied in a regime where ASA itself is not beneficial.",
        "Evaluate on at least one additional model scale (e.g., Qwen2.5-7B) to assess whether findings generalize, or explicitly limit claims to the 1.5B regime."
      ],
      "conference_scores": {
        "soundness": 2.2,
        "presentation": 2.5,
        "contribution": 1.8,
        "overall_rating": 3.2,
        "confidence": 3
      }
    }
  ]
}