{
  "pdf": "finauditing-arelle-symbolic-baseline.pdf",
  "title": "EXECUTABLE FINMR: ARELLE-BASED SYMBOLIC BASELINES AN EXECUTABILITY AUDIT XBRL MATHEMATICAL REASONING",
  "elapsed": 51.4,
  "runs_mode": 1,
  "valid_runs": 1,
  "avg_score": 3.2,
  "scores": [
    3.2
  ],
  "score_std": 0,
  "final_verdict": "Strong Reject",
  "final_confidence": 0.6,
  "conference_scores": {
    "soundness": 2.3,
    "presentation": 2.5,
    "contribution": 1.8,
    "overall_rating": 3.2,
    "confidence": 3
  },
  "strengths": [
    "Clear and testable hypothesis: The paper posits that FinMR tests XBRL tooling rather than mathematical reasoning, and provides strong empirical support via a 3-5x performance gap between symbolic and neural approaches (Section 1, Table 1: 42.17% vs 13.86%).",
    "Executability audit provides actionable diagnostic value: The finding that only 58.73% of instances are executable, with 64.2% of failures due to missing external taxonomy dependencies (Figure 2), directly identifies a concrete, fixable problem with the benchmark rather than making vague claims about benchmark quality.",
    "Ablation study cleanly isolates a critical component: Table 3 shows that without href rewriting, 0% of instances are executable, conclusively demonstrating that FinMR queries are not self-contained XBRL artifacts — this is a sharp and informative result.",
    "Per-DQC rule breakdown reveals where symbolic execution genuinely excels vs. where it fails: Arelle outperforms regex by +16.7pp and +24.0pp on DQC 0015 and 0117 respectively, but underperforms by -31.8pp on DQC 0126 (Table 2). This honest disaggregation prevents over-claiming."
  ],
  "weaknesses": [
    "The Regex Message-Only baseline outperforms the proposed Arelle baseline overall (44.58% vs 42.17% on full set, 74.36% vs 71.79% on executable subset — Table 1), which fundamentally undermines the paper's narrative that symbolic XBRL execution is the right approach. The paper's own primary baseline is beaten by simple pattern matching on DQC messages, suggesting the benchmark may have even less reasoning depth than claimed.",
    "DQC 0126 calculation error rate of 47.06% on the executable subset (Table 2) is alarmingly high for a symbolic system operating on valid XBRL packages. The paper attributes this to 'implementation issues in calculation linkbase traversal: selecting the correct link role and handling incomplete child fact sets' (Section 3.3), but this is never resolved or analyzed in depth. A symbolic system failing nearly half the time on one-third of the data raises questions about engineering completeness.",
    "The paper is auto-generated ('WARNING: This paper was generated by an automated research system' — Abstract), which raises concerns about the depth of analysis, novelty assessment, and reproducibility of the engineering work. No human authorship is claimed, making it impossible to assess expertise or accountability for the Arelle integration details.",
    "Limited novelty in the core insight: The observation that domain-specific symbolic tools outperform general LLMs on structured domain tasks is well-established (MRKL, PAL, Program-of-Thoughts — all cited in Section 4). The contribution is primarily an engineering effort (wrapping Arelle) plus a benchmark audit, not a new method or theoretical insight.",
    "The 50-instance SC LLM baseline (gpt-4.1, k=4) is too small and uses a different model than the published LLM results, making comparisons indirect. The '29-instance intersection' comparison (Section 3.2) is particularly weak as a controlled experiment due to tiny sample size and no statistical testing."
  ],
  "must_fix_items": [
    "Address why the Regex baseline outperforms Arelle overall and what this implies for the 'FinMR tests tooling not reasoning' claim — if regex on message text beats symbolic XBRL execution, perhaps FinMR tests neither tooling nor reasoning but simply text extraction from leaky DQC messages.",
    "Resolve or deeply analyze the DQC 0126 47.06% calculation error rate — a symbolic system failing nearly half its calculations on valid inputs suggests either the system is incomplete or the benchmark's calculation rules are ambiguous; either way this must be diagnosed.",
    "Provide statistical significance tests for the LLM comparison, especially the 29-instance intersection result."
  ],
  "runs": [
    {
      "run": 1,
      "score": 3.2,
      "verdict": "Strong Reject",
      "confidence": 0.6,
      "strengths": [
        "Clear and testable hypothesis: The paper posits that FinMR tests XBRL tooling rather than mathematical reasoning, and provides strong empirical support via a 3-5x performance gap between symbolic and neural approaches (Section 1, Table 1: 42.17% vs 13.86%).",
        "Executability audit provides actionable diagnostic value: The finding that only 58.73% of instances are executable, with 64.2% of failures due to missing external taxonomy dependencies (Figure 2), directly identifies a concrete, fixable problem with the benchmark rather than making vague claims about benchmark quality.",
        "Ablation study cleanly isolates a critical component: Table 3 shows that without href rewriting, 0% of instances are executable, conclusively demonstrating that FinMR queries are not self-contained XBRL artifacts — this is a sharp and informative result.",
        "Per-DQC rule breakdown reveals where symbolic execution genuinely excels vs. where it fails: Arelle outperforms regex by +16.7pp and +24.0pp on DQC 0015 and 0117 respectively, but underperforms by -31.8pp on DQC 0126 (Table 2). This honest disaggregation prevents over-claiming."
      ],
      "weaknesses": [
        "The Regex Message-Only baseline outperforms the proposed Arelle baseline overall (44.58% vs 42.17% on full set, 74.36% vs 71.79% on executable subset — Table 1), which fundamentally undermines the paper's narrative that symbolic XBRL execution is the right approach. The paper's own primary baseline is beaten by simple pattern matching on DQC messages, suggesting the benchmark may have even less reasoning depth than claimed.",
        "DQC 0126 calculation error rate of 47.06% on the executable subset (Table 2) is alarmingly high for a symbolic system operating on valid XBRL packages. The paper attributes this to 'implementation issues in calculation linkbase traversal: selecting the correct link role and handling incomplete child fact sets' (Section 3.3), but this is never resolved or analyzed in depth. A symbolic system failing nearly half the time on one-third of the data raises questions about engineering completeness.",
        "The paper is auto-generated ('WARNING: This paper was generated by an automated research system' — Abstract), which raises concerns about the depth of analysis, novelty assessment, and reproducibility of the engineering work. No human authorship is claimed, making it impossible to assess expertise or accountability for the Arelle integration details.",
        "Limited novelty in the core insight: The observation that domain-specific symbolic tools outperform general LLMs on structured domain tasks is well-established (MRKL, PAL, Program-of-Thoughts — all cited in Section 4). The contribution is primarily an engineering effort (wrapping Arelle) plus a benchmark audit, not a new method or theoretical insight.",
        "The 50-instance SC LLM baseline (gpt-4.1, k=4) is too small and uses a different model than the published LLM results, making comparisons indirect. The '29-instance intersection' comparison (Section 3.2) is particularly weak as a controlled experiment due to tiny sample size and no statistical testing."
      ],
      "must_fix_items": [
        "Address why the Regex baseline outperforms Arelle overall and what this implies for the 'FinMR tests tooling not reasoning' claim — if regex on message text beats symbolic XBRL execution, perhaps FinMR tests neither tooling nor reasoning but simply text extraction from leaky DQC messages.",
        "Resolve or deeply analyze the DQC 0126 47.06% calculation error rate — a symbolic system failing nearly half its calculations on valid inputs suggests either the system is incomplete or the benchmark's calculation rules are ambiguous; either way this must be diagnosed.",
        "Provide statistical significance tests for the LLM comparison, especially the 29-instance intersection result."
      ],
      "conference_scores": {
        "soundness": 2.3,
        "presentation": 2.5,
        "contribution": 1.8,
        "overall_rating": 3.2,
        "confidence": 3
      }
    }
  ]
}