Title: TEMPLATELEAK: A TEMPLATE-DISJOINT EVALUA-TION AUDIT OF COMMONFORMS FORM FIELD DETEC-TION
PDF: commonforms-template-disjoint-eval.pdf
Score: 3.5
Verdict: Strong Reject
Confidence: 0.60
Elapsed: 44.2s

Strengths:
1. The paper addresses a legitimate and important concern in document understanding evaluation—template overlap between train/test splits—and provides a rigorous statistical framework (MinHash/LSH clustering + permutation testing) to audit it. The method is principled and the pre-registered decision criteria (Section 2.4) are a good practice that prevents post-hoc rationalization of results.
2. The threshold sensitivity analysis (Section 3.4, Table 3) strengthens the main claim by showing the 'Refute' conclusion holds across four similarity thresholds (τ = 0.50 to 0.95), with all z-scores negative and all p-values > 0.72. This demonstrates robustness to the key hyperparameter choice.
3. The paper is transparent about the nuanced finding: despite refuting template leakage, it honestly reports the substantial mAP gap of +12.9 between Overlap-Test and Novel-Test slices (Table 1) and discusses confound factors (Section 3.5) such as field count and size differences, rather than hiding this inconvenient result.

Weaknesses:
1. The contribution is narrow and incremental: the paper applies well-established techniques (MinHash/LSH from Broder 1997, permutation testing) to a single dataset (CommonForms) and reaches a single negative conclusion (no leakage). The framework itself is not novel—MinHash clustering and permutation tests are standard tools—and the main result is a null finding. Null findings can be valuable, but the paper does not demonstrate the framework's utility beyond this one dataset, limiting generalizability claims (Section 5 mentions 'future work' but provides no evidence).
2. The quantized field-layout token representation (Section 2.1) may not adequately capture template similarity. Using B=32 bins for spatial quantization and representing pages as multisets of 5-tuples ignores document-level structure such as field ordering, semantic groupings, and visual formatting cues. Two forms with identical templates but slightly different field counts or minor positional shifts could fall below the Jaccard threshold, while two unrelated forms with sparse, similarly-positioned fields could be clustered together. No validation of this representation's fidelity to human-judged template similarity is provided.
3. The permutation test has a conceptual issue: it shuffles document IDs while preserving split sizes, but the null distribution reflects random splitting of the existing document pool. If CommonForms itself has low template diversity (many documents from few templates), random splitting would naturally produce high overlap fractions in the null, making it easy for the observed split to appear 'below chance.' The paper does not analyze template diversity in the corpus, so the negative z-score could reflect corpus properties rather than careful split construction. Additionally, with only N=1000 permutations, the resolution of the p-value is limited (granularity of 0.001).

Must Fix Items:
1. Validate the quantized field-layout token representation against human-judged template similarity, or at minimum discuss its limitations and failure modes.
2. Report template diversity statistics for the CommonForms corpus (e.g., number of distinct clusters, cluster size distribution) to contextualize the negative z-score finding.
3. Increase the number of permutations (e.g., N=10000) for finer-grained p-value estimation, or justify why N=1000 is sufficient.

Runs:
- run=1 score=3.5 verdict=Strong Reject confidence=0.6 error=None