Title: SCAFFOLDSWAP: ARE DISCRETE SPEECH UNITS NECESSARY AS A TEMPORAL SCAFFOLD FOR AUDIO-DRIVEN 3D FACIAL ANIMATION?
PDF: exomni-scaffold-swap-ablation.pdf
Score: 3.8
Verdict: Strong Reject
Confidence: 0.60
Elapsed: 48.7s

Strengths:
1. Controlled experimental design: The paper holds the decoder architecture (TCN), training procedure (optimizer, schedule, epochs), prosody features (F0, energy), and evaluation protocol constant across all three conditions, enabling a clean isolation of the speech representation variable. This is a meaningful methodological contribution for a field where comparisons across papers are confounded by architectural differences (Section 3.2, 3.3, 3.4).
2. Valuable practical finding for practitioners: The result that phoneme+timing achieves ~90% of the improvement of discrete units over SSL (9.0% vs 10.2% on BIWI, 5.5% vs 5.9% on VOCASET) is practically useful. Production systems that already have forced alignment pipelines can make informed decisions about whether adding discrete tokenization complexity is worthwhile (Table 1, Section 5).
3. Discretization ablation cleanly isolates the bottleneck effect: Comparing HuBERT-continuous (9.494) vs HuBERT+k-means (8.080) under identical encoder backends demonstrates that the 17.5% improvement comes from quantization itself rather than the underlying representation quality. This is a non-obvious finding that supports the information bottleneck hypothesis (Table 2).

Weaknesses:
1. Statistical significance claim is flawed: The paper claims the gap between discrete units and phoneme+timing is 'statistically significant' because the gap exceeds one standard deviation of the discrete units condition by 7.1× (BIWI) and 5.2× (VOCASET). However, comparing a gap to only one group's standard deviation is not a valid statistical test. A proper test (e.g., paired t-test, bootstrap, or Welch's t-test) should compare the distributions of both conditions. With only 3 seeds, even a proper test would have very low power. The claim of 'rejecting the scaffold equivalence hypothesis' is overstated given this methodological gap (Section 4.2).
2. Extremely limited evaluation scope: The paper evaluates on only two small datasets (BIWI: 40 sentences, 6 speakers, 0.33 hours; VOCASET: 40 sentences, 12 speakers, 0.56 hours) using a single metric (LVE). There is no evaluation of upper-face motion, expression quality, subjective perceptual studies (e.g., user studies), or other standard metrics like FDD or MVE. The practical significance of a 0.5-1.3% LVE improvement is unclear without perceptual validation (Section 4.1, Table 1).
3. Decoder architecture may not be representative: The shared TCN decoder is a simple architecture (5 conv blocks, kernel size 3). The conclusions about scaffold effectiveness may not transfer to transformer-based or diffusion-based decoders used in state-of-the-art systems (FaceFormer, CodeTalker, FaceDiffuser). A TCN may disproportionately benefit from discrete inputs due to limited receptive field, whereas transformer architectures with attention might close the gap between continuous and discrete conditioning. This limitation is not discussed (Section 3.3).

Must Fix Items:
1. Replace the ad-hoc 'gap exceeds one standard deviation' significance test with a proper statistical test (e.g., paired t-test across seeds, or bootstrap confidence intervals on the performance difference). With only 3 seeds, acknowledge low statistical power honestly rather than claiming hypothesis rejection.
2. Add at least one supplementary evaluation metric beyond LVE (e.g., Max Vertex Error, FDD, or a small-scale perceptual study) to assess whether the 0.5-1.3% LVE gap between discrete units and phoneme+timing is perceptually meaningful.
3. Discuss the architectural generalizability limitation: acknowledge that findings with a TCN decoder may not transfer to attention-based or diffusion-based architectures, and ideally add at least one alternative decoder as a robustness check.

Runs:
- run=1 score=3.8 verdict=Strong Reject confidence=0.6 error=None