Title: COPY-THEN-INPAINT: IMPROVING TEMPORAL CON-SISTENCY IN MULTI-STEP GUI GENERATION VIA SE-LECTIVE REGION EDITING FARS
PDF: 79415f26-c79f-4fa8-94aa-27ab56b73177.pdf
Score: 4.2
Verdict: Reject
Confidence: 0.72
Elapsed: 327.7s

Strengths:
1. Clean three-stage pipeline design with clear motivation: the observation that most GUI actions modify only small regions is valid and well-articulated (Section 3.1, paragraph 2), and the pipeline directly exploits this domain property through masked inpainting and pixel compositing.
2. Well-designed ablation framework: the shuffled-mask condition (C) is a thoughtful control that isolates semantic alignment from mere area reduction (Table 1, Section 4.3), and the dilation ablation (Table 2) reveals a meaningful tradeoff between boundary coherence and task completion, with statistical significance reported (p=0.004 for GOAL degradation).
3. Transparent reporting of negative results: the authors honestly report that the method fails on English GUIs (CONS −0.4, p=0.87 in Section 4.5), which is unusual and commendable. This finding actually provides more insight than the positive aggregate result, as it reveals that the pipeline's effectiveness is contingent on VLM localization accuracy in structured vs. complex layouts.

Weaknesses:
1. Trivial core contribution with over-packaging: the pipeline is 'VLM predicts bounding box → inpaint masked region → paste back,' which is standard practice in image editing. Equation (1) is a simple masked blend. The related work already describes DiffEdit (Couairon et al., 2022) doing automatic mask generation + inpainting. The paper adds a VLM-based mask predictor and applies it to GUI trajectories — an engineering adaptation, not a methodological contribution. The abstract's language ('significantly improves,' 'essential') inflates what is essentially a straightforward composition of existing components.
2. No comparison to existing GUI world models: the related work discusses ViMo (Luo et al., 2025), gWorld (Koh et al., 2026), and MobileDreamer (Cao et al., 2026), but none appear as baselines. The only baseline is the self-constructed full-mask condition (A), which is a strawman — regenerating the entire frame with the same inpainting model. Without comparison to published GUI trajectory generation methods, it is impossible to assess whether the approach is competitive or merely better than an obviously suboptimal baseline.
3. The aggregate CONS improvement (+5.7) is misleading — it is entirely driven by the Chinese subset (+11.8), while English shows zero effect (−0.4, p=0.87, Section 4.5). Reporting the aggregate as a significant improvement (p<0.01) masks the fact that the method provides no benefit on half the benchmark. This is a severe limitation for a method claimed to address 'temporal drift in multi-step GUI generation' generically. The p-value on the aggregate is driven by one subset, making the headline claim non-representative.
4. No mask prediction quality analysis: the entire pipeline depends on the VLM's ability to accurately predict change regions, yet the paper provides zero quantitative evaluation of mask quality — no IoU with ground-truth change regions, no precision/recall, no failure case analysis. Section 4.5 speculates about 'structured layouts' vs. 'complex UI elements' but offers no evidence. Without this, it is impossible to determine whether the bottleneck is the mask predictor or the inpainting model, or to assess the pipeline's robustness.
5. Single benchmark, single model, single evaluation protocol: the evaluation uses only GEBench Type 2 (n=200), only Qwen-Image-Edit for generation, and only GPT-4o as a judge. There is no cross-model generalization test (e.g., with a different inpainting backbone), no alternative GUI benchmark, and no human evaluation to validate the LLM-as-judge scores. The n=200 sample size across 4 subsets (50 each) limits statistical power for subset-level claims.

Must Fix Items:
1. Report mask prediction quality (IoU or similar) against ground-truth change regions to quantify the VLM's localization accuracy and identify failure modes.
2. Add at least one comparison to a published GUI world model baseline (ViMo, gWorld, or MobileDreamer) rather than only the self-constructed full-mask strawman.
3. Disaggregate the headline CONS claim by subset — the abstract and introduction should not present the +5.7 figure as representative when it is driven entirely by Chinese GUIs, with English GUIs showing no improvement. The aggregate claim is misleading without this qualification.

Runs:
- run=1 score=4.2 verdict=Reject confidence=0.72 error=None