To better understand model performance beyond value prediction, we also analyze click accuracy on forms with varying numbers of fields. Click accuracy declines as the number of fields increases—a trend largely driven by higher visual complexity and denser layouts. Unlike value prediction, which applies across all field types, we report click accuracy only for String and Description fields, since model performance on other types like checkboxes and dropdowns remains consistently close to zero. Even with this filtering, click accuracy across all models remains low, highlighting the difficulty of precise pixel-level interaction and spatial grounding, even for simple text inputs.
We further explore how field count and field type interact, as they often co-occur in real-world scenarios and jointly contribute to task difficulty. To capture this, we conduct a 3D analysis that examines model performance across both variables. Models perform best when both field count and type complexity are low, and worst when both are high—suggesting a strong compounding effect. This pattern supports the design of our benchmark, which intentionally includes forms with both numerous and diverse fields to stress-test spatial reasoning, alignment, and grounding capabilities in current MLLMs.