FormFactory: An Interactive Benchmarking Suite for Multimodal Form-Filling Agents

1National University of Singapore, 2Wuhan University, 3Zhejiang University, 4Nanjing University

# indicates equal contribution.

FormFactory Overview
Figure 1: Overview of the form-filling task and its challenges. Compared to general GUI tasks, form-filling involves more diverse applications and demands higher semantic understanding, layout flexibility, and interaction complexity.

Abstract

Online form filling is a common yet labor-intensive task involving extensive keyboard and mouse interactions. Despite the long-standing vision of automating this process with "one click," existing tools remain largely rule-based and lack generalizable, generative capabilities. Recent advances in Multimodal Large Language Models (MLLMs) have enabled promising agents for GUI-related tasks in general-purpose scenarios. However, they struggle with the unique challenges of form filling, such as flexible layouts and the difficulty of aligning textual instructions with on-screen fields. To bridge this gap, we formally define the form-filling task and propose FormFactory—an interactive benchmarking suite comprising a web-based interface, backend evaluation module, and carefully constructed dataset. Our benchmark covers diverse real-world scenarios, incorporates various field formats, and simulates high-fidelity form interactions. We conduct a comprehensive evaluation of state-of-the-art MLLMs and observe that no model surpasses 5% accuracy, underscoring the inherent difficulty of the task. These findings also reveal significant limitations in current models' visual layout reasoning and field-value alignment abilities. We hope our benchmark can serve as a stepping stone for further research into robust, practical form-filling agents.

Benchmark Construction

We developed a high-fidelity browser-based interaction platform using Python and Flask to evaluate the performance of form-filling agents. This interactive platform enables users to complete various form-filling tasks directly within a web interface, while the backend system performs real-time evaluation by automatically comparing submitted values against gold-standard field annotations. The platform comprises 20 web forms spanning eight real-world domains, including academia, finance, healthcare, and information technology. It supports a diverse range of input types, such as text fields, dropdown menus, date pickers, checkboxes, file uploads, and numerical inputs.

To simulate the complexity of real-world deployment environments, the platform incorporates a wide variety of page layouts, font styles, and color schemes. It also supports multi-page forms and modular field definitions, allowing for flexible configuration of diverse evaluation scenarios. This platform provides a realistic and compositional UI environment for large-scale, reproducible evaluation of both human annotators and multimodal models.

Demo video of the interaction platform.

Additionally, we constructed a dataset of 1,250 form-filling instances using a generative approach. Each instance consists of an input document or descriptive text paired with structured field-value annotations. The inputs include real or LLM-generated resumes, academic papers, leave requests, registration forms, and other free-form text formats. We first sample gold-standard field values from the form templates, and then prompt a large language model to generate natural language inputs that implicitly contain these values, thereby simulating realistic user behavior. For example, in job application scenarios, we generate detailed Markdown-formatted resumes, while in academic contexts, metadata is extracted from real research papers.

The final dataset comprises 13,800 field-value pairs, encompassing a wide variety of field types and input modalities. It presents significant challenges for models in terms of language understanding and layout reasoning. Detailed statistics of the dataset are provided in Table 1.

Category Form Type Fields Samples Total Fields Field Types Text Long Text Number Date Selection Time File URL Required
Academic & Research Job Application for University Positions 4502002
Grant or Research Funding Application 6503005
Paper Submission Form 7503003
Student Course Registration Form 8504004
Scholarship Application for Students 16508004
Professional & Business Startup Funding Application 18509006
Real Estate Rental Application 22501,1006
Educational Workshop Registration 17508504
Association Membership Application 20501,0006
Arts & Creative Art Exhibition Submission Form 11505506
Literary Magazine Submission Form 11505505
Conference Speaker Application Form 14507006
Technology & Software Bug Reporting Form 10505004
IT Support Request Form 11505505
Finance & Banking Personal Loan Application Form 7503503
Bank Account Opening Form 5502503
Financial Planning Consultation Form 6503004
Healthcare & Medical Patient Consent for Surgery 8504003
Medical Research Study Enrollment 8504004
Health Insurance Claim Form 10504005
9504506
11505504
14507006
Construction & Manufacturing Project Bid Submission Form 13506505
Manufacturing Order Form 13506505
Overall 2791,25013,8009

Table 1: Form field statistics across domains. "Pair Count" refers to the total number of field-value pairs. Abbreviations: Bin. Chc. = Binary Choice, Text Desp. = Text Description, Multi Chc. = Multiple Choice, Ckx. Input = Checkbox Input, Num. Input = Numeric Input. indicates the presence of a field type.

Lightweight MLLM-driven Framework

To enable automated execution and evaluation of form-filling tasks, we develop a lightweight MLLM-driven framework, shown in Figure 3. The system has three main components: a web-based form frontend, a backend scorer, and an agent execution module. Given a form and an input document (e.g., a resume), the LLM generates a sequence of GUI actions—such as Click(x, y) and Type(text)—to fill the form. These actions are executed automatically using tools like PyAutoGUI.After submission, the backend scorer compares the filled entries with ground-truth annotations and produces a detailed evaluation report, including field accuracy, action match rate, and overall task score. This framework supports scalable and fine-grained analysis of model performance in realistic form-filling tasks.

LLM-Driven Form Filling Overview
Figure 2: Overview of the form-filling task system. The platform takes a form and a resume as input, prompting an MLLM to generate GUI actions (Click, Type) for form completion. Execution and scoring modules evaluate task performance.

Model Performance

We evaluate several state-of-the-art MLLMs (e.g., GPT-4o, Gemini 2.5 Pro, Claude Sonnet 3.7) via their public APIs without task-specific fine-tuning, using an automated browser interface implemented with PyAutoGUI on a Windows platform. This setup enables fair assessment of the models' inherent capabilities in visual grounding, spatial reasoning, and field-value alignment. The evaluation includes both Atomic tasks (single field types) and Episodic tasks (end-to-end form filling), measuring fine-grained interaction and multi-step reasoning performance. We report two metrics: Click (UI interaction accuracy) and Value (field content accuracy), with BLEU used for generative Description fields.

Model String Drop-down List Checkbox Radio Button Description Date Check Episodic
Click Value Click Value Click Value Click Value Click Value Click Value Click Value Click Value
GPT 4o 2.217.5 0.030.7 0.031.3 0.010.0 8.80.84 0.02.8 0.09.8 0.911.3
Gemini 2.5 Pro 0.998.7 0.099.0 0.076.1 0.052.6 8.10.72 0.099.7 0.079.8 0.470.7
Claude 3.7 Sonnet 0.095.2 0.066.2 0.072.0 0.097.9 0.00.70 0.099.1 0.055.7 0.058.0
Qwen-VL-Max 4.697.1 1.798.9 0.091.8 0.099.0 11.10.74 0.098.6 0.071.4 1.172.7
Grok 3 0.096.2 0.091.2 0.092.4 0.098.1 5.90.71 0.097.9 0.075.5 0.070.7
Doubao-vision-pro-32k 0.094.2 0.089.7 0.038.6 0.096.9 0.00.51 0.092.1 0.069.9 0.064.7

Table 2: Atomic-level and episodic-level evaluation of MLLMs across different field types using Clk. and Val. metrics. GPT-4o often refuses execution despite strong capabilities, resulting in lower atomic scores. The relatively high Clk. accuracy on Description fields stems from their large input area, which tolerates less precise clicks. Episodic results measure end-to-end form completion accuracy.

Existing MLLMs still struggle to complete form-filling tasks reliably. Improving spatial reasoning and field alignment remains critical for enabling practical, GUI-driven agents in office automation scenarios.

To better understand model performance beyond value prediction, we also analyze click accuracy on forms with varying numbers of fields. Click accuracy declines as the number of fields increases—a trend largely driven by higher visual complexity and denser layouts. Unlike value prediction, which applies across all field types, we report click accuracy only for String and Description fields, since model performance on other types like checkboxes and dropdowns remains consistently close to zero. Even with this filtering, click accuracy across all models remains low, highlighting the difficulty of precise pixel-level interaction and spatial grounding, even for simple text inputs.

We further explore how field count and field type interact, as they often co-occur in real-world scenarios and jointly contribute to task difficulty. To capture this, we conduct a 3D analysis that examines model performance across both variables. Models perform best when both field count and type complexity are low, and worst when both are high—suggesting a strong compounding effect. This pattern supports the design of our benchmark, which intentionally includes forms with both numerous and diverse fields to stress-test spatial reasoning, alignment, and grounding capabilities in current MLLMs.

Form Example 3
Figure 3: Value accuracy across varying field counts (smoothed with a window size of 3)
Form Example 4
Figure 4: Value accuracy across varying field types
Form Example 5
Figure 5: Click accuracy across varying field counts (smoothed with a window size of 3)
Form Example 6
Figure 6: Click accuracy across varying field types
Form Example 1
Figure 7: : Click accuracy under joint variation of field count and field types for Claude 3.7 Sonnet
Form Example 2
Figure 8: Click accuracy under joint variation of field count and field types for Qwen-VL-Max

BibTeX

@misc{li2025formfactory,
      title={FormFactory: An Interactive Benchmarking Suite for Multimodal Form-Filling Agents}, 
      author={Bobo Li and Yuheng Wang and Hao Fei and Juncheng Li and Wei Ji and Mong-Li Lee and Wynne Hsu},
      year={2025},
      eprint={2506.01520},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.01520}, 
}