FormFactory: An Interactive Benchmarking Suite for Multimodal Form-Filling Agents

Abstract

Online form filling is a common yet labor-intensive task involving extensive keyboard and mouse interactions. Despite the long-standing vision of automating this process with "one click," existing tools remain largely rule-based and lack generalizable, generative capabilities. Recent advances in Multimodal Large Language Models (MLLMs) have enabled promising agents for GUI-related tasks in general-purpose scenarios. However, they struggle with the unique challenges of form filling, such as flexible layouts and the difficulty of aligning textual instructions with on-screen fields. To bridge this gap, we formally define the form-filling task and propose FormFactory—an interactive benchmarking suite comprising a web-based interface, backend evaluation module, and carefully constructed dataset. Our benchmark covers diverse real-world scenarios, incorporates various field formats, and simulates high-fidelity form interactions. We conduct a comprehensive evaluation of state-of-the-art MLLMs and observe that no model surpasses 5% accuracy, underscoring the inherent difficulty of the task. These findings also reveal significant limitations in current models' visual layout reasoning and field-value alignment abilities. We hope our benchmark can serve as a stepping stone for further research into robust, practical form-filling agents.

Benchmark Construction

We developed a high-fidelity browser-based interaction platform using Python and Flask to evaluate the performance of form-filling agents. This interactive platform enables users to complete various form-filling tasks directly within a web interface, while the backend system performs real-time evaluation by automatically comparing submitted values against gold-standard field annotations. The platform comprises 20 web forms spanning eight real-world domains, including academia, finance, healthcare, and information technology. It supports a diverse range of input types, such as text fields, dropdown menus, date pickers, checkboxes, file uploads, and numerical inputs.

To simulate the complexity of real-world deployment environments, the platform incorporates a wide variety of page layouts, font styles, and color schemes. It also supports multi-page forms and modular field definitions, allowing for flexible configuration of diverse evaluation scenarios. This platform provides a realistic and compositional UI environment for large-scale, reproducible evaluation of both human annotators and multimodal models.

Demo video of the interaction platform.

Additionally, we constructed a dataset of 1,250 form-filling instances using a generative approach. Each instance consists of an input document or descriptive text paired with structured field-value annotations. The inputs include real or LLM-generated resumes, academic papers, leave requests, registration forms, and other free-form text formats. We first sample gold-standard field values from the form templates, and then prompt a large language model to generate natural language inputs that implicitly contain these values, thereby simulating realistic user behavior. For example, in job application scenarios, we generate detailed Markdown-formatted resumes, while in academic contexts, metadata is extracted from real research papers.

The final dataset comprises 13,800 field-value pairs, encompassing a wide variety of field types and input modalities. It presents significant challenges for models in terms of language understanding and layout reasoning. Detailed statistics of the dataset are provided in Table 1.

Category	Form Type	Fields	Samples	Total Fields	Field Types	Text	Long Text	Number	Date	Selection	Time	File	URL	Required
Academic & Research	Job Application for University Positions	4	50	200	2				✓
	Grant or Research Funding Application	6	50	300	5	✓	✓			✓		✓
	Paper Submission Form	7	50	300	3			✓		✓
	Student Course Registration Form	8	50	400	4			✓	✓		✓
	Scholarship Application for Students	16	50	800	4			✓	✓	✓				✓
Professional & Business	Startup Funding Application	18	50	900	6	✓		✓	✓	✓			✓	✓
	Real Estate Rental Application	22	50	1,100	6	✓		✓	✓	✓			✓	✓
	Educational Workshop Registration	17	50	850	4	✓		✓	✓					✓
	Association Membership Application	20	50	1,000	6	✓		✓	✓	✓		✓		✓
Arts & Creative	Art Exhibition Submission Form	11	50	550	6			✓	✓	✓		✓	✓	✓
	Literary Magazine Submission Form	11	50	550	5			✓	✓	✓		✓		✓
	Conference Speaker Application Form	14	50	700	6		✓	✓	✓	✓		✓		✓
Technology & Software	Bug Reporting Form	10	50	500	4			✓	✓	✓				✓
Technology & Software	IT Support Request Form	11	50	550	5			✓	✓	✓			✓	✓
Finance & Banking	Personal Loan Application Form	7	50	350	3			✓					✓
	Bank Account Opening Form	5	50	250	3	✓		✓
	Financial Planning Consultation Form	6	50	300	4	✓		✓	✓
Healthcare & Medical	Patient Consent for Surgery	8	50	400	3	✓						✓
	Medical Research Study Enrollment	8	50	400	4			✓	✓				✓
	Health Insurance Claim Form	10	50	400	5	✓		✓	✓	✓				✓
Legal & Compliance	NDA Submission Form	9	50	450	6	✓	✓		✓			✓	✓
	Background Check Auth. Form	11	50	550	4	✓			✓			✓
	Contractor Onboarding Form	14	50	700	6	✓		✓	✓	✓		✓		✓
Construction & Manufacturing	Project Bid Submission Form	13	50	650	5	✓			✓	✓			✓	✓
Construction & Manufacturing	Manufacturing Order Form	13	50	650	5	✓		✓	✓	✓				✓
Overall		279	1,250	13,800	9	✓	✓	✓	✓	✓	✓	✓	✓	✓

Lightweight MLLM-driven Framework

To enable automated execution and evaluation of form-filling tasks, we develop a lightweight MLLM-driven framework, shown in Figure 3. The system has three main components: a web-based form frontend, a backend scorer, and an agent execution module. Given a form and an input document (e.g., a resume), the LLM generates a sequence of GUI actions—such as Click(x, y) and Type(text)—to fill the form. These actions are executed automatically using tools like PyAutoGUI.After submission, the backend scorer compares the filled entries with ground-truth annotations and produces a detailed evaluation report, including field accuracy, action match rate, and overall task score. This framework supports scalable and fine-grained analysis of model performance in realistic form-filling tasks.

LLM-Driven Form Filling Overview — Figure 2: Overview of the form-filling task system. The platform takes a form and a resume as input, prompting an MLLM to generate GUI actions (Click, Type) for form completion. Execution and scoring modules evaluate task performance.

Model Performance

We evaluate several state-of-the-art MLLMs (e.g., GPT-4o, Gemini 2.5 Pro, Claude Sonnet 3.7) via their public APIs without task-specific fine-tuning, using an automated browser interface implemented with PyAutoGUI on a Windows platform. This setup enables fair assessment of the models' inherent capabilities in visual grounding, spatial reasoning, and field-value alignment. The evaluation includes both Atomic tasks (single field types) and Episodic tasks (end-to-end form filling), measuring fine-grained interaction and multi-step reasoning performance. We report two metrics: Click (UI interaction accuracy) and Value (field content accuracy), with BLEU used for generative Description fields.

Model	String	Drop-down List	Checkbox	Radio Button	Date	Episodic
GPT 4o	2.2	17.5	0.0	30.7	31.3	10.0	8.8	0.84	2.8	9.8	0.9	11.3
Gemini 2.5 Pro	0.9	98.7	0.0	99.0	76.1	52.6	8.1	0.72	99.7	79.8	0.4	70.7
Claude 3.7 Sonnet	0.0	95.2	0.0	66.2	72.0	97.9	0.0	0.70	99.1	55.7	0.0	58.0
Qwen-VL-Max	4.6	97.1	1.7	98.9	91.8	99.0	11.1	0.74	98.6	71.4	1.1	72.7
Grok 3	0.0	96.2	0.0	91.2	92.4	98.1	5.9	0.71	97.9	75.5	0.0	70.7
Doubao-vision-pro-32k	0.0	94.2	0.0	89.7	38.6	96.9	0.0	0.51	92.1	69.9	0.0	64.7

Model

String

Drop-down List

Checkbox

Radio Button

Description

Date

Check

Episodic

Click

Value

Click

Value

Click

Value

Click

Value

Click

Value

Click

Value

Click

Value

Click

Value

GPT 4o

2.2

17.5

0.0

30.7

0.0

31.3

0.0

10.0

8.8

0.84

0.0

2.8

0.0

9.8

0.9

11.3

Gemini 2.5 Pro

0.9

98.7

0.0

99.0

0.0

76.1

0.0

52.6

8.1

0.72

0.0

99.7

0.0

79.8

0.4

70.7

Claude 3.7 Sonnet

0.0

95.2

0.0

66.2

0.0

72.0

0.0

97.9

0.0

0.70

0.0

99.1

0.0

55.7

0.0

58.0

Qwen-VL-Max

4.6

97.1

1.7

98.9

0.0

91.8

0.0

99.0

11.1

0.74

0.0

98.6

0.0

71.4

1.1

72.7

Grok 3

0.0

96.2

0.0

91.2

0.0

92.4

0.0

98.1

5.9

0.71

0.0

97.9

0.0

75.5

0.0

70.7

Doubao-vision-pro-32k

0.0

94.2

0.0

89.7

0.0

38.6

0.0

96.9

0.0

0.51

0.0

92.1

0.0

69.9

0.0

64.7

To better understand model performance beyond value prediction, we also analyze click accuracy on forms with varying numbers of fields. Click accuracy declines as the number of fields increases—a trend largely driven by higher visual complexity and denser layouts. Unlike value prediction, which applies across all field types, we report click accuracy only for String and Description fields, since model performance on other types like checkboxes and dropdowns remains consistently close to zero. Even with this filtering, click accuracy across all models remains low, highlighting the difficulty of precise pixel-level interaction and spatial grounding, even for simple text inputs.

We further explore how field count and field type interact, as they often co-occur in real-world scenarios and jointly contribute to task difficulty. To capture this, we conduct a 3D analysis that examines model performance across both variables. Models perform best when both field count and type complexity are low, and worst when both are high—suggesting a strong compounding effect. This pattern supports the design of our benchmark, which intentionally includes forms with both numerous and diverse fields to stress-test spatial reasoning, alignment, and grounding capabilities in current MLLMs.

Form Example 3 — Figure 3: Value accuracy across varying field counts (smoothed with a window size of 3)

Form Example 4 — Figure 4: Value accuracy across varying field types

Form Example 5 — Figure 5: Click accuracy across varying field counts (smoothed with a window size of 3)

Form Example 6 — Figure 6: Click accuracy across varying field types

Form Example 1 — Figure 7: : Click accuracy under joint variation of field count and field types for Claude 3.7 Sonnet

Form Example 2 — Figure 8: Click accuracy under joint variation of field count and field types for Qwen-VL-Max

BibTeX

@misc{li2025formfactory, title={FormFactory: An Interactive Benchmarking Suite for Multimodal Form-Filling Agents}, author={Bobo Li and Yuheng Wang and Hao Fei and Juncheng Li and Wei Ji and Mong-Li Lee and Wynne Hsu}, year={2025}, eprint={2506.01520}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2506.01520}, }