WebClick — Multimodal benchmark for web browsing

WebClick is a multimodal benchmark dataset designed to assess the ability of models and agents to understand and navigate web interfaces. It contains screenshots annotated with natural language instructions and specific click areas.

Download dataset

Size

1,639 PNG/JPEG images, text instructions, bounding box coordinates in JSON

Licence

Apache 2.0

Description

‍

The dataset WebClick contains 1,639 screenshots of annotated websites with natural language instructions and precise bounding boxes. This data comes from real tasks of human agents and users, covering web browsing, online shopping, and calendar management.

‍

What is this dataset for?

‍

Evaluate the understanding of user interfaces by multimodal models
Test the ability to accurately locate clicks in response to natural language instructions
Develop and benchmark intelligent agents for automated web browsing

‍

Can it be enriched or improved?

‍

This dataset can be enriched with additional annotations, such as complex interactive elements or multi-page contexts. The integration of data from other web environments would improve the robustness of the models.

‍

🔎 In summary

Criterion	Evaluation
🧩 Ease of use	⭐⭐⭐⭐⭐ (Good, JSON format and images easy to use)
🧼 Need for cleaning	⭐⭐⭐⭐⭐ (Minimal, precise and rigorous annotations)
🏷️ Annotation richness	⭐⭐⭐⭐⭐ (Excellent, including natural language instructions and exact bounding boxes)
📜 Commercial license	✅ Yes (Apache 2.0)
👨‍💻 Beginner friendly	🌟 Yes, well-documented and structured dataset
🔁 Fine-tuning ready	🎯 Perfect for training multimodal UI/language models
🌍 Cultural diversity	⚠️ Primarily English, wide variety of websites