WebClick — Multimodal benchmark for web browsing
WebClick is a multimodal benchmark dataset designed to assess the ability of models and agents to understand and navigate web interfaces. It contains screenshots annotated with natural language instructions and specific click areas.
1,639 PNG/JPEG images, text instructions, bounding box coordinates in JSON
Apache 2.0
Description
The dataset WebClick contains 1,639 screenshots of annotated websites with natural language instructions and precise bounding boxes. This data comes from real tasks of human agents and users, covering web browsing, online shopping, and calendar management.
What is this dataset for?
- Evaluate the understanding of user interfaces by multimodal models
- Test the ability to accurately locate clicks in response to natural language instructions
- Develop and benchmark intelligent agents for automated web browsing
Can it be enriched or improved?
This dataset can be enriched with additional annotations, such as complex interactive elements or multi-page contexts. The integration of data from other web environments would improve the robustness of the models.
🔎 In summary
🧠 Recommended for
- Multimodal AI researchers
- Web agent developers
- R&D, UX and automated navigation teams
🔧 Compatible tools
- PyTorch
- TensorFlow
- Hugging Face
- Visual annotation tools
💡 Tip
Use advanced spatial grounding techniques to maximize click location accuracy.
Frequently Asked Questions
What data is provided in WebClick?
Website screenshots, natural language instructions, and precise coordinates of bounding boxes.
Is this dataset suitable for creating intelligent agents for web browsing?
Yes, it allows you to train and evaluate agents who are able to understand instructions and interact with web interfaces.
What are the usage scenarios covered by WebClick?
Agent-assisted browsing, online shopping, calendar management, and other complex web interactions.




