Webcode2m

Webcode2m is a multimodal dataset combining screenshots of web page designs with their HTML/CSS code and associated layout information. It aims to improve the automatic generation of web code.

Download dataset

Size

3,171,024 instances, PNG images, HTML/CSS code in text, Parquet files (~1.1 TB)

Licence

CC BY 4.0

Description

‍

Webcode2m is a vast dataset containing over 3 million real examples combining web design images, their corresponding HTML/CSS codes, and layout data (bounding boxes, hierarchy). This dataset makes it possible to form multimodal models capable of generating front-end code from a design image.

‍

What is this dataset for?

‍

Training multimodal learning models for the automatic generation of web code
Develop front-end AI design support tools
Test the robustness of MLLMs in the visual and textual understanding of interfaces

‍

Can it be enriched or improved?

‍

Yes, we can enrich this dataset by more thorough filtering of sensitive content, the addition of linguistic variants or even the documentation of the various CSS styles present to better guide learning.

‍

🔎 In summary

Criterion	Evaluation
🧩 Ease of use	⭐⭐✩✩✩ (Large and requires significant computing resources)
🧼 Need for cleaning	⭐⭐⭐✩✩ (Moderate – filtering of potentially inappropriate content needed)
🏷️ Annotation richness	⭐⭐⭐⭐⭐ (Excellent – image, code, layout, and linguistic metadata)
📜 Commercial license	✅ Yes (CC BY 4.0)
👨‍💻 Beginner friendly	⚠️ No – recommended for advanced users
🔁 Fine-tuning ready	✅ Very suitable for training multimodal MLLMs
🌍 Cultural diversity	🈳 Good – supports 20 main web languages