Webcode2m
Webcode2m is a multimodal dataset combining screenshots of web page designs with their HTML/CSS code and associated layout information. It aims to improve the automatic generation of web code.
3,171,024 instances, PNG images, HTML/CSS code in text, Parquet files (~1.1 TB)
CC BY 4.0
Description
Webcode2m is a vast dataset containing over 3 million real examples combining web design images, their corresponding HTML/CSS codes, and layout data (bounding boxes, hierarchy). This dataset makes it possible to form multimodal models capable of generating front-end code from a design image.
What is this dataset for?
- Training multimodal learning models for the automatic generation of web code
- Develop front-end AI design support tools
- Test the robustness of MLLMs in the visual and textual understanding of interfaces
Can it be enriched or improved?
Yes, we can enrich this dataset by more thorough filtering of sensitive content, the addition of linguistic variants or even the documentation of the various CSS styles present to better guide learning.
🔎 In summary
🧠 Recommended for
- Front-end AI researchers
- Multimodal MLLMs developers
- User interface generation projects
🔧 Compatible tools
- PyTorch
- TensorFlow
- Hugging Face Datasets
- Vision Transformer
- Diffusers
💡 Tip
Prefer the use of the purified version to avoid inappropriate contents during training.
Frequently Asked Questions
Does this dataset contain sensitive or inappropriate data?
Yes, despite filtering, a small amount of inappropriate content may remain. A purified version is available.
What are the languages covered by this dataset?
It covers 20 main languages including French, English, Chinese, Chinese, Arabic, Spanish, Japanese, and more.
What is the total size of the dataset?
Approximately 1.1 TB of data in total, including images, codes and metadata in Parquet format.




