By clicking "Accept", you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. See our Privacy Policy for more information
Open Datasets
Webcode2m
Multimodal

Webcode2m

Webcode2m is a multimodal dataset combining screenshots of web page designs with their HTML/CSS code and associated layout information. It aims to improve the automatic generation of web code.

Download dataset
Size

3,171,024 instances, PNG images, HTML/CSS code in text, Parquet files (~1.1 TB)

Licence

CC BY 4.0

Description

Webcode2m is a vast dataset containing over 3 million real examples combining web design images, their corresponding HTML/CSS codes, and layout data (bounding boxes, hierarchy). This dataset makes it possible to form multimodal models capable of generating front-end code from a design image.

What is this dataset for?

  • Training multimodal learning models for the automatic generation of web code
  • Develop front-end AI design support tools
  • Test the robustness of MLLMs in the visual and textual understanding of interfaces

Can it be enriched or improved?

Yes, we can enrich this dataset by more thorough filtering of sensitive content, the addition of linguistic variants or even the documentation of the various CSS styles present to better guide learning.

🔎 In summary

Criterion Evaluation
🧩 Ease of use⭐⭐✩✩✩ (Large and requires significant computing resources)
🧼 Need for cleaning⭐⭐⭐✩✩ (Moderate – filtering of potentially inappropriate content needed)
🏷️ Annotation richness⭐⭐⭐⭐⭐ (Excellent – image, code, layout, and linguistic metadata)
📜 Commercial license✅ Yes (CC BY 4.0)
👨‍💻 Beginner friendly⚠️ No – recommended for advanced users
🔁 Fine-tuning ready✅ Very suitable for training multimodal MLLMs
🌍 Cultural diversity🈳 Good – supports 20 main web languages

🧠 Recommended for

  • Front-end AI researchers
  • Multimodal MLLMs developers
  • User interface generation projects

🔧 Compatible tools

  • PyTorch
  • TensorFlow
  • Hugging Face Datasets
  • Vision Transformer
  • Diffusers

💡 Tip

Prefer the use of the purified version to avoid inappropriate contents during training.

Frequently Asked Questions

Does this dataset contain sensitive or inappropriate data?

Yes, despite filtering, a small amount of inappropriate content may remain. A purified version is available.

What are the languages covered by this dataset?

It covers 20 main languages including French, English, Chinese, Chinese, Arabic, Spanish, Japanese, and more.

What is the total size of the dataset?

Approximately 1.1 TB of data in total, including images, codes and metadata in Parquet format.

Similar datasets

See more
Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.