Jack of All Trades (JAT) Dataset

The Jack of All Trades (JAT) dataset is a massive and varied corpus intended for the training of multimodal generalist AI models. It integrates text, images, RL demonstrations, and image-caption pairs.

Download dataset

Size

Over 258 million examples, 1.07 TB, Parquet formats (text, images, RLs, captions)

Licence

Apache 2.0

Description

‍

The dataset Jack of All Trades (JAT) is a diverse and large-scale collection designed for training generalist artificial intelligence models. It combines several subdomains of data: free text, annotated images, demonstrations by reinforcement learning agents, and image-caption pairs. Thanks to its richness and variety, this corpus serves as a robust basis for multimodal artificial intelligence research.

‍

What is this dataset for?

‍

Train AI agents who can understand and produce text/image content
Leveraging RL demonstrations to learn complex behaviors
Test and develop multi-tasking and multi-entry AI architectures

‍

Can it be enriched or improved?

‍

Yes, the dataset can be enriched with other types of content (audio, video), or refined by selecting thematic subsets. Additional annotations can also be added to refine supervised or semi-supervised training. Adaptation to specific languages or contexts is also possible.

‍

🔎 In summary

Criterion	Evaluation
🧩 Ease of use	⭐⭐✩✩✩ (Massive volume, requires strong resources)
🧼 Need for cleaning	⭐⭐⭐⭐⭐ (Low – data is well-formatted)
🏷️ Annotation richness	⭐⭐⭐✩✩ (Mixed: depends on the subset - captions, RL, free text)
📜 Commercial license	✅ Yes (Apache 2.0)
👨‍💻 Beginner friendly	⚠️ No – requires technical expertise and computing power
🔁 Fine-tuning ready	✅ Excellent for multimodal fine-tuning
🌍 Cultural diversity	⚠️ To verify – content mainly technical