Jack of All Trades (JAT) Dataset
The Jack of All Trades (JAT) dataset is a massive and varied corpus intended for the training of multimodal generalist AI models. It integrates text, images, RL demonstrations, and image-caption pairs.
Over 258 million examples, 1.07 TB, Parquet formats (text, images, RLs, captions)
Apache 2.0
Description
The dataset Jack of All Trades (JAT) is a diverse and large-scale collection designed for training generalist artificial intelligence models. It combines several subdomains of data: free text, annotated images, demonstrations by reinforcement learning agents, and image-caption pairs. Thanks to its richness and variety, this corpus serves as a robust basis for multimodal artificial intelligence research.
What is this dataset for?
- Train AI agents who can understand and produce text/image content
- Leveraging RL demonstrations to learn complex behaviors
- Test and develop multi-tasking and multi-entry AI architectures
Can it be enriched or improved?
Yes, the dataset can be enriched with other types of content (audio, video), or refined by selecting thematic subsets. Additional annotations can also be added to refine supervised or semi-supervised training. Adaptation to specific languages or contexts is also possible.
🔎 In summary
🧠 Recommended for
- Multimodal AI researchers
- General agent developers
- Laboratories RL
🔧 Compatible tools
- PyTorch
- Hugging Face Transformers
- RLlib
- TensorFlow
- LangChain
💡 Tip
For best results, start by fine-tuning specific subsets before tackling the full corpus.
Frequently Asked Questions
Is this dataset suitable for training a multitasking model?
Yes, it was designed for that, with a variety of formats and areas useful for multitasking or multi-modality training.
Is it possible to use only part of the dataset?
Yes, each sub-dataset is accessible independently, which allows targeted selection according to training needs.
What infrastructure is recommended to use this dataset?
A machine with GPU (s) and high storage is recommended to process 1TB of data efficiently.




