UCI Machine Learning Repository

The UCI Machine Learning Repository is one of the most iconic resources for the machine learning community. Created at the University of California, Irvine, it brings together hundreds of public datasets used for experimenting, teaching, and benchmarking machine learning algorithms.

Download dataset

Size

Several hundreds of datasets, of various sizes, in CSV, ARFF and other formats

Licence

Free for academic use. Verification recommended for commercial uses according to data sets

Description

‍
The UCI repository includes:

Several hundreds of datasets classified by type of task (classification, regression, clustering)
Various formats: CSV, ARFF, TXT, etc.
Metadata associated with each data set (source, description, type of variables...)
A simple interface to explore, download, and use files directly

‍

What is this repository for?

‍
It is used for:

Experimenting and testing machine learning models
Validating tabular data processing pipelines
Training supervised models on concrete cases (classification, regression)
Teaching data science and machine learning algorithms

‍

Can it be enriched or improved?

‍
Yes, this resource can be enriched:

By offering cleaned or pre-processed versions of the most popular datasets
By annotating certain datasets with secondary tasks (for example, anomaly detection)
By cross-referencing UCI datasets with real sources for hybrid use cases
By creating explanatory notebooks or standardized benchmarks on the most used games

‍

🔗 Source: UCI Machine Learning Repository

‍

Frequently Asked Questions

Is the repository still relevant despite the emergence of more modern sources?

Yes, it remains a reference for learning, rapid validation of algorithms and educational projects. Its diversity and simplicity make it an ideal starting point.

Can these datasets be used in production?

Not directly. Most are small in size and intended for experimentation or teaching. For projects in production, it is recommended to use more representative data.

Are there newer alternatives?

Yes, platforms like Kaggle Datasets, OpenML, or Hugging Face Datasets offer modern datasets that are often larger or annotated for specific tasks.

Similar datasets

Medical

CHexpert Dataset

Audio

DCASE Challenge Dataset

Text

MultiNli (Multi-Genre Natural Language Inference Corpus)