UCF101

UCF101 is an open source dataset that is a reference in the field of video analysis. It includes more than 13,000 clips representing various human actions such as running, jumping, cooking or playing sports. It is one of the most used benchmarks for training and evaluating action recognition models.

Download dataset

Size

13320 videos classified into 101 categories of human actions, AVI format

Licence

Free for academic use, licensed under Creative Commons Attribution-NonCommercial 4.0 (CC BY-NC 4.0)

Description

‍
The dataset contains:

13,320 short videos (about 7 seconds on average)
101 action classes (sports, daily actions, social interactions...)
Videos from YouTube, with a realistic, unfiltered background
25 groups for a standardized division in training/testing
Video data in AVI format, 320×240 pixels at 25 fps

‍

Each video shows a single main action, making the supervised classification task easier.

‍

What is this dataset for?

‍
UCF101 is used for:

Training human action recognition models (CNN 3D, RNN, Video Transformers)
Validation of embedded vision systems (robots, security cameras, etc.)
Pre-training video models that are then used to detect events
Research on space-time processing architectures (SlowFast, TimesFormer, VideoMAE)
Behavioral analysis in a general public or surveillance context

‍

Can it be enriched or improved?

‍
Yes, in particular via:

The addition of finer annotations (multi-actions, exact timeframe)
Conversion to HDF5 or TFRecord to speed up ingestion
Training temporal segmentation or multi-label detection models
Cross-referencing audio or text data for multimodal approaches

‍

🔗 Source: UCF101 Dataset (official)

‍

Frequently Asked Questions

Does UCF101 contain sound?

No, the videos are silent. Combining with other datasets like Kinetics is recommended if you are looking for an audio component.

‍

Is the dataset suitable for real-time detection?

Partially. The videos are short and well-cut, which is great for classification. For real-time detection, adaptations or a dataset like ActivityNet are preferable.

Is there a newer or extended version?

Yes. The HMDB51 dataset is more difficult (fewer examples, more noise), and Kinetics-600/700 offers a larger volume for similar tasks.

Similar datasets

Audio

Google Speech Commands

Multimodal

HowTo100m

Audio

ESC-50 (Environmental Sound Classification)