One-Hot Encoding
One-hot encoding is a technique for representing categorical data as binary vectors. Each category is mapped to a unique vector with a 1 in a single position and 0s elsewhere.
Background
Machine learning models typically require numerical input. Since categorical variables lack inherent numerical meaning, one-hot encoding provides a way to include them without introducing false ordinal relationships. It is widely used in natural language processing, recommendation systems, and computer vision.
Example
Feature “color” with three categories: red, green, blue:
- red → [1, 0, 0]
- green → [0, 1, 0]
- blue → [0, 0, 1]
Strengths and challenges
- ✅ Straightforward and preserves categorical meaning.
- ✅ Prevents unintended ordering of categories.
- ❌ Can lead to very high-dimensional feature spaces.
- ❌ Less efficient than embeddings for large vocabularies.
One-hot encoding is one of the most fundamental preprocessing techniques in machine learning. By representing categories as sparse binary vectors, it ensures that algorithms interpret each category as distinct and unrelated, rather than mistakenly inferring numerical order.
Its simplicity makes it a default choice in many workflows, especially for structured data like categorical features in tabular datasets, or for representing tokens in early natural language processing systems. However, its main limitation is the curse of dimensionality: as the number of categories grows, the resulting vectors become extremely large and sparse, which can increase memory usage and slow down training.
Modern alternatives such as embeddings or hashing tricks address these issues by mapping categories into lower-dimensional continuous spaces, capturing similarities between categories while reducing computational costs. Still, one-hot encoding remains widely used in smaller-scale tasks or as a baseline method due to its clarity and universality.
📚 Further Reading
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning.