Struggling with high variable cardinality in your categorical data?

Dec 12, 2024

Categorical variables have labels instead of numbers. For example, the variable city is categorical, with values like London, Bristol and Manchester.

Some categorical variables have low cardinality, that is, the only take a small number of different categories. Some other variables can be highly cardinal, like for example postcode, or employment status.

Highly cardinal variables are challenging in machine learning, because many of the categories appear only in the train set or in the test set. Thus, the model can overfit to the data, or won't know how to preprocess it.

Fortunately, there are many ways in which we can encode our variables to tackle high cardinality. Here a summary:

1️⃣ One-Hot Encoding of Frequent Categories: Add binary variables only for frequent categories. This avoids overfitting and explosion of the feature space.

💡 With this method, rare categories are treated together as one additional group.

2️⃣ Count/Frequency Encoding: Replace categories with counts or proportions. That is, when the count of the categories aligns with your business or data logic.

A common example occurs in sales forecasting: the item’s count hints at its popularity. Hence, count encoding makes a useful predictor.

💡 With this method, categories with similar counts are given similar treatment, or even grouped together, reducing the cardinality.

3️⃣ Mean or Target Encoding: Replace categories with a blend of the posterior and prior target expectation.

In plain English, what this means is that infrequent categories are replaced by the global target mean, and hence treated collectively as one category.

4️⃣ Grouping Rare Categories: Merge infrequent categories into one group and encode them collectively with any method of your choosing.

Infrequent categories are difficult because it’s very hard to get any certainty about their patterns and distributions. Hence, treating them as a single category, helps avoid overfitting and streamline production pipelines.

🎁 And now relevant links:

Feature-Engine's category encoders.
Our course on feature engineering.
Our Python Feature Engineering Cookbook.

More courses

How do you tackle high cardinality? Share in the comments 👇

Train in Data

Discussion about this post

Ready for more?