Which scaling method should I use to rescale my variables?
Which Scaling Method is Best for Rescaling My Variables?
This is a question that I get often in my course. So, I thought, I’d put together some general guidelines.
Standardization:
Standardization consists in removing the mean from the values and then dividing by the standard deviation.
Use standardization when models require the variables to be centered at zero and data is not sparse (centering sparse data will destroy its sparse nature).
Standardization is sensitive to outliers and the z-score (another name for the same method) does not keep the symmetric properties if the variables are highly skewed.
Scaling to the minimum and maximum:
This re-scaling method involves subtracting the minimum value from the variable and then dividing by the value range.
It is suitable for variables with very small standard deviations, when the models do not require data to be centered at zero, and when we want to preserve zero entries in sparse data, like for example, in one hot encoded variables.
Note that this scaling method is sensitive to outliers.
Robust scaling:
In robust scaling, we subtract the median from the values and divide the result by the inter-quartile range.
Robust scaling is a suitable alternative to standardization when models require the variables to be centered and the data contains outliers.
Mean normalization:
Mean normaliation is a suitable alternative for models that need the variables to be centered at zero.
This method is sensitive to outliers and not a suitable option for sparse data, as it will destroy the sparse nature.
Want to learn more about re-scaling techniques?
Check out my course and book:
🎓 Course on Feature Engineering
📘Python Feature Engineering Cookbook
Before I finish my post, I’d like to take a moment to send a quick reminder—this is your last chance to get the Second Edition of the Feature Selection in Machine Learning book at a 20% discount!
What’s New in the Second Edition?
This edition is packed with exciting updates:
More details & deeper explanations for every selection technique.
Clearer diagrams for better understanding.
Two new feature selection methods: Feature selection using random probes and Maximum Relevance Minimum Redundancy (MRMR)
How to Get It:
Visit Train in Data or simply click the button below to get your copy! 👇
Select the $24.99 price option.
Enter coupon code SECONDEDITION2025 at checkout
Offer expires 28th February 2025 – don’t miss out!
Ready to enhance your skills?
Our specializations, courses and books are here to assist you:
Advanced Machine Learning (specialization)
Forecasting with Machine Learning (course)