How to identify outliers in data science?

Discover the three most commonly used methods to identify outliers in your data.

Feb 18, 2025

Hey there!

In case you missed it, last week I released the Second Edition of my book, Feature Selection in Machine Learning! 🎉

To celebrate, I’m giving a 20% discount until the end of February!

What’s New?

This edition’s got a bunch of cool updates: more details, clearer diagrams, and deeper dives into every feature selection method. Plus, I’ve added two additional methods:

Feature selection using random probes
Maximum Relevance Minimum Redundancy (MRMR)

How to Get It:

Head over to Train in Data, choose the $24.99 price, and pop in the coupon code SECONDEDITION2025 at checkout.

🗓 Offer ends 28th February 2025. Discount only available on Train in Data’s website.

Grab the book

How to identify outliers in data science?

Outliers are data points that significantly diverge from the rest of the values in the variable. In plain English, univariate outliers are values that are extremely low, or extremely high, compared to the rest of the values in the variable, and that appear with very low frequency.

With that definition in mind, there are 3 main ways to identify univariate outliers in a variable.

Interquartile range (IQR) proximity rule

👉 The interquartile range (IQR) proximity rule: states that outliers are those data points that lie beyond the 1st quartile minus 1.5 times the IQR or the 3rd quartile plus 1.5 times the IQR.

▶️ If the variable is normally distributed, these limits coincide, roughly, with the mean plus and minus 3 times the standard deviation, which is known to enclose 99.87% of the observations.

▶️ But the beauty of the IQR rule is that it is non-parametric, and hence it can be used to find outliers in non-normally distributed variables.

▶️ Another plus of the IQR rule is that it creates asymmetric boundaries, which works better for asymmetric distributions.

Mean and standard deviation

👉 If the variable is normally distributed, the mean plus and minus 3 times the standard deviation has been traditionally used as limits beyond which, we’d consider a value to be an outlier.

▶️ The Gaussian distribution is very well studied, and those values seem reasonable and are used in many statistical tests as cut points

▶️ But the mean and the standard deviation are heavily impacted by outliers, so using this method to detect outliers defeats the purpose.

Median absolute deviation

👉 A better method consists in using the median absolute deviation. It works like the mean and the standard deviation, but instead of the mean we use the median, and instead of the std we use MAD, which is the median, of the median absolute error.

▶️ The median is known to be robust to outliers, so it is clearly a good candidate for outlier identification

▶️ It is in general the recommended method, but it has some limitations: first, it produces symmetric boundaries, which is not the best for very asymmetric distributions, and it does not work if 50% of the variable’s values are identical.

To identify outliers with any of these methods, you can do so with Feature-engine:

To learn more about how to identify them and preprocess them, check out:

🎓Our course

📘 Our book

Ready to enhance your data science skills?

Stop aimless internet browsing. Start learning today with meticulously crafted courses offering a robust curriculum, fostering skill development with steadfast focus and efficiency.

Forecasting specialization (course)
Interpreting Machine Learning Models (course)
Feature Selection in Machine Learning (book)
More courses

Train in Data

How to identify outliers in data science?

Discover the three most commonly used methods to identify outliers in your data.

What’s New?

How to Get It:

How to identify outliers in data science?

Interquartile range (IQR) proximity rule

Mean and standard deviation

Median absolute deviation

Ready to enhance your data science skills?

Discussion about this post