How to identify outliers in data science?
Discover the three most commonly used methods to identify outliers in your data.
Hey there!
In case you missed it, last week I released the Second Edition of my book, Feature Selection in Machine Learning! š
To celebrate, Iām giving a 20% discount until the end of February!
Whatās New?
This editionās got a bunch of cool updates: more details, clearer diagrams, and deeper dives into every feature selection method. Plus, Iāve added two additional methods:
Feature selection using random probes
Maximum Relevance Minimum Redundancy (MRMR)
How to Get It:
Head over to Train in Data, choose the $24.99 price, and pop in the coupon code SECONDEDITION2025 at checkout.
š Offer ends 28th February 2025. Discount only available on Train in Dataās website.
How to identify outliers in data science?
Outliers are data points that significantly diverge from the rest of the values in the variable. In plain English, univariate outliers are values that are extremely low, or extremely high, compared to the rest of the values in the variable, and that appear with very low frequency.
With that definition in mind, there are 3 main ways to identify univariate outliers in a variable.
Interquartile range (IQR) proximity rule
š The interquartile range (IQR) proximity rule: states that outliers are those data points that lie beyond the 1st quartile minus 1.5 times the IQR or the 3rd quartile plus 1.5 times the IQR.
ā¶ļø If the variable is normally distributed, these limits coincide, roughly, with the mean plus and minus 3 times the standard deviation, which is known to enclose 99.87% of the observations.
ā¶ļø But the beauty of the IQR rule is that it is non-parametric, and hence it can be used to find outliers in non-normally distributed variables.
ā¶ļø Another plus of the IQR rule is that it creates asymmetric boundaries, which works better for asymmetric distributions.
Mean and standard deviation
š If the variable is normally distributed, the mean plus and minus 3 times the standard deviation has been traditionally used as limits beyond which, weād consider a value to be an outlier.
ā¶ļø The Gaussian distribution is very well studied, and those values seem reasonable and are used in many statistical tests as cut points
ā¶ļø But the mean and the standard deviation are heavily impacted by outliers, so using this method to detect outliers defeats the purpose.
Median absolute deviation
š A better method consists in using the median absolute deviation. It works like the mean and the standard deviation, but instead of the mean we use the median, and instead of the std we use MAD, which is the median, of the median absolute error.
ā¶ļø The median is known to be robust to outliers, so it is clearly a good candidate for outlier identification
ā¶ļø It is in general the recommended method, but it has some limitations: first, it produces symmetric boundaries, which is not the best for very asymmetric distributions, and it does not work if 50% of the variableās values are identical.
To identify outliers with any of these methods, you can do so with Feature-engine:
To learn more about how to identify them and preprocess them, check out:
šOur course
š Our book
Ready to enhance your data science skills?
Stop aimless internet browsing. Start learning today with meticulously crafted courses offering a robust curriculum, fostering skill development with steadfast focus and efficiency.
Forecasting specialization (course)
Interpreting Machine Learning Models (course)