If you are using the predict() method of any scikit-learn classifier, you are doing it wrong!
Scikit-learn classifiers, like logistic regression, decision trees and random forests, have 3 key methods:
👉 fit(): triggers the learning of the model's parameters
👉 predict_proba(): outputs a vector of probabilities per class (for binary classification it outputs 2 vectors, for multiclass a vector per class).
👉 predict(): returns the predicted class.
predict() uses the arbitrary 😱 threshold of 0.5 😱 to predict the class. That’s why I never used it in my (data science) working life, and I thought that went without saying.
But last year I saw a talk by one of scikit-learn developers, where he explained how you should not use 0.5 as a threshold, and that got me thinking 💭… Maybe this is a common mistake?
The thing is, a threshold of 0.5 could work if the classes are perfectly balanced ⚖️ , which happens… almost never in real world datasets.
And even if the classes were balanced, we’d normally optimize the model for a certain metric, for example, to reduce the number of false positives, or false negatives, or to maximize the accuracy, which means that we would move the threshold up or down as needed.
👉 That makes the output of predict() useless, in almost every case.
In short, if you use a threshold of 0.5 without any further analysis, you are not taking advantage of the full potential of the model, and might be drawing wrong conclusions about the data. In fact, this is what happened to the people who developed SMOTE!!!
The good news is: scikit-learn has now released ☀️ 2 new transformers ☀️:
▶️ One that allows you to set the threshold to any value you wish: FixedThresholdClassifier.
▶️ One automatically finds the best threshold based on a performance metric: TunedThresholdClassifier.
This means that now, you you can change or find the optimal threshold for a given classifier by using additional tools from the scikit-learn ecosystem.
Before I leave, I wanted to share an opportunity that I thought you might enjoy:
Get Your Hands on 21 Machine Learning Books from Packt!
You can get a bundle with 21 books from Packt on Machine Learning, including my Python Feature Engineering Cookbook, at a price of your choosing. And what's more... a fraction of what you pay can be donated to charity.
Have you heard of humble bundle? Me neither, until recently. It's a platform that offers bundles of books, where you choose what you pay and also where your money goes, including to charity. Check it out!
Ready to enhance your data science skills?
Stop aimless internet browsing. Start learning today with meticulously crafted courses offering a robust curriculum, fostering skill development with steadfast focus and efficiency.
Forecasting specialization (course)
Interpreting Machine Learning Models (course)