When it comes to scaling data for machine learning models, I am frequently asked:
Can we not fit scalers on the full dataset? Why do we need to split into train and test sets? I mean, what if the test set has a value greater than the maximum value in the training set? Why make our lives complicated?
Well, not so fast.
🔍 Let's compare the two approaches:
Training Set Only: 🛠️
📚 Fitting scalers (and feature engineering techniques in general) on the training set alone prevents overfitting.
🎯 It ensures the model learns patterns that generalize well to unseen data.
🛡️ It treats the test set as a true representation of real-world data post-deployment.
Full Dataset Approach: 🔄
🚫 Risks incorporating test set information into the training process, potentially overestimating model performance.
🧐 Extreme values in the test set may lead to compromised generalization.
📉 A significant number of observations with values outside the training set range could indicate distribution shifts, and you want to find out about this sooner than later (as in, not when your model is finally deployed). Splitting the dataset would have given you this hint..
The Goal: 🎯
Even if we used the full data set to train the scaler, or the entire pipeline for that matter, future data, the one that we are actually going to score with the model once we deploy it, may still present values outside the range of those seen in your training and testing sets. And believe me, this is not the right moment to find out about that.
If you split the data, you’ll encounter representative challenges during the development phase, and hopefully, you can put some mechanisms in place to tackle or minimize the impact when this happens.
In summary, while the full dataset approach may seem intuitive, prioritizing the train-test split principle reinforces model robustness and generalization capabilities.
Ready to enhance your data science skills?
Stop aimless internet browsing. Start learning today with meticulously crafted courses offering a robust curriculum, fostering skill development with steadfast focus and efficiency.
Forecasting specialization (course)
Interpreting Machine Learning Models (course)