Friday, December 2, 2022
HomeData ScienceConsiderably Improve Your Grid-Search Outcomes With These Parameters | by Tomer Gabay...

Considerably Improve Your Grid-Search Outcomes With These Parameters | by Tomer Gabay | Dec, 2022


Grid search over any machine studying pipeline step utilizing an EstimatorSwitch

Picture by Héctor J. Rivas on Unsplash

A quite common step in constructing a machine studying mannequin is to grid search over a classifier’s parameters on the prepare set, utilizing cross-validation, to seek out probably the most optimum parameters. What’s much less identified, is you could additionally grid search over just about any pipeline step, reminiscent of characteristic engineering steps. E.g. which imputation technique works finest for numerical values? Imply, median or arbitrary? Which categorical encoding methodology to make use of? One-hot encoding, or perhaps ordinal?

On this article, I’ll information you thru the steps to have the ability to reply such questions in your personal machine-learning tasks utilizing grid searches.

To put in all of the required Python packages for this text:

pip set up extra-datascience-tools feature-engine

The dataset

Let’s contemplate the next quite simple public area information set I created which has two columns: last_grade and passed_course. The final grade column incorporates the grade the scholar achieved on their final examination and the handed course column is a boolean column with True if the scholar handed the course and False if the scholar failed the course. Can we construct a mannequin that predicts whether or not a scholar handed the course based mostly on their final grade?

Allow us to first discover the dataset:

import pandas as pd

df = pd.read_csv('last_grades.csv')
df.isna().sum()

OUTPUT
last_grade 125
course_passed 0
dtype: int64

Our goal variable course_passed has no nan values, so no want for dropping rows right here.

After all, to forestall any information leakage we should always cut up our information set right into a prepare and take a look at set first earlier than persevering with.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
df[['last_grade']],
df['course_passed'],
random_state=42)

As a result of most machine studying fashions don’t permit for nan values, we should contemplate totally different imputation methods. After all, usually, you’ll begin EDA (explorative information evaluation) to find out whether or not nan values are MAR (Lacking at Random) MCAR (Lacking Utterly at Random) or MNAR (Lacking not at Random). A great article that explains the variations between these will be discovered right here:

As a substitute of analyzing why for some college students their final grade is lacking, we’re merely going to attempt to grid search over totally different imputation methods as an instance the right way to grid search over any pipeline step, reminiscent of this characteristic engineering step.

Let’s discover the distribution of the impartial variable last_grade :

import seaborn as sns

sns.histplot(information=X_train, x='last_grade')

Distribution of last_grade (Picture by Creator)

It seems just like the final grades are usually distributed with a imply of ~6.5 and values between ~3 and ~9.5.

Let’s additionally have a look at the distribution of the goal variable to find out which scoring metric to make use of:

y_train.value_counts()
OUTPUT
True 431
False 412
Identify: course_passed, dtype: int64

The goal variable is roughly equally divided, which suggests we are able to use scikit-learn’s default scorer for classification duties, which is the accuracy rating. Within the case of an unequally divided goal variable the accuracy rating isn’t correct, use e.g. F1 as a substitute.

Grid looking

Subsequent, we’re going to arrange the mannequin and the grid-search and run it by simply optimizing the classifier’s parameters, which is how I see most information scientists use a grid-search. We’ll use feature-engine’s MeanMedianImputer for now to impute the imply and scikit-learn’s DecisionTreeClassifier for predicting the goal variable.

from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

from feature-engine.imputation import MeanMedianImputer

mannequin = Pipeline(
[
("meanmedianimputer", MeanMedianImputer(imputation_method="mean")),
("tree", DecisionTreeClassifier())
]
)

param_grid = [
{"tree__max_depth": [None, 2, 5]}
]

gridsearch = GridSearchCV(mannequin, param_grid=param_grid)
gridsearch.match(X_train, y_train)
gridsearch.prepare(X_train, y_train)

pd.DataFrame(gridsearch.cv_results_).loc[:,
['rank_test_score',
'mean_test_score',
'param_tree__max_depth']
].sort_values('rank_test_score')

Outcomes from code above (Picture by Creator)

As we are able to see from the desk above, utilizing GridsearchCV we discovered that we are able to enhance the accuracy of the mannequin by ~0.55 simply by altering the max_depth of the DecisionTreeClassifier from its default worth None to 5. This clearly illustrates the optimistic affect grid looking can have.

Nevertheless, we don’t know whether or not imputing the lacking last_grades with the imply is definitely one of the best imputation technique. What we are able to do is definitely grid search over three totally different imputation methods utilizing extra-datascience-toolsEstimatorSwitch :

  • Imply imputation
  • Median imputation
  • Arbitrary quantity imputation (by default 999 for feature-engine’s ArbitraryNumberImputer .
from feature_engine.imputation import (
ArbitraryNumberImputer,
MeanMedianImputer,
)
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from extra_ds_tools.ml.sklearn.meta_estimators import EstimatorSwitch

# create a pipeline with two imputation methods
mannequin = Pipeline(
[
("meanmedianimputer", EstimatorSwitch(
MeanMedianImputer()
)),
("arbitraryimputer", EstimatorSwitch(
ArbitraryNumberImputer()
)),
("tree", DecisionTreeClassifier())
]
)

# specify the parameter grid for the classifier
classifier_param_grid = [{"tree__max_depth": [None, 2, 5]}]

# specify the parameter grid for characteristic engineering
feature_param_grid = [
{"meanmedianimputer__apply": [True],
"meanmedianimputer__estimator__imputation_method": ["mean", "median"],
"arbitraryimputer__apply": [False],
},
{"meanmedianimputer__apply": [False],
"arbitraryimputer__apply": [True],
},

]

# be a part of the parameter grids collectively
model_param_grid = [
{
**classifier_params,
**feature_params
}
for feature_params in feature_param_grid
for classifier_params in classifier_param_grid
]

Some vital issues to note right here:

  • We enclosed each imputers within the Pipeline inside extra-datascience-tools’ EstimatorSwitch as a result of we don’t wish to use each imputers on the similar time. It’s because after the primary imputer has reworked X there will probably be no nan values left for the second imputer to remodel.
  • We cut up the parameter grid between a classifier parameter grid and a characteristic engineering parameter grid. On the backside of the code, we be a part of these two grids collectively so that each characteristic engineering grid is mixed with each classifier grid, as a result of we wish to attempt a max_tree_depth of None, 2 and 5 for each the ArbitraryNumberImputer and the MeanMedianImputer .
  • We use a listing of dictionaries as a substitute of a dictionary within the characteristic parameter grid, in order that we stop the MeanMedianImputer and the ArbitraryNumberImputer for being utilized on the similar time. Utilizing the apply parameter of EstimatorSwitch we are able to merely activate or off one of many two imputers. After all, you additionally may run the code twice, as soon as with the primary imputer commented out, and the second run with the second imputer commented out. Nevertheless, this can result in errors in our parameter grid, so we would want to regulate that one as effectively, and the outcomes of the totally different imputation methods aren’t obtainable in the identical grid search cv outcomes, which makes it rather more tough to match.

Allow us to have a look at the brand new outcomes:

gridsearch = GridSearchCV(mannequin, param_grid=model_param_grid)
gridsearch.match(X_train, y_train)
gridsearch.prepare(X_train, y_train)

pd.DataFrame(gridsearch.cv_results_).loc[:,
['rank_test_score',
'mean_test_score',
'param_tree__max_depth',
'param_meanmedianimputer__estimator__imputation_method']
].sort_values('rank_test_score')

Grid-search outcomes on characteristic engineering (picture by Creator)

We now see a brand new finest mannequin, which is the choice tree with a max_depth of 2, utilizing the ArbitraryNumberImputer . We improved the accuracy by 1.4% by implementing a special imputation technique! And as a welcome bonus, our tree depth has shrunk to 2, which makes the mannequin simpler to interpret.

After all, grid looking can already take fairly a while, and by not solely grid looking over the classifier but in addition over different pipeline steps the grid search can take longer as effectively. There are just a few strategies to maintain the additional time it takes to a minimal:

  • First grid search over the classifier’s parameters, after which over different steps reminiscent of characteristic engineering steps, or vice versa, relying on the scenario.
  • Use extra-datascience-toolsfilter_tried_params to forestall duplicate parameter settings of a grid-search.
  • Use scikit-learn’s HalvingGridSearch or HalvingRandomSearch as a substitute of a GridSearchCV (nonetheless within the experimental part).

In addition to utilizing grid looking to optimize a classifier reminiscent of a choice tree, we noticed you may truly optimize just about any step in a machine studying pipeline utilizing extra-datascience-toolsEstimatorSwitch by e.g. grid looking over the imputation technique. Some extra examples of pipeline steps that are price grid looking over beside the imputation technique and the classifier itself are:

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments