Early stopping feature selection¶
Model performance during sequential forward feature selection often reaches a performance plateau before all features are selected. To save time it makes sense to stop the feature selection process once the relevant features are selected. Potentially a variaty of stopping criteria might make sense.
Here we show how to implement and use a custom early stopping criteria.
[2]:
%matplotlib inline
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
from eobox.ml import ffgs
Dataset¶
Let us define a dataset with useful and uninformative features.
[3]:
iris = load_iris(as_frame=True)
X_informative = iris.data
np.random.seed(0)
X_uninformative = pd.DataFrame(np.random.uniform(X_informative.min(),
X_informative.max(),
size=X_informative.shape))
X_uninformative.columns = [f"noinfo-{i}" for i in range(X_uninformative.shape[1])]
X = pd.concat([X_informative, X_uninformative], axis=1)
y = iris.target
Full selection¶
Lets start by running through the full selection process and visualize the feature learning curve.
[4]:
knn = KNeighborsClassifier(n_neighbors=4)
fsel = ffgs.ForwardFeatureGroupSelection(knn, k_features=X.shape[1])
fsel = fsel.fit(X, y)
plot_sfs(fsel.get_metric_dict())
[4]:
(<Figure size 432x288 with 1 Axes>,
<AxesSubplot:xlabel='Number of Features', ylabel='Performance'>)
Custom early stopping function¶
Example implementation¶
To implement early stopping we need to implement a function that takes the metric dictionary (output of ForwardFeatureGroupSelection.get_metric_dict()) as an input and returnes a boolean as an output that indicates if to stop (True) or proceed (False) the feature selection. This function will be internally evaluated after every iteration with the metric dictionary as filled at the respective point.
For example a stopping criteria could be to stop if the mean performance does not increases for more than 0.5% (p) for 3 (n) subsequent iterations.
[5]:
p = 0.005
n = 3
One way to implement this is as follows (note that sfs_metrics_to_dataframe will be available at exection time and can be used inside the costum function).
[6]:
def early_stop(metrics_dict, p=0.005, n=3, verbosity=0):
metrics_df = ffgs.sfs_metrics_to_dataframe(metrics_dict)
metrics_df["perf_change"] = metrics_df["avg_score"].pct_change()
metrics_df["perf_change_let_p"] = metrics_df["perf_change"] <= p
metrics_df["perf_change_let_p_n_successive"] = metrics_df["perf_change_let_p"].groupby(metrics_df["perf_change_let_p"].eq(0).cumsum()).cumsum().tolist()
stop = metrics_df.loc[metrics_df.index[-1], "perf_change_let_p_n_successive"] >= n
if verbosity > 0 and stop:
print("Stopping criteria met!")
if verbosity > 1 and stop:
display(metrics_df[["avg_score", "perf_change", "perf_change_let_p", "perf_change_let_p_n_successive"]])
return stop
Note that the metrics dictionary will have one more element after every iteration, i.e. after every additionally selected feature. The following shows the early stopping evaluation process:
[7]:
full_metric_dict = fsel.get_metric_dict()
metric_dict_after_iter_i = {}
for i, last_md_key in enumerate(full_metric_dict.keys()):
print(f"i={i}", end=" - ")
metric_dict_after_iter_i[last_md_key] = full_metric_dict[last_md_key]
stop = early_stop(metric_dict_after_iter_i, p=0.005, n=3, verbosity=2)
#print(f"STOP: {stop}")
if stop:
break
i=0 - i=1 - i=2 - i=3 - i=4 - i=5 - Stopping criteria met!
| avg_score | perf_change | perf_change_let_p | perf_change_let_p_n_successive | |
|---|---|---|---|---|
| iter | ||||
| 1 | 0.960000 | NaN | False | 0 |
| 2 | 0.966667 | 0.006944 | False | 0 |
| 3 | 0.973333 | 0.006897 | False | 0 |
| 4 | 0.973333 | 0.000000 | True | 1 |
| 5 | 0.966667 | -0.006849 | True | 2 |
| 6 | 0.926667 | -0.041379 | True | 3 |
Useage¶
As we can see from the outputs and plot the last iterations are not ran anymore.
[8]:
fsel = ffgs.ForwardFeatureGroupSelection(knn, k_features=X.shape[1], verbose=True)
fsel = fsel.fit(X, y, custom_early_stop=early_stop)
plot_sfs(fsel.get_metric_dict())
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 8 out of 8 | elapsed: 0.1s finished
Features: 1/8[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 7 out of 7 | elapsed: 0.1s finished
Features: 2/8[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 6 out of 6 | elapsed: 0.1s finished
Features: 3/8[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.1s finished
Features: 4/8[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 0.1s finished
Features: 5/8[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 3 out of 3 | elapsed: 0.0s finished
Features: 6/8
Stopping early due to custom early stopping criteria.
[8]:
(<Figure size 432x288 with 1 Axes>,
<AxesSubplot:xlabel='Number of Features', ylabel='Performance'>)