Early stopping feature selection

Model performance during sequential forward feature selection often reaches a performance plateau before all features are selected. To save time it makes sense to stop the feature selection process once the relevant features are selected. Potentially a variaty of stopping criteria might make sense.

Here we show how to implement and use a custom early stopping criteria.

[2]:
%matplotlib inline
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs

from eobox.ml import ffgs

Dataset

Let us define a dataset with useful and uninformative features.

[3]:
iris = load_iris(as_frame=True)
X_informative = iris.data
np.random.seed(0)
X_uninformative = pd.DataFrame(np.random.uniform(X_informative.min(),
                                                 X_informative.max(),
                                                 size=X_informative.shape))
X_uninformative.columns = [f"noinfo-{i}" for i in range(X_uninformative.shape[1])]
X = pd.concat([X_informative, X_uninformative], axis=1)
y = iris.target

Full selection

Lets start by running through the full selection process and visualize the feature learning curve.

[4]:
knn = KNeighborsClassifier(n_neighbors=4)
fsel = ffgs.ForwardFeatureGroupSelection(knn, k_features=X.shape[1])
fsel = fsel.fit(X, y)
plot_sfs(fsel.get_metric_dict())
[4]:
(<Figure size 432x288 with 1 Axes>,
 <AxesSubplot:xlabel='Number of Features', ylabel='Performance'>)
../../_images/examples_ml_ffgs_EarlyStoppingFeatureSelection_5_1.png

Custom early stopping function

Example implementation

To implement early stopping we need to implement a function that takes the metric dictionary (output of ForwardFeatureGroupSelection.get_metric_dict()) as an input and returnes a boolean as an output that indicates if to stop (True) or proceed (False) the feature selection. This function will be internally evaluated after every iteration with the metric dictionary as filled at the respective point.

For example a stopping criteria could be to stop if the mean performance does not increases for more than 0.5% (p) for 3 (n) subsequent iterations.

[5]:
p = 0.005
n = 3

One way to implement this is as follows (note that sfs_metrics_to_dataframe will be available at exection time and can be used inside the costum function).

[6]:
def early_stop(metrics_dict, p=0.005, n=3, verbosity=0):
    metrics_df = ffgs.sfs_metrics_to_dataframe(metrics_dict)
    metrics_df["perf_change"] = metrics_df["avg_score"].pct_change()
    metrics_df["perf_change_let_p"] = metrics_df["perf_change"] <= p

    metrics_df["perf_change_let_p_n_successive"] = metrics_df["perf_change_let_p"].groupby(metrics_df["perf_change_let_p"].eq(0).cumsum()).cumsum().tolist()
    stop = metrics_df.loc[metrics_df.index[-1], "perf_change_let_p_n_successive"] >= n

    if verbosity > 0 and stop:
        print("Stopping criteria met!")
    if verbosity > 1 and stop:
        display(metrics_df[["avg_score", "perf_change", "perf_change_let_p", "perf_change_let_p_n_successive"]])
    return stop

Note that the metrics dictionary will have one more element after every iteration, i.e. after every additionally selected feature. The following shows the early stopping evaluation process:

[7]:
full_metric_dict = fsel.get_metric_dict()

metric_dict_after_iter_i = {}
for i, last_md_key in enumerate(full_metric_dict.keys()):
    print(f"i={i}", end=" - ")
    metric_dict_after_iter_i[last_md_key] = full_metric_dict[last_md_key]
    stop = early_stop(metric_dict_after_iter_i, p=0.005, n=3, verbosity=2)
    #print(f"STOP: {stop}")
    if stop:
        break
i=0 - i=1 - i=2 - i=3 - i=4 - i=5 - Stopping criteria met!
avg_score perf_change perf_change_let_p perf_change_let_p_n_successive
iter
1 0.960000 NaN False 0
2 0.966667 0.006944 False 0
3 0.973333 0.006897 False 0
4 0.973333 0.000000 True 1
5 0.966667 -0.006849 True 2
6 0.926667 -0.041379 True 3

Useage

As we can see from the outputs and plot the last iterations are not ran anymore.

[8]:
fsel = ffgs.ForwardFeatureGroupSelection(knn, k_features=X.shape[1], verbose=True)
fsel = fsel.fit(X, y, custom_early_stop=early_stop)
plot_sfs(fsel.get_metric_dict())
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:    0.1s finished
Features: 1/8[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:    0.1s finished
Features: 2/8[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:    0.1s finished
Features: 3/8[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.1s finished
Features: 4/8[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    0.1s finished
Features: 5/8[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s finished
Features: 6/8
Stopping early due to custom early stopping criteria.
[8]:
(<Figure size 432x288 with 1 Axes>,
 <AxesSubplot:xlabel='Number of Features', ylabel='Performance'>)
../../_images/examples_ml_ffgs_EarlyStoppingFeatureSelection_13_3.png