SequentialFeatureSelector analysis utilities

The functions shown here are meant to facilitate the analysis of feature selection results. Here they are applied on a trained ForwardFeatureGroupSelection (SFS) instance but they can also be used with a a trained SequentialFeatureSelector instance.

[2]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

# from mlxtend import feature_selection

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris


from eobox.ml import ffgs
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

Dataset

[3]:
iris = load_iris(as_frame=True)
X_informative = iris.data
np.random.seed(0)
X_uninformative = pd.DataFrame(np.random.uniform(X_informative.min(),
                                                 X_informative.max(),
                                                 size=X_informative.shape))
X_uninformative.columns = [f"noinfo-{i}" for i in range(X_uninformative.shape[1])]
X = pd.concat([X_informative, X_uninformative], axis=1)
y = iris.target

Feature selection

[4]:
mod = KNeighborsClassifier(n_neighbors=4)
fsel = ffgs.ForwardFeatureGroupSelection(mod,
                                         k_features=X.shape[1],
                                         scoring='accuracy',
                                         verbose=1)
fgroups = [0, 0, 1, 1, 2, 3, 3, 3]
#fgroups = None
fsel = fsel.fit(X, y, custom_feature_names=X.columns, fgroups=fgroups)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    0.1s finished
Features: 2/8[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.1s finished
Features: 4/8[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s finished
Features: 5/8[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s finished
Features: 8/8

SFS - metrics dataframes

Default

[5]:
sfs_metrics_df = ffgs.sfs_metrics_to_dataframe(fsel, None)
sfs_metrics_df
[5]:
n_features feature_idx cv_scores avg_score feature_names ci_bound std_dev std_err feature_idx_new feature_names_new
iter
1 2 (2, 3) [0.9666666666666667, 0.9666666666666667, 0.933... 0.966667 (petal length (cm), petal width (cm)) 0.027096 0.021082 0.010541 (2, 3) (petal length (cm), petal width (cm))
2 4 (0, 1, 2, 3) [0.9666666666666667, 0.9666666666666667, 0.966... 0.973333 (sepal length (cm), sepal width (cm), petal le... 0.017137 0.013333 0.006667 (0, 1) (sepal length (cm), sepal width (cm))
3 5 (0, 1, 2, 3, 4) [0.9666666666666667, 1.0, 0.9333333333333333, ... 0.953333 (sepal length (cm), sepal width (cm), petal le... 0.064122 0.049889 0.024944 (4,) (noinfo-0,)
4 8 (0, 1, 2, 3, 4, 5, 6, 7) [0.8666666666666667, 0.9333333333333333, 0.9, ... 0.886667 (sepal length (cm), sepal width (cm), petal le... 0.064122 0.049889 0.024944 (5, 6, 7) (noinfo-1, noinfo-3, noinfo-2)

Wide dataframe

Cross validation scores in additional columns. One row per selected feature. K(-folds) additional columns.

[6]:
sfs_metrics_df_wide = ffgs.sfs_metrics_to_dataframe(fsel, "wide")
sfs_metrics_df_wide
[6]:
n_features feature_idx cv_scores avg_score feature_names ci_bound std_dev std_err feature_idx_new feature_names_new cv_score_0 cv_score_1 cv_score_2 cv_score_3 cv_score_4
iter
1 2 (2, 3) [0.9666666666666667, 0.9666666666666667, 0.933... 0.966667 (petal length (cm), petal width (cm)) 0.027096 0.021082 0.010541 (2, 3) (petal length (cm), petal width (cm)) 0.966667 0.966667 0.933333 0.966667 1.0
2 4 (0, 1, 2, 3) [0.9666666666666667, 0.9666666666666667, 0.966... 0.973333 (sepal length (cm), sepal width (cm), petal le... 0.017137 0.013333 0.006667 (0, 1) (sepal length (cm), sepal width (cm)) 0.966667 0.966667 0.966667 0.966667 1.0
3 5 (0, 1, 2, 3, 4) [0.9666666666666667, 1.0, 0.9333333333333333, ... 0.953333 (sepal length (cm), sepal width (cm), petal le... 0.064122 0.049889 0.024944 (4,) (noinfo-0,) 0.966667 1.000000 0.933333 0.866667 1.0
4 8 (0, 1, 2, 3, 4, 5, 6, 7) [0.8666666666666667, 0.9333333333333333, 0.9, ... 0.886667 (sepal length (cm), sepal width (cm), petal le... 0.064122 0.049889 0.024944 (5, 6, 7) (noinfo-1, noinfo-3, noinfo-2) 0.866667 0.933333 0.900000 0.933333 0.8

Long dataframe

Cross validation scores in additional rows. K(-folds) additional rows.

[7]:
sfs_metrics_df_long = ffgs.sfs_metrics_to_dataframe(fsel, "long")
sfs_metrics_df_long
[7]:
iter fold n_features feature_idx avg_score feature_names ci_bound std_dev std_err feature_idx_new feature_names_new cv_score
1-0 1 0 2 (2, 3) 0.966667 (petal length (cm), petal width (cm)) 0.027096 0.021082 0.010541 (2, 3) (petal length (cm), petal width (cm)) 0.966667
1-1 1 1 2 (2, 3) 0.966667 (petal length (cm), petal width (cm)) 0.027096 0.021082 0.010541 (2, 3) (petal length (cm), petal width (cm)) 0.966667
1-2 1 2 2 (2, 3) 0.966667 (petal length (cm), petal width (cm)) 0.027096 0.021082 0.010541 (2, 3) (petal length (cm), petal width (cm)) 0.933333
1-3 1 3 2 (2, 3) 0.966667 (petal length (cm), petal width (cm)) 0.027096 0.021082 0.010541 (2, 3) (petal length (cm), petal width (cm)) 0.966667
1-4 1 4 2 (2, 3) 0.966667 (petal length (cm), petal width (cm)) 0.027096 0.021082 0.010541 (2, 3) (petal length (cm), petal width (cm)) 1.000000
2-0 2 0 4 (0, 1, 2, 3) 0.973333 (sepal length (cm), sepal width (cm), petal le... 0.017137 0.013333 0.006667 (0, 1) (sepal length (cm), sepal width (cm)) 0.966667
2-1 2 1 4 (0, 1, 2, 3) 0.973333 (sepal length (cm), sepal width (cm), petal le... 0.017137 0.013333 0.006667 (0, 1) (sepal length (cm), sepal width (cm)) 0.966667
2-2 2 2 4 (0, 1, 2, 3) 0.973333 (sepal length (cm), sepal width (cm), petal le... 0.017137 0.013333 0.006667 (0, 1) (sepal length (cm), sepal width (cm)) 0.966667
2-3 2 3 4 (0, 1, 2, 3) 0.973333 (sepal length (cm), sepal width (cm), petal le... 0.017137 0.013333 0.006667 (0, 1) (sepal length (cm), sepal width (cm)) 0.966667
2-4 2 4 4 (0, 1, 2, 3) 0.973333 (sepal length (cm), sepal width (cm), petal le... 0.017137 0.013333 0.006667 (0, 1) (sepal length (cm), sepal width (cm)) 1.000000
3-0 3 0 5 (0, 1, 2, 3, 4) 0.953333 (sepal length (cm), sepal width (cm), petal le... 0.064122 0.049889 0.024944 (4,) (noinfo-0,) 0.966667
3-1 3 1 5 (0, 1, 2, 3, 4) 0.953333 (sepal length (cm), sepal width (cm), petal le... 0.064122 0.049889 0.024944 (4,) (noinfo-0,) 1.000000
3-2 3 2 5 (0, 1, 2, 3, 4) 0.953333 (sepal length (cm), sepal width (cm), petal le... 0.064122 0.049889 0.024944 (4,) (noinfo-0,) 0.933333
3-3 3 3 5 (0, 1, 2, 3, 4) 0.953333 (sepal length (cm), sepal width (cm), petal le... 0.064122 0.049889 0.024944 (4,) (noinfo-0,) 0.866667
3-4 3 4 5 (0, 1, 2, 3, 4) 0.953333 (sepal length (cm), sepal width (cm), petal le... 0.064122 0.049889 0.024944 (4,) (noinfo-0,) 1.000000
4-0 4 0 8 (0, 1, 2, 3, 4, 5, 6, 7) 0.886667 (sepal length (cm), sepal width (cm), petal le... 0.064122 0.049889 0.024944 (5, 6, 7) (noinfo-1, noinfo-3, noinfo-2) 0.866667
4-1 4 1 8 (0, 1, 2, 3, 4, 5, 6, 7) 0.886667 (sepal length (cm), sepal width (cm), petal le... 0.064122 0.049889 0.024944 (5, 6, 7) (noinfo-1, noinfo-3, noinfo-2) 0.933333
4-2 4 2 8 (0, 1, 2, 3, 4, 5, 6, 7) 0.886667 (sepal length (cm), sepal width (cm), petal le... 0.064122 0.049889 0.024944 (5, 6, 7) (noinfo-1, noinfo-3, noinfo-2) 0.900000
4-3 4 3 8 (0, 1, 2, 3, 4, 5, 6, 7) 0.886667 (sepal length (cm), sepal width (cm), petal le... 0.064122 0.049889 0.024944 (5, 6, 7) (noinfo-1, noinfo-3, noinfo-2) 0.933333
4-4 4 4 8 (0, 1, 2, 3, 4, 5, 6, 7) 0.886667 (sepal length (cm), sepal width (cm), petal le... 0.064122 0.049889 0.024944 (5, 6, 7) (noinfo-1, noinfo-3, noinfo-2) 0.800000

Plotting

In case of a forward feature group selection it makes sense to add a column with group names for plotting.

[8]:
fgroup_names = {0: "sepal", 1: "petal", 2: "noinfo-A", 3: "noinfo-B"}

sfs_metrics_df = ffgs.sfs_metrics_to_dataframe(fsel, explode_cv_scores="wide", fgroups=fgroups, fgroup_names=fgroup_names)
sfs_metrics_df
[8]:
n_features feature_idx cv_scores avg_score feature_names ci_bound std_dev std_err feature_idx_new feature_names_new cv_score_0 cv_score_1 cv_score_2 cv_score_3 cv_score_4 feature_groups feature_group_new feature_group_names feature_group_name_new
iter
1 2 (2, 3) [0.9666666666666667, 0.9666666666666667, 0.933... 0.966667 (petal length (cm), petal width (cm)) 0.027096 0.021082 0.010541 (2, 3) (petal length (cm), petal width (cm)) 0.966667 0.966667 0.933333 0.966667 1.0 (1,) 1 (petal,) petal
2 4 (0, 1, 2, 3) [0.9666666666666667, 0.9666666666666667, 0.966... 0.973333 (sepal length (cm), sepal width (cm), petal le... 0.017137 0.013333 0.006667 (0, 1) (sepal length (cm), sepal width (cm)) 0.966667 0.966667 0.966667 0.966667 1.0 (0, 1) 0 (petal, sepal) sepal
3 5 (0, 1, 2, 3, 4) [0.9666666666666667, 1.0, 0.9333333333333333, ... 0.953333 (sepal length (cm), sepal width (cm), petal le... 0.064122 0.049889 0.024944 (4,) (noinfo-0,) 0.966667 1.000000 0.933333 0.866667 1.0 (0, 1, 2) 2 (noinfo-A, petal, sepal) noinfo-A
4 8 (0, 1, 2, 3, 4, 5, 6, 7) [0.8666666666666667, 0.9333333333333333, 0.9, ... 0.886667 (sepal length (cm), sepal width (cm), petal le... 0.064122 0.049889 0.024944 (5, 6, 7) (noinfo-1, noinfo-3, noinfo-2) 0.866667 0.933333 0.900000 0.933333 0.8 (0, 1, 2, 3) 3 (noinfo-A, noinfo-B, petal, sepal) noinfo-B

Cross-validation accuracies and mean

[9]:
colnames_cv_scores = sfs_metrics_df.columns[sfs_metrics_df.columns.str.startswith("cv_score_")]
ax = sfs_metrics_df[colnames_cv_scores].plot(style='-o',
                                             xticks=sfs_metrics_df.index,
                                             figsize=(18, 6))
ax = sfs_metrics_df.avg_score.plot(style='-o', c="k")
ax = ax.set_xticklabels(sfs_metrics_df.feature_names_new)
../../_images/examples_ml_ffgs_SequentialFeatureSelectorAnalysisUtilities_15_0.png

Cross-validation accuracies as boxplots

[10]:
sfs_metrics_df_long = ffgs.sfs_metrics_to_dataframe(fsel, explode_cv_scores="long", fgroups=fgroups, fgroup_names=fgroup_names)

fig, ax = plt.subplots(figsize=(18, 6))
ax = sns.boxplot(x="feature_group_name_new", y="cv_score", color="w", data=sfs_metrics_df_long)
ax = sns.swarmplot(x="feature_group_name_new", y="cv_score", hue="fold", data=sfs_metrics_df_long)
../../_images/examples_ml_ffgs_SequentialFeatureSelectorAnalysisUtilities_17_0.png