`SequentialFeatureSelector` analysis utilities¶

The functions shown here are meant to facilitate the analysis of feature selection results. Here they are applied on a trained ForwardFeatureGroupSelection (SFS) instance but they can also be used with a a trained SequentialFeatureSelector instance.

[2]:

%matplotlib inline
%load_ext autoreload
%autoreload 2

# from mlxtend import feature_selection

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris


from eobox.ml import ffgs

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

Dataset¶

[3]:

iris = load_iris(as_frame=True)
X_informative = iris.data
np.random.seed(0)
X_uninformative = pd.DataFrame(np.random.uniform(X_informative.min(),
                                                 X_informative.max(),
                                                 size=X_informative.shape))
X_uninformative.columns = [f"noinfo-{i}" for i in range(X_uninformative.shape[1])]
X = pd.concat([X_informative, X_uninformative], axis=1)
y = iris.target

Feature selection¶

[4]:

mod = KNeighborsClassifier(n_neighbors=4)
fsel = ffgs.ForwardFeatureGroupSelection(mod,
                                         k_features=X.shape[1],
                                         scoring='accuracy',
                                         verbose=1)
fgroups = [0, 0, 1, 1, 2, 3, 3, 3]
#fgroups = None
fsel = fsel.fit(X, y, custom_feature_names=X.columns, fgroups=fgroups)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    0.1s finished
Features: 2/8[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.1s finished
Features: 4/8[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s finished
Features: 5/8[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s finished
Features: 8/8

SFS - metrics dataframes¶

Default¶

[5]:

sfs_metrics_df = ffgs.sfs_metrics_to_dataframe(fsel, None)
sfs_metrics_df

[5]:

	n_features	feature_idx	cv_scores	avg_score	feature_names	ci_bound	std_dev	std_err	feature_idx_new	feature_names_new
iter
1	2	(2, 3)	[0.9666666666666667, 0.9666666666666667, 0.933...	0.966667	(petal length (cm), petal width (cm))	0.027096	0.021082	0.010541	(2, 3)	(petal length (cm), petal width (cm))
2	4	(0, 1, 2, 3)	[0.9666666666666667, 0.9666666666666667, 0.966...	0.973333	(sepal length (cm), sepal width (cm), petal le...	0.017137	0.013333	0.006667	(0, 1)	(sepal length (cm), sepal width (cm))
3	5	(0, 1, 2, 3, 4)	[0.9666666666666667, 1.0, 0.9333333333333333, ...	0.953333	(sepal length (cm), sepal width (cm), petal le...	0.064122	0.049889	0.024944	(4,)	(noinfo-0,)
4	8	(0, 1, 2, 3, 4, 5, 6, 7)	[0.8666666666666667, 0.9333333333333333, 0.9, ...	0.886667	(sepal length (cm), sepal width (cm), petal le...	0.064122	0.049889	0.024944	(5, 6, 7)	(noinfo-1, noinfo-3, noinfo-2)

Wide dataframe¶

Cross validation scores in additional columns. One row per selected feature. K(-folds) additional columns.

[6]:

sfs_metrics_df_wide = ffgs.sfs_metrics_to_dataframe(fsel, "wide")
sfs_metrics_df_wide

[6]:

	n_features	feature_idx	cv_scores	avg_score	feature_names	ci_bound	std_dev	std_err	feature_idx_new	feature_names_new	cv_score_0	cv_score_1	cv_score_2	cv_score_3	cv_score_4
iter
1	2	(2, 3)	[0.9666666666666667, 0.9666666666666667, 0.933...	0.966667	(petal length (cm), petal width (cm))	0.027096	0.021082	0.010541	(2, 3)	(petal length (cm), petal width (cm))	0.966667	0.966667	0.933333	0.966667	1.0
2	4	(0, 1, 2, 3)	[0.9666666666666667, 0.9666666666666667, 0.966...	0.973333	(sepal length (cm), sepal width (cm), petal le...	0.017137	0.013333	0.006667	(0, 1)	(sepal length (cm), sepal width (cm))	0.966667	0.966667	0.966667	0.966667	1.0
3	5	(0, 1, 2, 3, 4)	[0.9666666666666667, 1.0, 0.9333333333333333, ...	0.953333	(sepal length (cm), sepal width (cm), petal le...	0.064122	0.049889	0.024944	(4,)	(noinfo-0,)	0.966667	1.000000	0.933333	0.866667	1.0
4	8	(0, 1, 2, 3, 4, 5, 6, 7)	[0.8666666666666667, 0.9333333333333333, 0.9, ...	0.886667	(sepal length (cm), sepal width (cm), petal le...	0.064122	0.049889	0.024944	(5, 6, 7)	(noinfo-1, noinfo-3, noinfo-2)	0.866667	0.933333	0.900000	0.933333	0.8

Long dataframe¶

Cross validation scores in additional rows. K(-folds) additional rows.

[7]:

sfs_metrics_df_long = ffgs.sfs_metrics_to_dataframe(fsel, "long")
sfs_metrics_df_long

[7]:

	iter	fold	n_features	feature_idx	avg_score	feature_names	ci_bound	std_dev	std_err	feature_idx_new	feature_names_new	cv_score
1-0	1	0	2	(2, 3)	0.966667	(petal length (cm), petal width (cm))	0.027096	0.021082	0.010541	(2, 3)	(petal length (cm), petal width (cm))	0.966667
1-1	1	1	2	(2, 3)	0.966667	(petal length (cm), petal width (cm))	0.027096	0.021082	0.010541	(2, 3)	(petal length (cm), petal width (cm))	0.966667
1-2	1	2	2	(2, 3)	0.966667	(petal length (cm), petal width (cm))	0.027096	0.021082	0.010541	(2, 3)	(petal length (cm), petal width (cm))	0.933333
1-3	1	3	2	(2, 3)	0.966667	(petal length (cm), petal width (cm))	0.027096	0.021082	0.010541	(2, 3)	(petal length (cm), petal width (cm))	0.966667
1-4	1	4	2	(2, 3)	0.966667	(petal length (cm), petal width (cm))	0.027096	0.021082	0.010541	(2, 3)	(petal length (cm), petal width (cm))	1.000000
2-0	2	0	4	(0, 1, 2, 3)	0.973333	(sepal length (cm), sepal width (cm), petal le...	0.017137	0.013333	0.006667	(0, 1)	(sepal length (cm), sepal width (cm))	0.966667
2-1	2	1	4	(0, 1, 2, 3)	0.973333	(sepal length (cm), sepal width (cm), petal le...	0.017137	0.013333	0.006667	(0, 1)	(sepal length (cm), sepal width (cm))	0.966667
2-2	2	2	4	(0, 1, 2, 3)	0.973333	(sepal length (cm), sepal width (cm), petal le...	0.017137	0.013333	0.006667	(0, 1)	(sepal length (cm), sepal width (cm))	0.966667
2-3	2	3	4	(0, 1, 2, 3)	0.973333	(sepal length (cm), sepal width (cm), petal le...	0.017137	0.013333	0.006667	(0, 1)	(sepal length (cm), sepal width (cm))	0.966667
2-4	2	4	4	(0, 1, 2, 3)	0.973333	(sepal length (cm), sepal width (cm), petal le...	0.017137	0.013333	0.006667	(0, 1)	(sepal length (cm), sepal width (cm))	1.000000
3-0	3	0	5	(0, 1, 2, 3, 4)	0.953333	(sepal length (cm), sepal width (cm), petal le...	0.064122	0.049889	0.024944	(4,)	(noinfo-0,)	0.966667
3-1	3	1	5	(0, 1, 2, 3, 4)	0.953333	(sepal length (cm), sepal width (cm), petal le...	0.064122	0.049889	0.024944	(4,)	(noinfo-0,)	1.000000
3-2	3	2	5	(0, 1, 2, 3, 4)	0.953333	(sepal length (cm), sepal width (cm), petal le...	0.064122	0.049889	0.024944	(4,)	(noinfo-0,)	0.933333
3-3	3	3	5	(0, 1, 2, 3, 4)	0.953333	(sepal length (cm), sepal width (cm), petal le...	0.064122	0.049889	0.024944	(4,)	(noinfo-0,)	0.866667
3-4	3	4	5	(0, 1, 2, 3, 4)	0.953333	(sepal length (cm), sepal width (cm), petal le...	0.064122	0.049889	0.024944	(4,)	(noinfo-0,)	1.000000
4-0	4	0	8	(0, 1, 2, 3, 4, 5, 6, 7)	0.886667	(sepal length (cm), sepal width (cm), petal le...	0.064122	0.049889	0.024944	(5, 6, 7)	(noinfo-1, noinfo-3, noinfo-2)	0.866667
4-1	4	1	8	(0, 1, 2, 3, 4, 5, 6, 7)	0.886667	(sepal length (cm), sepal width (cm), petal le...	0.064122	0.049889	0.024944	(5, 6, 7)	(noinfo-1, noinfo-3, noinfo-2)	0.933333
4-2	4	2	8	(0, 1, 2, 3, 4, 5, 6, 7)	0.886667	(sepal length (cm), sepal width (cm), petal le...	0.064122	0.049889	0.024944	(5, 6, 7)	(noinfo-1, noinfo-3, noinfo-2)	0.900000
4-3	4	3	8	(0, 1, 2, 3, 4, 5, 6, 7)	0.886667	(sepal length (cm), sepal width (cm), petal le...	0.064122	0.049889	0.024944	(5, 6, 7)	(noinfo-1, noinfo-3, noinfo-2)	0.933333
4-4	4	4	8	(0, 1, 2, 3, 4, 5, 6, 7)	0.886667	(sepal length (cm), sepal width (cm), petal le...	0.064122	0.049889	0.024944	(5, 6, 7)	(noinfo-1, noinfo-3, noinfo-2)	0.800000

Plotting¶

In case of a forward feature group selection it makes sense to add a column with group names for plotting.

[8]:

fgroup_names = {0: "sepal", 1: "petal", 2: "noinfo-A", 3: "noinfo-B"}

sfs_metrics_df = ffgs.sfs_metrics_to_dataframe(fsel, explode_cv_scores="wide", fgroups=fgroups, fgroup_names=fgroup_names)
sfs_metrics_df

[8]:

	n_features	feature_idx	cv_scores	avg_score	feature_names	ci_bound	std_dev	std_err	feature_idx_new	feature_names_new	cv_score_0	cv_score_1	cv_score_2	cv_score_3	cv_score_4	feature_groups	feature_group_new	feature_group_names	feature_group_name_new
iter
1	2	(2, 3)	[0.9666666666666667, 0.9666666666666667, 0.933...	0.966667	(petal length (cm), petal width (cm))	0.027096	0.021082	0.010541	(2, 3)	(petal length (cm), petal width (cm))	0.966667	0.966667	0.933333	0.966667	1.0	(1,)	1	(petal,)	petal
2	4	(0, 1, 2, 3)	[0.9666666666666667, 0.9666666666666667, 0.966...	0.973333	(sepal length (cm), sepal width (cm), petal le...	0.017137	0.013333	0.006667	(0, 1)	(sepal length (cm), sepal width (cm))	0.966667	0.966667	0.966667	0.966667	1.0	(0, 1)	0	(petal, sepal)	sepal
3	5	(0, 1, 2, 3, 4)	[0.9666666666666667, 1.0, 0.9333333333333333, ...	0.953333	(sepal length (cm), sepal width (cm), petal le...	0.064122	0.049889	0.024944	(4,)	(noinfo-0,)	0.966667	1.000000	0.933333	0.866667	1.0	(0, 1, 2)	2	(noinfo-A, petal, sepal)	noinfo-A
4	8	(0, 1, 2, 3, 4, 5, 6, 7)	[0.8666666666666667, 0.9333333333333333, 0.9, ...	0.886667	(sepal length (cm), sepal width (cm), petal le...	0.064122	0.049889	0.024944	(5, 6, 7)	(noinfo-1, noinfo-3, noinfo-2)	0.866667	0.933333	0.900000	0.933333	0.8	(0, 1, 2, 3)	3	(noinfo-A, noinfo-B, petal, sepal)	noinfo-B

Cross-validation accuracies and mean¶

[9]:

colnames_cv_scores = sfs_metrics_df.columns[sfs_metrics_df.columns.str.startswith("cv_score_")]
ax = sfs_metrics_df[colnames_cv_scores].plot(style='-o',
                                             xticks=sfs_metrics_df.index,
                                             figsize=(18, 6))
ax = sfs_metrics_df.avg_score.plot(style='-o', c="k")
ax = ax.set_xticklabels(sfs_metrics_df.feature_names_new)

../../_images/examples_ml_ffgs_SequentialFeatureSelectorAnalysisUtilities_15_0.png

Cross-validation accuracies as boxplots¶

[10]:

sfs_metrics_df_long = ffgs.sfs_metrics_to_dataframe(fsel, explode_cv_scores="long", fgroups=fgroups, fgroup_names=fgroup_names)

fig, ax = plt.subplots(figsize=(18, 6))
ax = sns.boxplot(x="feature_group_name_new", y="cv_score", color="w", data=sfs_metrics_df_long)
ax = sns.swarmplot(x="feature_group_name_new", y="cv_score", hue="fold", data=sfs_metrics_df_long)

../../_images/examples_ml_ffgs_SequentialFeatureSelectorAnalysisUtilities_17_0.png

Read the Docs v: latest

Versions: master; latest; develop

Downloads: html

On Read the Docs: Project Home; Builds

Free document hosting provided by Read the Docs.