.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/feature_selection/plot_feature_selection.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. or to run this example in your browser via Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_feature_selection_plot_feature_selection.py: ============================ Univariate Feature Selection ============================ This notebook is an example of using univariate feature selection to improve classification accuracy on a noisy dataset. In this example, some noisy (non informative) features are added to the iris dataset. Support vector machine (SVM) is used to classify the dataset both before and after applying univariate feature selection. For each feature, we plot the p-values for the univariate feature selection and the corresponding weights of SVMs. With this, we will compare model accuracy and examine the impact of univariate feature selection on model weights. .. GENERATED FROM PYTHON SOURCE LINES 18-22 .. code-block:: Python # Authors: The scikit-learn developers # SPDX-License-Identifier: BSD-3-Clause .. GENERATED FROM PYTHON SOURCE LINES 23-26 Generate sample data -------------------- .. GENERATED FROM PYTHON SOURCE LINES 26-43 .. code-block:: Python import numpy as np from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split # The iris dataset X, y = load_iris(return_X_y=True) # Some noisy data not correlated E = np.random.RandomState(42).uniform(0, 0.1, size=(X.shape[0], 20)) # Add the noisy data to the informative features X = np.hstack((X, E)) # Split dataset to select feature and evaluate the classifier X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0) .. GENERATED FROM PYTHON SOURCE LINES 44-50 Univariate feature selection ---------------------------- Univariate feature selection with F-test for feature scoring. We use the default selection function to select the four most significant features. .. GENERATED FROM PYTHON SOURCE LINES 50-57 .. code-block:: Python from sklearn.feature_selection import SelectKBest, f_classif selector = SelectKBest(f_classif, k=4) selector.fit(X_train, y_train) scores = -np.log10(selector.pvalues_) scores /= scores.max() .. GENERATED FROM PYTHON SOURCE LINES 58-69 .. code-block:: Python import matplotlib.pyplot as plt X_indices = np.arange(X.shape[-1]) plt.figure(1) plt.clf() plt.bar(X_indices - 0.05, scores, width=0.2) plt.title("Feature univariate score") plt.xlabel("Feature number") plt.ylabel(r"Univariate score ($-Log(p_{value})$)") plt.show() .. image-sg:: /auto_examples/feature_selection/images/sphx_glr_plot_feature_selection_001.png :alt: Feature univariate score :srcset: /auto_examples/feature_selection/images/sphx_glr_plot_feature_selection_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 70-73 In the total set of features, only the 4 of the original features are significant. We can see that they have the highest score with univariate feature selection. .. GENERATED FROM PYTHON SOURCE LINES 75-79 Compare with SVMs ----------------- Without univariate feature selection .. GENERATED FROM PYTHON SOURCE LINES 79-94 .. code-block:: Python from sklearn.pipeline import make_pipeline from sklearn.preprocessing import MinMaxScaler from sklearn.svm import LinearSVC clf = make_pipeline(MinMaxScaler(), LinearSVC()) clf.fit(X_train, y_train) print( "Classification accuracy without selecting features: {:.3f}".format( clf.score(X_test, y_test) ) ) svm_weights = np.abs(clf[-1].coef_).sum(axis=0) svm_weights /= svm_weights.sum() .. rst-class:: sphx-glr-script-out .. code-block:: none Classification accuracy without selecting features: 0.789 .. GENERATED FROM PYTHON SOURCE LINES 95-96 After univariate feature selection .. GENERATED FROM PYTHON SOURCE LINES 96-107 .. code-block:: Python clf_selected = make_pipeline(SelectKBest(f_classif, k=4), MinMaxScaler(), LinearSVC()) clf_selected.fit(X_train, y_train) print( "Classification accuracy after univariate feature selection: {:.3f}".format( clf_selected.score(X_test, y_test) ) ) svm_weights_selected = np.abs(clf_selected[-1].coef_).sum(axis=0) svm_weights_selected /= svm_weights_selected.sum() .. rst-class:: sphx-glr-script-out .. code-block:: none Classification accuracy after univariate feature selection: 0.868 .. GENERATED FROM PYTHON SOURCE LINES 108-128 .. code-block:: Python plt.bar( X_indices - 0.45, scores, width=0.2, label=r"Univariate score ($-Log(p_{value})$)" ) plt.bar(X_indices - 0.25, svm_weights, width=0.2, label="SVM weight") plt.bar( X_indices[selector.get_support()] - 0.05, svm_weights_selected, width=0.2, label="SVM weights after selection", ) plt.title("Comparing feature selection") plt.xlabel("Feature number") plt.yticks(()) plt.axis("tight") plt.legend(loc="upper right") plt.show() .. image-sg:: /auto_examples/feature_selection/images/sphx_glr_plot_feature_selection_002.png :alt: Comparing feature selection :srcset: /auto_examples/feature_selection/images/sphx_glr_plot_feature_selection_002.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 129-134 Without univariate feature selection, the SVM assigns a large weight to the first 4 original significant features, but also selects many of the non-informative features. Applying univariate feature selection before the SVM increases the SVM weight attributed to the significant features, and will thus improve classification. .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 0.170 seconds) .. _sphx_glr_download_auto_examples_feature_selection_plot_feature_selection.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/scikit-learn/scikit-learn/main?urlpath=lab/tree/notebooks/auto_examples/feature_selection/plot_feature_selection.ipynb :alt: Launch binder :width: 150 px .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_feature_selection.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_feature_selection.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: plot_feature_selection.zip ` .. include:: plot_feature_selection.recommendations .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_