.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/multiclass/plot_multiclass_overview.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. or to run this example in your browser via Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_multiclass_plot_multiclass_overview.py: =============================================== Overview of multiclass training meta-estimators =============================================== In this example, we discuss the problem of classification when the target variable is composed of more than two classes. This is called multiclass classification. In scikit-learn, all estimators support multiclass classification out of the box: the most sensible strategy was implemented for the end-user. The :mod:`sklearn.multiclass` module implements various strategies that one can use for experimenting or developing third-party estimators that only support binary classification. :mod:`sklearn.multiclass` includes OvO/OvR strategies used to train a multiclass classifier by fitting a set of binary classifiers (the :class:`~sklearn.multiclass.OneVsOneClassifier` and :class:`~sklearn.multiclass.OneVsRestClassifier` meta-estimators). This example will review them. .. GENERATED FROM PYTHON SOURCE LINES 22-26 .. code-block:: Python # Authors: The scikit-learn developers # SPDX-License-Identifier: BSD-3-Clause .. GENERATED FROM PYTHON SOURCE LINES 27-33 The Yeast UCI dataset --------------------- In this example, we use a UCI dataset [1]_, generally referred as the Yeast dataset. We use the :func:`sklearn.datasets.fetch_openml` function to load the dataset from OpenML. .. GENERATED FROM PYTHON SOURCE LINES 33-37 .. code-block:: Python from sklearn.datasets import fetch_openml X, y = fetch_openml(data_id=181, as_frame=True, return_X_y=True) .. GENERATED FROM PYTHON SOURCE LINES 38-40 To know the type of data science problem we are dealing with, we can check the target for which we want to build a predictive model. .. GENERATED FROM PYTHON SOURCE LINES 40-42 .. code-block:: Python y.value_counts().sort_index() .. rst-class:: sphx-glr-script-out .. code-block:: none class_protein_localization CYT 463 ERL 5 EXC 35 ME1 44 ME2 51 ME3 163 MIT 244 NUC 429 POX 20 VAC 30 Name: count, dtype: int64 .. GENERATED FROM PYTHON SOURCE LINES 43-74 We see that the target is discrete and composed of 10 classes. We therefore deal with a multiclass classification problem. Strategies comparison --------------------- In the following experiment, we use a :class:`~sklearn.tree.DecisionTreeClassifier` and a :class:`~sklearn.model_selection.RepeatedStratifiedKFold` cross-validation with 3 splits and 5 repetitions. We compare the following strategies: * :class:~sklearn.tree.DecisionTreeClassifier can handle multiclass classification without needing any special adjustments. It works by breaking down the training data into smaller subsets and focusing on the most common class in each subset. By repeating this process, the model can accurately classify input data into multiple different classes. * :class:`~sklearn.multiclass.OneVsOneClassifier` trains a set of binary classifiers where each classifier is trained to distinguish between two classes. * :class:`~sklearn.multiclass.OneVsRestClassifier`: trains a set of binary classifiers where each classifier is trained to distinguish between one class and the rest of the classes. * :class:`~sklearn.multiclass.OutputCodeClassifier`: trains a set of binary classifiers where each classifier is trained to distinguish between a set of classes from the rest of the classes. The set of classes is defined by a codebook, which is randomly generated in scikit-learn. This method exposes a parameter `code_size` to control the size of the codebook. We set it above one since we are not interested in compressing the class representation. .. GENERATED FROM PYTHON SOURCE LINES 74-96 .. code-block:: Python import pandas as pd from sklearn.model_selection import RepeatedStratifiedKFold, cross_validate from sklearn.multiclass import ( OneVsOneClassifier, OneVsRestClassifier, OutputCodeClassifier, ) from sklearn.tree import DecisionTreeClassifier cv = RepeatedStratifiedKFold(n_splits=3, n_repeats=5, random_state=0) tree = DecisionTreeClassifier(random_state=0) ovo_tree = OneVsOneClassifier(tree) ovr_tree = OneVsRestClassifier(tree) ecoc = OutputCodeClassifier(tree, code_size=2) cv_results_tree = cross_validate(tree, X, y, cv=cv, n_jobs=2) cv_results_ovo = cross_validate(ovo_tree, X, y, cv=cv, n_jobs=2) cv_results_ovr = cross_validate(ovr_tree, X, y, cv=cv, n_jobs=2) cv_results_ecoc = cross_validate(ecoc, X, y, cv=cv, n_jobs=2) .. GENERATED FROM PYTHON SOURCE LINES 97-99 We can now compare the statistical performance of the different strategies. We plot the score distribution of the different strategies. .. GENERATED FROM PYTHON SOURCE LINES 99-116 .. code-block:: Python from matplotlib import pyplot as plt scores = pd.DataFrame( { "DecisionTreeClassifier": cv_results_tree["test_score"], "OneVsOneClassifier": cv_results_ovo["test_score"], "OneVsRestClassifier": cv_results_ovr["test_score"], "OutputCodeClassifier": cv_results_ecoc["test_score"], } ) ax = scores.plot.kde(legend=True) ax.set_xlabel("Accuracy score") ax.set_xlim([0, 0.7]) _ = ax.set_title( "Density of the accuracy scores for the different multiclass strategies" ) .. image-sg:: /auto_examples/multiclass/images/sphx_glr_plot_multiclass_overview_001.png :alt: Density of the accuracy scores for the different multiclass strategies :srcset: /auto_examples/multiclass/images/sphx_glr_plot_multiclass_overview_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 117-133 At a first glance, we can see that the built-in strategy of the decision tree classifier is working quite well. One-vs-one and the error-correcting output code strategies are working even better. However, the one-vs-rest strategy is not working as well as the other strategies. Indeed, these results reproduce something reported in the literature as in [2]_. However, the story is not as simple as it seems. The importance of hyperparameters search ---------------------------------------- It was later shown in [3]_ that the multiclass strategies would show similar scores if the hyperparameters of the base classifiers are first optimized. Here we try to reproduce such result by at least optimizing the depth of the base decision tree. .. GENERATED FROM PYTHON SOURCE LINES 133-163 .. code-block:: Python from sklearn.model_selection import GridSearchCV param_grid = {"max_depth": [3, 5, 8]} tree_optimized = GridSearchCV(tree, param_grid=param_grid, cv=3) ovo_tree = OneVsOneClassifier(tree_optimized) ovr_tree = OneVsRestClassifier(tree_optimized) ecoc = OutputCodeClassifier(tree_optimized, code_size=2) cv_results_tree = cross_validate(tree_optimized, X, y, cv=cv, n_jobs=2) cv_results_ovo = cross_validate(ovo_tree, X, y, cv=cv, n_jobs=2) cv_results_ovr = cross_validate(ovr_tree, X, y, cv=cv, n_jobs=2) cv_results_ecoc = cross_validate(ecoc, X, y, cv=cv, n_jobs=2) scores = pd.DataFrame( { "DecisionTreeClassifier": cv_results_tree["test_score"], "OneVsOneClassifier": cv_results_ovo["test_score"], "OneVsRestClassifier": cv_results_ovr["test_score"], "OutputCodeClassifier": cv_results_ecoc["test_score"], } ) ax = scores.plot.kde(legend=True) ax.set_xlabel("Accuracy score") ax.set_xlim([0, 0.7]) _ = ax.set_title( "Density of the accuracy scores for the different multiclass strategies" ) plt.show() .. image-sg:: /auto_examples/multiclass/images/sphx_glr_plot_multiclass_overview_002.png :alt: Density of the accuracy scores for the different multiclass strategies :srcset: /auto_examples/multiclass/images/sphx_glr_plot_multiclass_overview_002.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 164-205 We can see that once the hyperparameters are optimized, all multiclass strategies have similar performance as discussed in [3]_. Conclusion ---------- We can get some intuition behind those results. First, the reason for which one-vs-one and error-correcting output code are outperforming the tree when the hyperparameters are not optimized relies on fact that they ensemble a larger number of classifiers. The ensembling improves the generalization performance. This is a bit similar why a bagging classifier generally performs better than a single decision tree if no care is taken to optimize the hyperparameters. Then, we see the importance of optimizing the hyperparameters. Indeed, it should be regularly explored when developing predictive models even if techniques such as ensembling help at reducing this impact. Finally, it is important to recall that the estimators in scikit-learn are developed with a specific strategy to handle multiclass classification out of the box. So for these estimators, it means that there is no need to use different strategies. These strategies are mainly useful for third-party estimators supporting only binary classification. In all cases, we also show that the hyperparameters should be optimized. References ---------- .. [1] https://archive.ics.uci.edu/ml/datasets/Yeast .. [2] `"Reducing multiclass to binary: A unifying approach for margin classifiers." Allwein, Erin L., Robert E. Schapire, and Yoram Singer. Journal of machine learning research 1 Dec (2000): 113-141. `_. .. [3] `"In defense of one-vs-all classification." Journal of Machine Learning Research 5 Jan (2004): 101-141. `_. .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 18.953 seconds) .. _sphx_glr_download_auto_examples_multiclass_plot_multiclass_overview.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/scikit-learn/scikit-learn/main?urlpath=lab/tree/notebooks/auto_examples/multiclass/plot_multiclass_overview.ipynb :alt: Launch binder :width: 150 px .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_multiclass_overview.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_multiclass_overview.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: plot_multiclass_overview.zip ` .. include:: plot_multiclass_overview.recommendations .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_