.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/preprocessing/plot_target_encoder_cross_val.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. or to run this example in your browser via Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_preprocessing_plot_target_encoder_cross_val.py: ======================================= Target Encoder's Internal Cross fitting ======================================= .. currentmodule:: sklearn.preprocessing The :class:`TargetEncoder` replaces each category of a categorical feature with the shrunk mean of the target variable for that category. This method is useful in cases where there is a strong relationship between the categorical feature and the target. To prevent overfitting, :meth:`TargetEncoder.fit_transform` uses an internal :term:`cross fitting` scheme to encode the training data to be used by a downstream model. This scheme involves splitting the data into *k* folds and encoding each fold using the encodings learnt using the other *k-1* folds. In this example, we demonstrate the importance of the cross fitting procedure to prevent overfitting. .. GENERATED FROM PYTHON SOURCE LINES 18-22 .. code-block:: Python # Authors: The scikit-learn developers # SPDX-License-Identifier: BSD-3-Clause .. GENERATED FROM PYTHON SOURCE LINES 23-32 Create Synthetic Dataset ======================== For this example, we build a dataset with three categorical features: * an informative feature with medium cardinality ("informative") * an uninformative feature with medium cardinality ("shuffled") * an uninformative feature with high cardinality ("near_unique") First, we generate the informative feature: .. GENERATED FROM PYTHON SOURCE LINES 32-57 .. code-block:: Python import numpy as np from sklearn.preprocessing import KBinsDiscretizer n_samples = 50_000 rng = np.random.RandomState(42) y = rng.randn(n_samples) noise = 0.5 * rng.randn(n_samples) n_categories = 100 kbins = KBinsDiscretizer( n_bins=n_categories, encode="ordinal", strategy="uniform", random_state=rng, subsample=None, ) X_informative = kbins.fit_transform((y + noise).reshape(-1, 1)) # Remove the linear relationship between y and the bin index by permuting the # values of X_informative: permuted_categories = rng.permutation(n_categories) X_informative = permuted_categories[X_informative.astype(np.int32)] .. GENERATED FROM PYTHON SOURCE LINES 58-60 The uninformative feature with medium cardinality is generated by permuting the informative feature and removing the relationship with the target: .. GENERATED FROM PYTHON SOURCE LINES 60-62 .. code-block:: Python X_shuffled = rng.permutation(X_informative) .. GENERATED FROM PYTHON SOURCE LINES 63-70 The uninformative feature with high cardinality is generated so that it is independent of the target variable. We will show that target encoding without :term:`cross fitting` will cause catastrophic overfitting for the downstream regressor. These high cardinality features are basically unique identifiers for samples which should generally be removed from machine learning datasets. In this example, we generate them to show how :class:`TargetEncoder`'s default :term:`cross fitting` behavior mitigates the overfitting issue automatically. .. GENERATED FROM PYTHON SOURCE LINES 70-74 .. code-block:: Python X_near_unique_categories = rng.choice( int(0.9 * n_samples), size=n_samples, replace=True ).reshape(-1, 1) .. GENERATED FROM PYTHON SOURCE LINES 75-76 Finally, we assemble the dataset and perform a train test split: .. GENERATED FROM PYTHON SOURCE LINES 76-89 .. code-block:: Python import pandas as pd from sklearn.model_selection import train_test_split X = pd.DataFrame( np.concatenate( [X_informative, X_shuffled, X_near_unique_categories], axis=1, ), columns=["informative", "shuffled", "near_unique"], ) X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) .. GENERATED FROM PYTHON SOURCE LINES 90-98 Training a Ridge Regressor ========================== In this section, we train a ridge regressor on the dataset with and without encoding and explore the influence of target encoder with and without the internal :term:`cross fitting`. First, we see the Ridge model trained on the raw features will have low performance. This is because we permuted the order of the informative feature meaning `X_informative` is not informative when raw: .. GENERATED FROM PYTHON SOURCE LINES 98-110 .. code-block:: Python import sklearn from sklearn.linear_model import Ridge # Configure transformers to always output DataFrames sklearn.set_config(transform_output="pandas") ridge = Ridge(alpha=1e-6, solver="lsqr", fit_intercept=False) raw_model = ridge.fit(X_train, y_train) print("Raw Model score on training set: ", raw_model.score(X_train, y_train)) print("Raw Model score on test set: ", raw_model.score(X_test, y_test)) .. rst-class:: sphx-glr-script-out .. code-block:: none Raw Model score on training set: 0.0049896314219657345 Raw Model score on test set: 0.00457762158146513 .. GENERATED FROM PYTHON SOURCE LINES 111-114 Next, we create a pipeline with the target encoder and ridge model. The pipeline uses :meth:`TargetEncoder.fit_transform` which uses :term:`cross fitting`. We see that the model fits the data well and generalizes to the test set: .. GENERATED FROM PYTHON SOURCE LINES 114-122 .. code-block:: Python from sklearn.pipeline import make_pipeline from sklearn.preprocessing import TargetEncoder model_with_cf = make_pipeline(TargetEncoder(random_state=0), ridge) model_with_cf.fit(X_train, y_train) print("Model with CF on train set: ", model_with_cf.score(X_train, y_train)) print("Model with CF on test set: ", model_with_cf.score(X_test, y_test)) .. rst-class:: sphx-glr-script-out .. code-block:: none Model with CF on train set: 0.8000184677460302 Model with CF on test set: 0.7927845601690917 .. GENERATED FROM PYTHON SOURCE LINES 123-125 The coefficients of the linear model shows that most of the weight is on the feature at column index 0, which is the informative feature .. GENERATED FROM PYTHON SOURCE LINES 125-140 .. code-block:: Python import matplotlib.pyplot as plt import pandas as pd plt.rcParams["figure.constrained_layout.use"] = True coefs_cf = pd.Series( model_with_cf[-1].coef_, index=model_with_cf[-1].feature_names_in_ ).sort_values() ax = coefs_cf.plot(kind="barh") _ = ax.set( title="Target encoded with cross fitting", xlabel="Ridge coefficient", ylabel="Feature", ) .. image-sg:: /auto_examples/preprocessing/images/sphx_glr_plot_target_encoder_cross_val_001.png :alt: Target encoded with cross fitting :srcset: /auto_examples/preprocessing/images/sphx_glr_plot_target_encoder_cross_val_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 141-148 While :meth:`TargetEncoder.fit_transform` uses an internal :term:`cross fitting` scheme to learn encodings for the training set, :meth:`TargetEncoder.transform` itself does not. It uses the complete training set to learn encodings and to transform the categorical features. Thus, we can use :meth:`TargetEncoder.fit` followed by :meth:`TargetEncoder.transform` to disable the :term:`cross fitting`. This encoding is then passed to the ridge model. .. GENERATED FROM PYTHON SOURCE LINES 148-155 .. code-block:: Python target_encoder = TargetEncoder(random_state=0) target_encoder.fit(X_train, y_train) X_train_no_cf_encoding = target_encoder.transform(X_train) X_test_no_cf_encoding = target_encoder.transform(X_test) model_no_cf = ridge.fit(X_train_no_cf_encoding, y_train) .. GENERATED FROM PYTHON SOURCE LINES 156-158 We evaluate the model that did not use :term:`cross fitting` when encoding and see that it overfits: .. GENERATED FROM PYTHON SOURCE LINES 158-170 .. code-block:: Python print( "Model without CF on training set: ", model_no_cf.score(X_train_no_cf_encoding, y_train), ) print( "Model without CF on test set: ", model_no_cf.score( X_test_no_cf_encoding, y_test, ), ) .. rst-class:: sphx-glr-script-out .. code-block:: none Model without CF on training set: 0.858486250088675 Model without CF on test set: 0.6338211367102258 .. GENERATED FROM PYTHON SOURCE LINES 171-175 The ridge model overfits because it assigns much more weight to the uninformative extremely high cardinality ("near_unique") and medium cardinality ("shuffled") features than when the model used :term:`cross fitting` to encode the features. .. GENERATED FROM PYTHON SOURCE LINES 175-185 .. code-block:: Python coefs_no_cf = pd.Series( model_no_cf.coef_, index=model_no_cf.feature_names_in_ ).sort_values() ax = coefs_no_cf.plot(kind="barh") _ = ax.set( title="Target encoded without cross fitting", xlabel="Ridge coefficient", ylabel="Feature", ) .. image-sg:: /auto_examples/preprocessing/images/sphx_glr_plot_target_encoder_cross_val_002.png :alt: Target encoded without cross fitting :srcset: /auto_examples/preprocessing/images/sphx_glr_plot_target_encoder_cross_val_002.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 186-195 Conclusion ========== This example demonstrates the importance of :class:`TargetEncoder`'s internal :term:`cross fitting`. It is important to use :meth:`TargetEncoder.fit_transform` to encode training data before passing it to a machine learning model. When a :class:`TargetEncoder` is a part of a :class:`~sklearn.pipeline.Pipeline` and the pipeline is fitted, the pipeline will correctly call :meth:`TargetEncoder.fit_transform` and use :term:`cross fitting` when encoding the training data. .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 0.306 seconds) .. _sphx_glr_download_auto_examples_preprocessing_plot_target_encoder_cross_val.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/scikit-learn/scikit-learn/main?urlpath=lab/tree/notebooks/auto_examples/preprocessing/plot_target_encoder_cross_val.ipynb :alt: Launch binder :width: 150 px .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_target_encoder_cross_val.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_target_encoder_cross_val.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: plot_target_encoder_cross_val.zip ` .. include:: plot_target_encoder_cross_val.recommendations .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_