.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/gaussian_process/plot_gpr_co2.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. or to run this example in your browser via Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_gaussian_process_plot_gpr_co2.py: ==================================================================================== Forecasting of CO2 level on Mona Loa dataset using Gaussian process regression (GPR) ==================================================================================== This example is based on Section 5.4.3 of "Gaussian Processes for Machine Learning" [1]_. It illustrates an example of complex kernel engineering and hyperparameter optimization using gradient ascent on the log-marginal-likelihood. The data consists of the monthly average atmospheric CO2 concentrations (in parts per million by volume (ppm)) collected at the Mauna Loa Observatory in Hawaii, between 1958 and 2001. The objective is to model the CO2 concentration as a function of the time :math:`t` and extrapolate for years after 2001. .. rubric:: References .. [1] `Rasmussen, Carl Edward. "Gaussian processes in machine learning." Summer school on machine learning. Springer, Berlin, Heidelberg, 2003 `_. .. GENERATED FROM PYTHON SOURCE LINES 21-27 .. code-block:: Python print(__doc__) # Authors: The scikit-learn developers # SPDX-License-Identifier: BSD-3-Clause .. GENERATED FROM PYTHON SOURCE LINES 28-36 Build the dataset ----------------- We will derive a dataset from the Mauna Loa Observatory that collected air samples. We are interested in estimating the concentration of CO2 and extrapolate it for further year. First, we load the original dataset available in OpenML as a pandas dataframe. This will be replaced with Polars once `fetch_openml` adds a native support for it. .. GENERATED FROM PYTHON SOURCE LINES 36-41 .. code-block:: Python from sklearn.datasets import fetch_openml co2 = fetch_openml(data_id=41187, as_frame=True) co2.frame.head() .. raw:: html
year month day weight flag station co2
0 1958 3 29 4 0 MLO 316.1
1 1958 4 5 6 0 MLO 317.3
2 1958 4 12 4 0 MLO 317.6
3 1958 4 19 6 0 MLO 317.5
4 1958 4 26 2 0 MLO 316.4


.. GENERATED FROM PYTHON SOURCE LINES 42-44 First, we process the original dataframe to create a date column and select it along with the CO2 column. .. GENERATED FROM PYTHON SOURCE LINES 44-51 .. code-block:: Python import polars as pl co2_data = pl.DataFrame(co2.frame[["year", "month", "day", "co2"]]).select( pl.date("year", "month", "day"), "co2" ) co2_data.head() .. raw:: html
shape: (5, 2)
dateco2
datef64
1958-03-29316.1
1958-04-05317.3
1958-04-12317.6
1958-04-19317.5
1958-04-26316.4


.. GENERATED FROM PYTHON SOURCE LINES 52-54 .. code-block:: Python co2_data["date"].min(), co2_data["date"].max() .. rst-class:: sphx-glr-script-out .. code-block:: none (datetime.date(1958, 3, 29), datetime.date(2001, 12, 29)) .. GENERATED FROM PYTHON SOURCE LINES 55-58 We see that we get CO2 concentration for some days from March, 1958 to December, 2001. We can plot these raw information to have a better understanding. .. GENERATED FROM PYTHON SOURCE LINES 58-65 .. code-block:: Python import matplotlib.pyplot as plt plt.plot(co2_data["date"], co2_data["co2"]) plt.xlabel("date") plt.ylabel("CO$_2$ concentration (ppm)") _ = plt.title("Raw air samples measurements from the Mauna Loa Observatory") .. image-sg:: /auto_examples/gaussian_process/images/sphx_glr_plot_gpr_co2_001.png :alt: Raw air samples measurements from the Mauna Loa Observatory :srcset: /auto_examples/gaussian_process/images/sphx_glr_plot_gpr_co2_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 66-69 We will preprocess the dataset by taking a monthly average and drop month for which no measurements were collected. Such a processing will have an smoothing effect on the data. .. GENERATED FROM PYTHON SOURCE LINES 69-83 .. code-block:: Python co2_data = ( co2_data.sort(by="date") .group_by_dynamic("date", every="1mo") .agg(pl.col("co2").mean()) .drop_nulls() ) plt.plot(co2_data["date"], co2_data["co2"]) plt.xlabel("date") plt.ylabel("Monthly average of CO$_2$ concentration (ppm)") _ = plt.title( "Monthly average of air samples measurements\nfrom the Mauna Loa Observatory" ) .. image-sg:: /auto_examples/gaussian_process/images/sphx_glr_plot_gpr_co2_002.png :alt: Monthly average of air samples measurements from the Mauna Loa Observatory :srcset: /auto_examples/gaussian_process/images/sphx_glr_plot_gpr_co2_002.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 84-90 The idea in this example will be to predict the CO2 concentration in function of the date. We are as well interested in extrapolating for upcoming year after 2001. As a first step, we will divide the data and the target to estimate. The data being a date, we will convert it into a numeric. .. GENERATED FROM PYTHON SOURCE LINES 90-95 .. code-block:: Python X = co2_data.select( pl.col("date").dt.year() + pl.col("date").dt.month() / 12 ).to_numpy() y = co2_data["co2"].to_numpy() .. GENERATED FROM PYTHON SOURCE LINES 96-110 Design the proper kernel ------------------------ To design the kernel to use with our Gaussian process, we can make some assumption regarding the data at hand. We observe that they have several characteristics: we see a long term rising trend, a pronounced seasonal variation and some smaller irregularities. We can use different appropriate kernel that would capture these features. First, the long term rising trend could be fitted using a radial basis function (RBF) kernel with a large length-scale parameter. The RBF kernel with a large length-scale enforces this component to be smooth. An trending increase is not enforced as to give a degree of freedom to our model. The specific length-scale and the amplitude are free hyperparameters. .. GENERATED FROM PYTHON SOURCE LINES 110-114 .. code-block:: Python from sklearn.gaussian_process.kernels import RBF long_term_trend_kernel = 50.0**2 * RBF(length_scale=50.0) .. GENERATED FROM PYTHON SOURCE LINES 115-122 The seasonal variation is explained by the periodic exponential sine squared kernel with a fixed periodicity of 1 year. The length-scale of this periodic component, controlling its smoothness, is a free parameter. In order to allow decaying away from exact periodicity, the product with an RBF kernel is taken. The length-scale of this RBF component controls the decay time and is a further free parameter. This type of kernel is also known as locally periodic kernel. .. GENERATED FROM PYTHON SOURCE LINES 122-130 .. code-block:: Python from sklearn.gaussian_process.kernels import ExpSineSquared seasonal_kernel = ( 2.0**2 * RBF(length_scale=100.0) * ExpSineSquared(length_scale=1.0, periodicity=1.0, periodicity_bounds="fixed") ) .. GENERATED FROM PYTHON SOURCE LINES 131-136 The small irregularities are to be explained by a rational quadratic kernel component, whose length-scale and alpha parameter, which quantifies the diffuseness of the length-scales, are to be determined. A rational quadratic kernel is equivalent to an RBF kernel with several length-scale and will better accommodate the different irregularities. .. GENERATED FROM PYTHON SOURCE LINES 136-140 .. code-block:: Python from sklearn.gaussian_process.kernels import RationalQuadratic irregularities_kernel = 0.5**2 * RationalQuadratic(length_scale=1.0, alpha=1.0) .. GENERATED FROM PYTHON SOURCE LINES 141-146 Finally, the noise in the dataset can be accounted with a kernel consisting of an RBF kernel contribution, which shall explain the correlated noise components such as local weather phenomena, and a white kernel contribution for the white noise. The relative amplitudes and the RBF's length scale are further free parameters. .. GENERATED FROM PYTHON SOURCE LINES 146-152 .. code-block:: Python from sklearn.gaussian_process.kernels import WhiteKernel noise_kernel = 0.1**2 * RBF(length_scale=0.1) + WhiteKernel( noise_level=0.1**2, noise_level_bounds=(1e-5, 1e5) ) .. GENERATED FROM PYTHON SOURCE LINES 153-154 Thus, our final kernel is an addition of all previous kernel. .. GENERATED FROM PYTHON SOURCE LINES 154-159 .. code-block:: Python co2_kernel = ( long_term_trend_kernel + seasonal_kernel + irregularities_kernel + noise_kernel ) co2_kernel .. rst-class:: sphx-glr-script-out .. code-block:: none 50**2 * RBF(length_scale=50) + 2**2 * RBF(length_scale=100) * ExpSineSquared(length_scale=1, periodicity=1) + 0.5**2 * RationalQuadratic(alpha=1, length_scale=1) + 0.1**2 * RBF(length_scale=0.1) + WhiteKernel(noise_level=0.01) .. GENERATED FROM PYTHON SOURCE LINES 160-169 Model fitting and extrapolation ------------------------------- Now, we are ready to use a Gaussian process regressor and fit the available data. To follow the example from the literature, we will subtract the mean from the target. We could have used `normalize_y=True`. However, doing so would have also scaled the target (dividing `y` by its standard deviation). Thus, the hyperparameters of the different kernel would have had different meaning since they would not have been expressed in ppm. .. GENERATED FROM PYTHON SOURCE LINES 169-175 .. code-block:: Python from sklearn.gaussian_process import GaussianProcessRegressor y_mean = y.mean() gaussian_process = GaussianProcessRegressor(kernel=co2_kernel, normalize_y=False) gaussian_process.fit(X, y - y_mean) .. raw:: html
GaussianProcessRegressor(kernel=50**2 * RBF(length_scale=50) + 2**2 * RBF(length_scale=100) * ExpSineSquared(length_scale=1, periodicity=1) + 0.5**2 * RationalQuadratic(alpha=1, length_scale=1) + 0.1**2 * RBF(length_scale=0.1) + WhiteKernel(noise_level=0.01))
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


.. GENERATED FROM PYTHON SOURCE LINES 176-183 Now, we will use the Gaussian process to predict on: - training data to inspect the goodness of fit; - future data to see the extrapolation done by the model. Thus, we create synthetic data from 1958 to the current month. In addition, we need to add the subtracted mean computed during training. .. GENERATED FROM PYTHON SOURCE LINES 183-193 .. code-block:: Python import datetime import numpy as np today = datetime.datetime.now() current_month = today.year + today.month / 12 X_test = np.linspace(start=1958, stop=current_month, num=1_000).reshape(-1, 1) mean_y_pred, std_y_pred = gaussian_process.predict(X_test, return_std=True) mean_y_pred += y_mean .. GENERATED FROM PYTHON SOURCE LINES 194-210 .. code-block:: Python plt.plot(X, y, color="black", linestyle="dashed", label="Measurements") plt.plot(X_test, mean_y_pred, color="tab:blue", alpha=0.4, label="Gaussian process") plt.fill_between( X_test.ravel(), mean_y_pred - std_y_pred, mean_y_pred + std_y_pred, color="tab:blue", alpha=0.2, ) plt.legend() plt.xlabel("Year") plt.ylabel("Monthly average of CO$_2$ concentration (ppm)") _ = plt.title( "Monthly average of air samples measurements\nfrom the Mauna Loa Observatory" ) .. image-sg:: /auto_examples/gaussian_process/images/sphx_glr_plot_gpr_co2_003.png :alt: Monthly average of air samples measurements from the Mauna Loa Observatory :srcset: /auto_examples/gaussian_process/images/sphx_glr_plot_gpr_co2_003.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 211-218 Our fitted model is capable to fit previous data properly and extrapolate to future year with confidence. Interpretation of kernel hyperparameters ---------------------------------------- Now, we can have a look at the hyperparameters of the kernel. .. GENERATED FROM PYTHON SOURCE LINES 218-220 .. code-block:: Python gaussian_process.kernel_ .. rst-class:: sphx-glr-script-out .. code-block:: none 44.8**2 * RBF(length_scale=51.6) + 2.64**2 * RBF(length_scale=91.5) * ExpSineSquared(length_scale=1.48, periodicity=1) + 0.536**2 * RationalQuadratic(alpha=2.89, length_scale=0.968) + 0.188**2 * RBF(length_scale=0.122) + WhiteKernel(noise_level=0.0367) .. GENERATED FROM PYTHON SOURCE LINES 221-229 Thus, most of the target signal, with the mean subtracted, is explained by a long-term rising trend for ~45 ppm and a length-scale of ~52 years. The periodic component has an amplitude of ~2.6ppm, a decay time of ~90 years and a length-scale of ~1.5. The long decay time indicates that we have a component very close to a seasonal periodicity. The correlated noise has an amplitude of ~0.2 ppm with a length scale of ~0.12 years and a white-noise contribution of ~0.04 ppm. Thus, the overall noise level is very small, indicating that the data can be very well explained by the model. .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 5.553 seconds) .. _sphx_glr_download_auto_examples_gaussian_process_plot_gpr_co2.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/scikit-learn/scikit-learn/main?urlpath=lab/tree/notebooks/auto_examples/gaussian_process/plot_gpr_co2.ipynb :alt: Launch binder :width: 150 px .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_gpr_co2.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_gpr_co2.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: plot_gpr_co2.zip ` .. include:: plot_gpr_co2.recommendations .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_