.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "generated/examples/general/plot_masked_data.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_generated_examples_general_plot_masked_data.py>`
        to download the full example code.

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_generated_examples_general_plot_masked_data.py:


Working with Masked or Missing Data
-----------------------------------

There are cases where data needs to be filtered/masked before processing, for example a faulty
detector can result in problematic regions in measurements. In these cases, baseline correction on
the raw data could lead to severely incorrect fits. One such use of masking in literature is presented by
`Temmink, et al. <https://doi.org/10.1051/0004-6361/202348911>`_ for removing downward spikes in
mid-infrared data collected from the James Webb Space Telescope before performing baseline correction.
This example will detail  how to handle working with masked data for the various types of baseline
correction algorithms in pybaselines.

As a preface for this guide, the author of this library acknowledges that masking could potentially
be handled internally within each individual algorithm to make them "mask-aware". However, it would
be a large change to the codebase, and there are many edge cases that would make it a non-trivial endeavor.
Support for masking following the guidelines presented in this example could alternatively be added
as an :doc:`optimizer-type algorithm <../../../algorithms/optimizers>`, but, in the author's opinion,
making one single function that is expected to cover 60+ different algorithms would be fiddly at best
and very prone to bugs. Therefore, this guide is presented as a starting point for users to adapt for
targeting a spectific baseline correction algorithm or group of algorithms.

The various algorithms in pybaselines can be broadly grouped into three different categories
for how they handle masked data:

1. Methods that directly support masking by inputting the mask as weights, which includes
   all :doc:`classification <../../../algorithms/classification>` and
   :doc:`polynomial <../../../algorithms/polynomial>` methods except for :meth:`~.Baseline.loess`
   and :meth:`~.Baseline.quant_reg`. They are not NaN-aware, however, so if working with missing
   data, that has to be accounted for.
2. Methods that do iterative reweighting, such as
   :doc:`Whittaker smoothing methods <../../../algorithms/whittaker>`, most
   :doc:`spline methods <../../../algorithms/spline>`, :meth:`~.Baseline.loess` and
   :meth:`~.Baseline.quant_reg`. As to be covered later in this guide, it is relatively
   easy to emulate a "mask-aware" implementation of these algorithms by making use of the
   output weights in the parameter dictionary to perform weighted interpolation in masked regions.
3. All other methods. Most would not be able to have a "mask-aware" version and would instead
   rely simply on interpolation of the input data, so whether this was handled internally or
   externally in user-code, the results would be the same.

.. GENERATED FROM PYTHON SOURCE LINES 42-69

.. code-block:: Python


    import matplotlib.pyplot as plt
    import numpy as np
    from scipy.interpolate import PchipInterpolator

    from pybaselines import Baseline
    from pybaselines.utils import gaussian, relative_difference
    from pybaselines._banded_utils import PenalizedSystem
    from pybaselines._weighting import _arpls


    x = np.linspace(500, 4000, 1000)
    signal = (
        + gaussian(x, 8, 650, 16)
        + gaussian(x, 9, 1100, 50)
        + gaussian(x, 8, 1350, 20)
        + gaussian(x, 11, 2800, 20)
        + gaussian(x, 8, 2900, 20)
        + gaussian(x, 5, 3400, 40)
    )
    baseline = 0.08 + 0.00004 * (x - 1000) + gaussian(x, 10, 1900, 800)
    rng = np.random.default_rng(123)
    noise = rng.normal(0, 0.1, len(x))
    y = signal + baseline + noise

    baseline_fitter = Baseline(x)


.. GENERATED FROM PYTHON SOURCE LINES 71-73

This example will emulate an issue with the detector such that a certain spectral region is
just noise.

.. GENERATED FROM PYTHON SOURCE LINES 73-82

.. code-block:: Python

    bad_region = (x > 2000) & (x < 2500)

    y_bad = y.copy()
    y_bad[bad_region] = rng.normal(0.5, 0.25, len(x[bad_region]))

    plt.plot(x, y, label='normal data')
    plt.plot(x, y_bad, label='problematic data')
    plt.legend()


.. image-sg:: /generated/examples/general/images/sphx_glr_plot_masked_data_001.png
   :alt: plot masked data
   :srcset: /generated/examples/general/images/sphx_glr_plot_masked_data_001.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 83-89

To start, the mask for fitting the data has to be made. This can be done by eye if fitting
a few datasets, or can be automated using some metric. Many
:doc:`classification methods <../../../algorithms/classification>` use different methods for
excluding positive peaks; for excluding negative peaks, see
`Temmink, et al. <https://doi.org/10.1051/0004-6361/202348911>`_ for an example. This example
will simply define the mask region by hand.

.. GENERATED FROM PYTHON SOURCE LINES 89-92

.. code-block:: Python

    fit_mask = (x < 1900) | (x > 2550)  # 1 in regions to fit, 0 in masked region


.. GENERATED FROM PYTHON SOURCE LINES 93-96

First, the algorithms that are already mask-aware to a certain extent will be covered.
For these algorithms, the mask needs to simply be input as ``weights``, and any NaN values
need to be replaced (with zeros, through interpolation, or using :func:`numpy.nan_to_num`).

.. GENERATED FROM PYTHON SOURCE LINES 96-120

.. code-block:: Python

    non_masked_std_distributionc = baseline_fitter.std_distribution(
        y_bad, half_window=20, num_std=3
    )[0]
    masked_std_distribution = baseline_fitter.std_distribution(
        y_bad, half_window=20, num_std=3, weights=fit_mask
    )[0]

    non_masked_imodpoly = baseline_fitter.imodpoly(y_bad, poly_order=5, num_std=0.1)[0]
    masked_imodpoly = baseline_fitter.imodpoly(y_bad, poly_order=5, num_std=0.1, weights=fit_mask)[0]

    _, (ax1, ax2) = plt.subplots(2, layout='constrained')
    ax1.set_title('std_distribution')
    ax2.set_title('imodpoly')

    ax1.plot(x, y_bad, label='data')
    ax1.plot(x, non_masked_std_distributionc, label='non-masked baseline')
    ax1.plot(x, masked_std_distribution, label='masked baseline')
    ax1.legend()

    ax2.plot(x, y_bad, label='data')
    ax2.plot(x, non_masked_imodpoly, label='non-masked baseline')
    ax2.plot(x, masked_imodpoly, label='masked baseline')
    ax2.legend()


.. image-sg:: /generated/examples/general/images/sphx_glr_plot_masked_data_002.png
   :alt: std_distribution, imodpoly
   :srcset: /generated/examples/general/images/sphx_glr_plot_masked_data_002.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 121-123

As seen in the plots above, the desired baselines can be obtained by simply
inputting the mask as weights.

.. GENERATED FROM PYTHON SOURCE LINES 125-138

For all other non-mask-aware algorithms, interpolation is required for either the input data
or the output baseline.
It is generally recommended to interpolate the input data **before** passing the data to
a baseline correction method rather than passing the masked data to the method and interpolating
after since a large majority of algorithms are
not x-aware, so the two regions on the edge of the mask would be assumed to be connected
for those algorithms, as shown below, and would subsequently cause edge effects. Knowing
which algorithms are x-aware and which are not is an implementation detail that is typically
not apparent without intimate knowledge of each algorithm, so assuming all algorithms are
not x-aware and interpolating before baseline correction is the safer route.

Note that as a generalization, polynomial, spline, and classifcation methods are x-aware
while Whittaker, morphological, and smoothing methods are not.

.. GENERATED FROM PYTHON SOURCE LINES 138-144

.. code-block:: Python

    _, (ax1, ax2) = plt.subplots(2, layout='constrained')
    ax1.plot(x[fit_mask], y[fit_mask])
    ax2.plot(y[fit_mask])
    ax1.set_title('What x-aware algorithms see')
    ax2.set_title('What non-x-aware algorithms see')


.. image-sg:: /generated/examples/general/images/sphx_glr_plot_masked_data_003.png
   :alt: What x-aware algorithms see, What non-x-aware algorithms see
   :srcset: /generated/examples/general/images/sphx_glr_plot_masked_data_003.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 145-146

For the next example, simple linear interpolation will be used.

.. GENERATED FROM PYTHON SOURCE LINES 146-156

.. code-block:: Python

    y_linear = np.interp(x, x[fit_mask], y_bad[fit_mask])
    plt.figure()
    plt.plot(x, y_bad, label='data')
    plt.plot(x, y_linear, label='linear interpolation')
    plt.fill_between(
        x[~fit_mask], 0, 1, color='green', alpha=0.1, transform=plt.gca().get_xaxis_transform(),
        label='masked region'
    )
    plt.legend()


.. image-sg:: /generated/examples/general/images/sphx_glr_plot_masked_data_004.png
   :alt: plot masked data
   :srcset: /generated/examples/general/images/sphx_glr_plot_masked_data_004.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 157-163

Next are algorithms that use iterative reweighting. While the majority of these
methods allow inputting weights, the input weights are just used for the first
iteration to jump-start the calculation and are ignored in subsequent calculations.
"Mask-aware" versions of these algorithms could be implemented, as shown below for
the arPLS algorithm. However, the "mask-aware" behavior can be closely approximated
simply by using weighted interpolation, to be shown below.

.. GENERATED FROM PYTHON SOURCE LINES 163-200

.. code-block:: Python


    def masked_arpls(y, mask=None, lam=1e5, diff_order=2, tol=1e-3, max_iter=50, weights=None):
        """A mask-aware version of the arpls algorithm."""
        len_y = len(y)
        if mask is None:
            mask = np.ones(len_y, dtype=bool)
            y_fit = y
        else:
            y_fit = 1 * y  # don't want to override the input y, so make a copy
            y_fit[~mask] = 0  # cover that case of nan values in y since 0 * nan = nan rather than 0
        if weights is None:
            weights = np.ones(len_y)
        else:
            weights = 1 * weights  # don't want to override the input weights, so make a copy
        weights[~mask] = 0
        whittaker_system = PenalizedSystem(len(y), lam=lam, diff_order=diff_order)
        for _ in range(max_iter):
            baseline = whittaker_system.solve(
                whittaker_system.add_diagonal(weights), weights * y_fit,
            )
            # need to ignore the problem regions in y since they would otherwise affect
            # the arpls weighting; could alternatively do:
            # _arpls(np.interp(x, x[mask], y[mask]), baseline) to approximate
            # the y-values, but it leads to a slightly different result
            calc_weights, exit_early = _arpls(y[mask], baseline[mask])
            if exit_early:
                break
            new_weights = np.zeros(len_y)
            new_weights[mask] = calc_weights
            if relative_difference(weights, new_weights) < tol:
                break
            weights = new_weights

        return baseline


.. GENERATED FROM PYTHON SOURCE LINES 201-206

For iteratively reweighted algorithms, the best way to approximate a fully "mask-aware"
version of the algorithms is to interpolate the input and get an approximated baseline. Then,
the output weights in the region to be excluded should be set to 0, and the method should be
called again for a single iteration, which can typically be done by setting the tolerance to
infinity.

.. GENERATED FROM PYTHON SOURCE LINES 206-223

.. code-block:: Python


    lam = 1e5
    non_masked_arpls = baseline_fitter.arpls(y_bad, lam=lam)[0]
    masked_arpls = masked_arpls(y_bad, mask=fit_mask, lam=lam)
    initial_fit, params = baseline_fitter.arpls(y_linear, lam=lam)
    params['weights'][~fit_mask] = 0
    weighted_arpls = baseline_fitter.arpls(y_linear, lam=lam, weights=params['weights'], tol=np.inf)[0]

    plt.figure()
    plt.plot(x, y_bad)
    plt.plot(x, non_masked_arpls, label='non-masked')
    plt.plot(x, masked_arpls, label='mask-aware')
    plt.plot(x, initial_fit, label='initial interpolated fit')
    plt.plot(x, weighted_arpls, '--', label='final weighted interpolation')

    plt.legend()


.. image-sg:: /generated/examples/general/images/sphx_glr_plot_masked_data_005.png
   :alt: plot masked data
   :srcset: /generated/examples/general/images/sphx_glr_plot_masked_data_005.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 224-227

For all other algorithms, the performance is directly tied to the interpolation since
information about the mask cannot be integrated into the method. For example, the difference
between using linear interpolation and PCHIP interpolation is shown below.

.. GENERATED FROM PYTHON SOURCE LINES 227-246

.. code-block:: Python

    y_pchip = PchipInterpolator(x[fit_mask], y[fit_mask])(x)
    _, (ax1, ax2) = plt.subplots(2, layout='constrained')
    ax1.set_title('Interpolated Data')
    ax2.set_title('Calculated Baselines using "mor"')

    ax1.plot(x, y_bad)
    ax1.plot(x, y_linear, label='linear interpolation')
    ax1.plot(x, y_pchip, label='PCHIP interpolation')
    ax1.legend()

    half_window = 35
    mor_linear = baseline_fitter.mor(y_linear, half_window=half_window)[0]
    mor_pchip = baseline_fitter.mor(y_pchip, half_window=half_window)[0]
    ax2.plot(x, y_bad)
    ax2.plot(x, mor_linear, label='linear interpolation')
    ax2.plot(x, mor_pchip, label='PCHIP interpolation')
    ax2.legend()

    plt.show()


.. image-sg:: /generated/examples/general/images/sphx_glr_plot_masked_data_006.png
   :alt: Interpolated Data, Calculated Baselines using "mor"
   :srcset: /generated/examples/general/images/sphx_glr_plot_masked_data_006.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 6.182 seconds)


.. _sphx_glr_download_generated_examples_general_plot_masked_data.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: plot_masked_data.ipynb <plot_masked_data.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: plot_masked_data.py <plot_masked_data.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: plot_masked_data.zip <plot_masked_data.zip>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_