pybaselines.api

Module Contents

Classes

Baseline

A class for all baseline correction algorithms.

class pybaselines.api.Baseline(x_data=None, check_finite=True, assume_sorted=False, output_dtype=None)[source]

A class for all baseline correction algorithms.

Contains all available baseline correction algorithms in pybaselines as methods to allow a single interface for easier usage.

Parameters:
x_dataarray-like, shape (N,), optional

The x-values of the measured data. Default is None, which will create an array from -1 to 1 during the first function call with length equal to the input data length.

check_finitebool, optional

If True (default), will raise an error if any values in input data are not finite. Setting to False will skip the check. Note that errors may occur if check_finite is False and the input data contains non-finite values.

assume_sortedbool, optional

If False (default), will sort the input x_data values. Otherwise, the input is assumed to be sorted. Note that some functions may raise an error if x_data is not sorted.

output_dtypetype or numpy.dtype, optional

The dtype to cast the output array. Default is None, which uses the typing of the input data.

Attributes:
poly_orderint

The last polynomial order used for a polynomial algorithm. Initially is -1, denoting that no polynomial fitting has been performed.

psplinepybaselines._spline_utils.PSpline or None

The PSpline object for setting up and solving penalized spline algorithms. Is None if no penalized spline setup has been performed.

vandermondenumpy.ndarray or None

The Vandermonde matrix for solving polynomial equations. Is None if no polynomial setup has been performed.

whittaker_systempybaselines._banded_utils.PenalizedSystem or None

The PenalizedSystem object for setting up and solving Whittaker-smoothing-based algorithms. Is None if no Whittaker setup has been performed.

xnumpy.ndarray or None

The x-values for the object. If initialized with None, then x is initialized the first function call to have the same length as the input data and has min and max values of -1 and 1, respectively.

x_domainnumpy.ndarray

The minimum and maximum values of x. If x_data is None during initialization, then set to numpy.ndarray([-1, 1]).

property pentapy_solver

The integer or string designating which solver to use if using pentapy.

See pentapy.solve() for available options, although 1 or 2 are the most relevant options. Default is 2.

New in version 1.1.0.

adaptive_minmax(data, poly_order=None, method='modpoly', weights=None, constrained_fraction=0.01, constrained_weight=100000.0, estimation_poly_order=2, method_kwargs=None)

Fits polynomials of different orders and uses the maximum values as the baseline.

Each polynomial order fit is done both unconstrained and constrained at the endpoints.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points.

poly_orderint or Sequence(int, int) or None, optional

The two polynomial orders to use for fitting. If a single integer is given, then will use the input value and one plus the input value. Default is None, which will do a preliminary fit using a polynomial of order estimation_poly_order and then select the appropriate polynomial orders according to [7].

method{'modpoly', 'imodpoly'}, optional

The method to use for fitting each polynomial. Default is 'modpoly'.

weightsarray-like, shape (N,), optional

The weighting array. If None (default), then will be an array with size equal to N and all values set to 1.

constrained_fractionfloat or Sequence(float, float), optional

The fraction of points at the left and right edges to use for the constrained fit. Default is 0.01. If constrained_fraction is a sequence, the first item is the fraction for the left edge and the second is the fraction for the right edge.

constrained_weightfloat or Sequence(float, float), optional

The weighting to give to the endpoints. Higher values ensure that the end points are fit, but can cause large fluctuations in the other sections of the polynomial. Default is 1e5. If constrained_weight is a sequence, the first item is the weight for the left edge and the second is the weight for the right edge.

estimation_poly_orderint, optional

The polynomial order used for estimating the baseline-to-signal ratio to select the appropriate polynomial orders if poly_order is None. Default is 2.

method_kwargsdict, optional

Additional keyword arguments to pass to modpoly() or imodpoly(). These include tol, max_iter, use_original, mask_initial_peaks, and num_std.

Returns:
numpy.ndarray, shape (N,)

The calculated baseline.

paramsdict

A dictionary with the following items:

  • 'weights': numpy.ndarray, shape (N,)

    The weight array used for fitting the data.

  • 'constrained_weights': numpy.ndarray, shape (N,)

    The weight array used for the endpoint-constrained fits.

  • 'poly_order': numpy.ndarray, shape (2,)

    An array of the two polynomial orders used for the fitting.

References

[7]

Cao, A., et al. A robust method for automated background subtraction of tissue fluorescence. Journal of Raman Spectroscopy, 2007, 38, 1199-1205.

airpls(data, lam=1000000.0, diff_order=2, max_iter=50, tol=0.001, weights=None)

Adaptive iteratively reweighted penalized least squares (airPLS) baseline.

Parameters:
dataarray-like

The y-values of the measured data, with N data points. Must not contain missing data (NaN) or Inf.

lamfloat, optional

The smoothing parameter. Larger values will create smoother baselines. Default is 1e6.

diff_orderint, optional

The order of the differential matrix. Must be greater than 0. Default is 2 (second order differential matrix). Typical values are 2 or 1.

max_iterint, optional

The max number of fit iterations. Default is 50.

tolfloat, optional

The exit criteria. Default is 1e-3.

weightsarray-like, shape (N,), optional

The weighting array. If None (default), then the initial weights will be an array with size equal to N and all values set to 1.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

paramsdict

A dictionary with the following items:

  • 'weights': numpy.ndarray, shape (N,)

    The weight array used for fitting the data.

  • 'tol_history': numpy.ndarray

    An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.

References

Zhang, Z.M., et al. Baseline correction using adaptive iteratively reweighted penalized least squares. Analyst, 2010, 135(5), 1138-1146.

amormol(data, half_window=None, tol=0.001, max_iter=200, pad_kwargs=None, **window_kwargs)

Iteratively averaging morphological and mollified (aMorMol) baseline.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points.

half_windowint, optional

The half-window used for the morphology functions. If a value is input, then that value will be used. Default is None, which will optimize the half-window size using optimize_window() and window_kwargs.

tolfloat, optional

The exit criteria. Default is 1e-3.

max_iterint, optional

The maximum number of iterations. Default is 200.

pad_kwargsdict, optional

A dictionary of keyword arguments to pass to pad_edges() for padding the edges of the data to prevent edge effects from convolution.

**window_kwargs

Values for setting the half window used for the morphology operations. Items include:

  • 'increment': int

    The step size for iterating half windows. Default is 1.

  • 'max_hits': int

    The number of consecutive half windows that must produce the same morphological opening before accepting the half window as the optimum value. Default is 1.

  • 'window_tol': float

    The tolerance value for considering two morphological openings as equivalent. Default is 1e-6.

  • 'max_half_window': int

    The maximum allowable window size. If None (default), will be set to (len(data) - 1) / 2.

  • 'min_half_window': int

    The minimum half-window size. If None (default), will be set to 1.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

paramsdict

A dictionary with the following items:

  • 'half_window': int

    The half window used for the morphological calculations.

  • 'tol_history': numpy.ndarray

    An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.

References

Chen, H., et al. An Adaptive and Fully Automated Baseline Correction Method for Raman Spectroscopy Based on Morphological Operations and Mollifications. Applied Spectroscopy, 2019, 73(3), 284-293.

arpls(data, lam=100000.0, diff_order=2, max_iter=50, tol=0.001, weights=None)

Asymmetrically reweighted penalized least squares smoothing (arPLS).

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points. Must not contain missing data (NaN) or Inf.

lamfloat, optional

The smoothing parameter. Larger values will create smoother baselines. Default is 1e5.

diff_orderint, optional

The order of the differential matrix. Must be greater than 0. Default is 2 (second order differential matrix). Typical values are 2 or 1.

max_iterint, optional

The max number of fit iterations. Default is 50.

tolfloat, optional

The exit criteria. Default is 1e-3.

weightsarray-like, shape (N,), optional

The weighting array. If None (default), then the initial weights will be an array with size equal to N and all values set to 1.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

paramsdict

A dictionary with the following items:

  • 'weights': numpy.ndarray, shape (N,)

    The weight array used for fitting the data.

  • 'tol_history': numpy.ndarray

    An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.

References

Baek, S.J., et al. Baseline correction using asymmetrically reweighted penalized least squares smoothing. Analyst, 2015, 140, 250-257.

asls(data, lam=1000000.0, p=0.01, diff_order=2, max_iter=50, tol=0.001, weights=None)

Fits the baseline using asymmetric least squares (AsLS) fitting.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points. Must not contain missing data (NaN) or Inf.

lamfloat, optional

The smoothing parameter. Larger values will create smoother baselines. Default is 1e6.

pfloat, optional

The penalizing weighting factor. Must be between 0 and 1. Values greater than the baseline will be given p weight, and values less than the baseline will be given p - 1 weight. Default is 1e-2.

diff_orderint, optional

The order of the differential matrix. Must be greater than 0. Default is 2 (second order differential matrix). Typical values are 2 or 1.

max_iterint, optional

The max number of fit iterations. Default is 50.

tolfloat, optional

The exit criteria. Default is 1e-3.

weightsarray-like, shape (N,), optional

The weighting array. If None (default), then the initial weights will be an array with size equal to N and all values set to 1.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

paramsdict

A dictionary with the following items:

  • 'weights': numpy.ndarray, shape (N,)

    The weight array used for fitting the data.

  • 'tol_history': numpy.ndarray

    An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.

Raises:
ValueError

Raised if p is not between 0 and 1.

References

Eilers, P. A Perfect Smoother. Analytical Chemistry, 2003, 75(14), 3631-3636.

Eilers, P., et al. Baseline correction with asymmetric least squares smoothing. Leiden University Medical Centre Report, 2005, 1(1).

aspls(data, lam=100000.0, diff_order=2, max_iter=100, tol=0.001, weights=None, alpha=None)

Adaptive smoothness penalized least squares smoothing (asPLS).

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points. Must not contain missing data (NaN) or Inf.

lamfloat, optional

The smoothing parameter. Larger values will create smoother baselines. Default is 1e5.

diff_orderint, optional

The order of the differential matrix. Must be greater than 0. Default is 2 (second order differential matrix). Typical values are 2 or 1.

max_iterint, optional

The max number of fit iterations. Default is 100.

tolfloat, optional

The exit criteria. Default is 1e-3.

weightsarray-like, shape (N,), optional

The weighting array. If None (default), then the initial weights will be an array with size equal to N and all values set to 1.

alphaarray-like, shape (N,), optional

An array of values that control the local value of lam to better fit peak and non-peak regions. If None (default), then the initial values will be an array with size equal to N and all values set to 1.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

paramsdict

A dictionary with the following items:

  • 'weights': numpy.ndarray, shape (N,)

    The weight array used for fitting the data.

  • 'alpha': numpy.ndarray, shape (N,)

    The array of alpha values used for fitting the data in the final iteration.

  • 'tol_history': numpy.ndarray

    An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.

Notes

The weighting uses an asymmetric coefficient (k in the asPLS paper) of 0.5 instead of the 2 listed in the asPLS paper. pybaselines uses the factor of 0.5 since it matches the results in Table 2 and Figure 5 of the asPLS paper closer than the factor of 2 and fits noisy data much better.

References

Zhang, F., et al. Baseline correction for infrared spectra using adaptive smoothness parameter penalized least squares method. Spectroscopy Letters, 2020, 53(3), 222-233.

beads(data, freq_cutoff=0.005, lam_0=1.0, lam_1=1.0, lam_2=1.0, asymmetry=6.0, filter_type=1, cost_function=2, max_iter=50, tol=0.01, eps_0=1e-06, eps_1=1e-06, fit_parabola=True, smooth_half_window=None)

Baseline estimation and denoising with sparsity (BEADS).

Decomposes the input data into baseline and pure, noise-free signal by modeling the baseline as a low pass filter and by considering the signal and its derivatives as sparse [1].

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points.

freq_cutofffloat, optional

The cutoff frequency of the high pass filter, normalized such that 0 < freq_cutoff < 0.5. Default is 0.005.

lam_0float, optional

The regularization parameter for the signal values. Default is 1.0. Higher values give a higher penalty.

lam_1float, optional

The regularization parameter for the first derivative of the signal. Default is 1.0. Higher values give a higher penalty.

lam_2float, optional

The regularization parameter for the second derivative of the signal. Default is 1.0. Higher values give a higher penalty.

asymmetryfloat, optional

A number greater than 0 that determines the weighting of negative values compared to positive values in the cost function. Default is 6.0, which gives negative values six times more impact on the cost function that positive values. Set to 1 for a symmetric cost function, or a value less than 1 to weigh positive values more.

filter_typeint, optional

An integer describing the high pass filter type. The order of the high pass filter is 2 * filter_type. Default is 1 (second order filter).

cost_function{2, 1, "l1_v1", "l1_v2"}, optional

An integer or string indicating which approximation of the l1 (absolute value) penalty to use. 1 or "l1_v1" will use \(l(x) = \sqrt{x^2 + \text{eps_1}}\) and 2 (default) or "l1_v2" will use \(l(x) = |x| - \text{eps_1}\log{(|x| + \text{eps_1})}\).

max_iterint, optional

The maximum number of iterations. Default is 50.

tolfloat, optional

The exit criteria. Default is 1e-2.

eps_0float, optional

The cutoff threshold between absolute loss and quadratic loss. Values in the signal with absolute value less than eps_0 will have quadratic loss. Default is 1e-6.

eps_1float, optional

A small, positive value used to prevent issues when the first or second order derivatives are close to zero. Default is 1e-6.

fit_parabolabool, optional

If True (default), will fit a parabola to the data and subtract it before performing the beads fit as suggested in [2]. This ensures the endpoints of the fit data are close to 0, which is required by beads. If the data is already close to 0 on both endpoints, set fit_parabola to False.

smooth_half_windowint, optional

The half-window to use for smoothing the derivatives of the data with a moving average and full window size of 2 * smooth_half_window + 1. Smoothing can improve the convergence of the calculation, and make the calculation less sensitive to small changes in lam_1 and lam_2, as noted in the pybeads package [3]. Default is None, which will not perform any smoothing.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

paramsdict

A dictionary with the following items:

  • 'signal': numpy.ndarray, shape (N,)

    The pure signal portion of the input data without noise or the baseline.

  • 'tol_history': numpy.ndarray

    An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.

Raises:
ValueError

Raised if asymmetry is less than 0.

Notes

The default lam_0, lam_1, and lam_2 values are good starting points for a dataset with 1000 points. Typically, smaller values are needed for larger datasets and larger values for smaller datasets.

When finding the best parameters for fitting, it is usually best to find the optimal freq_cutoff for the noise in the data before adjusting any other parameters since it has the largest effect [2].

References

[1]

Ning, X., et al. Chromatogram baseline estimation and denoising using sparsity (BEADS). Chemometrics and Intelligent Laboratory Systems, 2014, 139, 156-167.

[2] (1,2)

Navarro-Huerta, J.A., et al. Assisted baseline subtraction in complex chromatograms using the BEADS algorithm. Journal of Chromatography A, 2017, 1507, 1-10.

collab_pls(data, average_dataset=True, method='asls', method_kwargs=None)

Collaborative Penalized Least Squares (collab-PLS).

Averages the data or the fit weights for an entire dataset to get more optimal results. Uses any Whittaker-smoothing-based or weighted spline algorithm.

Parameters:
dataarray-like, shape (M, N)

An array with shape (M, N) where M is the number of entries in the dataset and N is the number of data points in each entry.

average_datasetbool, optional

If True (default) will average the dataset before fitting to get the weighting. If False, will fit each individual entry in the dataset and then average the weights to get the weighting for the dataset.

methodstr, optional

A string indicating the Whittaker-smoothing-based or weighted spline method to use for fitting the baseline. Default is 'asls'.

method_kwargsdict, optional

A dictionary of keyword arguments to pass to the selected method function. Default is None, which will use an empty dictionary.

Returns:
baselinesnp.ndarray, shape (M, N)

An array of all of the baselines.

paramsdict

A dictionary with the following items:

  • 'average_weights': numpy.ndarray, shape (N,)

    The weight array used to fit all of the baselines.

  • 'average_alpha': numpy.ndarray, shape (N,)

    Only returned if method is 'aspls' or 'pspline_aspls'. The alpha array used to fit all of the baselines for the aspls() or pspline_aspls() methods.

Additional items depend on the output of the selected method. Every other key will have a list of values, with each item corresponding to a fit.

Notes

If method is 'aspls' or 'pspline_aspls', collab_pls will also calculate the alpha array for the entire dataset in the same manner as the weights.

References

Chen, L., et al. Collaborative Penalized Least Squares for Background Correction of Multiple Raman Spectra. Journal of Analytical Methods in Chemistry, 2018, 2018.

corner_cutting(data, max_iter=100)

Iteratively removes corner points and creates a Bezier spline from the remaining points.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points.

max_iterint, optional

The maximum number of iterations to try to remove corner points. Default is 100. Typically all corner points are removed in 10 to 20 iterations.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

dict

An empty dictionary, just to match the output of all other algorithms.

References

Liu, Y.J., et al. A Concise Iterative Method with Bezier Technique for Baseline Construction. Analyst, 2015, 140(23), 7984-7996.

custom_bc(data, method='asls', regions=((None, None),), sampling=1, lam=None, diff_order=2, method_kwargs=None)

Customized baseline correction for fine tuned stiffness of the baseline at specific regions.

Divides the data into regions with variable number of data points and then uses other baseline algorithms to fit the truncated data. Regions with less points effectively makes the fit baseline more stiff in those regions.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points.

methodstr, optional

A string indicating the algorithm to use for fitting the baseline; can be any non-optimizer algorithm in pybaselines. Default is 'asls'.

regionsarray-like, shape (M, 2), optional

The two dimensional array containing the start and stop indices for each region of interest. Each region is defined as data[start:stop]. Default is ((None, None),), which will use all points.

samplingint or array-like, optional

The sampling step size for each region defined in regions. If sampling is an integer, then all regions will use the same index step size; if sampling is an array-like, its length must be equal to M, the first dimension in regions. Default is 1, which will use all points.

lamfloat or None, optional

The value for smoothing the calculated interpolated baseline using Whittaker smoothing, in order to reduce the kinks between regions. Default is None, which will not smooth the baseline; a value of 0 will also not perform smoothing.

diff_orderint, optional

The difference order used for Whittaker smoothing of the calculated baseline. Default is 2.

method_kwargsdict, optional

A dictionary of keyword arguments to pass to the selected method function. Default is None, which will use an empty dictionary.

Returns:
baselinenumpy.ndarray, shape (N,)

The baseline calculated with the optimum parameter.

paramsdict
A dictionary with the following items:
  • 'x_fit': numpy.ndarray, shape (P,)

    The truncated x-values used for fitting the baseline.

  • 'y_fit': numpy.ndarray, shape (P,)

    The truncated y-values used for fitting the baseline.

Additional items depend on the output of the selected method.

Raises:
ValueError

Raised if regions is not two dimensional, if sampling is not the same length as rois.shape[0], if any values in sampling or regions is less than 1, if segments in regions overlap, or if any value in regions is greater than the length of the input data.

Notes

Uses Whittaker smoothing to smooth the transitions between regions rather than LOESS as used in [31].

Uses binning rather than direct truncation of the regions in order to get better results for noisy data.

References

[31]

Liland, K., et al. Customized baseline correction. Chemometrics and Intelligent Laboratory Systems, 2011, 109(1), 51-56.

cwt_br(data, poly_order=5, scales=None, num_std=1.0, min_length=2, max_iter=50, tol=0.001, symmetric=False, weights=None, **pad_kwargs)

Continuous wavelet transform baseline recognition (CWT-BR) algorithm.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points.

poly_orderint, optional

The polynomial order for fitting the baseline. Default is 5.

scalesarray-like, optional

The scales at which to perform the continuous wavelet transform. Default is None,

num_stdfloat, optional

The number of standard deviations to include when thresholding. Default is 1.0.

min_lengthint, optional

Any region of consecutive baseline points less than min_length is considered to be a false positive and all points in the region are converted to peak points. A higher min_length ensures less points are falsely assigned as baseline points. Default is 2, which only removes lone baseline points.

max_iterint, optional

The maximum number of iterations. Default is 50.

tolfloat, optional

The exit criteria. Default is 1e-3.

symmetricbool, optional

When fitting the identified baseline points with a polynomial, if symmetric is False (default), will add any point i as a baseline point where the fit polynomial is greater than the input data for N/100 consecutive points on both sides of point i. If symmetric is True, then it means that both positive and negative peaks exist and baseline points are not modified during the polynomial fitting.

weightsarray-like, shape (N,), optional

The weighting array, used to override the function's baseline identification to designate peak points. Only elements with 0 or False values will have an effect; all non-zero values are considered baseline points. If None (default), then will be an array with size equal to N and all values set to 1.

**pad_kwargs

Additional keyword arguments to pass to pad_edges() for padding the edges of the data to prevent edge effects from convolution for the continuous wavelet transform.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

paramsdict

A dictionary with the following items:

  • 'mask': numpy.ndarray, shape (N,)

    The boolean array designating baseline points as True and peak points as False.

  • 'tol_history': numpy.ndarray

    An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.

  • 'best_scale'scalar

    The scale at which the Shannon entropy of the continuous wavelet transform of the data is at a minimum.

Notes

Uses the standard deviation for determining outliers during polynomial fitting rather than the standard error as used in the reference since the number of standard errors to include when thresholding varies with data size while the number of standard deviations is independent of data size.

References

Bertinetto, C., et al. Automatic Baseline Recognition for the Correction of Large Sets of Spectra Using Continuous Wavelet Transform and Iterative Fitting. Applied Spectroscopy, 2014, 68(2), 155-164.

derpsalsa(data, lam=1000000.0, p=0.01, k=None, diff_order=2, max_iter=50, tol=0.001, weights=None, smooth_half_window=None, num_smooths=16, **pad_kwargs)

Derivative Peak-Screening Asymmetric Least Squares Algorithm (derpsalsa).

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points. Must not contain missing data (NaN) or Inf.

lamfloat, optional

The smoothing parameter. Larger values will create smoother baselines. Default is 1e6.

pfloat, optional

The penalizing weighting factor. Must be between 0 and 1. Values greater than the baseline will be given p weight, and values less than the baseline will be given p - 1 weight. Default is 1e-2.

kfloat, optional

A factor that controls the exponential decay of the weights for baseline values greater than the data. Should be approximately the height at which a value could be considered a peak. Default is None, which sets k to one-tenth of the standard deviation of the input data. A large k value will produce similar results to asls().

diff_orderint, optional

The order of the differential matrix. Must be greater than 0. Default is 2 (second order differential matrix). Typical values are 2 or 1.

max_iterint, optional

The max number of fit iterations. Default is 50.

tolfloat, optional

The exit criteria. Default is 1e-3.

weightsarray-like, shape (N,), optional

The weighting array. If None (default), then the initial weights will be an array with size equal to N and all values set to 1.

smooth_half_windowint, optional

The half-window to use for smoothing the data before computing the first and second derivatives. Default is None, which will use len(data) / 200.

num_smoothsint, optional

The number of times to smooth the data before computing the first and second derivatives. Default is 16.

**pad_kwargs

Additional keyword arguments to pass to pad_edges() for padding the edges of the data to prevent edge effects from smoothing.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

paramsdict

A dictionary with the following items:

  • 'weights': numpy.ndarray, shape (N,)

    The weight array used for fitting the data.

  • 'tol_history': numpy.ndarray

    An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.

Raises:
ValueError

Raised if p is not between 0 and 1.

References

Korepanov, V. Asymmetric least-squares baseline algorithm with peak screening for automatic processing of the Raman spectra. Journal of Raman Spectroscopy. 2020, 51(10), 2061-2065.

dietrich(data, smooth_half_window=None, num_std=3.0, interp_half_window=5, poly_order=5, max_iter=50, tol=0.001, weights=None, return_coef=False, min_length=2, **pad_kwargs)

Dietrich's method for identifying baseline regions.

Calculates the power spectrum of the data as the squared derivative of the data. Then baseline points are identified by iteratively removing points where the mean of the power spectrum is less than num_std times the standard deviation of the power spectrum.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points.

smooth_half_windowint, optional

The half window to use for smoothing the input data with a moving average. Default is None, which will use N / 256. Set to 0 to not smooth the data.

num_stdfloat, optional

The number of standard deviations to include when thresholding. Higher values will assign more points as baseline. Default is 3.0.

interp_half_windowint, optional

When interpolating between baseline segments, will use the average of data[i-interp_half_window:i+interp_half_window+1], where i is the index of the peak start or end, to fit the linear segment. Default is 5.

poly_orderint, optional

The polynomial order for fitting the identified baseline. Default is 5.

max_iterint, optional

The maximum number of iterations for fitting a polynomial to the identified baseline. If max_iter is 0, the returned baseline will be just the linear interpolation of the baseline segments. Default is 50.

tolfloat, optional

The exit criteria for fitting a polynomial to the identified baseline points. Default is 1e-3.

weightsarray-like, shape (N,), optional

The weighting array, used to override the function's baseline identification to designate peak points. Only elements with 0 or False values will have an effect; all non-zero values are considered baseline points. If None (default), then will be an array with size equal to N and all values set to 1.

return_coefbool, optional

If True, will convert the polynomial coefficients for the fit baseline to a form that fits the input x_data and return them in the params dictionary. Default is False, since the conversion takes time.

min_lengthint, optional

Any region of consecutive baseline points less than min_length is considered to be a false positive and all points in the region are converted to peak points. A higher min_length ensures less points are falsely assigned as baseline points. Default is 2, which only removes lone baseline points.

**pad_kwargs

Additional keyword arguments to pass to pad_edges() for padding the edges of the data to prevent edge effects from smoothing.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

paramsdict

A dictionary with the following items:

  • 'mask': numpy.ndarray, shape (N,)

    The boolean array designating baseline points as True and peak points as False.

  • 'coef': numpy.ndarray, shape (poly_order,)

    Only if return_coef is True and max_iter is greater than 0. The array of polynomial coefficients for the baseline, in increasing order. Can be used to create a polynomial using numpy.polynomial.polynomial.Polynomial.

  • 'tol_history': numpy.ndarray

    Only if max_iter is greater than 1. An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.

Notes

When choosing parameters, first choose a smooth_half_window that appropriately smooths the data, and then reduce num_std until no peak regions are included in the baseline. If no value of num_std works, change smooth_half_window and repeat.

If max_iter is 0, the baseline is simply a linear interpolation of the identified baseline points. Otherwise, a polynomial is iteratively fit through the baseline points, and the interpolated sections are replaced each iteration with the polynomial fit.

References

Dietrich, W., et al. Fast and Precise Automatic Baseline Correction of One- and Two-Dimensional NMR Spectra. Journal of Magnetic Resonance. 1991, 91, 1-11.

drpls(data, lam=100000.0, eta=0.5, max_iter=50, tol=0.001, weights=None, diff_order=2)

Doubly reweighted penalized least squares (drPLS) baseline.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points. Must not contain missing data (NaN) or Inf.

lamfloat, optional

The smoothing parameter. Larger values will create smoother baselines. Default is 1e5.

etafloat

A term for controlling the value of lam; should be between 0 and 1. Low values will produce smoother baselines, while higher values will more aggressively fit peaks. Default is 0.5.

max_iterint, optional

The max number of fit iterations. Default is 50.

tolfloat, optional

The exit criteria. Default is 1e-3.

weightsarray-like, shape (N,), optional

The weighting array. If None (default), then the initial weights will be an array with size equal to N and all values set to 1.

diff_orderint, optional

The order of the differential matrix. Must be greater than 1. Default is 2 (second order differential matrix). Typical values are 2 or 3.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

paramsdict

A dictionary with the following items:

  • 'weights': numpy.ndarray, shape (N,)

    The weight array used for fitting the data.

  • 'tol_history': numpy.ndarray

    An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.

Raises:
ValueError

Raised if eta is not between 0 and 1 or if diff_order is less than 2.

References

Xu, D. et al. Baseline correction method based on doubly reweighted penalized least squares, Applied Optics, 2019, 58, 3913-3920.

fabc(data, lam=1000000.0, scale=None, num_std=3.0, diff_order=2, min_length=2, weights=None, weights_as_mask=False, **pad_kwargs)

Fully automatic baseline correction (fabc).

Similar to Dietrich's method, except that the derivative is estimated using a continuous wavelet transform and the baseline is calculated using Whittaker smoothing through the identified baseline points.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points.

lamfloat, optional

The smoothing parameter. Larger values will create smoother baselines. Default is 1e6.

scaleint, optional

The scale at which to calculate the continuous wavelet transform. Should be approximately equal to the index-based full-width-at-half-maximum of the peaks or features in the data. Default is None, which will use half of the value from optimize_window(), which is not always a good value, but at least scales with the number of data points and gives a starting point for tuning the parameter.

num_stdfloat, optional

The number of standard deviations to include when thresholding. Higher values will assign more points as baseline. Default is 3.0.

diff_orderint, optional

The order of the differential matrix. Must be greater than 0. Default is 2 (second order differential matrix). Typical values are 2 or 1.

min_lengthint, optional

Any region of consecutive baseline points less than min_length is considered to be a false positive and all points in the region are converted to peak points. A higher min_length ensures less points are falsely assigned as baseline points. Default is 2, which only removes lone baseline points.

weightsarray-like, shape (N,), optional

The weighting array, used to override the function's baseline identification to designate peak points. Only elements with 0 or False values will have an effect; all non-zero values are considered baseline points. If None (default), then will be an array with size equal to N and all values set to 1.

weights_as_maskbool, optional

If True, signifies that the input weights is the mask to use for fitting, which skips the continuous wavelet calculation and just smooths the input data. Default is False.

**pad_kwargs

Additional keyword arguments to pass to pad_edges() for padding the edges of the data to prevent edge effects from convolution for the continuous wavelet transform.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

paramsdict

A dictionary with the following items:

  • 'mask': numpy.ndarray, shape (N,)

    The boolean array designating baseline points as True and peak points as False.

  • 'weights': numpy.ndarray, shape (N,)

    The weight array used for fitting the data.

Notes

The classification of baseline points is similar to dietrich(), except that this method approximates the first derivative using a continous wavelet transform with the Haar wavelet, which is more robust than the numerical derivative in Dietrich's method.

References

Cobas, J., et al. A new general-purpose fully automatic baseline-correction procedure for 1D and 2D NMR data. Journal of Magnetic Resonance, 2006, 183(1), 145-151.

fastchrom(data, half_window=None, threshold=None, min_fwhm=None, interp_half_window=5, smooth_half_window=None, weights=None, max_iter=100, min_length=2, **pad_kwargs)

Identifies baseline segments by thresholding the rolling standard deviation distribution.

Baseline points are identified as any point where the rolling standard deviation is less than the specified threshold. Peak regions are iteratively interpolated until the baseline is below the data.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points.

half_windowint, optional

The half-window to use for the rolling standard deviation calculation. Should be approximately equal to the full-width-at-half-maximum of the peaks or features in the data. Default is None, which will use half of the value from optimize_window(), which is not always a good value, but at least scales with the number of data points and gives a starting point for tuning the parameter.

thresholdfloat of Callable, optional

All points in the rolling standard deviation below threshold will be considered as baseline. Higher values will assign more points as baseline. Default is None, which will set the threshold as the 15th percentile of the rolling standard deviation. If threshold is Callable, it should take the rolling standard deviation as the only argument and output a float.

min_fwhmint, optional

After creating the interpolated baseline, any region where the baseline is greater than the data for min_fwhm consecutive points will have an additional baseline point added and reinterpolated. Should be set to approximately the index-based full-width-at-half-maximum of the smallest peak. Default is None, which uses 2 * half_window.

interp_half_windowint, optional

When interpolating between baseline segments, will use the average of data[i-interp_half_window:i+interp_half_window+1], where i is the index of the peak start or end, to fit the linear segment. Default is 5.

smooth_half_windowint, optional

The half window to use for smoothing the interpolated baseline with a moving average. Default is None, which will use half_window. Set to 0 to not smooth the baseline.

weightsarray-like, shape (N,), optional

The weighting array, used to override the function's baseline identification to designate peak points. Only elements with 0 or False values will have an effect; all non-zero values are considered baseline points. If None (default), then will be an array with size equal to N and all values set to 1.

max_iterint, optional

The maximum number of iterations to attempt to fill in regions where the baseline is greater than the input data. Default is 100.

min_lengthint, optional

Any region of consecutive baseline points less than min_length is considered to be a false positive and all points in the region are converted to peak points. A higher min_length ensures less points are falsely assigned as baseline points. Default is 2, which only removes lone baseline points.

**pad_kwargs

Additional keyword arguments to pass to pad_edges() for padding the edges of the data to prevent edge effects from the moving average smoothing.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

paramsdict

A dictionary with the following items:

  • 'mask': numpy.ndarray, shape (N,)

    The boolean array designating baseline points as True and peak points as False.

Notes

Only covers the baseline correction from FastChrom, not its peak finding and peak grouping capabilities.

References

Johnsen, L., et al. An automated method for baseline correction, peak finding and peak grouping in chromatographic data. Analyst. 2013, 138, 3502-3511.

goldindec(data, poly_order=2, tol=0.001, max_iter=250, weights=None, cost_function='asymmetric_indec', peak_ratio=0.5, alpha_factor=0.99, tol_2=0.001, tol_3=1e-06, max_iter_2=100, return_coef=False)

Fits a polynomial baseline using a non-quadratic cost function.

The non-quadratic cost functions penalize residuals with larger values, giving a more robust fit compared to normal least-squares.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points.

poly_orderint, optional

The polynomial order for fitting the baseline. Default is 2.

tolfloat, optional

The exit criteria for the fitting with a given threshold value. Default is 1e-3.

max_iterint, optional

The maximum number of iterations for fitting a threshold value. Default is 250.

weightsarray-like, shape (N,), optional

The weighting array. If None (default), then will be an array with size equal to N and all values set to 1.

cost_functionstr, optional

The non-quadratic cost function to minimize. Unlike penalized_poly(), this function only works with asymmetric cost functions, so the symmetry prefix ('a' or 'asymmetric') is optional (eg. 'indec' and 'a_indec' are the same). Default is 'asymmetric_indec'. Available methods, and their associated reference, are:

  • 'asymmetric_indec'[25]

  • 'asymmetric_truncated_quadratic'[26]

  • 'asymmetric_huber'[26]

peak_ratiofloat, optional

A value between 0 and 1 that designates how many points in the data belong to peaks. Values are valid within ~10% of the actual peak ratio. Default is 0.5.

alpha_factorfloat, optional

A value between 0 and 1 that controls the value of the penalty. Default is 0.99. Typically should not need to change this value.

tol_2float, optional

The exit criteria for the difference between the optimal up-down ratio (number of points above 0 in the residual compared to number of points below 0) and the up-down ratio for a given threshold value. Default is 1e-3.

tol_3float, optional

The exit criteria for the relative change in the threshold value. Default is 1e-6.

max_iter_2float, optional

The number of iterations for iterating between different threshold values. Default is 100.

return_coefbool, optional

If True, will convert the polynomial coefficients for the fit baseline to a form that fits the input x_data and return them in the params dictionary. Default is False, since the conversion takes time.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

paramsdict

A dictionary with the following items:

  • 'weights': numpy.ndarray, shape (N,)

    The weight array used for fitting the data.

  • 'tol_history': numpy.ndarray, shape (J, K)

    An array containing the calculated tolerance values for each iteration of both threshold values and fit values. Index 0 are the tolerence values for the difference in up-down ratios, index 1 are the tolerance values for the relative change in the threshold, and indices >= 2 are the tolerance values for each fit. All values that were not used in fitting have values of 0. Shape J is 2 plus the number of iterations for the threshold to converge (related to max_iter_2, tol_2, tol_3), and shape K is the maximum of the number of iterations for the threshold and the maximum number of iterations for all of the fits of the various threshold values (related to max_iter and tol).

  • 'threshold'float

    The optimal threshold value. Could be used in penalized_poly() for fitting other similar data.

  • 'coef': numpy.ndarray, shape (poly_order + 1,)

    Only if return_coef is True. The array of polynomial parameters for the baseline, in increasing order. Can be used to create a polynomial using numpy.polynomial.polynomial.Polynomial.

Raises:
ValueError

Raised if alpha_factor or peak_ratio are not between 0 and 1, or if the specified cost function is symmetric.

References

[25]

Liu, J., et al. Goldindec: A Novel Algorithm for Raman Spectrum Baseline Correction. Applied Spectroscopy, 2015, 69(7), 834-842.

[26] (1,2)

Mazet, V., et al. Background removal from spectra by designing and minimising a non-quadratic cost function. Chemometrics and Intelligent Laboratory Systems, 2005, 76(2), 121-133.

golotvin(data, half_window=None, num_std=2.0, sections=32, smooth_half_window=None, interp_half_window=5, weights=None, min_length=2, **pad_kwargs)

Golotvin's method for identifying baseline regions.

Divides the data into sections and takes the minimum standard deviation of all sections as the noise standard deviation for the entire data. Then classifies any point where the rolling max minus min is less than num_std * noise standard deviation as belonging to the baseline.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points.

half_windowint, optional

The half-window to use for the rolling maximum and rolling minimum calculations. Should be approximately equal to the full-width-at-half-maximum of the peaks or features in the data. Default is None, which will use half of the value from optimize_window(), which is not always a good value, but at least scales with the number of data points and gives a starting point for tuning the parameter.

num_stdfloat, optional

The number of standard deviations to include when thresholding. Higher values will assign more points as baseline. Default is 3.0.

sectionsint, optional

The number of sections to divide the input data into for finding the minimum standard deviation.

smooth_half_windowint, optional

The half window to use for smoothing the interpolated baseline with a moving average. Default is None, which will use half_window. Set to 0 to not smooth the baseline.

interp_half_windowint, optional

When interpolating between baseline segments, will use the average of data[i-interp_half_window:i+interp_half_window+1], where i is the index of the peak start or end, to fit the linear segment. Default is 5.

weightsarray-like, shape (N,), optional

The weighting array, used to override the function's baseline identification to designate peak points. Only elements with 0 or False values will have an effect; all non-zero values are considered baseline points. If None (default), then will be an array with size equal to N and all values set to 1.

min_lengthint, optional

Any region of consecutive baseline points less than min_length is considered to be a false positive and all points in the region are converted to peak points. A higher min_length ensures less points are falsely assigned as baseline points. Default is 2, which only removes lone baseline points.

**pad_kwargs

Additional keyword arguments to pass to pad_edges() for padding the edges of the data to prevent edge effects from the moving average smoothing.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

paramsdict

A dictionary with the following items:

  • 'mask': numpy.ndarray, shape (N,)

    The boolean array designating baseline points as True and peak points as False.

References

Golotvin, S., et al. Improved Baseline Recognition and Modeling of FT NMR Spectra. Journal of Magnetic Resonance. 2000, 146, 122-125.

iarpls(data, lam=100000.0, diff_order=2, max_iter=50, tol=0.001, weights=None)

Improved asymmetrically reweighted penalized least squares smoothing (IarPLS).

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points. Must not contain missing data (NaN) or Inf.

lamfloat, optional

The smoothing parameter. Larger values will create smoother baselines. Default is 1e5.

diff_orderint, optional

The order of the differential matrix. Must be greater than 0. Default is 2 (second order differential matrix). Typical values are 2 or 1.

max_iterint, optional

The max number of fit iterations. Default is 50.

tolfloat, optional

The exit criteria. Default is 1e-3.

weightsarray-like, shape (N,), optional

The weighting array. If None (default), then the initial weights will be an array with size equal to N and all values set to 1.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

paramsdict

A dictionary with the following items:

  • 'weights': numpy.ndarray, shape (N,)

    The weight array used for fitting the data.

  • 'tol_history': numpy.ndarray

    An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.

References

Ye, J., et al. Baseline correction method based on improved asymmetrically reweighted penalized least squares for Raman spectrum. Applied Optics, 2020, 59, 10933-10943.

iasls(data, lam=1000000.0, p=0.01, lam_1=0.0001, max_iter=50, tol=0.001, weights=None, diff_order=2)

Fits the baseline using the improved asymmetric least squares (IAsLS) algorithm.

The algorithm consideres both the first and second derivatives of the residual.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points. Must not contain missing data (NaN) or Inf.

lamfloat, optional

The smoothing parameter. Larger values will create smoother baselines. Default is 1e6.

pfloat, optional

The penalizing weighting factor. Must be between 0 and 1. Values greater than the baseline will be given p weight, and values less than the baseline will be given p - 1 weight. Default is 1e-2.

lam_1float, optional

The smoothing parameter for the first derivative of the residual. Default is 1e-4.

max_iterint, optional

The max number of fit iterations. Default is 50.

tolfloat, optional

The exit criteria. Default is 1e-3.

weightsarray-like, shape (N,), optional

The weighting array. If None (default), then the initial weights will be set by fitting the data with a second order polynomial.

diff_orderint, optional

The order of the differential matrix. Must be greater than 1. Default is 2 (second order differential matrix). Typical values are 2 or 3.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

paramsdict

A dictionary with the following items:

  • 'weights': numpy.ndarray, shape (N,)

    The weight array used for fitting the data.

  • 'tol_history': numpy.ndarray

    An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.

Raises:
ValueError

Raised if p is not between 0 and 1 or if diff_order is less than 2.

References

He, S., et al. Baseline correction for raman spectra using an improved asymmetric least squares method, Analytical Methods, 2014, 6(12), 4402-4407.

imodpoly(data, poly_order=2, tol=0.001, max_iter=250, weights=None, use_original=False, mask_initial_peaks=True, return_coef=False, num_std=1.0)

The improved modofied polynomial (IModPoly) baseline algorithm.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points.

poly_orderint, optional

The polynomial order for fitting the baseline. Default is 2.

tolfloat, optional

The exit criteria. Default is 1e-3.

max_iterint, optional

The maximum number of iterations. Default is 250.

weightsarray-like, shape (N,), optional

The weighting array. If None (default), then will be an array with size equal to N and all values set to 1.

use_originalbool, optional

If False (default), will compare the baseline of each iteration with the y-values of that iteration [11] when choosing minimum values. If True, will compare the baseline with the original y-values given by data [12].

mask_initial_peaksbool, optional

If True (default), will mask any data where the initial baseline fit + the standard deviation of the residual is less than measured data [13].

return_coefbool, optional

If True, will convert the polynomial coefficients for the fit baseline to a form that fits the input x_data and return them in the params dictionary. Default is False, since the conversion takes time.

num_stdfloat, optional

The number of standard deviations to include when thresholding. Default is 1. Must be greater or equal to 0.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

paramsdict

A dictionary with the following items:

  • 'weights': numpy.ndarray, shape (N,)

    The weight array used for fitting the data.

  • 'tol_history': numpy.ndarray

    An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.

  • 'coef': numpy.ndarray, shape (poly_order + 1,)

    Only if return_coef is True. The array of polynomial parameters for the baseline, in increasing order. Can be used to create a polynomial using numpy.polynomial.polynomial.Polynomial.

Raises:
ValueError

Raised if num_std is less than 0.

Notes

Algorithm originally developed in [13].

References

[11]

Gan, F., et al. Baseline correction by improved iterative polynomial fitting with automatic threshold. Chemometrics and Intelligent Laboratory Systems, 2006, 82, 59-65.

[12]

Lieber, C., et al. Automated method for subtraction of fluorescence from biological raman spectra. Applied Spectroscopy, 2003, 57(11), 1363-1367.

[13] (1,2)

Zhao, J., et al. Automated Autofluorescence Background Subtraction Algorithm for Biomedical Raman Spectroscopy, Applied Spectroscopy, 2007, 61(11), 1225-1232.

imor(data, half_window=None, tol=0.001, max_iter=200, **window_kwargs)

An Improved Morphological based (IMor) baseline algorithm.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points.

half_windowint, optional

The half-window used for the morphology functions. If a value is input, then that value will be used. Default is None, which will optimize the half-window size using optimize_window() and window_kwargs.

tolfloat, optional

The exit criteria. Default is 1e-3.

max_iterint, optional

The maximum number of iterations. Default is 200.

**window_kwargs

Values for setting the half window used for the morphology operations. Items include:

  • 'increment': int

    The step size for iterating half windows. Default is 1.

  • 'max_hits': int

    The number of consecutive half windows that must produce the same morphological opening before accepting the half window as the optimum value. Default is 1.

  • 'window_tol': float

    The tolerance value for considering two morphological openings as equivalent. Default is 1e-6.

  • 'max_half_window': int

    The maximum allowable window size. If None (default), will be set to (len(data) - 1) / 2.

  • 'min_half_window': int

    The minimum half-window size. If None (default), will be set to 1.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

paramsdict

A dictionary with the following items:

  • 'half_window': int

    The half window used for the morphological calculations.

  • 'tol_history': numpy.ndarray

    An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.

References

Dai, L., et al. An Automated Baseline Correction Method Based on Iterative Morphological Operations. Applied Spectroscopy, 2018, 72(5), 731-739.

interp_pts(data=None, baseline_points=(), interp_method='linear')

Creates a baseline by interpolating through input points.

Parameters:
dataarray-like, optional

The y-values. Not used by this function, but input is allowed for consistency with other functions.

baseline_pointsarray-like, shape (n, 2)

An array of ((x_1, y_1), (x_2, y_2), ..., (x_n, y_n)) values for each point representing the baseline.

interp_methodstr, optional

The method to use for interpolation. See scipy.interpolate.interp1d for all options. Default is 'linear', which connects each point with a line segment.

Returns:
baselinenumpy.ndarray, shape (N,)

The baseline array constructed from interpolating between each input baseline point.

dict

An empty dictionary, just to match the output of all other algorithms.

Raises:
ValueError

Raised of baseline_points does not contain at least two values, signifying one x-y point.

Notes

This method is only suggested for use within user-interfaces.

Regions of the baseline where x_data is less than the minimum x-value or greater than the maximum x-value in baseline_points will be assigned values of 0.

ipsa(data, half_window=None, max_iter=500, tol=None, roi=None, original_criteria=False, **pad_kwargs)

Iterative Polynomial Smoothing Algorithm (IPSA).

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points.

half_windowint

The half-window to use for the smoothing each iteration. Should be approximately equal to the full-width-at-half-maximum of the peaks or features in the data. Default is None, which will use 4 times the output of optimize_window(), which is not always a good value, but at least scales with the number of data points and gives a starting point for tuning the parameter.

max_iterint, optional

The maximum number of iterations. Default is 500.

tolfloat, optional

The exit criteria. Default is None, which uses 1e-3 if original_criteria is False, and 1 / (max(data) - min(data)) if original_criteria is True.

roislice or array-like, shape(N,)

The region of interest, such that np.asarray(data)[roi] gives the values for calculating the tolerance if original_criteria is True. Not used if original_criteria is True. Default is None, which uses all values in data.

original_criteriabool, optional

Whether to use the original exit criteria from the reference, which is difficult to use since it requires knowledge of how high the peaks should be after baseline correction. If False (default), then compares norm(old, new) / norm(old), where old is the previous iteration's baseline, and new is the current iteration's baseline.

**pad_kwargs

Additional keyword arguments to pass to pad_edges() for padding the edges of the data to prevent edge effects from convolution.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

paramsdict

A dictionary with the following items:

  • 'tol_history': numpy.ndarray

    An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.

References

Wang, T., et al. Background Subtraction of Raman Spectra Based on Iterative Polynomial Smoothing. Applied Spectroscopy. 71(6) (2017) 1169-1179.

irsqr(data, lam=100, quantile=0.05, num_knots=100, spline_degree=3, diff_order=3, max_iter=100, tol=1e-06, weights=None, eps=None)

Iterative Reweighted Spline Quantile Regression (IRSQR).

Fits the baseline using quantile regression with penalized splines.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points. Must not contain missing data (NaN) or Inf.

lamfloat, optional

The smoothing parameter. Larger values will create smoother baselines. Default is 1e6.

quantilefloat, optional

The quantile at which to fit the baseline. Default is 0.05.

num_knotsint, optional

The number of knots for the spline. Default is 100.

spline_degreeint, optional

The degree of the spline. Default is 3, which is a cubic spline.

diff_orderint, optional

The order of the differential matrix. Must be greater than 0. Default is 3 (third order differential matrix). Typical values are 3, 2, or 1.

max_iterint, optional

The max number of fit iterations. Default is 100.

tolfloat, optional

The exit criteria. Default is 1e-6.

weightsarray-like, shape (N,), optional

The weighting array. If None (default), then the initial weights will be an array with size equal to N and all values set to 1.

epsfloat, optional

A small value added to the square of the residual to prevent dividing by 0. Default is None, which uses the square of the maximum-absolute-value of the fit each iteration multiplied by 1e-6.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

paramsdict

A dictionary with the following items:

  • 'weights': numpy.ndarray, shape (N,)

    The weight array used for fitting the data.

  • 'tol_history': numpy.ndarray

    An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.

Raises:
ValueError

Raised if quantile is not between 0 and 1.

References

Han, Q., et al. Iterative Reweighted Quantile Regression Using Augmented Lagrangian Optimization for Baseline Correction. 2018 5th International Conference on Information Science and Control Engineering (ICISCE), 2018, 280-284.

jbcd(data, half_window=None, alpha=0.1, beta=10.0, gamma=1.0, beta_mult=1.1, gamma_mult=0.909, diff_order=1, max_iter=20, tol=0.01, tol_2=0.001, robust_opening=True, **window_kwargs)

Joint Baseline Correction and Denoising (jbcd) Algorithm.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points.

half_windowint, optional

The half-window used for the morphology functions. If a value is input, then that value will be used. Default is None, which will optimize the half-window size using optimize_window() and window_kwargs.

alphafloat, optional

The regularization parameter that controls how close the baseline must fit the calculated morphological opening. Larger values make the fit more constrained to the opening and can make the baseline less smooth. Default is 0.1.

betafloat, optional

The regularization parameter that controls how smooth the baseline is. Larger values produce smoother baselines. Default is 1e1.

gammafloat, optional

The regularization parameter that controls how smooth the signal is. Larger values produce smoother baselines. Default is 1.

beta_multfloat, optional

The value that beta is multiplied by each iteration. Default is 1.1.

gamma_multfloat, optional

The value that gamma is multiplied by each iteration. Default is 0.909.

diff_orderint, optional

The order of the differential matrix. Must be greater than 0. Default is 1 (first order differential matrix). Typical values are 2 or 1.

max_iterint, optional

The maximum number of iterations. Default is 20.

tolfloat, optional

The exit criteria for the change in the calculated signal. Default is 1e-2.

tol_2float, optional

The exit criteria for the change in the calculated baseline. Default is 1e-2.

robust_openingbool, optional

If True (default), the opening used to represent the initial baseline is the element-wise minimum between the morphological opening and the average of the morphological erosion and dilation of the opening, similar to mor(). If False, the opening is just the morphological opening, as used in the reference. The robust opening typically represents the baseline better.

**window_kwargs

Values for setting the half window used for the morphology operations. Items include:

  • 'increment': int

    The step size for iterating half windows. Default is 1.

  • 'max_hits': int

    The number of consecutive half windows that must produce the same morphological opening before accepting the half window as the optimum value. Default is 1.

  • 'window_tol': float

    The tolerance value for considering two morphological openings as equivalent. Default is 1e-6.

  • 'max_half_window': int

    The maximum allowable window size. If None (default), will be set to (len(data) - 1) / 2.

  • 'min_half_window': int

    The minimum half-window size. If None (default), will be set to 1.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

paramsdict

A dictionary with the following items:

  • 'half_window': int

    The half window used for the morphological calculations.

  • 'tol_history': numpy.ndarray, shape (K, 2)

    An array containing the calculated tolerance values for each iteration. Index 0 are the tolerence values for the relative change in the signal, and index 1 are the tolerance values for the relative change in the baseline. The length of the array is the number of iterations completed, K. If the last values in the array are greater than the input tol or tol_2 values, then the function did not converge.

  • 'signal': numpy.ndarray, shape (N,)

    The pure signal portion of the input data without noise or the baseline.

References

Liu, H., et al. Joint Baseline-Correction and Denoising for Raman Spectra. Applied Spectroscopy, 2015, 69(9), 1013-1022.

loess(data, fraction=0.2, total_points=None, poly_order=1, scale=3.0, tol=0.001, max_iter=10, symmetric_weights=False, use_threshold=False, num_std=1, use_original=False, weights=None, return_coef=False, conserve_memory=True, delta=0.0)

Locally estimated scatterplot smoothing (LOESS).

Performs polynomial regression at each data point using the nearest points.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points.

fractionfloat, optional

The fraction of N data points to include for the fitting on each point. Default is 0.2. Not used if total_points is not None.

total_pointsint, optional

The total number of points to include for the fitting on each point. Default is None, which will use fraction * N to determine the number of points.

scalefloat, optional

A scale factor applied to the weighted residuals to control the robustness of the fit. Default is 3.0, as used in [16]. Note that the original loess procedure in [17] used a scale of ~4.05.

poly_orderint, optional

The polynomial order for fitting the baseline. Default is 1.

tolfloat, optional

The exit criteria. Default is 1e-3.

max_iterint, optional

The maximum number of iterations. Default is 10.

symmetric_weightsbool, optional

If False (default), will apply weighting asymmetrically, with residuals < 0 having a weight of 1, according to [16]. If True, will apply weighting the same for both positive and negative residuals, which is regular LOESS. If use_threshold is True, this parameter is ignored.

use_thresholdbool, optional

If False (default), will compute weights each iteration to perform the robust fitting, which is regular LOESS. If True, will apply a threshold on the data being fit each iteration, based on the maximum values of the data and the fit baseline, as proposed by [18], similar to the modpoly and imodpoly techniques.

num_stdfloat, optional

The number of standard deviations to include when thresholding. Default is 1, which is the value used for the imodpoly technique. Only used if use_threshold is True.

use_originalbool, optional

If False (default), will compare the baseline of each iteration with the y-values of that iteration [19] when choosing minimum values for thresholding. If True, will compare the baseline with the original y-values given by data [20]. Only used if use_threshold is True.

weightsarray-like, shape (N,), optional

The weighting array. If None (default), then will be an array with size equal to N and all values set to 1.

return_coefbool, optional

If True, will convert the polynomial coefficients for the fit baseline to a form that fits the input x_data and return them in the params dictionary. Default is False, since the conversion takes time.

conserve_memorybool, optional

If False, will cache the distance-weighted kernels for each value in x_data on the first iteration and reuse them on subsequent iterations to save time. The shape of the array of kernels is (len(x_data), total_points). If True (default), will recalculate the kernels each iteration, which uses very little memory, but is slower. Can usually set to False unless x_data and`total_points` are quite large and the function causes memory issues when cacheing the kernels. If numba is installed, there is no significant time difference since the calculations are sped up.

deltafloat, optional

If delta is > 0, will skip all but the last x-value in the range x_last + delta, where x_last is the last x-value to be fit using weighted least squares, and instead use linear interpolation to calculate the fit for those x-values (same behavior as in statsmodels [21] and Cleveland's original Fortran lowess implementation [22]). Fits all x-values if delta is <= 0. Default is 0.0. Note that x_data is scaled to fit in the range [-1, 1], so delta should likewise be scaled. For example, if the desired delta value was 0.01 * (max(x_data) - min(x_data)), then the correctly scaled delta would be 0.02 (ie. 0.01 * (1 - (-1))).

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

paramsdict

A dictionary with the following items:

  • 'weights': numpy.ndarray, shape (N,)

    The weight array used for fitting the data. Does NOT contain the individual distance-weighted kernels for each x-value.

  • 'tol_history': numpy.ndarray

    An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.

  • 'coef': numpy.ndarray, shape (N, poly_order + 1)

    Only if return_coef is True. The array of polynomial parameters for the baseline, in increasing order. Can be used to create a polynomial using numpy.polynomial.polynomial.Polynomial. If delta is > 0, the coefficients for any skipped x-value will all be 0.

Raises:
ValueError

Raised if the number of points per window for the fitting is less than poly_order + 1 or greater than the total number of points, or if the values in self.x are not strictly increasing.

Notes

The iterative, robust, aspect of the fitting can be achieved either through reweighting based on the residuals (the typical usage), or thresholding the fit data based on the residuals, as proposed by [18], similar to the modpoly and imodpoly techniques.

In baseline literature, this procedure is sometimes called "rbe", meaning "robust baseline estimate".

References

[16] (1,2)

Ruckstuhl, A.F., et al. Baseline subtraction using robust local regression estimation. J. Quantitative Spectroscopy and Radiative Transfer, 2001, 68, 179-193.

[17]

Cleveland, W. Robust locally weighted regression and smoothing scatterplots. Journal of the American Statistical Association, 1979, 74(368), 829-836.

[18] (1,2)

Komsta, Ł. Comparison of Several Methods of Chromatographic Baseline Removal with a New Approach Based on Quantile Regression. Chromatographia, 2011, 73, 721-731.

[19]

Gan, F., et al. Baseline correction by improved iterative polynomial fitting with automatic threshold. Chemometrics and Intelligent Laboratory Systems, 2006, 82, 59-65.

[20]

Lieber, C., et al. Automated method for subtraction of fluorescence from biological raman spectra. Applied Spectroscopy, 2003, 57(11), 1363-1367.

[22]

https://www.netlib.org/go (lowess.f is the file).

mixture_model(data, lam=100000.0, p=0.01, num_knots=100, spline_degree=3, diff_order=3, max_iter=50, tol=0.001, weights=None, symmetric=False, num_bins=None)

Considers the data as a mixture model composed of noise and peaks.

Weights are iteratively assigned by calculating the probability each value in the residual belongs to a normal distribution representing the noise.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points. Must not contain missing data (NaN) or Inf.

lamfloat, optional

The smoothing parameter. Larger values will create smoother baselines. Default is 1e5.

pfloat, optional

The penalizing weighting factor. Must be between 0 and 1. Values greater than the baseline will be given p weight, and values less than the baseline will be given p - 1 weight. Used to set the initial weights before performing expectation-maximization. Default is 1e-2.

num_knotsint, optional

The number of knots for the spline. Default is 100.

spline_degreeint, optional

The degree of the spline. Default is 3, which is a cubic spline.

diff_orderint, optional

The order of the differential matrix. Must be greater than 0. Default is 3 (third order differential matrix). Typical values are 2 or 3.

max_iterint, optional

The max number of fit iterations. Default is 50.

tolfloat, optional

The exit criteria. Default is 1e-3.

weightsarray-like, shape (N,), optional

The weighting array. If None (default), then the initial weights will be an array with size equal to N and all values set to 1, and then two iterations of reweighted least-squares are performed to provide starting weights for the expectation-maximization of the mixture model.

symmetricbool, optional

If False (default), the total mixture model will be composed of one normal distribution for the noise and one uniform distribution for positive non-noise residuals. If True, an additional uniform distribution will be added to the mixture model for negative non-noise residuals. Only need to set symmetric to True when peaks are both positive and negative.

num_binsint, optional, deprecated

Deprecated since version 1.1.0: num_bins is deprecated since it is no longer necessary for performing the expectation-maximization and will be removed in pybaselines version 1.3.0.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

paramsdict

A dictionary with the following items:

  • 'weights': numpy.ndarray, shape (N,)

    The weight array used for fitting the data.

  • 'tol_history': numpy.ndarray

    An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.

Raises:
ValueError

Raised if p is not between 0 and 1.

References

de Rooi, J., et al. Mixture models for baseline estimation. Chemometric and Intelligent Laboratory Systems, 2012, 117, 56-60.

Ghojogh, B., et al. Fitting A Mixture Distribution to Data: Tutorial. arXiv preprint arXiv:1901.06708, 2019.

modpoly(data, poly_order=2, tol=0.001, max_iter=250, weights=None, use_original=False, mask_initial_peaks=False, return_coef=False)

The modified polynomial (ModPoly) baseline algorithm.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points.

x_dataarray-like, shape (N,), optional

The x-values of the measured data. Default is None, which will create an array from -1 to 1 with N points.

poly_orderint, optional

The polynomial order for fitting the baseline. Default is 2.

tolfloat, optional

The exit criteria. Default is 1e-3.

max_iterint, optional

The maximum number of iterations. Default is 250.

weightsarray-like, shape (N,), optional

The weighting array. If None (default), then will be an array with size equal to N and all values set to 1.

use_originalbool, optional

If False (default), will compare the baseline of each iteration with the y-values of that iteration [8] when choosing minimum values. If True, will compare the baseline with the original y-values given by data [9].

mask_initial_peaksbool, optional

If True, will mask any data where the initial baseline fit + the standard deviation of the residual is less than measured data [10]. Default is False.

return_coefbool, optional

If True, will convert the polynomial coefficients for the fit baseline to a form that fits the input x_data and return them in the params dictionary. Default is False, since the conversion takes time.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

paramsdict

A dictionary with the following items:

  • 'weights': numpy.ndarray, shape (N,)

    The weight array used for fitting the data.

  • 'tol_history': numpy.ndarray

    An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.

  • 'coef': numpy.ndarray, shape (poly_order + 1,)

    Only if return_coef is True. The array of polynomial parameters for the baseline, in increasing order. Can be used to create a polynomial using numpy.polynomial.polynomial.Polynomial.

Notes

Algorithm originally developed in [9] and then slightly modified in [8].

References

[8] (1,2)

Gan, F., et al. Baseline correction by improved iterative polynomial fitting with automatic threshold. Chemometrics and Intelligent Laboratory Systems, 2006, 82, 59-65.

[9] (1,2)

Lieber, C., et al. Automated method for subtraction of fluorescence from biological raman spectra. Applied Spectroscopy, 2003, 57(11), 1363-1367.

[10]

Zhao, J., et al. Automated Autofluorescence Background Subtraction Algorithm for Biomedical Raman Spectroscopy, Applied Spectroscopy, 2007, 61(11), 1225-1232.

mor(data, half_window=None, **window_kwargs)

A Morphological based (Mor) baseline algorithm.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points.

half_windowint, optional

The half-window used for the morphology functions. If a value is input, then that value will be used. Default is None, which will optimize the half-window size using optimize_window() and window_kwargs.

**window_kwargs

Values for setting the half window used for the morphology operations. Items include:

  • 'increment': int

    The step size for iterating half windows. Default is 1.

  • 'max_hits': int

    The number of consecutive half windows that must produce the same morphological opening before accepting the half window as the optimum value. Default is 1.

  • 'window_tol': float

    The tolerance value for considering two morphological openings as equivalent. Default is 1e-6.

  • 'max_half_window': int

    The maximum allowable window size. If None (default), will be set to (len(data) - 1) / 2.

  • 'min_half_window': int

    The minimum half-window size. If None (default), will be set to 1.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

dict

A dictionary with the following items:

  • 'half_window': int

    The half window used for the morphological calculations.

References

Perez-Pueyo, R., et al. Morphology-Based Automated Baseline Removal for Raman Spectra of Artistic Pigments. Applied Spectroscopy, 2010, 64, 595-600.

mormol(data, half_window=None, tol=0.001, max_iter=250, smooth_half_window=None, pad_kwargs=None, **window_kwargs)

Iterative morphological and mollified (MorMol) baseline.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points.

half_windowint, optional

The half-window used for the morphology functions. If a value is input, then that value will be used. Default is None, which will optimize the half-window size using optimize_window() and window_kwargs.

tolfloat, optional

The exit criteria. Default is 1e-3.

max_iterint, optional

The maximum number of iterations. Default is 200.

smooth_half_windowint, optional

The half-window to use for smoothing the data before performing the morphological operation. Default is None, which will use a value of 1, which gives no smoothing.

pad_kwargsdict, optional

A dictionary of keyword arguments to pass to pad_edges() for padding the edges of the data to prevent edge effects from convolution.

**window_kwargs

Values for setting the half window used for the morphology operations. Items include:

  • 'increment': int

    The step size for iterating half windows. Default is 1.

  • 'max_hits': int

    The number of consecutive half windows that must produce the same morphological opening before accepting the half window as the optimum value. Default is 1.

  • 'window_tol': float

    The tolerance value for considering two morphological openings as equivalent. Default is 1e-6.

  • 'max_half_window': int

    The maximum allowable window size. If None (default), will be set to (len(data) - 1) / 2.

  • 'min_half_window': int

    The minimum half-window size. If None (default), will be set to 1.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

paramsdict

A dictionary with the following items:

  • 'half_window': int

    The half window used for the morphological calculations.

  • 'tol_history': numpy.ndarray

    An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.

References

Koch, M., et al. Iterative morphological and mollifier-based baseline correction for Raman spectra. J Raman Spectroscopy, 2017, 48(2), 336-342.

mpls(data, half_window=None, lam=1000000.0, p=0.0, diff_order=2, tol=0.001, max_iter=50, weights=None, **window_kwargs)

The Morphological penalized least squares (MPLS) baseline algorithm.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points.

half_windowint, optional

The half-window used for the morphology functions. If a value is input, then that value will be used. Default is None, which will optimize the half-window size using optimize_window() and window_kwargs.

lamfloat, optional

The smoothing parameter. Larger values will create smoother baselines. Default is 1e6.

pfloat, optional

The penalizing weighting factor. Must be between 0 and 1. Anchor points identified by the procedure in [4] are given a weight of 1 - p, and all other points have a weight of p. Default is 0.0.

diff_orderint, optional

The order of the differential matrix. Must be greater than 0. Default is 2 (second order differential matrix). Typical values are 2 or 1.

max_iterint, optional

The max number of fit iterations. Default is 50.

tolfloat, optional

The exit criteria. Default is 1e-3.

weightsarray-like, shape (N,), optional

The weighting array. If None (default), then the weights will be calculated following the procedure in [4].

**window_kwargs

Values for setting the half window used for the morphology operations. Items include:

  • 'increment': int

    The step size for iterating half windows. Default is 1.

  • 'max_hits': int

    The number of consecutive half windows that must produce the same morphological opening before accepting the half window as the optimum value. Default is 1.

  • 'window_tol': float

    The tolerance value for considering two morphological openings as equivalent. Default is 1e-6.

  • 'max_half_window': int

    The maximum allowable window size. If None (default), will be set to (len(data) - 1) / 2.

  • 'min_half_window': int

    The minimum half-window size. If None (default), will be set to 1.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

paramsdict

A dictionary with the following items:

  • 'weights': numpy.ndarray, shape (N,)

    The weight array used for fitting the data.

  • 'half_window': int

    The half window used for the morphological calculations.

Raises:
ValueError

Raised if p is not between 0 and 1.

References

[4] (1,2)

Li, Zhong, et al. Morphological weighted penalized least squares for background correction. Analyst, 2013, 138, 4483-4492.

mpspline(data, half_window=None, lam=10000.0, lam_smooth=0.01, p=0.0, num_knots=100, spline_degree=3, diff_order=2, weights=None, pad_kwargs=None, **window_kwargs)

Morphology-based penalized spline baseline.

Identifies baseline points using morphological operations, and then uses weighted least-squares to fit a penalized spline to the baseline.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points.

half_windowint, optional

The half-window used for the morphology functions. If a value is input, then that value will be used. Default is None, which will optimize the half-window size using optimize_window() and window_kwargs.

lamfloat, optional

The smoothing parameter for the penalized spline when fitting the baseline. Larger values will create smoother baselines. Default is 1e4. Larger values are needed for larger num_knots.

lam_smoothfloat, optional

The smoothing parameter for the penalized spline when smoothing the input data. Default is 1e-2. Larger values are needed for noisy data or for larger num_knots.

pfloat, optional

The penalizing weighting factor. Must be between 0 and 1. Anchor points identified by the procedure in the reference are given a weight of 1 - p, and all other points have a weight of p. Default is 0.0.

num_knotsint, optional

The number of knots for the spline. Default is 100.

spline_degreeint, optional

The degree of the spline. Default is 3, which is a cubic spline.

diff_orderint, optional

The order of the differential matrix. Must be greater than 0. Default is 2 (second order differential matrix). Typical values are 2 or 3.

weightsarray-like, shape (N,), optional

The weighting array. If None (default), then the weights will be calculated following the procedure in the reference.

**window_kwargs

Values for setting the half window used for the morphology operations. Items include:

  • 'increment': int

    The step size for iterating half windows. Default is 1.

  • 'max_hits': int

    The number of consecutive half windows that must produce the same morphological opening before accepting the half window as the optimum value. Default is 1.

  • 'window_tol': float

    The tolerance value for considering two morphological openings as equivalent. Default is 1e-6.

  • 'max_half_window': int

    The maximum allowable window size. If None (default), will be set to (len(data) - 1) / 2.

  • 'min_half_window': int

    The minimum half-window size. If None (default), will be set to 1.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

paramsdict

A dictionary with the following items:

  • 'weights': numpy.ndarray, shape (N,)

    The weight array used for fitting the data.

  • 'half_window': int

    The half window used for the morphological calculations.

Raises:
ValueError

Raised if half_window is < 1, if lam or lam_smooth is <= 0, or if p is not between 0 and 1.

Notes

The optimal opening is calculated as the element-wise minimum of the opening and the average of the erosion and dilation of the opening. The reference used the erosion and dilation of the smoothed data, rather than the opening, which tends to overestimate the baseline.

Rather than setting knots at the intersection points of the optimal opening and the smoothed data as described in the reference, weights are assigned to 1 - p at the intersection points and p elsewhere. This simplifies the penalized spline calculation by allowing the use of equally spaced knots, but should otherwise give similar results as the reference algorithm.

References

Gonzalez-Vidal, J., et al. Automatic morphology-based cubic p-spline fitting methodology for smoothing and baseline-removal of Raman spectra. Journal of Raman Spectroscopy. 2017, 48(6), 878-883.

mwmv(data, half_window=None, smooth_half_window=None, pad_kwargs=None, **window_kwargs)

Moving window minimum value (MWMV) baseline.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points.

half_windowint, optional

The half-window used for the morphology functions. If a value is input, then that value will be used. Default is None, which will optimize the half-window size using optimize_window() and window_kwargs.

smooth_half_windowint, optional

The half-window to use for smoothing the data after performing the morphological operation. Default is None, which will use the same value as used for the morphological operation.

pad_kwargsdict, optional

A dictionary of keyword arguments to pass to pad_edges() for padding the edges of the data to prevent edge effects from the moving average.

**window_kwargs

Values for setting the half window used for the morphology operations. Items include:

  • 'increment': int

    The step size for iterating half windows. Default is 1.

  • 'max_hits': int

    The number of consecutive half windows that must produce the same morphological opening before accepting the half window as the optimum value. Default is 1.

  • 'window_tol': float

    The tolerance value for considering two morphological openings as equivalent. Default is 1e-6.

  • 'max_half_window': int

    The maximum allowable window size. If None (default), will be set to (len(data) - 1) / 2.

  • 'min_half_window': int

    The minimum half-window size. If None (default), will be set to 1.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

dict

A dictionary with the following items:

  • 'half_window': int

    The half window used for the morphological calculations.

Notes

Performs poorly when baseline is rapidly changing.

References

Yaroshchyk, P., et al. Automatic correction of continuum background in Laser-induced Breakdown Spectroscopy using a model-free algorithm. Spectrochimica Acta Part B, 2014, 99, 138-149.

noise_median(data, half_window=None, smooth_half_window=None, sigma=None, **pad_kwargs)

The noise-median method for baseline identification.

Assumes the baseline can be considered as the median value within a moving window, and the resulting baseline is then smoothed with a Gaussian kernel.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points.

half_windowint, optional

The index-based size to use for the median window. The total window size will range from [-half_window, ..., half_window] with size 2 * half_window + 1. Default is None, which will use twice the output from optimize_window(), which is an okay starting value.

smooth_half_windowint, optional

The half window to use for smoothing. Default is None, which will use the same value as half_window.

sigmafloat, optional

The standard deviation of the smoothing Gaussian kernel. Default is None, which will use (2 * smooth_half_window + 1) / 6.

**pad_kwargs

Additional keyword arguments to pass to pad_edges() for padding the edges of the data to prevent edge effects from convolution.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated and smoothed baseline.

dict

An empty dictionary, just to match the output of all other algorithms.

References

Friedrichs, M., A model-free algorithm for the removal of baseline artifacts. J. Biomolecular NMR, 1995, 5, 147-153.

optimize_extended_range(data, method='asls', side='both', width_scale=0.1, height_scale=1.0, sigma_scale=1.0 / 12.0, min_value=2, max_value=8, step=1, pad_kwargs=None, method_kwargs=None)

Extends data and finds the best parameter value for the given baseline method.

Adds additional data to the left and/or right of the input data, and then iterates through parameter values to find the best fit. Useful for calculating the optimum lam or poly_order value required to optimize other algorithms.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points.

methodstr, optional

A string indicating the Whittaker-smoothing-based, polynomial, or spline method to use for fitting the baseline. Default is 'asls'.

side{'both', 'left', 'right'}, optional

The side of the measured data to extend. Default is 'both'.

width_scalefloat, optional

The number of data points added to each side is width_scale * N. Default is 0.1.

height_scalefloat, optional

The height of the added Gaussian peak(s) is calculated as height_scale * max(data). Default is 1.

sigma_scalefloat, optional

The sigma value for the added Gaussian peak(s) is calculated as sigma_scale * width_scale * N. Default is 1/12, which will make the Gaussian span +- 6 sigma, making its total width about half of the added length.

min_valueint or float, optional

The minimum value for the lam or poly_order value to use with the indicated method. If using a polynomial method, min_value must be an integer. If using a Whittaker-smoothing-based method, min_value should be the exponent to raise to the power of 10 (eg. a min_value value of 2 designates a lam value of 10**2). Default is 2.

max_valueint or float, optional

The maximum value for the lam or poly_order value to use with the indicated method. If using a polynomial method, max_value must be an integer. If using a Whittaker-smoothing-based method, max_value should be the exponent to raise to the power of 10 (eg. a max_value value of 3 designates a lam value of 10**3). Default is 8.

stepint or float, optional

The step size for iterating the parameter value from min_value to max_value. If using a polynomial method, step must be an integer.

pad_kwargsdict, optional

A dictionary of options to pass to pad_edges() for padding the edges of the data when adding the extended left and/or right sections. Default is None, which will use an empty dictionary.

method_kwargsdict, optional

A dictionary of keyword arguments to pass to the selected method function. Default is None, which will use an empty dictionary.

Returns:
baselinenumpy.ndarray, shape (N,)

The baseline calculated with the optimum parameter.

method_paramsdict

A dictionary with the following items:

  • 'optimal_parameter': int or float

    The lam or poly_order value that produced the lowest root-mean-squared-error.

  • 'min_rmse': float

    The minimum root-mean-squared-error obtained when using the optimal parameter.

Additional items depend on the output of the selected method.

Raises:
ValueError

Raised if side is not 'left', 'right', or 'both'.

TypeError

Raised if using a polynomial method and min_value, max_value, or step is not an integer.

ValueError

Raised if using a Whittaker-smoothing-based method and min_value, max_value, or step is greater than 100.

Notes

Based on the extended range penalized least squares (erPLS) method from [5]. The method proposed by [5] was for optimizing lambda only for the aspls method by extending only the right side of the spectrum. The method was modified by allowing extending either side following [6], and for optimizing lambda or the polynomial degree for all of the affected algorithms in pybaselines.

References

[5] (1,2)

Zhang, F., et al. An Automatic Baseline Correction Method Based on the Penalized Least Squares Method. Sensors, 2020, 20(7), 2015.

[6]

Krishna, H., et al. Range-independent background subtraction algorithm for recovery of Raman spectra of biological tissue. Journal of Raman Spectroscopy. 2012, 43(12), 1884-1894.

penalized_poly(data, poly_order=2, tol=0.001, max_iter=250, weights=None, cost_function='asymmetric_truncated_quadratic', threshold=None, alpha_factor=0.99, return_coef=False)

Fits a polynomial baseline using a non-quadratic cost function.

The non-quadratic cost functions penalize residuals with larger values, giving a more robust fit compared to normal least-squares.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points.

poly_orderint, optional

The polynomial order for fitting the baseline. Default is 2.

tolfloat, optional

The exit criteria. Default is 1e-3.

max_iterint, optional

The maximum number of iterations. Default is 250.

weightsarray-like, shape (N,), optional

The weighting array. If None (default), then will be an array with size equal to N and all values set to 1.

cost_functionstr, optional

The non-quadratic cost function to minimize. Must indicate symmetry of the method by appending 'a' or 'asymmetric' for asymmetric loss, and 's' or 'symmetric' for symmetric loss. Default is 'asymmetric_truncated_quadratic'. Available methods, and their associated reference, are:

  • 'asymmetric_truncated_quadratic'[14]

  • 'symmetric_truncated_quadratic'[14]

  • 'asymmetric_huber'[14]

  • 'symmetric_huber'[14]

  • 'asymmetric_indec'[15]

  • 'symmetric_indec'[15]

thresholdfloat, optional

The threshold value for the loss method, where the function goes from quadratic loss (such as used for least squares) to non-quadratic. For symmetric loss methods, residual values with absolute value less than threshold will have quadratic loss. For asymmetric loss methods, residual values less than the threshold will have quadratic loss. Default is None, which sets threshold to one-tenth of the standard deviation of the input data.

alpha_factorfloat, optional

A value between 0 and 1 that controls the value of the penalty. Default is 0.99. Typically should not need to change this value.

return_coefbool, optional

If True, will convert the polynomial coefficients for the fit baseline to a form that fits the input x_data and return them in the params dictionary. Default is False, since the conversion takes time.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

paramsdict

A dictionary with the following items:

  • 'weights': numpy.ndarray, shape (N,)

    The weight array used for fitting the data.

  • 'tol_history': numpy.ndarray

    An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.

  • 'coef': numpy.ndarray, shape (poly_order + 1,)

    Only if return_coef is True. The array of polynomial parameters for the baseline, in increasing order. Can be used to create a polynomial using numpy.polynomial.polynomial.Polynomial.

Raises:
ValueError

Raised if alpha_factor is not between 0 and 1.

Notes

In baseline literature, this procedure is sometimes called "backcor".

References

[14] (1,2,3,4)

Mazet, V., et al. Background removal from spectra by designing and minimising a non-quadratic cost function. Chemometrics and Intelligent Laboratory Systems, 2005, 76(2), 121-133.

[15] (1,2)

Liu, J., et al. Goldindec: A Novel Algorithm for Raman Spectrum Baseline Correction. Applied Spectroscopy, 2015, 69(7), 834-842.

poly(data, poly_order=2, weights=None, return_coef=False)

Computes a polynomial that fits the baseline of the data.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points.

poly_orderint, optional

The polynomial order for fitting the baseline. Default is 2.

weightsarray-like, shape (N,), optional

The weighting array. If None (default), then will be an array with size equal to N and all values set to 1.

return_coefbool, optional

If True, will convert the polynomial coefficients for the fit baseline to a form that fits the input x_data and return them in the params dictionary. Default is False, since the conversion takes time.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

paramsdict

A dictionary with the following items:

  • 'weights': numpy.ndarray, shape (N,)

    The weight array used for fitting the data.

  • 'coef': numpy.ndarray, shape (poly_order,)

    Only if return_coef is True. The array of polynomial parameters for the baseline, in increasing order. Can be used to create a polynomial using numpy.polynomial.polynomial.Polynomial.

Notes

To only fit regions without peaks, supply a weight array with zero values at the indices where peaks are located.

psalsa(data, lam=100000.0, p=0.5, k=None, diff_order=2, max_iter=50, tol=0.001, weights=None)

Peaked Signal's Asymmetric Least Squares Algorithm (psalsa).

Similar to the asymmetric least squares (AsLS) algorithm, but applies an exponential decay weighting to values greater than the baseline to allow using a higher p value to better fit noisy data.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points. Must not contain missing data (NaN) or Inf.

lamfloat, optional

The smoothing parameter. Larger values will create smoother baselines. Default is 1e5.

pfloat, optional

The penalizing weighting factor. Must be between 0 and 1. Values greater than the baseline will be given p weight, and values less than the baseline will be given p - 1 weight. Default is 0.5.

kfloat, optional

A factor that controls the exponential decay of the weights for baseline values greater than the data. Should be approximately the height at which a value could be considered a peak. Default is None, which sets k to one-tenth of the standard deviation of the input data. A large k value will produce similar results to asls().

diff_orderint, optional

The order of the differential matrix. Must be greater than 0. Default is 2 (second order differential matrix). Typical values are 2 or 1.

max_iterint, optional

The max number of fit iterations. Default is 50.

tolfloat, optional

The exit criteria. Default is 1e-3.

weightsarray-like, shape (N,), optional

The weighting array. If None (default), then the initial weights will be an array with size equal to N and all values set to 1.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

paramsdict

A dictionary with the following items:

  • 'weights': numpy.ndarray, shape (N,)

    The weight array used for fitting the data.

  • 'tol_history': numpy.ndarray

    An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.

Raises:
ValueError

Raised if p is not between 0 and 1.

Notes

The exit criteria for the original algorithm was to check whether the signs of the residuals do not change between two iterations, but the comparison of the l2 norms of the weight arrays between iterations is used instead to be more comparable to other Whittaker-smoothing-based algorithms.

References

Oller-Moreno, S., et al. Adaptive Asymmetric Least Squares baseline estimation for analytical instruments. 2014 IEEE 11th International Multi-Conference on Systems, Signals, and Devices, 2014, 1-5.

pspline_airpls(data, lam=1000.0, num_knots=100, spline_degree=3, diff_order=2, max_iter=50, tol=0.001, weights=None)

A penalized spline version of the airPLS algorithm.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points. Must not contain missing data (NaN) or Inf.

lamfloat, optional

The smoothing parameter. Larger values will create smoother baselines. Default is 1e3.

num_knotsint, optional

The number of knots for the spline. Default is 100.

spline_degreeint, optional

The degree of the spline. Default is 3, which is a cubic spline.

diff_orderint, optional

The order of the differential matrix. Must be greater than 0. Default is 2 (second order differential matrix). Typical values are 2 or 1.

max_iterint, optional

The max number of fit iterations. Default is 50.

tolfloat, optional

The exit criteria. Default is 1e-3.

weightsarray-like, shape (N,), optional

The weighting array. If None (default), then the initial weights will be an array with size equal to N and all values set to 1.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

paramsdict

A dictionary with the following items:

  • 'weights': numpy.ndarray, shape (N,)

    The weight array used for fitting the data.

  • 'tol_history': numpy.ndarray

    An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.

See also

Baseline.airpls

References

Zhang, Z.M., et al. Baseline correction using adaptive iteratively reweighted penalized least squares. Analyst, 2010, 135(5), 1138-1146.

Eilers, P., et al. Splines, knots, and penalties. Wiley Interdisciplinary Reviews: Computational Statistics, 2010, 2(6), 637-653.

pspline_arpls(data, lam=1000.0, num_knots=100, spline_degree=3, diff_order=2, max_iter=50, tol=0.001, weights=None)

A penalized spline version of the arPLS algorithm.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points. Must not contain missing data (NaN) or Inf.

lamfloat, optional

The smoothing parameter. Larger values will create smoother baselines. Default is 1e3.

num_knotsint, optional

The number of knots for the spline. Default is 100.

spline_degreeint, optional

The degree of the spline. Default is 3, which is a cubic spline.

diff_orderint, optional

The order of the differential matrix. Must be greater than 0. Default is 2 (second order differential matrix). Typical values are 2 or 1.

max_iterint, optional

The max number of fit iterations. Default is 50.

tolfloat, optional

The exit criteria. Default is 1e-3.

weightsarray-like, shape (N,), optional

The weighting array. If None (default), then the initial weights will be an array with size equal to N and all values set to 1.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

paramsdict

A dictionary with the following items:

  • 'weights': numpy.ndarray, shape (N,)

    The weight array used for fitting the data.

  • 'tol_history': numpy.ndarray

    An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.

See also

Baseline.arpls

References

Baek, S.J., et al. Baseline correction using asymmetrically reweighted penalized least squares smoothing. Analyst, 2015, 140, 250-257.

Eilers, P., et al. Splines, knots, and penalties. Wiley Interdisciplinary Reviews: Computational Statistics, 2010, 2(6), 637-653.

pspline_asls(data, lam=1000.0, p=0.01, num_knots=100, spline_degree=3, diff_order=2, max_iter=50, tol=0.001, weights=None)

A penalized spline version of the asymmetric least squares (AsLS) algorithm.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points. Must not contain missing data (NaN) or Inf.

lamfloat, optional

The smoothing parameter. Larger values will create smoother baselines. Default is 1e3.

pfloat, optional

The penalizing weighting factor. Must be between 0 and 1. Values greater than the baseline will be given p weight, and values less than the baseline will be given p - 1 weight. Default is 1e-2.

num_knotsint, optional

The number of knots for the spline. Default is 100.

spline_degreeint, optional

The degree of the spline. Default is 3, which is a cubic spline.

diff_orderint, optional

The order of the differential matrix. Must be greater than 0. Default is 2 (second order differential matrix). Typical values are 2 or 1.

max_iterint, optional

The max number of fit iterations. Default is 50.

tolfloat, optional

The exit criteria. Default is 1e-3.

weightsarray-like, shape (N,), optional

The weighting array. If None (default), then the initial weights will be an array with size equal to N and all values set to 1.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

paramsdict

A dictionary with the following items:

  • 'weights': numpy.ndarray, shape (N,)

    The weight array used for fitting the data.

  • 'tol_history': numpy.ndarray

    An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.

Raises:
ValueError

Raised if p is not between 0 and 1.

See also

Baseline.asls

References

Eilers, P. A Perfect Smoother. Analytical Chemistry, 2003, 75(14), 3631-3636.

Eilers, P., et al. Baseline correction with asymmetric least squares smoothing. Leiden University Medical Centre Report, 2005, 1(1).

Eilers, P., et al. Splines, knots, and penalties. Wiley Interdisciplinary Reviews: Computational Statistics, 2010, 2(6), 637-653.

pspline_aspls(data, lam=10000.0, num_knots=100, spline_degree=3, diff_order=2, max_iter=100, tol=0.001, weights=None, alpha=None)

A penalized spline version of the asPLS algorithm.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points. Must not contain missing data (NaN) or Inf.

lamfloat, optional

The smoothing parameter. Larger values will create smoother baselines. Default is 1e3.

num_knotsint, optional

The number of knots for the spline. Default is 100.

spline_degreeint, optional

The degree of the spline. Default is 3, which is a cubic spline.

diff_orderint, optional

The order of the differential matrix. Must be greater than 0. Default is 2 (second order differential matrix). Typical values are 2 or 1.

max_iterint, optional

The max number of fit iterations. Default is 100.

tolfloat, optional

The exit criteria. Default is 1e-3.

weightsarray-like, shape (N,), optional

The weighting array. If None (default), then the initial weights will be an array with size equal to N and all values set to 1.

alphaarray-like, shape (N,), optional

An array of values that control the local value of lam to better fit peak and non-peak regions. If None (default), then the initial values will be an array with size equal to N and all values set to 1.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

paramsdict

A dictionary with the following items:

  • 'weights': numpy.ndarray, shape (N,)

    The weight array used for fitting the data.

  • 'alpha': numpy.ndarray, shape (N,)

    The array of alpha values used for fitting the data in the final iteration.

  • 'tol_history': numpy.ndarray

    An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.

See also

Baseline.aspls

Notes

The weighting uses an asymmetric coefficient (k in the asPLS paper) of 0.5 instead of the 2 listed in the asPLS paper. pybaselines uses the factor of 0.5 since it matches the results in Table 2 and Figure 5 of the asPLS paper closer than the factor of 2 and fits noisy data much better.

References

Zhang, F., et al. Baseline correction for infrared spectra using adaptive smoothness parameter penalized least squares method. Spectroscopy Letters, 2020, 53(3), 222-233.

Eilers, P., et al. Splines, knots, and penalties. Wiley Interdisciplinary Reviews: Computational Statistics, 2010, 2(6), 637-653.

pspline_derpsalsa(data, lam=100.0, p=0.01, k=None, num_knots=100, spline_degree=3, diff_order=2, max_iter=50, tol=0.001, weights=None, smooth_half_window=None, num_smooths=16, **pad_kwargs)

A penalized spline version of the derpsalsa algorithm.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points. Must not contain missing data (NaN) or Inf.

lamfloat, optional

The smoothing parameter. Larger values will create smoother baselines. Default is 1e2.

pfloat, optional

The penalizing weighting factor. Must be between 0 and 1. Values greater than the baseline will be given p weight, and values less than the baseline will be given p - 1 weight. Default is 1e-2.

kfloat, optional

A factor that controls the exponential decay of the weights for baseline values greater than the data. Should be approximately the height at which a value could be considered a peak. Default is None, which sets k to one-tenth of the standard deviation of the input data. A large k value will produce similar results to asls().

num_knotsint, optional

The number of knots for the spline. Default is 100.

spline_degreeint, optional

The degree of the spline. Default is 3, which is a cubic spline.

diff_orderint, optional

The order of the differential matrix. Must be greater than 0. Default is 2 (second order differential matrix). Typical values are 2 or 1.

max_iterint, optional

The max number of fit iterations. Default is 50.

tolfloat, optional

The exit criteria. Default is 1e-3.

weightsarray-like, shape (N,), optional

The weighting array. If None (default), then the initial weights will be an array with size equal to N and all values set to 1.

smooth_half_windowint, optional

The half-window to use for smoothing the data before computing the first and second derivatives. Default is None, which will use len(data) / 200.

num_smoothsint, optional

The number of times to smooth the data before computing the first and second derivatives. Default is 16.

**pad_kwargs

Additional keyword arguments to pass to pad_edges() for padding the edges of the data to prevent edge effects from smoothing.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

paramsdict

A dictionary with the following items:

  • 'weights': numpy.ndarray, shape (N,)

    The weight array used for fitting the data.

  • 'tol_history': numpy.ndarray

    An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.

Raises:
ValueError

Raised if p is not between 0 and 1.

References

Korepanov, V. Asymmetric least-squares baseline algorithm with peak screening for automatic processing of the Raman spectra. Journal of Raman Spectroscopy. 2020, 51(10), 2061-2065.

Eilers, P., et al. Splines, knots, and penalties. Wiley Interdisciplinary Reviews: Computational Statistics, 2010, 2(6), 637-653.

pspline_drpls(data, lam=1000.0, eta=0.5, num_knots=100, spline_degree=3, diff_order=2, max_iter=50, tol=0.001, weights=None)

A penalized spline version of the drPLS algorithm.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points. Must not contain missing data (NaN) or Inf.

lamfloat, optional

The smoothing parameter. Larger values will create smoother baselines. Default is 1e3.

etafloat

A term for controlling the value of lam; should be between 0 and 1. Low values will produce smoother baselines, while higher values will more aggressively fit peaks. Default is 0.5.

num_knotsint, optional

The number of knots for the spline. Default is 100.

spline_degreeint, optional

The degree of the spline. Default is 3, which is a cubic spline.

diff_orderint, optional

The order of the differential matrix. Must be greater than 0. Default is 2 (second order differential matrix). Typical values are 2 or 1.

max_iterint, optional

The max number of fit iterations. Default is 50.

tolfloat, optional

The exit criteria. Default is 1e-3.

weightsarray-like, shape (N,), optional

The weighting array. If None (default), then the initial weights will be an array with size equal to N and all values set to 1.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

paramsdict

A dictionary with the following items:

  • 'weights': numpy.ndarray, shape (N,)

    The weight array used for fitting the data.

  • 'tol_history': numpy.ndarray

    An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.

Raises:
ValueError

Raised if eta is not between 0 and 1 or if diff_order is less than 2.

See also

Baseline.drpls

References

Xu, D. et al. Baseline correction method based on doubly reweighted penalized least squares, Applied Optics, 2019, 58, 3913-3920.

Eilers, P., et al. Splines, knots, and penalties. Wiley Interdisciplinary Reviews: Computational Statistics, 2010, 2(6), 637-653.

pspline_iarpls(data, lam=1000.0, num_knots=100, spline_degree=3, diff_order=2, max_iter=50, tol=0.001, weights=None)

A penalized spline version of the IarPLS algorithm.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points. Must not contain missing data (NaN) or Inf.

lamfloat, optional

The smoothing parameter. Larger values will create smoother baselines. Default is 1e3.

num_knotsint, optional

The number of knots for the spline. Default is 100.

spline_degreeint, optional

The degree of the spline. Default is 3, which is a cubic spline.

diff_orderint, optional

The order of the differential matrix. Must be greater than 0. Default is 2 (second order differential matrix). Typical values are 2 or 1.

max_iterint, optional

The max number of fit iterations. Default is 50.

tolfloat, optional

The exit criteria. Default is 1e-3.

weightsarray-like, shape (N,), optional

The weighting array. If None (default), then the initial weights will be an array with size equal to N and all values set to 1.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

paramsdict

A dictionary with the following items:

  • 'weights': numpy.ndarray, shape (N,)

    The weight array used for fitting the data.

  • 'tol_history': numpy.ndarray

    An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.

See also

Baseline.iarpls

References

Ye, J., et al. Baseline correction method based on improved asymmetrically reweighted penalized least squares for Raman spectrum. Applied Optics, 2020, 59, 10933-10943.

Eilers, P., et al. Splines, knots, and penalties. Wiley Interdisciplinary Reviews: Computational Statistics, 2010, 2(6), 637-653.

pspline_iasls(data, lam=10.0, p=0.01, lam_1=0.0001, num_knots=100, spline_degree=3, max_iter=50, tol=0.001, weights=None, diff_order=2)

A penalized spline version of the IAsLS algorithm.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points. Must not contain missing data (NaN) or Inf.

lamfloat, optional

The smoothing parameter. Larger values will create smoother baselines. Default is 1e1.

pfloat, optional

The penalizing weighting factor. Must be between 0 and 1. Values greater than the baseline will be given p weight, and values less than the baseline will be given p - 1 weight. Default is 1e-2.

lam_1float, optional

The smoothing parameter for the first derivative of the residual. Default is 1e-4.

num_knotsint, optional

The number of knots for the spline. Default is 100.

spline_degreeint, optional

The degree of the spline. Default is 3, which is a cubic spline.

max_iterint, optional

The max number of fit iterations. Default is 50.

tolfloat, optional

The exit criteria. Default is 1e-3.

weightsarray-like, shape (N,), optional

The weighting array. If None (default), then the initial weights will be an array with size equal to N and all values set to 1.

diff_orderint, optional

The order of the differential matrix. Must be greater than 1. Default is 2 (second order differential matrix). Typical values are 2 or 3.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

paramsdict

A dictionary with the following items:

  • 'weights': numpy.ndarray, shape (N,)

    The weight array used for fitting the data.

  • 'tol_history': numpy.ndarray

    An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.

Raises:
ValueError

Raised if p is not between 0 and 1 or if diff_order is less than 2.

See also

Baseline.iasls

References

He, S., et al. Baseline correction for raman spectra using an improved asymmetric least squares method, Analytical Methods, 2014, 6(12), 4402-4407.

Eilers, P., et al. Splines, knots, and penalties. Wiley Interdisciplinary Reviews: Computational Statistics, 2010, 2(6), 637-653.

pspline_mpls(data, half_window=None, lam=1000.0, p=0.0, num_knots=100, spline_degree=3, diff_order=2, tol=0.001, max_iter=50, weights=None, **window_kwargs)

A penalized spline version of the morphological penalized least squares (MPLS) algorithm.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points.

half_windowint, optional

The half-window used for the morphology functions. If a value is input, then that value will be used. Default is None, which will optimize the half-window size using optimize_window() and window_kwargs.

lamfloat, optional

The smoothing parameter. Larger values will create smoother baselines. Default is 1e3.

pfloat, optional

The penalizing weighting factor. Must be between 0 and 1. Anchor points identified by the procedure in [32] are given a weight of 1 - p, and all other points have a weight of p. Default is 0.0.

num_knotsint, optional

The number of knots for the spline. Default is 100.

spline_degreeint, optional

The degree of the spline. Default is 3, which is a cubic spline.

diff_orderint, optional

The order of the differential matrix. Must be greater than 0. Default is 2 (second order differential matrix). Typical values are 2 or 1.

max_iterint, optional

The max number of fit iterations. Default is 50.

tolfloat, optional

The exit criteria. Default is 1e-3.

weightsarray-like, shape (N,), optional

The weighting array. If None (default), then the weights will be calculated following the procedure in [32].

**window_kwargs

Values for setting the half window used for the morphology operations. Items include:

  • 'increment': int

    The step size for iterating half windows. Default is 1.

  • 'max_hits': int

    The number of consecutive half windows that must produce the same morphological opening before accepting the half window as the optimum value. Default is 1.

  • 'window_tol': float

    The tolerance value for considering two morphological openings as equivalent. Default is 1e-6.

  • 'max_half_window': int

    The maximum allowable window size. If None (default), will be set to (len(data) - 1) / 2.

  • 'min_half_window': int

    The minimum half-window size. If None (default), will be set to 1.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

paramsdict

A dictionary with the following items:

  • 'weights': numpy.ndarray, shape (N,)

    The weight array used for fitting the data.

  • 'half_window': int

    The half window used for the morphological calculations.

Raises:
ValueError

Raised if p is not between 0 and 1.

See also

Baseline.mpls

References

[32] (1,2)

Li, Zhong, et al. Morphological weighted penalized least squares for background correction. Analyst, 2013, 138, 4483-4492.

Eilers, P., et al. Splines, knots, and penalties. Wiley Interdisciplinary Reviews: Computational Statistics, 2010, 2(6), 637-653.

pspline_psalsa(data, lam=1000.0, p=0.5, k=None, num_knots=100, spline_degree=3, diff_order=2, max_iter=50, tol=0.001, weights=None)

A penalized spline version of the psalsa algorithm.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points. Must not contain missing data (NaN) or Inf.

lamfloat, optional

The smoothing parameter. Larger values will create smoother baselines. Default is 1e3.

pfloat, optional

The penalizing weighting factor. Must be between 0 and 1. Values greater than the baseline will be given p weight, and values less than the baseline will be given p - 1 weight. Default is 0.5.

kfloat, optional

A factor that controls the exponential decay of the weights for baseline values greater than the data. Should be approximately the height at which a value could be considered a peak. Default is None, which sets k to one-tenth of the standard deviation of the input data. A large k value will produce similar results to asls().

num_knotsint, optional

The number of knots for the spline. Default is 100.

spline_degreeint, optional

The degree of the spline. Default is 3, which is a cubic spline.

diff_orderint, optional

The order of the differential matrix. Must be greater than 0. Default is 2 (second order differential matrix). Typical values are 2 or 1.

max_iterint, optional

The max number of fit iterations. Default is 50.

tolfloat, optional

The exit criteria. Default is 1e-3.

weightsarray-like, shape (N,), optional

The weighting array. If None (default), then the initial weights will be an array with size equal to N and all values set to 1.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

paramsdict

A dictionary with the following items:

  • 'weights': numpy.ndarray, shape (N,)

    The weight array used for fitting the data.

  • 'tol_history': numpy.ndarray

    An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.

Raises:
ValueError

Raised if p is not between 0 and 1.

See also

Baseline.psalsa

References

Oller-Moreno, S., et al. Adaptive Asymmetric Least Squares baseline estimation for analytical instruments. 2014 IEEE 11th International Multi-Conference on Systems, Signals, and Devices, 2014, 1-5.

Eilers, P., et al. Splines, knots, and penalties. Wiley Interdisciplinary Reviews: Computational Statistics, 2010, 2(6), 637-653.

quant_reg(data, poly_order=2, quantile=0.05, tol=1e-06, max_iter=250, weights=None, eps=None, return_coef=False)

Approximates the baseline of the data using quantile regression.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points.

poly_orderint, optional

The polynomial order for fitting the baseline. Default is 2.

quantilefloat, optional

The quantile at which to fit the baseline. Default is 0.05.

tolfloat, optional

The exit criteria. Default is 1e-6. For extreme quantiles (quantile < 0.01 or quantile > 0.99), may need to use a lower value to get a good fit.

max_iterint, optional

The maximum number of iterations. Default is 250. For extreme quantiles (quantile < 0.01 or quantile > 0.99), may need to use a higher value to ensure convergence.

weightsarray-like, shape (N,), optional

The weighting array. If None (default), then will be an array with size equal to N and all values set to 1.

epsfloat, optional

A small value added to the square of the residual to prevent dividing by 0. Default is None, which uses the square of the maximum-absolute-value of the fit each iteration multiplied by 1e-6.

return_coefbool, optional

If True, will convert the polynomial coefficients for the fit baseline to a form that fits the input x_data and return them in the params dictionary. Default is False, since the conversion takes time.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

paramsdict

A dictionary with the following items:

  • 'weights': numpy.ndarray, shape (N,)

    The weight array used for fitting the data.

  • 'tol_history': numpy.ndarray

    An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.

  • 'coef': numpy.ndarray, shape (poly_order + 1,)

    Only if return_coef is True. The array of polynomial parameters for the baseline, in increasing order. Can be used to create a polynomial using numpy.polynomial.polynomial.Polynomial.

Raises:
ValueError

Raised if quantile is not between 0 and 1.

Notes

Application of quantile regression for baseline fitting ss described in [23].

Performs quantile regression using iteratively reweighted least squares (IRLS) as described in [24].

References

[23]

Komsta, Ł. Comparison of Several Methods of Chromatographic Baseline Removal with a New Approach Based on Quantile Regression. Chromatographia, 2011, 73, 721-731.

[24]

Schnabel, S., et al. Simultaneous estimation of quantile curves using quantile sheets. AStA Advances in Statistical Analysis, 2013, 97, 77-87.

ria(data, half_window=None, max_iter=500, tol=0.01, side='both', width_scale=0.1, height_scale=1.0, sigma_scale=1.0 / 12.0, **pad_kwargs)

Range Independent Algorithm (RIA).

Adds additional data to the left and/or right of the input data, and then iteratively smooths until the area of the additional data is removed.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points.

half_windowint, optional

The half-window to use for the smoothing each iteration. Should be approximately equal to the full-width-at-half-maximum of the peaks or features in the data. Default is None, which will use the output of optimize_window(), which is not always a good value, but at least scales with the number of data points and gives a starting point for tuning the parameter.

max_iterint, optional

The maximum number of iterations. Default is 500.

tolfloat, optional

The exit criteria. Default is 1e-2.

side{'both', 'left', 'right'}, optional

The side of the measured data to extend. Default is 'both'.

width_scalefloat, optional

The number of data points added to each side is width_scale * N. Default is 0.1.

height_scalefloat, optional

The height of the added Gaussian peak(s) is calculated as height_scale * max(data). Default is 1.

sigma_scalefloat, optional

The sigma value for the added Gaussian peak(s) is calculated as sigma_scale * width_scale * N. Default is 1/12, which will make the Gaussian span +- 6 sigma, making its total width about half of the added length.

**pad_kwargs

Additional keyword arguments to pass to pad_edges() for padding the edges of the data when adding the extended left and/or right sections.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

paramsdict

A dictionary with the following items:

  • 'tol_history': numpy.ndarray

    An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge (if the array length is equal to max_iter) or the areas of the smoothed extended regions exceeded their initial areas (if the array length is < max_iter).

Raises:
ValueError

Raised if side is not 'left', 'right', or 'both'.

References

Krishna, H., et al. Range-independent background subtraction algorithm for recovery of Raman spectra of biological tissue. J Raman Spectroscopy. 43(12) (2012) 1884-1894.

rolling_ball(data, half_window=None, smooth_half_window=None, pad_kwargs=None, **window_kwargs)

The rolling ball baseline algorithm.

Applies a minimum and then maximum moving window, and subsequently smooths the result, giving a baseline that resembles rolling a ball across the data.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points.

half_windowint, optional

The half-window used for the morphology functions. If a value is input, then that value will be used. Default is None, which will optimize the half-window size using optimize_window() and window_kwargs.

smooth_half_windowint, optional

The half-window to use for smoothing the data after performing the morphological operation. Default is None, which will use the same value as used for the morphological operation.

pad_kwargsdict, optional

A dictionary of keyword arguments to pass to pad_edges() for padding the edges of the data to prevent edge effects from the moving average.

**window_kwargs

Values for setting the half window used for the morphology operations. Items include:

  • 'increment': int

    The step size for iterating half windows. Default is 1.

  • 'max_hits': int

    The number of consecutive half windows that must produce the same morphological opening before accepting the half window as the optimum value. Default is 1.

  • 'window_tol': float

    The tolerance value for considering two morphological openings as equivalent. Default is 1e-6.

  • 'max_half_window': int

    The maximum allowable window size. If None (default), will be set to (len(data) - 1) / 2.

  • 'min_half_window': int

    The minimum half-window size. If None (default), will be set to 1.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

dict

A dictionary with the following items:

  • 'half_window': int

    The half window used for the morphological calculations.

References

Kneen, M.A., et al. Algorithm for fitting XRF, SEM and PIXE X-ray spectra backgrounds. Nuclear Instruments and Methods in Physics Research B, 1996, 109, 209-213.

Liland, K., et al. Optimal Choice of Baseline Correction for Multivariate Calibration of Spectra. Applied Spectroscopy, 2010, 64(9), 1007-1016.

rubberband(data, segments=1, lam=None, diff_order=2, weights=None, smooth_half_window=None, **pad_kwargs)

Identifies baseline points by fitting a convex hull to the bottom of the data.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points.

segmentsint or array-like[int], optional

Used to fit multiple convex hulls to the data to negate the effects of concave data. If the input is an integer, it sets the number of equally sized segments the data will be split into. If the input is an array-like, each integer in the array will be the index that splits two segments, which allows constructing unequally sized segments. Default is 1, which fits a single convex hull to the data.

lamfloat or None, optional

The smoothing parameter for interpolating the baseline points using Whittaker smoothing. Set to 0 or None to use linear interpolation instead. Default is None, which does not smooth.

diff_orderint, optional

The order of the differential matrix if using Whittaker smoothing. Must be greater than 0. Default is 2 (second order differential matrix). Typical values are 2 or 1.

weightsarray-like, shape (N,), optional

The weighting array, used to override the function's baseline identification to designate peak points. Only elements with 0 or False values will have an effect; all non-zero values are considered potential baseline points. If None (default), then will be an array with size equal to N and all values set to 1.

smooth_half_windowint or None, optional

The half window to use for smoothing the input data with a moving average before calculating the convex hull, which gives much better results for noisy data. Set to None (default) or 0 to not smooth the data.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

dict

A dictionary with the following items:

  • 'mask': numpy.ndarray, shape (N,)

    The boolean array designating baseline points as True and peak points as False.

Raises:
ValueError

Raised if the number of segments per window for the fitting is less than poly_order + 1 or greater than the total number of points, or if the values in self.x are not strictly increasing.

snip(data, max_half_window=None, decreasing=False, smooth_half_window=None, filter_order=2, **pad_kwargs)

Statistics-sensitive Non-linear Iterative Peak-clipping (SNIP).

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points.

max_half_windowint or Sequence(int, int), optional

The maximum number of iterations. Should be set such that max_half_window is approxiamtely (w-1)/2, where w is the index-based width of a feature or peak. max_half_window can also be a sequence of two integers for asymmetric peaks, with the first item corresponding to the max_half_window of the peak's left edge, and the second item for the peak's right edge [29]. Default is None, which will use the output from optimize_window(), which is an okay starting value.

decreasingbool, optional

If False (default), will iterate through window sizes from 1 to max_half_window. If True, will reverse the order and iterate from max_half_window to 1, which gives a smoother baseline according to [29] and [30].

smooth_half_windowint, optional

The half window to use for smoothing the data. If smooth_half_window is greater than 0, will perform a moving average smooth on the data for each window, which gives better results for noisy data [29]. Default is None, which will not perform any smoothing.

filter_order{2, 4, 6, 8}, optional

If the measured data has a more complicated baseline consisting of other elements such as Compton edges, then a higher filter_order should be selected [29]. Default is 2, which works well for approximating a linear baseline.

**pad_kwargs

Additional keyword arguments to pass to pad_edges() for padding the edges of the data to prevent edge effects from convolution.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

dict

An empty dictionary, just to match the output of all other algorithms.

Raises:
ValueError

Raised if filter_order is not 2, 4, 6, or 8.

Warns:
UserWarning

Raised if max_half_window is greater than (len(data) - 1) // 2.

Notes

Algorithm initially developed by [27], and this specific version of the algorithm is adapted from [28], [29], and [30].

If data covers several orders of magnitude, better results can be obtained by first transforming the data using log-log-square transform before using SNIP [28]:

transformed_data =  np.log(np.log(np.sqrt(data + 1) + 1) + 1)

and then baseline can then be reverted back to the original scale using inverse:

baseline = -1 + (np.exp(np.exp(snip(transformed_data)) - 1) - 1)**2

References

[27]

Ryan, C.G., et al. SNIP, A Statistics-Sensitive Background Treatment For The Quantitative Analysis Of Pixe Spectra In Geoscience Applications. Nuclear Instruments and Methods in Physics Research B, 1988, 934, 396-402.

[28] (1,2)

Morháč, M., et al. Background elimination methods for multidimensional coincidence γ-ray spectra. Nuclear Instruments and Methods in Physics Research A, 1997, 401, 113-132.

[29] (1,2,3,4,5)

Morháč, M., et al. Peak Clipping Algorithms for Background Estimation in Spectroscopic Data. Applied Spectroscopy, 2008, 62(1), 91-106.

[30] (1,2)

Morháč, M. An algorithm for determination of peak regions and baseline elimination in spectroscopic data. Nuclear Instruments and Methods in Physics Research A, 2009, 60, 478-487.

std_distribution(data, half_window=None, interp_half_window=5, fill_half_window=3, num_std=1.1, smooth_half_window=None, weights=None, **pad_kwargs)

Identifies baseline segments by analyzing the rolling standard deviation distribution.

The rolling standard deviations are split into two distributions, with the smaller distribution assigned to noise. Baseline points are then identified as any point where the rolling standard deviation is less than a multiple of the median of the noise's standard deviation distribution.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points.

half_windowint, optional

The half-window to use for the rolling standard deviation calculation. Should be approximately equal to the full-width-at-half-maximum of the peaks or features in the data. Default is None, which will use half of the value from optimize_window(), which is not always a good value, but at least scales with the number of data points and gives a starting point for tuning the parameter.

interp_half_windowint, optional

When interpolating between baseline segments, will use the average of data[i-interp_half_window:i+interp_half_window+1], where i is the index of the peak start or end, to fit the linear segment. Default is 5.

fill_half_windowint, optional

When a point is identified as a peak point, all points +- fill_half_window are likewise set as peak points. Default is 3.

num_stdfloat, optional

The number of standard deviations to include when thresholding. Higher values will assign more points as baseline. Default is 1.1.

smooth_half_windowint, optional

The half window to use for smoothing the interpolated baseline with a moving average. Default is None, which will use half_window. Set to 0 to not smooth the baseline.

weightsarray-like, shape (N,), optional

The weighting array, used to override the function's baseline identification to designate peak points. Only elements with 0 or False values will have an effect; all non-zero values are considered baseline points. If None (default), then will be an array with size equal to N and all values set to 1.

**pad_kwargs

Additional keyword arguments to pass to pad_edges() for padding the edges of the data to prevent edge effects from the moving average smoothing.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

paramsdict

A dictionary with the following items:

  • 'mask': numpy.ndarray, shape (N,)

    The boolean array designating baseline points as True and peak points as False.

References

Wang, K.C., et al. Distribution-Based Classification Method for Baseline Correction of Metabolomic 1D Proton Nuclear Magnetic Resonance Spectra. Analytical Chemistry. 2013, 85, 1231-1239.

swima(data, min_half_window=3, max_half_window=None, smooth_half_window=None, **pad_kwargs)

Small-window moving average (SWiMA) baseline.

Computes an iterative moving average to smooth peaks and obtain the baseline.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points.

min_half_windowint, optional

The minimum half window value that must be reached before the exit criteria is considered. Can be increased to reduce the calculation time. Default is 3.

max_half_windowint, optional

The maximum number of iterations. Default is None, which will use (N - 1) / 2. Typically does not need to be specified.

smooth_half_windowint, optional

The half window to use for smoothing the input data with a moving average. Default is None, which will use N / 50. Use a value of 0 or less to not smooth the data. See Notes below for more details.

**pad_kwargs

Additional keyword arguments to pass to pad_edges() for padding the edges of the data to prevent edge effects from convolution.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

dict

A dictionary with the following items:

  • 'half_window': list(int)

    A list of the half windows at which the exit criteria was reached. Has a length of 1 if the main exit criteria was intially reached, otherwise has a length of 2.

  • 'converged': list(bool or None)

    A list of the convergence status. Has a length of 1 if the main exit criteria was intially reached, otherwise has a length of 2. Each convergence status is True if the main exit criteria was reached, False if the second exit criteria was reached, and None if max_half_window is reached before either exit criteria.

Notes

This algorithm requires the input data to be fairly smooth (noise-free), so it is recommended to either smooth the data beforehand, or specify a smooth_half_window value. Non-smooth data can cause the exit criteria to be reached prematurely (can be avoided by setting a larger min_half_window), while over-smoothed data can cause the exit criteria to be reached later than optimal.

The half-window at which convergence occurs is roughly close to the index-based full-width-at-half-maximum of a peak or feature, but can vary. Therfore, it is better to set a min_half_window that is smaller than expected to not miss the exit criteria.

If the main exit criteria is not reached on the initial fit, a gaussian baseline (which is well handled by this algorithm) is added to the data, and it is re-fit.

References

Schulze, H., et al. A Small-Window Moving Average-Based Fully Automated Baseline Estimation Method for Raman Spectra. Applied Spectroscopy, 2012, 66(7), 757-764.

tophat(data, half_window=None, **window_kwargs)

Estimates the baseline using a top-hat transformation (morphological opening).

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points.

half_windowint, optional

The half-window used for the morphological opening. If a value is input, then that value will be used. Default is None, which will optimize the half-window size using optimize_window() and window_kwargs.

**window_kwargs

Values for setting the half window used for the morphology operations. Items include:

  • 'increment': int

    The step size for iterating half windows. Default is 1.

  • 'max_hits': int

    The number of consecutive half windows that must produce the same morphological opening before accepting the half window as the optimum value. Default is 1.

  • 'window_tol': float

    The tolerance value for considering two morphological openings as equivalent. Default is 1e-6.

  • 'max_half_window': int

    The maximum allowable window size. If None (default), will be set to (len(data) - 1) / 2.

  • 'min_half_window': int

    The minimum half-window size. If None (default), will be set to 1.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

dict

A dictionary with the following items:

  • 'half_window': int

    The half window used for the morphological calculations.

Notes

The actual top-hat transformation is defined as data - opening(data), where opening is the morphological opening operation. This function, however, returns opening(data), since that is technically the baseline defined by the operation.

References

Perez-Pueyo, R., et al. Morphology-Based Automated Baseline Removal for Raman Spectra of Artistic Pigments. Applied Spectroscopy, 2010, 64, 595-600.