pybaselines.api
Module Contents
Classes
A class for all baseline correction algorithms. |
- class pybaselines.api.Baseline(x_data=None, check_finite=True, assume_sorted=False, output_dtype=None)[source]
A class for all baseline correction algorithms.
Contains all available baseline correction algorithms in pybaselines as methods to allow a single interface for easier usage.
- Parameters:
- x_dataarray-like, shape (N,), optional
The x-values of the measured data. Default is None, which will create an array from -1 to 1 during the first function call with length equal to the input data length.
- check_finitebool, optional
If True (default), will raise an error if any values in input data are not finite. Setting to False will skip the check. Note that errors may occur if check_finite is False and the input data contains non-finite values.
- assume_sortedbool, optional
If False (default), will sort the input x_data values. Otherwise, the input is assumed to be sorted. Note that some functions may raise an error if x_data is not sorted.
- output_dtypetype or numpy.dtype, optional
The dtype to cast the output array. Default is None, which uses the typing of the input data.
- Attributes:
- poly_orderint
The last polynomial order used for a polynomial algorithm. Initially is -1, denoting that no polynomial fitting has been performed.
- psplinepybaselines._spline_utils.PSpline or None
The PSpline object for setting up and solving penalized spline algorithms. Is None if no penalized spline setup has been performed.
- vandermondenumpy.ndarray or None
The Vandermonde matrix for solving polynomial equations. Is None if no polynomial setup has been performed.
- whittaker_systempybaselines._banded_utils.PenalizedSystem or None
The PenalizedSystem object for setting up and solving Whittaker-smoothing-based algorithms. Is None if no Whittaker setup has been performed.
- xnumpy.ndarray or None
The x-values for the object. If initialized with None, then x is initialized the first function call to have the same length as the input data and has min and max values of -1 and 1, respectively.
- x_domainnumpy.ndarray
The minimum and maximum values of x. If x_data is None during initialization, then set to numpy.ndarray([-1, 1]).
- property pentapy_solver
The integer or string designating which solver to use if using pentapy.
See
pentapy.solve()
for available options, although 1 or 2 are the most relevant options. Default is 2.New in version 1.1.0.
- adaptive_minmax(data, poly_order=None, method='modpoly', weights=None, constrained_fraction=0.01, constrained_weight=100000.0, estimation_poly_order=2, method_kwargs=None)
Fits polynomials of different orders and uses the maximum values as the baseline.
Each polynomial order fit is done both unconstrained and constrained at the endpoints.
- Parameters:
- dataarray-like, shape (N,)
The y-values of the measured data, with N data points.
- poly_orderint or Sequence(int, int) or None, optional
The two polynomial orders to use for fitting. If a single integer is given, then will use the input value and one plus the input value. Default is None, which will do a preliminary fit using a polynomial of order estimation_poly_order and then select the appropriate polynomial orders according to [7].
- method{'modpoly', 'imodpoly'}, optional
The method to use for fitting each polynomial. Default is 'modpoly'.
- weightsarray-like, shape (N,), optional
The weighting array. If None (default), then will be an array with size equal to N and all values set to 1.
- constrained_fractionfloat or Sequence(float, float), optional
The fraction of points at the left and right edges to use for the constrained fit. Default is 0.01. If constrained_fraction is a sequence, the first item is the fraction for the left edge and the second is the fraction for the right edge.
- constrained_weightfloat or Sequence(float, float), optional
The weighting to give to the endpoints. Higher values ensure that the end points are fit, but can cause large fluctuations in the other sections of the polynomial. Default is 1e5. If constrained_weight is a sequence, the first item is the weight for the left edge and the second is the weight for the right edge.
- estimation_poly_orderint, optional
The polynomial order used for estimating the baseline-to-signal ratio to select the appropriate polynomial orders if poly_order is None. Default is 2.
- method_kwargsdict, optional
Additional keyword arguments to pass to
modpoly()
orimodpoly()
. These include tol, max_iter, use_original, mask_initial_peaks, and num_std.
- Returns:
- numpy.ndarray, shape (N,)
The calculated baseline.
- paramsdict
A dictionary with the following items:
- 'weights': numpy.ndarray, shape (N,)
The weight array used for fitting the data.
- 'constrained_weights': numpy.ndarray, shape (N,)
The weight array used for the endpoint-constrained fits.
- 'poly_order': numpy.ndarray, shape (2,)
An array of the two polynomial orders used for the fitting.
References
[7]Cao, A., et al. A robust method for automated background subtraction of tissue fluorescence. Journal of Raman Spectroscopy, 2007, 38, 1199-1205.
- airpls(data, lam=1000000.0, diff_order=2, max_iter=50, tol=0.001, weights=None)
Adaptive iteratively reweighted penalized least squares (airPLS) baseline.
- Parameters:
- dataarray-like
The y-values of the measured data, with N data points. Must not contain missing data (NaN) or Inf.
- lamfloat, optional
The smoothing parameter. Larger values will create smoother baselines. Default is 1e6.
- diff_orderint, optional
The order of the differential matrix. Must be greater than 0. Default is 2 (second order differential matrix). Typical values are 2 or 1.
- max_iterint, optional
The max number of fit iterations. Default is 50.
- tolfloat, optional
The exit criteria. Default is 1e-3.
- weightsarray-like, shape (N,), optional
The weighting array. If None (default), then the initial weights will be an array with size equal to N and all values set to 1.
- Returns:
- baselinenumpy.ndarray, shape (N,)
The calculated baseline.
- paramsdict
A dictionary with the following items:
- 'weights': numpy.ndarray, shape (N,)
The weight array used for fitting the data.
- 'tol_history': numpy.ndarray
An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.
References
Zhang, Z.M., et al. Baseline correction using adaptive iteratively reweighted penalized least squares. Analyst, 2010, 135(5), 1138-1146.
- amormol(data, half_window=None, tol=0.001, max_iter=200, pad_kwargs=None, **window_kwargs)
Iteratively averaging morphological and mollified (aMorMol) baseline.
- Parameters:
- dataarray-like, shape (N,)
The y-values of the measured data, with N data points.
- half_windowint, optional
The half-window used for the morphology functions. If a value is input, then that value will be used. Default is None, which will optimize the half-window size using
optimize_window()
and window_kwargs.- tolfloat, optional
The exit criteria. Default is 1e-3.
- max_iterint, optional
The maximum number of iterations. Default is 200.
- pad_kwargsdict, optional
A dictionary of keyword arguments to pass to
pad_edges()
for padding the edges of the data to prevent edge effects from convolution.- **window_kwargs
Values for setting the half window used for the morphology operations. Items include:
- 'increment': int
The step size for iterating half windows. Default is 1.
- 'max_hits': int
The number of consecutive half windows that must produce the same morphological opening before accepting the half window as the optimum value. Default is 1.
- 'window_tol': float
The tolerance value for considering two morphological openings as equivalent. Default is 1e-6.
- 'max_half_window': int
The maximum allowable window size. If None (default), will be set to (len(data) - 1) / 2.
- 'min_half_window': int
The minimum half-window size. If None (default), will be set to 1.
- Returns:
- baselinenumpy.ndarray, shape (N,)
The calculated baseline.
- paramsdict
A dictionary with the following items:
- 'half_window': int
The half window used for the morphological calculations.
- 'tol_history': numpy.ndarray
An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.
References
Chen, H., et al. An Adaptive and Fully Automated Baseline Correction Method for Raman Spectroscopy Based on Morphological Operations and Mollifications. Applied Spectroscopy, 2019, 73(3), 284-293.
- arpls(data, lam=100000.0, diff_order=2, max_iter=50, tol=0.001, weights=None)
Asymmetrically reweighted penalized least squares smoothing (arPLS).
- Parameters:
- dataarray-like, shape (N,)
The y-values of the measured data, with N data points. Must not contain missing data (NaN) or Inf.
- lamfloat, optional
The smoothing parameter. Larger values will create smoother baselines. Default is 1e5.
- diff_orderint, optional
The order of the differential matrix. Must be greater than 0. Default is 2 (second order differential matrix). Typical values are 2 or 1.
- max_iterint, optional
The max number of fit iterations. Default is 50.
- tolfloat, optional
The exit criteria. Default is 1e-3.
- weightsarray-like, shape (N,), optional
The weighting array. If None (default), then the initial weights will be an array with size equal to N and all values set to 1.
- Returns:
- baselinenumpy.ndarray, shape (N,)
The calculated baseline.
- paramsdict
A dictionary with the following items:
- 'weights': numpy.ndarray, shape (N,)
The weight array used for fitting the data.
- 'tol_history': numpy.ndarray
An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.
References
Baek, S.J., et al. Baseline correction using asymmetrically reweighted penalized least squares smoothing. Analyst, 2015, 140, 250-257.
- asls(data, lam=1000000.0, p=0.01, diff_order=2, max_iter=50, tol=0.001, weights=None)
Fits the baseline using asymmetric least squares (AsLS) fitting.
- Parameters:
- dataarray-like, shape (N,)
The y-values of the measured data, with N data points. Must not contain missing data (NaN) or Inf.
- lamfloat, optional
The smoothing parameter. Larger values will create smoother baselines. Default is 1e6.
- pfloat, optional
The penalizing weighting factor. Must be between 0 and 1. Values greater than the baseline will be given p weight, and values less than the baseline will be given p - 1 weight. Default is 1e-2.
- diff_orderint, optional
The order of the differential matrix. Must be greater than 0. Default is 2 (second order differential matrix). Typical values are 2 or 1.
- max_iterint, optional
The max number of fit iterations. Default is 50.
- tolfloat, optional
The exit criteria. Default is 1e-3.
- weightsarray-like, shape (N,), optional
The weighting array. If None (default), then the initial weights will be an array with size equal to N and all values set to 1.
- Returns:
- baselinenumpy.ndarray, shape (N,)
The calculated baseline.
- paramsdict
A dictionary with the following items:
- 'weights': numpy.ndarray, shape (N,)
The weight array used for fitting the data.
- 'tol_history': numpy.ndarray
An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.
- Raises:
- ValueError
Raised if p is not between 0 and 1.
References
Eilers, P. A Perfect Smoother. Analytical Chemistry, 2003, 75(14), 3631-3636.
Eilers, P., et al. Baseline correction with asymmetric least squares smoothing. Leiden University Medical Centre Report, 2005, 1(1).
- aspls(data, lam=100000.0, diff_order=2, max_iter=100, tol=0.001, weights=None, alpha=None)
Adaptive smoothness penalized least squares smoothing (asPLS).
- Parameters:
- dataarray-like, shape (N,)
The y-values of the measured data, with N data points. Must not contain missing data (NaN) or Inf.
- lamfloat, optional
The smoothing parameter. Larger values will create smoother baselines. Default is 1e5.
- diff_orderint, optional
The order of the differential matrix. Must be greater than 0. Default is 2 (second order differential matrix). Typical values are 2 or 1.
- max_iterint, optional
The max number of fit iterations. Default is 100.
- tolfloat, optional
The exit criteria. Default is 1e-3.
- weightsarray-like, shape (N,), optional
The weighting array. If None (default), then the initial weights will be an array with size equal to N and all values set to 1.
- alphaarray-like, shape (N,), optional
An array of values that control the local value of lam to better fit peak and non-peak regions. If None (default), then the initial values will be an array with size equal to N and all values set to 1.
- Returns:
- baselinenumpy.ndarray, shape (N,)
The calculated baseline.
- paramsdict
A dictionary with the following items:
- 'weights': numpy.ndarray, shape (N,)
The weight array used for fitting the data.
- 'alpha': numpy.ndarray, shape (N,)
The array of alpha values used for fitting the data in the final iteration.
- 'tol_history': numpy.ndarray
An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.
Notes
The weighting uses an asymmetric coefficient (k in the asPLS paper) of 0.5 instead of the 2 listed in the asPLS paper. pybaselines uses the factor of 0.5 since it matches the results in Table 2 and Figure 5 of the asPLS paper closer than the factor of 2 and fits noisy data much better.
References
Zhang, F., et al. Baseline correction for infrared spectra using adaptive smoothness parameter penalized least squares method. Spectroscopy Letters, 2020, 53(3), 222-233.
- beads(data, freq_cutoff=0.005, lam_0=1.0, lam_1=1.0, lam_2=1.0, asymmetry=6.0, filter_type=1, cost_function=2, max_iter=50, tol=0.01, eps_0=1e-06, eps_1=1e-06, fit_parabola=True, smooth_half_window=None)
Baseline estimation and denoising with sparsity (BEADS).
Decomposes the input data into baseline and pure, noise-free signal by modeling the baseline as a low pass filter and by considering the signal and its derivatives as sparse [1].
- Parameters:
- dataarray-like, shape (N,)
The y-values of the measured data, with N data points.
- freq_cutofffloat, optional
The cutoff frequency of the high pass filter, normalized such that 0 < freq_cutoff < 0.5. Default is 0.005.
- lam_0float, optional
The regularization parameter for the signal values. Default is 1.0. Higher values give a higher penalty.
- lam_1float, optional
The regularization parameter for the first derivative of the signal. Default is 1.0. Higher values give a higher penalty.
- lam_2float, optional
The regularization parameter for the second derivative of the signal. Default is 1.0. Higher values give a higher penalty.
- asymmetryfloat, optional
A number greater than 0 that determines the weighting of negative values compared to positive values in the cost function. Default is 6.0, which gives negative values six times more impact on the cost function that positive values. Set to 1 for a symmetric cost function, or a value less than 1 to weigh positive values more.
- filter_typeint, optional
An integer describing the high pass filter type. The order of the high pass filter is
2 * filter_type
. Default is 1 (second order filter).- cost_function{2, 1, "l1_v1", "l1_v2"}, optional
An integer or string indicating which approximation of the l1 (absolute value) penalty to use. 1 or "l1_v1" will use \(l(x) = \sqrt{x^2 + \text{eps_1}}\) and 2 (default) or "l1_v2" will use \(l(x) = |x| - \text{eps_1}\log{(|x| + \text{eps_1})}\).
- max_iterint, optional
The maximum number of iterations. Default is 50.
- tolfloat, optional
The exit criteria. Default is 1e-2.
- eps_0float, optional
The cutoff threshold between absolute loss and quadratic loss. Values in the signal with absolute value less than eps_0 will have quadratic loss. Default is 1e-6.
- eps_1float, optional
A small, positive value used to prevent issues when the first or second order derivatives are close to zero. Default is 1e-6.
- fit_parabolabool, optional
If True (default), will fit a parabola to the data and subtract it before performing the beads fit as suggested in [2]. This ensures the endpoints of the fit data are close to 0, which is required by beads. If the data is already close to 0 on both endpoints, set fit_parabola to False.
- smooth_half_windowint, optional
The half-window to use for smoothing the derivatives of the data with a moving average and full window size of 2 * smooth_half_window + 1. Smoothing can improve the convergence of the calculation, and make the calculation less sensitive to small changes in lam_1 and lam_2, as noted in the pybeads package [3]. Default is None, which will not perform any smoothing.
- Returns:
- baselinenumpy.ndarray, shape (N,)
The calculated baseline.
- paramsdict
A dictionary with the following items:
- 'signal': numpy.ndarray, shape (N,)
The pure signal portion of the input data without noise or the baseline.
- 'tol_history': numpy.ndarray
An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.
- Raises:
- ValueError
Raised if asymmetry is less than 0.
Notes
The default lam_0, lam_1, and lam_2 values are good starting points for a dataset with 1000 points. Typically, smaller values are needed for larger datasets and larger values for smaller datasets.
When finding the best parameters for fitting, it is usually best to find the optimal freq_cutoff for the noise in the data before adjusting any other parameters since it has the largest effect [2].
References
[1]Ning, X., et al. Chromatogram baseline estimation and denoising using sparsity (BEADS). Chemometrics and Intelligent Laboratory Systems, 2014, 139, 156-167.
- collab_pls(data, average_dataset=True, method='asls', method_kwargs=None)
Collaborative Penalized Least Squares (collab-PLS).
Averages the data or the fit weights for an entire dataset to get more optimal results. Uses any Whittaker-smoothing-based or weighted spline algorithm.
- Parameters:
- dataarray-like, shape (M, N)
An array with shape (M, N) where M is the number of entries in the dataset and N is the number of data points in each entry.
- average_datasetbool, optional
If True (default) will average the dataset before fitting to get the weighting. If False, will fit each individual entry in the dataset and then average the weights to get the weighting for the dataset.
- methodstr, optional
A string indicating the Whittaker-smoothing-based or weighted spline method to use for fitting the baseline. Default is 'asls'.
- method_kwargsdict, optional
A dictionary of keyword arguments to pass to the selected method function. Default is None, which will use an empty dictionary.
- Returns:
- baselinesnp.ndarray, shape (M, N)
An array of all of the baselines.
- paramsdict
A dictionary with the following items:
- 'average_weights': numpy.ndarray, shape (N,)
The weight array used to fit all of the baselines.
- 'average_alpha': numpy.ndarray, shape (N,)
Only returned if method is 'aspls' or 'pspline_aspls'. The alpha array used to fit all of the baselines for the
aspls()
orpspline_aspls()
methods.
Additional items depend on the output of the selected method. Every other key will have a list of values, with each item corresponding to a fit.
Notes
If method is 'aspls' or 'pspline_aspls', collab_pls will also calculate the alpha array for the entire dataset in the same manner as the weights.
References
Chen, L., et al. Collaborative Penalized Least Squares for Background Correction of Multiple Raman Spectra. Journal of Analytical Methods in Chemistry, 2018, 2018.
- corner_cutting(data, max_iter=100)
Iteratively removes corner points and creates a Bezier spline from the remaining points.
- Parameters:
- dataarray-like, shape (N,)
The y-values of the measured data, with N data points.
- max_iterint, optional
The maximum number of iterations to try to remove corner points. Default is 100. Typically all corner points are removed in 10 to 20 iterations.
- Returns:
- baselinenumpy.ndarray, shape (N,)
The calculated baseline.
- dict
An empty dictionary, just to match the output of all other algorithms.
References
Liu, Y.J., et al. A Concise Iterative Method with Bezier Technique for Baseline Construction. Analyst, 2015, 140(23), 7984-7996.
- custom_bc(data, method='asls', regions=((None, None),), sampling=1, lam=None, diff_order=2, method_kwargs=None)
Customized baseline correction for fine tuned stiffness of the baseline at specific regions.
Divides the data into regions with variable number of data points and then uses other baseline algorithms to fit the truncated data. Regions with less points effectively makes the fit baseline more stiff in those regions.
- Parameters:
- dataarray-like, shape (N,)
The y-values of the measured data, with N data points.
- methodstr, optional
A string indicating the algorithm to use for fitting the baseline; can be any non-optimizer algorithm in pybaselines. Default is 'asls'.
- regionsarray-like, shape (M, 2), optional
The two dimensional array containing the start and stop indices for each region of interest. Each region is defined as
data[start:stop]
. Default is ((None, None),), which will use all points.- samplingint or array-like, optional
The sampling step size for each region defined in regions. If sampling is an integer, then all regions will use the same index step size; if sampling is an array-like, its length must be equal to M, the first dimension in regions. Default is 1, which will use all points.
- lamfloat or None, optional
The value for smoothing the calculated interpolated baseline using Whittaker smoothing, in order to reduce the kinks between regions. Default is None, which will not smooth the baseline; a value of 0 will also not perform smoothing.
- diff_orderint, optional
The difference order used for Whittaker smoothing of the calculated baseline. Default is 2.
- method_kwargsdict, optional
A dictionary of keyword arguments to pass to the selected method function. Default is None, which will use an empty dictionary.
- Returns:
- baselinenumpy.ndarray, shape (N,)
The baseline calculated with the optimum parameter.
- paramsdict
- A dictionary with the following items:
- 'x_fit': numpy.ndarray, shape (P,)
The truncated x-values used for fitting the baseline.
- 'y_fit': numpy.ndarray, shape (P,)
The truncated y-values used for fitting the baseline.
Additional items depend on the output of the selected method.
- Raises:
- ValueError
Raised if regions is not two dimensional, if sampling is not the same length as rois.shape[0], if any values in sampling or regions is less than 1, if segments in regions overlap, or if any value in regions is greater than the length of the input data.
Notes
Uses Whittaker smoothing to smooth the transitions between regions rather than LOESS as used in [31].
Uses binning rather than direct truncation of the regions in order to get better results for noisy data.
References
[31]Liland, K., et al. Customized baseline correction. Chemometrics and Intelligent Laboratory Systems, 2011, 109(1), 51-56.
- cwt_br(data, poly_order=5, scales=None, num_std=1.0, min_length=2, max_iter=50, tol=0.001, symmetric=False, weights=None, **pad_kwargs)
Continuous wavelet transform baseline recognition (CWT-BR) algorithm.
- Parameters:
- dataarray-like, shape (N,)
The y-values of the measured data, with N data points.
- poly_orderint, optional
The polynomial order for fitting the baseline. Default is 5.
- scalesarray-like, optional
The scales at which to perform the continuous wavelet transform. Default is None,
- num_stdfloat, optional
The number of standard deviations to include when thresholding. Default is 1.0.
- min_lengthint, optional
Any region of consecutive baseline points less than min_length is considered to be a false positive and all points in the region are converted to peak points. A higher min_length ensures less points are falsely assigned as baseline points. Default is 2, which only removes lone baseline points.
- max_iterint, optional
The maximum number of iterations. Default is 50.
- tolfloat, optional
The exit criteria. Default is 1e-3.
- symmetricbool, optional
When fitting the identified baseline points with a polynomial, if symmetric is False (default), will add any point i as a baseline point where the fit polynomial is greater than the input data for
N/100
consecutive points on both sides of point i. If symmetric is True, then it means that both positive and negative peaks exist and baseline points are not modified during the polynomial fitting.- weightsarray-like, shape (N,), optional
The weighting array, used to override the function's baseline identification to designate peak points. Only elements with 0 or False values will have an effect; all non-zero values are considered baseline points. If None (default), then will be an array with size equal to N and all values set to 1.
- **pad_kwargs
Additional keyword arguments to pass to
pad_edges()
for padding the edges of the data to prevent edge effects from convolution for the continuous wavelet transform.
- Returns:
- baselinenumpy.ndarray, shape (N,)
The calculated baseline.
- paramsdict
A dictionary with the following items:
- 'mask': numpy.ndarray, shape (N,)
The boolean array designating baseline points as True and peak points as False.
- 'tol_history': numpy.ndarray
An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.
- 'best_scale'scalar
The scale at which the Shannon entropy of the continuous wavelet transform of the data is at a minimum.
Notes
Uses the standard deviation for determining outliers during polynomial fitting rather than the standard error as used in the reference since the number of standard errors to include when thresholding varies with data size while the number of standard deviations is independent of data size.
References
Bertinetto, C., et al. Automatic Baseline Recognition for the Correction of Large Sets of Spectra Using Continuous Wavelet Transform and Iterative Fitting. Applied Spectroscopy, 2014, 68(2), 155-164.
- derpsalsa(data, lam=1000000.0, p=0.01, k=None, diff_order=2, max_iter=50, tol=0.001, weights=None, smooth_half_window=None, num_smooths=16, **pad_kwargs)
Derivative Peak-Screening Asymmetric Least Squares Algorithm (derpsalsa).
- Parameters:
- dataarray-like, shape (N,)
The y-values of the measured data, with N data points. Must not contain missing data (NaN) or Inf.
- lamfloat, optional
The smoothing parameter. Larger values will create smoother baselines. Default is 1e6.
- pfloat, optional
The penalizing weighting factor. Must be between 0 and 1. Values greater than the baseline will be given p weight, and values less than the baseline will be given p - 1 weight. Default is 1e-2.
- kfloat, optional
A factor that controls the exponential decay of the weights for baseline values greater than the data. Should be approximately the height at which a value could be considered a peak. Default is None, which sets k to one-tenth of the standard deviation of the input data. A large k value will produce similar results to
asls()
.- diff_orderint, optional
The order of the differential matrix. Must be greater than 0. Default is 2 (second order differential matrix). Typical values are 2 or 1.
- max_iterint, optional
The max number of fit iterations. Default is 50.
- tolfloat, optional
The exit criteria. Default is 1e-3.
- weightsarray-like, shape (N,), optional
The weighting array. If None (default), then the initial weights will be an array with size equal to N and all values set to 1.
- smooth_half_windowint, optional
The half-window to use for smoothing the data before computing the first and second derivatives. Default is None, which will use
len(data) / 200
.- num_smoothsint, optional
The number of times to smooth the data before computing the first and second derivatives. Default is 16.
- **pad_kwargs
Additional keyword arguments to pass to
pad_edges()
for padding the edges of the data to prevent edge effects from smoothing.
- Returns:
- baselinenumpy.ndarray, shape (N,)
The calculated baseline.
- paramsdict
A dictionary with the following items:
- 'weights': numpy.ndarray, shape (N,)
The weight array used for fitting the data.
- 'tol_history': numpy.ndarray
An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.
- Raises:
- ValueError
Raised if p is not between 0 and 1.
References
Korepanov, V. Asymmetric least-squares baseline algorithm with peak screening for automatic processing of the Raman spectra. Journal of Raman Spectroscopy. 2020, 51(10), 2061-2065.
- dietrich(data, smooth_half_window=None, num_std=3.0, interp_half_window=5, poly_order=5, max_iter=50, tol=0.001, weights=None, return_coef=False, min_length=2, **pad_kwargs)
Dietrich's method for identifying baseline regions.
Calculates the power spectrum of the data as the squared derivative of the data. Then baseline points are identified by iteratively removing points where the mean of the power spectrum is less than num_std times the standard deviation of the power spectrum.
- Parameters:
- dataarray-like, shape (N,)
The y-values of the measured data, with N data points.
- smooth_half_windowint, optional
The half window to use for smoothing the input data with a moving average. Default is None, which will use N / 256. Set to 0 to not smooth the data.
- num_stdfloat, optional
The number of standard deviations to include when thresholding. Higher values will assign more points as baseline. Default is 3.0.
- interp_half_windowint, optional
When interpolating between baseline segments, will use the average of
data[i-interp_half_window:i+interp_half_window+1]
, where i is the index of the peak start or end, to fit the linear segment. Default is 5.- poly_orderint, optional
The polynomial order for fitting the identified baseline. Default is 5.
- max_iterint, optional
The maximum number of iterations for fitting a polynomial to the identified baseline. If max_iter is 0, the returned baseline will be just the linear interpolation of the baseline segments. Default is 50.
- tolfloat, optional
The exit criteria for fitting a polynomial to the identified baseline points. Default is 1e-3.
- weightsarray-like, shape (N,), optional
The weighting array, used to override the function's baseline identification to designate peak points. Only elements with 0 or False values will have an effect; all non-zero values are considered baseline points. If None (default), then will be an array with size equal to N and all values set to 1.
- return_coefbool, optional
If True, will convert the polynomial coefficients for the fit baseline to a form that fits the input x_data and return them in the params dictionary. Default is False, since the conversion takes time.
- min_lengthint, optional
Any region of consecutive baseline points less than min_length is considered to be a false positive and all points in the region are converted to peak points. A higher min_length ensures less points are falsely assigned as baseline points. Default is 2, which only removes lone baseline points.
- **pad_kwargs
Additional keyword arguments to pass to
pad_edges()
for padding the edges of the data to prevent edge effects from smoothing.
- Returns:
- baselinenumpy.ndarray, shape (N,)
The calculated baseline.
- paramsdict
A dictionary with the following items:
- 'mask': numpy.ndarray, shape (N,)
The boolean array designating baseline points as True and peak points as False.
- 'coef': numpy.ndarray, shape (poly_order,)
Only if return_coef is True and max_iter is greater than 0. The array of polynomial coefficients for the baseline, in increasing order. Can be used to create a polynomial using
numpy.polynomial.polynomial.Polynomial
.
- 'tol_history': numpy.ndarray
Only if max_iter is greater than 1. An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.
Notes
When choosing parameters, first choose a smooth_half_window that appropriately smooths the data, and then reduce num_std until no peak regions are included in the baseline. If no value of num_std works, change smooth_half_window and repeat.
If max_iter is 0, the baseline is simply a linear interpolation of the identified baseline points. Otherwise, a polynomial is iteratively fit through the baseline points, and the interpolated sections are replaced each iteration with the polynomial fit.
References
Dietrich, W., et al. Fast and Precise Automatic Baseline Correction of One- and Two-Dimensional NMR Spectra. Journal of Magnetic Resonance. 1991, 91, 1-11.
- drpls(data, lam=100000.0, eta=0.5, max_iter=50, tol=0.001, weights=None, diff_order=2)
Doubly reweighted penalized least squares (drPLS) baseline.
- Parameters:
- dataarray-like, shape (N,)
The y-values of the measured data, with N data points. Must not contain missing data (NaN) or Inf.
- lamfloat, optional
The smoothing parameter. Larger values will create smoother baselines. Default is 1e5.
- etafloat
A term for controlling the value of lam; should be between 0 and 1. Low values will produce smoother baselines, while higher values will more aggressively fit peaks. Default is 0.5.
- max_iterint, optional
The max number of fit iterations. Default is 50.
- tolfloat, optional
The exit criteria. Default is 1e-3.
- weightsarray-like, shape (N,), optional
The weighting array. If None (default), then the initial weights will be an array with size equal to N and all values set to 1.
- diff_orderint, optional
The order of the differential matrix. Must be greater than 1. Default is 2 (second order differential matrix). Typical values are 2 or 3.
- Returns:
- baselinenumpy.ndarray, shape (N,)
The calculated baseline.
- paramsdict
A dictionary with the following items:
- 'weights': numpy.ndarray, shape (N,)
The weight array used for fitting the data.
- 'tol_history': numpy.ndarray
An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.
- Raises:
- ValueError
Raised if eta is not between 0 and 1 or if diff_order is less than 2.
References
Xu, D. et al. Baseline correction method based on doubly reweighted penalized least squares, Applied Optics, 2019, 58, 3913-3920.
- fabc(data, lam=1000000.0, scale=None, num_std=3.0, diff_order=2, min_length=2, weights=None, weights_as_mask=False, **pad_kwargs)
Fully automatic baseline correction (fabc).
Similar to Dietrich's method, except that the derivative is estimated using a continuous wavelet transform and the baseline is calculated using Whittaker smoothing through the identified baseline points.
- Parameters:
- dataarray-like, shape (N,)
The y-values of the measured data, with N data points.
- lamfloat, optional
The smoothing parameter. Larger values will create smoother baselines. Default is 1e6.
- scaleint, optional
The scale at which to calculate the continuous wavelet transform. Should be approximately equal to the index-based full-width-at-half-maximum of the peaks or features in the data. Default is None, which will use half of the value from
optimize_window()
, which is not always a good value, but at least scales with the number of data points and gives a starting point for tuning the parameter.- num_stdfloat, optional
The number of standard deviations to include when thresholding. Higher values will assign more points as baseline. Default is 3.0.
- diff_orderint, optional
The order of the differential matrix. Must be greater than 0. Default is 2 (second order differential matrix). Typical values are 2 or 1.
- min_lengthint, optional
Any region of consecutive baseline points less than min_length is considered to be a false positive and all points in the region are converted to peak points. A higher min_length ensures less points are falsely assigned as baseline points. Default is 2, which only removes lone baseline points.
- weightsarray-like, shape (N,), optional
The weighting array, used to override the function's baseline identification to designate peak points. Only elements with 0 or False values will have an effect; all non-zero values are considered baseline points. If None (default), then will be an array with size equal to N and all values set to 1.
- weights_as_maskbool, optional
If True, signifies that the input weights is the mask to use for fitting, which skips the continuous wavelet calculation and just smooths the input data. Default is False.
- **pad_kwargs
Additional keyword arguments to pass to
pad_edges()
for padding the edges of the data to prevent edge effects from convolution for the continuous wavelet transform.
- Returns:
- baselinenumpy.ndarray, shape (N,)
The calculated baseline.
- paramsdict
A dictionary with the following items:
- 'mask': numpy.ndarray, shape (N,)
The boolean array designating baseline points as True and peak points as False.
- 'weights': numpy.ndarray, shape (N,)
The weight array used for fitting the data.
Notes
The classification of baseline points is similar to
dietrich()
, except that this method approximates the first derivative using a continous wavelet transform with the Haar wavelet, which is more robust than the numerical derivative in Dietrich's method.References
Cobas, J., et al. A new general-purpose fully automatic baseline-correction procedure for 1D and 2D NMR data. Journal of Magnetic Resonance, 2006, 183(1), 145-151.
- fastchrom(data, half_window=None, threshold=None, min_fwhm=None, interp_half_window=5, smooth_half_window=None, weights=None, max_iter=100, min_length=2, **pad_kwargs)
Identifies baseline segments by thresholding the rolling standard deviation distribution.
Baseline points are identified as any point where the rolling standard deviation is less than the specified threshold. Peak regions are iteratively interpolated until the baseline is below the data.
- Parameters:
- dataarray-like, shape (N,)
The y-values of the measured data, with N data points.
- half_windowint, optional
The half-window to use for the rolling standard deviation calculation. Should be approximately equal to the full-width-at-half-maximum of the peaks or features in the data. Default is None, which will use half of the value from
optimize_window()
, which is not always a good value, but at least scales with the number of data points and gives a starting point for tuning the parameter.- thresholdfloat of Callable, optional
All points in the rolling standard deviation below threshold will be considered as baseline. Higher values will assign more points as baseline. Default is None, which will set the threshold as the 15th percentile of the rolling standard deviation. If threshold is Callable, it should take the rolling standard deviation as the only argument and output a float.
- min_fwhmint, optional
After creating the interpolated baseline, any region where the baseline is greater than the data for min_fwhm consecutive points will have an additional baseline point added and reinterpolated. Should be set to approximately the index-based full-width-at-half-maximum of the smallest peak. Default is None, which uses 2 * half_window.
- interp_half_windowint, optional
When interpolating between baseline segments, will use the average of
data[i-interp_half_window:i+interp_half_window+1]
, where i is the index of the peak start or end, to fit the linear segment. Default is 5.- smooth_half_windowint, optional
The half window to use for smoothing the interpolated baseline with a moving average. Default is None, which will use half_window. Set to 0 to not smooth the baseline.
- weightsarray-like, shape (N,), optional
The weighting array, used to override the function's baseline identification to designate peak points. Only elements with 0 or False values will have an effect; all non-zero values are considered baseline points. If None (default), then will be an array with size equal to N and all values set to 1.
- max_iterint, optional
The maximum number of iterations to attempt to fill in regions where the baseline is greater than the input data. Default is 100.
- min_lengthint, optional
Any region of consecutive baseline points less than min_length is considered to be a false positive and all points in the region are converted to peak points. A higher min_length ensures less points are falsely assigned as baseline points. Default is 2, which only removes lone baseline points.
- **pad_kwargs
Additional keyword arguments to pass to
pad_edges()
for padding the edges of the data to prevent edge effects from the moving average smoothing.
- Returns:
- baselinenumpy.ndarray, shape (N,)
The calculated baseline.
- paramsdict
A dictionary with the following items:
- 'mask': numpy.ndarray, shape (N,)
The boolean array designating baseline points as True and peak points as False.
Notes
Only covers the baseline correction from FastChrom, not its peak finding and peak grouping capabilities.
References
Johnsen, L., et al. An automated method for baseline correction, peak finding and peak grouping in chromatographic data. Analyst. 2013, 138, 3502-3511.
- goldindec(data, poly_order=2, tol=0.001, max_iter=250, weights=None, cost_function='asymmetric_indec', peak_ratio=0.5, alpha_factor=0.99, tol_2=0.001, tol_3=1e-06, max_iter_2=100, return_coef=False)
Fits a polynomial baseline using a non-quadratic cost function.
The non-quadratic cost functions penalize residuals with larger values, giving a more robust fit compared to normal least-squares.
- Parameters:
- dataarray-like, shape (N,)
The y-values of the measured data, with N data points.
- poly_orderint, optional
The polynomial order for fitting the baseline. Default is 2.
- tolfloat, optional
The exit criteria for the fitting with a given threshold value. Default is 1e-3.
- max_iterint, optional
The maximum number of iterations for fitting a threshold value. Default is 250.
- weightsarray-like, shape (N,), optional
The weighting array. If None (default), then will be an array with size equal to N and all values set to 1.
- cost_functionstr, optional
The non-quadratic cost function to minimize. Unlike
penalized_poly()
, this function only works with asymmetric cost functions, so the symmetry prefix ('a' or 'asymmetric') is optional (eg. 'indec' and 'a_indec' are the same). Default is 'asymmetric_indec'. Available methods, and their associated reference, are:- peak_ratiofloat, optional
A value between 0 and 1 that designates how many points in the data belong to peaks. Values are valid within ~10% of the actual peak ratio. Default is 0.5.
- alpha_factorfloat, optional
A value between 0 and 1 that controls the value of the penalty. Default is 0.99. Typically should not need to change this value.
- tol_2float, optional
The exit criteria for the difference between the optimal up-down ratio (number of points above 0 in the residual compared to number of points below 0) and the up-down ratio for a given threshold value. Default is 1e-3.
- tol_3float, optional
The exit criteria for the relative change in the threshold value. Default is 1e-6.
- max_iter_2float, optional
The number of iterations for iterating between different threshold values. Default is 100.
- return_coefbool, optional
If True, will convert the polynomial coefficients for the fit baseline to a form that fits the input x_data and return them in the params dictionary. Default is False, since the conversion takes time.
- Returns:
- baselinenumpy.ndarray, shape (N,)
The calculated baseline.
- paramsdict
A dictionary with the following items:
- 'weights': numpy.ndarray, shape (N,)
The weight array used for fitting the data.
- 'tol_history': numpy.ndarray, shape (J, K)
An array containing the calculated tolerance values for each iteration of both threshold values and fit values. Index 0 are the tolerence values for the difference in up-down ratios, index 1 are the tolerance values for the relative change in the threshold, and indices >= 2 are the tolerance values for each fit. All values that were not used in fitting have values of 0. Shape J is 2 plus the number of iterations for the threshold to converge (related to max_iter_2, tol_2, tol_3), and shape K is the maximum of the number of iterations for the threshold and the maximum number of iterations for all of the fits of the various threshold values (related to max_iter and tol).
- 'threshold'float
The optimal threshold value. Could be used in
penalized_poly()
for fitting other similar data.
- 'coef': numpy.ndarray, shape (poly_order + 1,)
Only if return_coef is True. The array of polynomial parameters for the baseline, in increasing order. Can be used to create a polynomial using
numpy.polynomial.polynomial.Polynomial
.
- Raises:
- ValueError
Raised if alpha_factor or peak_ratio are not between 0 and 1, or if the specified cost function is symmetric.
References
[25]Liu, J., et al. Goldindec: A Novel Algorithm for Raman Spectrum Baseline Correction. Applied Spectroscopy, 2015, 69(7), 834-842.
- golotvin(data, half_window=None, num_std=2.0, sections=32, smooth_half_window=None, interp_half_window=5, weights=None, min_length=2, **pad_kwargs)
Golotvin's method for identifying baseline regions.
Divides the data into sections and takes the minimum standard deviation of all sections as the noise standard deviation for the entire data. Then classifies any point where the rolling max minus min is less than
num_std * noise standard deviation
as belonging to the baseline.- Parameters:
- dataarray-like, shape (N,)
The y-values of the measured data, with N data points.
- half_windowint, optional
The half-window to use for the rolling maximum and rolling minimum calculations. Should be approximately equal to the full-width-at-half-maximum of the peaks or features in the data. Default is None, which will use half of the value from
optimize_window()
, which is not always a good value, but at least scales with the number of data points and gives a starting point for tuning the parameter.- num_stdfloat, optional
The number of standard deviations to include when thresholding. Higher values will assign more points as baseline. Default is 3.0.
- sectionsint, optional
The number of sections to divide the input data into for finding the minimum standard deviation.
- smooth_half_windowint, optional
The half window to use for smoothing the interpolated baseline with a moving average. Default is None, which will use half_window. Set to 0 to not smooth the baseline.
- interp_half_windowint, optional
When interpolating between baseline segments, will use the average of
data[i-interp_half_window:i+interp_half_window+1]
, where i is the index of the peak start or end, to fit the linear segment. Default is 5.- weightsarray-like, shape (N,), optional
The weighting array, used to override the function's baseline identification to designate peak points. Only elements with 0 or False values will have an effect; all non-zero values are considered baseline points. If None (default), then will be an array with size equal to N and all values set to 1.
- min_lengthint, optional
Any region of consecutive baseline points less than min_length is considered to be a false positive and all points in the region are converted to peak points. A higher min_length ensures less points are falsely assigned as baseline points. Default is 2, which only removes lone baseline points.
- **pad_kwargs
Additional keyword arguments to pass to
pad_edges()
for padding the edges of the data to prevent edge effects from the moving average smoothing.
- Returns:
- baselinenumpy.ndarray, shape (N,)
The calculated baseline.
- paramsdict
A dictionary with the following items:
- 'mask': numpy.ndarray, shape (N,)
The boolean array designating baseline points as True and peak points as False.
References
Golotvin, S., et al. Improved Baseline Recognition and Modeling of FT NMR Spectra. Journal of Magnetic Resonance. 2000, 146, 122-125.
- iarpls(data, lam=100000.0, diff_order=2, max_iter=50, tol=0.001, weights=None)
Improved asymmetrically reweighted penalized least squares smoothing (IarPLS).
- Parameters:
- dataarray-like, shape (N,)
The y-values of the measured data, with N data points. Must not contain missing data (NaN) or Inf.
- lamfloat, optional
The smoothing parameter. Larger values will create smoother baselines. Default is 1e5.
- diff_orderint, optional
The order of the differential matrix. Must be greater than 0. Default is 2 (second order differential matrix). Typical values are 2 or 1.
- max_iterint, optional
The max number of fit iterations. Default is 50.
- tolfloat, optional
The exit criteria. Default is 1e-3.
- weightsarray-like, shape (N,), optional
The weighting array. If None (default), then the initial weights will be an array with size equal to N and all values set to 1.
- Returns:
- baselinenumpy.ndarray, shape (N,)
The calculated baseline.
- paramsdict
A dictionary with the following items:
- 'weights': numpy.ndarray, shape (N,)
The weight array used for fitting the data.
- 'tol_history': numpy.ndarray
An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.
References
Ye, J., et al. Baseline correction method based on improved asymmetrically reweighted penalized least squares for Raman spectrum. Applied Optics, 2020, 59, 10933-10943.
- iasls(data, lam=1000000.0, p=0.01, lam_1=0.0001, max_iter=50, tol=0.001, weights=None, diff_order=2)
Fits the baseline using the improved asymmetric least squares (IAsLS) algorithm.
The algorithm consideres both the first and second derivatives of the residual.
- Parameters:
- dataarray-like, shape (N,)
The y-values of the measured data, with N data points. Must not contain missing data (NaN) or Inf.
- lamfloat, optional
The smoothing parameter. Larger values will create smoother baselines. Default is 1e6.
- pfloat, optional
The penalizing weighting factor. Must be between 0 and 1. Values greater than the baseline will be given p weight, and values less than the baseline will be given p - 1 weight. Default is 1e-2.
- lam_1float, optional
The smoothing parameter for the first derivative of the residual. Default is 1e-4.
- max_iterint, optional
The max number of fit iterations. Default is 50.
- tolfloat, optional
The exit criteria. Default is 1e-3.
- weightsarray-like, shape (N,), optional
The weighting array. If None (default), then the initial weights will be set by fitting the data with a second order polynomial.
- diff_orderint, optional
The order of the differential matrix. Must be greater than 1. Default is 2 (second order differential matrix). Typical values are 2 or 3.
- Returns:
- baselinenumpy.ndarray, shape (N,)
The calculated baseline.
- paramsdict
A dictionary with the following items:
- 'weights': numpy.ndarray, shape (N,)
The weight array used for fitting the data.
- 'tol_history': numpy.ndarray
An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.
- Raises:
- ValueError
Raised if p is not between 0 and 1 or if diff_order is less than 2.
References
He, S., et al. Baseline correction for raman spectra using an improved asymmetric least squares method, Analytical Methods, 2014, 6(12), 4402-4407.
- imodpoly(data, poly_order=2, tol=0.001, max_iter=250, weights=None, use_original=False, mask_initial_peaks=True, return_coef=False, num_std=1.0)
The improved modofied polynomial (IModPoly) baseline algorithm.
- Parameters:
- dataarray-like, shape (N,)
The y-values of the measured data, with N data points.
- poly_orderint, optional
The polynomial order for fitting the baseline. Default is 2.
- tolfloat, optional
The exit criteria. Default is 1e-3.
- max_iterint, optional
The maximum number of iterations. Default is 250.
- weightsarray-like, shape (N,), optional
The weighting array. If None (default), then will be an array with size equal to N and all values set to 1.
- use_originalbool, optional
If False (default), will compare the baseline of each iteration with the y-values of that iteration [11] when choosing minimum values. If True, will compare the baseline with the original y-values given by data [12].
- mask_initial_peaksbool, optional
If True (default), will mask any data where the initial baseline fit + the standard deviation of the residual is less than measured data [13].
- return_coefbool, optional
If True, will convert the polynomial coefficients for the fit baseline to a form that fits the input x_data and return them in the params dictionary. Default is False, since the conversion takes time.
- num_stdfloat, optional
The number of standard deviations to include when thresholding. Default is 1. Must be greater or equal to 0.
- Returns:
- baselinenumpy.ndarray, shape (N,)
The calculated baseline.
- paramsdict
A dictionary with the following items:
- 'weights': numpy.ndarray, shape (N,)
The weight array used for fitting the data.
- 'tol_history': numpy.ndarray
An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.
- 'coef': numpy.ndarray, shape (poly_order + 1,)
Only if return_coef is True. The array of polynomial parameters for the baseline, in increasing order. Can be used to create a polynomial using
numpy.polynomial.polynomial.Polynomial
.
- Raises:
- ValueError
Raised if num_std is less than 0.
Notes
Algorithm originally developed in [13].
References
[11]Gan, F., et al. Baseline correction by improved iterative polynomial fitting with automatic threshold. Chemometrics and Intelligent Laboratory Systems, 2006, 82, 59-65.
[12]Lieber, C., et al. Automated method for subtraction of fluorescence from biological raman spectra. Applied Spectroscopy, 2003, 57(11), 1363-1367.
- imor(data, half_window=None, tol=0.001, max_iter=200, **window_kwargs)
An Improved Morphological based (IMor) baseline algorithm.
- Parameters:
- dataarray-like, shape (N,)
The y-values of the measured data, with N data points.
- half_windowint, optional
The half-window used for the morphology functions. If a value is input, then that value will be used. Default is None, which will optimize the half-window size using
optimize_window()
and window_kwargs.- tolfloat, optional
The exit criteria. Default is 1e-3.
- max_iterint, optional
The maximum number of iterations. Default is 200.
- **window_kwargs
Values for setting the half window used for the morphology operations. Items include:
- 'increment': int
The step size for iterating half windows. Default is 1.
- 'max_hits': int
The number of consecutive half windows that must produce the same morphological opening before accepting the half window as the optimum value. Default is 1.
- 'window_tol': float
The tolerance value for considering two morphological openings as equivalent. Default is 1e-6.
- 'max_half_window': int
The maximum allowable window size. If None (default), will be set to (len(data) - 1) / 2.
- 'min_half_window': int
The minimum half-window size. If None (default), will be set to 1.
- Returns:
- baselinenumpy.ndarray, shape (N,)
The calculated baseline.
- paramsdict
A dictionary with the following items:
- 'half_window': int
The half window used for the morphological calculations.
- 'tol_history': numpy.ndarray
An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.
References
Dai, L., et al. An Automated Baseline Correction Method Based on Iterative Morphological Operations. Applied Spectroscopy, 2018, 72(5), 731-739.
- interp_pts(data=None, baseline_points=(), interp_method='linear')
Creates a baseline by interpolating through input points.
- Parameters:
- dataarray-like, optional
The y-values. Not used by this function, but input is allowed for consistency with other functions.
- baseline_pointsarray-like, shape (n, 2)
An array of ((x_1, y_1), (x_2, y_2), ..., (x_n, y_n)) values for each point representing the baseline.
- interp_methodstr, optional
The method to use for interpolation. See
scipy.interpolate.interp1d
for all options. Default is 'linear', which connects each point with a line segment.
- Returns:
- baselinenumpy.ndarray, shape (N,)
The baseline array constructed from interpolating between each input baseline point.
- dict
An empty dictionary, just to match the output of all other algorithms.
- Raises:
- ValueError
Raised of baseline_points does not contain at least two values, signifying one x-y point.
Notes
This method is only suggested for use within user-interfaces.
Regions of the baseline where x_data is less than the minimum x-value or greater than the maximum x-value in baseline_points will be assigned values of 0.
- ipsa(data, half_window=None, max_iter=500, tol=None, roi=None, original_criteria=False, **pad_kwargs)
Iterative Polynomial Smoothing Algorithm (IPSA).
- Parameters:
- dataarray-like, shape (N,)
The y-values of the measured data, with N data points.
- half_windowint
The half-window to use for the smoothing each iteration. Should be approximately equal to the full-width-at-half-maximum of the peaks or features in the data. Default is None, which will use 4 times the output of
optimize_window()
, which is not always a good value, but at least scales with the number of data points and gives a starting point for tuning the parameter.- max_iterint, optional
The maximum number of iterations. Default is 500.
- tolfloat, optional
The exit criteria. Default is None, which uses 1e-3 if original_criteria is False, and
1 / (max(data) - min(data))
if original_criteria is True.- roislice or array-like, shape(N,)
The region of interest, such that
np.asarray(data)[roi]
gives the values for calculating the tolerance if original_criteria is True. Not used if original_criteria is True. Default is None, which uses all values in data.- original_criteriabool, optional
Whether to use the original exit criteria from the reference, which is difficult to use since it requires knowledge of how high the peaks should be after baseline correction. If False (default), then compares
norm(old, new) / norm(old)
, where old is the previous iteration's baseline, and new is the current iteration's baseline.- **pad_kwargs
Additional keyword arguments to pass to
pad_edges()
for padding the edges of the data to prevent edge effects from convolution.
- Returns:
- baselinenumpy.ndarray, shape (N,)
The calculated baseline.
- paramsdict
A dictionary with the following items:
- 'tol_history': numpy.ndarray
An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.
References
Wang, T., et al. Background Subtraction of Raman Spectra Based on Iterative Polynomial Smoothing. Applied Spectroscopy. 71(6) (2017) 1169-1179.
- irsqr(data, lam=100, quantile=0.05, num_knots=100, spline_degree=3, diff_order=3, max_iter=100, tol=1e-06, weights=None, eps=None)
Iterative Reweighted Spline Quantile Regression (IRSQR).
Fits the baseline using quantile regression with penalized splines.
- Parameters:
- dataarray-like, shape (N,)
The y-values of the measured data, with N data points. Must not contain missing data (NaN) or Inf.
- lamfloat, optional
The smoothing parameter. Larger values will create smoother baselines. Default is 1e6.
- quantilefloat, optional
The quantile at which to fit the baseline. Default is 0.05.
- num_knotsint, optional
The number of knots for the spline. Default is 100.
- spline_degreeint, optional
The degree of the spline. Default is 3, which is a cubic spline.
- diff_orderint, optional
The order of the differential matrix. Must be greater than 0. Default is 3 (third order differential matrix). Typical values are 3, 2, or 1.
- max_iterint, optional
The max number of fit iterations. Default is 100.
- tolfloat, optional
The exit criteria. Default is 1e-6.
- weightsarray-like, shape (N,), optional
The weighting array. If None (default), then the initial weights will be an array with size equal to N and all values set to 1.
- epsfloat, optional
A small value added to the square of the residual to prevent dividing by 0. Default is None, which uses the square of the maximum-absolute-value of the fit each iteration multiplied by 1e-6.
- Returns:
- baselinenumpy.ndarray, shape (N,)
The calculated baseline.
- paramsdict
A dictionary with the following items:
- 'weights': numpy.ndarray, shape (N,)
The weight array used for fitting the data.
- 'tol_history': numpy.ndarray
An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.
- Raises:
- ValueError
Raised if quantile is not between 0 and 1.
References
Han, Q., et al. Iterative Reweighted Quantile Regression Using Augmented Lagrangian Optimization for Baseline Correction. 2018 5th International Conference on Information Science and Control Engineering (ICISCE), 2018, 280-284.
- jbcd(data, half_window=None, alpha=0.1, beta=10.0, gamma=1.0, beta_mult=1.1, gamma_mult=0.909, diff_order=1, max_iter=20, tol=0.01, tol_2=0.001, robust_opening=True, **window_kwargs)
Joint Baseline Correction and Denoising (jbcd) Algorithm.
- Parameters:
- dataarray-like, shape (N,)
The y-values of the measured data, with N data points.
- half_windowint, optional
The half-window used for the morphology functions. If a value is input, then that value will be used. Default is None, which will optimize the half-window size using
optimize_window()
and window_kwargs.- alphafloat, optional
The regularization parameter that controls how close the baseline must fit the calculated morphological opening. Larger values make the fit more constrained to the opening and can make the baseline less smooth. Default is 0.1.
- betafloat, optional
The regularization parameter that controls how smooth the baseline is. Larger values produce smoother baselines. Default is 1e1.
- gammafloat, optional
The regularization parameter that controls how smooth the signal is. Larger values produce smoother baselines. Default is 1.
- beta_multfloat, optional
The value that beta is multiplied by each iteration. Default is 1.1.
- gamma_multfloat, optional
The value that gamma is multiplied by each iteration. Default is 0.909.
- diff_orderint, optional
The order of the differential matrix. Must be greater than 0. Default is 1 (first order differential matrix). Typical values are 2 or 1.
- max_iterint, optional
The maximum number of iterations. Default is 20.
- tolfloat, optional
The exit criteria for the change in the calculated signal. Default is 1e-2.
- tol_2float, optional
The exit criteria for the change in the calculated baseline. Default is 1e-2.
- robust_openingbool, optional
If True (default), the opening used to represent the initial baseline is the element-wise minimum between the morphological opening and the average of the morphological erosion and dilation of the opening, similar to
mor()
. If False, the opening is just the morphological opening, as used in the reference. The robust opening typically represents the baseline better.- **window_kwargs
Values for setting the half window used for the morphology operations. Items include:
- 'increment': int
The step size for iterating half windows. Default is 1.
- 'max_hits': int
The number of consecutive half windows that must produce the same morphological opening before accepting the half window as the optimum value. Default is 1.
- 'window_tol': float
The tolerance value for considering two morphological openings as equivalent. Default is 1e-6.
- 'max_half_window': int
The maximum allowable window size. If None (default), will be set to (len(data) - 1) / 2.
- 'min_half_window': int
The minimum half-window size. If None (default), will be set to 1.
- Returns:
- baselinenumpy.ndarray, shape (N,)
The calculated baseline.
- paramsdict
A dictionary with the following items:
- 'half_window': int
The half window used for the morphological calculations.
- 'tol_history': numpy.ndarray, shape (K, 2)
An array containing the calculated tolerance values for each iteration. Index 0 are the tolerence values for the relative change in the signal, and index 1 are the tolerance values for the relative change in the baseline. The length of the array is the number of iterations completed, K. If the last values in the array are greater than the input tol or tol_2 values, then the function did not converge.
- 'signal': numpy.ndarray, shape (N,)
The pure signal portion of the input data without noise or the baseline.
References
Liu, H., et al. Joint Baseline-Correction and Denoising for Raman Spectra. Applied Spectroscopy, 2015, 69(9), 1013-1022.
- loess(data, fraction=0.2, total_points=None, poly_order=1, scale=3.0, tol=0.001, max_iter=10, symmetric_weights=False, use_threshold=False, num_std=1, use_original=False, weights=None, return_coef=False, conserve_memory=True, delta=0.0)
Locally estimated scatterplot smoothing (LOESS).
Performs polynomial regression at each data point using the nearest points.
- Parameters:
- dataarray-like, shape (N,)
The y-values of the measured data, with N data points.
- fractionfloat, optional
The fraction of N data points to include for the fitting on each point. Default is 0.2. Not used if total_points is not None.
- total_pointsint, optional
The total number of points to include for the fitting on each point. Default is None, which will use fraction * N to determine the number of points.
- scalefloat, optional
A scale factor applied to the weighted residuals to control the robustness of the fit. Default is 3.0, as used in [16]. Note that the original loess procedure in [17] used a scale of ~4.05.
- poly_orderint, optional
The polynomial order for fitting the baseline. Default is 1.
- tolfloat, optional
The exit criteria. Default is 1e-3.
- max_iterint, optional
The maximum number of iterations. Default is 10.
- symmetric_weightsbool, optional
If False (default), will apply weighting asymmetrically, with residuals < 0 having a weight of 1, according to [16]. If True, will apply weighting the same for both positive and negative residuals, which is regular LOESS. If use_threshold is True, this parameter is ignored.
- use_thresholdbool, optional
If False (default), will compute weights each iteration to perform the robust fitting, which is regular LOESS. If True, will apply a threshold on the data being fit each iteration, based on the maximum values of the data and the fit baseline, as proposed by [18], similar to the modpoly and imodpoly techniques.
- num_stdfloat, optional
The number of standard deviations to include when thresholding. Default is 1, which is the value used for the imodpoly technique. Only used if use_threshold is True.
- use_originalbool, optional
If False (default), will compare the baseline of each iteration with the y-values of that iteration [19] when choosing minimum values for thresholding. If True, will compare the baseline with the original y-values given by data [20]. Only used if use_threshold is True.
- weightsarray-like, shape (N,), optional
The weighting array. If None (default), then will be an array with size equal to N and all values set to 1.
- return_coefbool, optional
If True, will convert the polynomial coefficients for the fit baseline to a form that fits the input x_data and return them in the params dictionary. Default is False, since the conversion takes time.
- conserve_memorybool, optional
If False, will cache the distance-weighted kernels for each value in x_data on the first iteration and reuse them on subsequent iterations to save time. The shape of the array of kernels is (len(x_data), total_points). If True (default), will recalculate the kernels each iteration, which uses very little memory, but is slower. Can usually set to False unless x_data and`total_points` are quite large and the function causes memory issues when cacheing the kernels. If numba is installed, there is no significant time difference since the calculations are sped up.
- deltafloat, optional
If delta is > 0, will skip all but the last x-value in the range x_last + delta, where x_last is the last x-value to be fit using weighted least squares, and instead use linear interpolation to calculate the fit for those x-values (same behavior as in statsmodels [21] and Cleveland's original Fortran lowess implementation [22]). Fits all x-values if delta is <= 0. Default is 0.0. Note that x_data is scaled to fit in the range [-1, 1], so delta should likewise be scaled. For example, if the desired delta value was
0.01 * (max(x_data) - min(x_data))
, then the correctly scaled delta would be 0.02 (ie.0.01 * (1 - (-1))
).
- Returns:
- baselinenumpy.ndarray, shape (N,)
The calculated baseline.
- paramsdict
A dictionary with the following items:
- 'weights': numpy.ndarray, shape (N,)
The weight array used for fitting the data. Does NOT contain the individual distance-weighted kernels for each x-value.
- 'tol_history': numpy.ndarray
An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.
- 'coef': numpy.ndarray, shape (N, poly_order + 1)
Only if return_coef is True. The array of polynomial parameters for the baseline, in increasing order. Can be used to create a polynomial using
numpy.polynomial.polynomial.Polynomial
. If delta is > 0, the coefficients for any skipped x-value will all be 0.
- Raises:
- ValueError
Raised if the number of points per window for the fitting is less than poly_order + 1 or greater than the total number of points, or if the values in self.x are not strictly increasing.
Notes
The iterative, robust, aspect of the fitting can be achieved either through reweighting based on the residuals (the typical usage), or thresholding the fit data based on the residuals, as proposed by [18], similar to the modpoly and imodpoly techniques.
In baseline literature, this procedure is sometimes called "rbe", meaning "robust baseline estimate".
References
[16] (1,2)Ruckstuhl, A.F., et al. Baseline subtraction using robust local regression estimation. J. Quantitative Spectroscopy and Radiative Transfer, 2001, 68, 179-193.
[17]Cleveland, W. Robust locally weighted regression and smoothing scatterplots. Journal of the American Statistical Association, 1979, 74(368), 829-836.
[18] (1,2)Komsta, Ł. Comparison of Several Methods of Chromatographic Baseline Removal with a New Approach Based on Quantile Regression. Chromatographia, 2011, 73, 721-731.
[19]Gan, F., et al. Baseline correction by improved iterative polynomial fitting with automatic threshold. Chemometrics and Intelligent Laboratory Systems, 2006, 82, 59-65.
[20]Lieber, C., et al. Automated method for subtraction of fluorescence from biological raman spectra. Applied Spectroscopy, 2003, 57(11), 1363-1367.
[22]https://www.netlib.org/go (lowess.f is the file).
- mixture_model(data, lam=100000.0, p=0.01, num_knots=100, spline_degree=3, diff_order=3, max_iter=50, tol=0.001, weights=None, symmetric=False, num_bins=None)
Considers the data as a mixture model composed of noise and peaks.
Weights are iteratively assigned by calculating the probability each value in the residual belongs to a normal distribution representing the noise.
- Parameters:
- dataarray-like, shape (N,)
The y-values of the measured data, with N data points. Must not contain missing data (NaN) or Inf.
- lamfloat, optional
The smoothing parameter. Larger values will create smoother baselines. Default is 1e5.
- pfloat, optional
The penalizing weighting factor. Must be between 0 and 1. Values greater than the baseline will be given p weight, and values less than the baseline will be given p - 1 weight. Used to set the initial weights before performing expectation-maximization. Default is 1e-2.
- num_knotsint, optional
The number of knots for the spline. Default is 100.
- spline_degreeint, optional
The degree of the spline. Default is 3, which is a cubic spline.
- diff_orderint, optional
The order of the differential matrix. Must be greater than 0. Default is 3 (third order differential matrix). Typical values are 2 or 3.
- max_iterint, optional
The max number of fit iterations. Default is 50.
- tolfloat, optional
The exit criteria. Default is 1e-3.
- weightsarray-like, shape (N,), optional
The weighting array. If None (default), then the initial weights will be an array with size equal to N and all values set to 1, and then two iterations of reweighted least-squares are performed to provide starting weights for the expectation-maximization of the mixture model.
- symmetricbool, optional
If False (default), the total mixture model will be composed of one normal distribution for the noise and one uniform distribution for positive non-noise residuals. If True, an additional uniform distribution will be added to the mixture model for negative non-noise residuals. Only need to set symmetric to True when peaks are both positive and negative.
- num_binsint, optional, deprecated
Deprecated since version 1.1.0:
num_bins
is deprecated since it is no longer necessary for performing the expectation-maximization and will be removed in pybaselines version 1.3.0.
- Returns:
- baselinenumpy.ndarray, shape (N,)
The calculated baseline.
- paramsdict
A dictionary with the following items:
- 'weights': numpy.ndarray, shape (N,)
The weight array used for fitting the data.
- 'tol_history': numpy.ndarray
An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.
- Raises:
- ValueError
Raised if p is not between 0 and 1.
References
de Rooi, J., et al. Mixture models for baseline estimation. Chemometric and Intelligent Laboratory Systems, 2012, 117, 56-60.
Ghojogh, B., et al. Fitting A Mixture Distribution to Data: Tutorial. arXiv preprint arXiv:1901.06708, 2019.
- modpoly(data, poly_order=2, tol=0.001, max_iter=250, weights=None, use_original=False, mask_initial_peaks=False, return_coef=False)
The modified polynomial (ModPoly) baseline algorithm.
- Parameters:
- dataarray-like, shape (N,)
The y-values of the measured data, with N data points.
- x_dataarray-like, shape (N,), optional
The x-values of the measured data. Default is None, which will create an array from -1 to 1 with N points.
- poly_orderint, optional
The polynomial order for fitting the baseline. Default is 2.
- tolfloat, optional
The exit criteria. Default is 1e-3.
- max_iterint, optional
The maximum number of iterations. Default is 250.
- weightsarray-like, shape (N,), optional
The weighting array. If None (default), then will be an array with size equal to N and all values set to 1.
- use_originalbool, optional
If False (default), will compare the baseline of each iteration with the y-values of that iteration [8] when choosing minimum values. If True, will compare the baseline with the original y-values given by data [9].
- mask_initial_peaksbool, optional
If True, will mask any data where the initial baseline fit + the standard deviation of the residual is less than measured data [10]. Default is False.
- return_coefbool, optional
If True, will convert the polynomial coefficients for the fit baseline to a form that fits the input x_data and return them in the params dictionary. Default is False, since the conversion takes time.
- Returns:
- baselinenumpy.ndarray, shape (N,)
The calculated baseline.
- paramsdict
A dictionary with the following items:
- 'weights': numpy.ndarray, shape (N,)
The weight array used for fitting the data.
- 'tol_history': numpy.ndarray
An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.
- 'coef': numpy.ndarray, shape (poly_order + 1,)
Only if return_coef is True. The array of polynomial parameters for the baseline, in increasing order. Can be used to create a polynomial using
numpy.polynomial.polynomial.Polynomial
.
Notes
Algorithm originally developed in [9] and then slightly modified in [8].
References
[8] (1,2)Gan, F., et al. Baseline correction by improved iterative polynomial fitting with automatic threshold. Chemometrics and Intelligent Laboratory Systems, 2006, 82, 59-65.
[9] (1,2)Lieber, C., et al. Automated method for subtraction of fluorescence from biological raman spectra. Applied Spectroscopy, 2003, 57(11), 1363-1367.
[10]Zhao, J., et al. Automated Autofluorescence Background Subtraction Algorithm for Biomedical Raman Spectroscopy, Applied Spectroscopy, 2007, 61(11), 1225-1232.
- mor(data, half_window=None, **window_kwargs)
A Morphological based (Mor) baseline algorithm.
- Parameters:
- dataarray-like, shape (N,)
The y-values of the measured data, with N data points.
- half_windowint, optional
The half-window used for the morphology functions. If a value is input, then that value will be used. Default is None, which will optimize the half-window size using
optimize_window()
and window_kwargs.- **window_kwargs
Values for setting the half window used for the morphology operations. Items include:
- 'increment': int
The step size for iterating half windows. Default is 1.
- 'max_hits': int
The number of consecutive half windows that must produce the same morphological opening before accepting the half window as the optimum value. Default is 1.
- 'window_tol': float
The tolerance value for considering two morphological openings as equivalent. Default is 1e-6.
- 'max_half_window': int
The maximum allowable window size. If None (default), will be set to (len(data) - 1) / 2.
- 'min_half_window': int
The minimum half-window size. If None (default), will be set to 1.
- Returns:
- baselinenumpy.ndarray, shape (N,)
The calculated baseline.
- dict
A dictionary with the following items:
- 'half_window': int
The half window used for the morphological calculations.
References
Perez-Pueyo, R., et al. Morphology-Based Automated Baseline Removal for Raman Spectra of Artistic Pigments. Applied Spectroscopy, 2010, 64, 595-600.
- mormol(data, half_window=None, tol=0.001, max_iter=250, smooth_half_window=None, pad_kwargs=None, **window_kwargs)
Iterative morphological and mollified (MorMol) baseline.
- Parameters:
- dataarray-like, shape (N,)
The y-values of the measured data, with N data points.
- half_windowint, optional
The half-window used for the morphology functions. If a value is input, then that value will be used. Default is None, which will optimize the half-window size using
optimize_window()
and window_kwargs.- tolfloat, optional
The exit criteria. Default is 1e-3.
- max_iterint, optional
The maximum number of iterations. Default is 200.
- smooth_half_windowint, optional
The half-window to use for smoothing the data before performing the morphological operation. Default is None, which will use a value of 1, which gives no smoothing.
- pad_kwargsdict, optional
A dictionary of keyword arguments to pass to
pad_edges()
for padding the edges of the data to prevent edge effects from convolution.- **window_kwargs
Values for setting the half window used for the morphology operations. Items include:
- 'increment': int
The step size for iterating half windows. Default is 1.
- 'max_hits': int
The number of consecutive half windows that must produce the same morphological opening before accepting the half window as the optimum value. Default is 1.
- 'window_tol': float
The tolerance value for considering two morphological openings as equivalent. Default is 1e-6.
- 'max_half_window': int
The maximum allowable window size. If None (default), will be set to (len(data) - 1) / 2.
- 'min_half_window': int
The minimum half-window size. If None (default), will be set to 1.
- Returns:
- baselinenumpy.ndarray, shape (N,)
The calculated baseline.
- paramsdict
A dictionary with the following items:
- 'half_window': int
The half window used for the morphological calculations.
- 'tol_history': numpy.ndarray
An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.
References
Koch, M., et al. Iterative morphological and mollifier-based baseline correction for Raman spectra. J Raman Spectroscopy, 2017, 48(2), 336-342.
- mpls(data, half_window=None, lam=1000000.0, p=0.0, diff_order=2, tol=0.001, max_iter=50, weights=None, **window_kwargs)
The Morphological penalized least squares (MPLS) baseline algorithm.
- Parameters:
- dataarray-like, shape (N,)
The y-values of the measured data, with N data points.
- half_windowint, optional
The half-window used for the morphology functions. If a value is input, then that value will be used. Default is None, which will optimize the half-window size using
optimize_window()
and window_kwargs.- lamfloat, optional
The smoothing parameter. Larger values will create smoother baselines. Default is 1e6.
- pfloat, optional
The penalizing weighting factor. Must be between 0 and 1. Anchor points identified by the procedure in [4] are given a weight of 1 - p, and all other points have a weight of p. Default is 0.0.
- diff_orderint, optional
The order of the differential matrix. Must be greater than 0. Default is 2 (second order differential matrix). Typical values are 2 or 1.
- max_iterint, optional
The max number of fit iterations. Default is 50.
- tolfloat, optional
The exit criteria. Default is 1e-3.
- weightsarray-like, shape (N,), optional
The weighting array. If None (default), then the weights will be calculated following the procedure in [4].
- **window_kwargs
Values for setting the half window used for the morphology operations. Items include:
- 'increment': int
The step size for iterating half windows. Default is 1.
- 'max_hits': int
The number of consecutive half windows that must produce the same morphological opening before accepting the half window as the optimum value. Default is 1.
- 'window_tol': float
The tolerance value for considering two morphological openings as equivalent. Default is 1e-6.
- 'max_half_window': int
The maximum allowable window size. If None (default), will be set to (len(data) - 1) / 2.
- 'min_half_window': int
The minimum half-window size. If None (default), will be set to 1.
- Returns:
- baselinenumpy.ndarray, shape (N,)
The calculated baseline.
- paramsdict
A dictionary with the following items:
- 'weights': numpy.ndarray, shape (N,)
The weight array used for fitting the data.
- 'half_window': int
The half window used for the morphological calculations.
- Raises:
- ValueError
Raised if p is not between 0 and 1.
References
- mpspline(data, half_window=None, lam=10000.0, lam_smooth=0.01, p=0.0, num_knots=100, spline_degree=3, diff_order=2, weights=None, pad_kwargs=None, **window_kwargs)
Morphology-based penalized spline baseline.
Identifies baseline points using morphological operations, and then uses weighted least-squares to fit a penalized spline to the baseline.
- Parameters:
- dataarray-like, shape (N,)
The y-values of the measured data, with N data points.
- half_windowint, optional
The half-window used for the morphology functions. If a value is input, then that value will be used. Default is None, which will optimize the half-window size using
optimize_window()
and window_kwargs.- lamfloat, optional
The smoothing parameter for the penalized spline when fitting the baseline. Larger values will create smoother baselines. Default is 1e4. Larger values are needed for larger num_knots.
- lam_smoothfloat, optional
The smoothing parameter for the penalized spline when smoothing the input data. Default is 1e-2. Larger values are needed for noisy data or for larger num_knots.
- pfloat, optional
The penalizing weighting factor. Must be between 0 and 1. Anchor points identified by the procedure in the reference are given a weight of 1 - p, and all other points have a weight of p. Default is 0.0.
- num_knotsint, optional
The number of knots for the spline. Default is 100.
- spline_degreeint, optional
The degree of the spline. Default is 3, which is a cubic spline.
- diff_orderint, optional
The order of the differential matrix. Must be greater than 0. Default is 2 (second order differential matrix). Typical values are 2 or 3.
- weightsarray-like, shape (N,), optional
The weighting array. If None (default), then the weights will be calculated following the procedure in the reference.
- **window_kwargs
Values for setting the half window used for the morphology operations. Items include:
- 'increment': int
The step size for iterating half windows. Default is 1.
- 'max_hits': int
The number of consecutive half windows that must produce the same morphological opening before accepting the half window as the optimum value. Default is 1.
- 'window_tol': float
The tolerance value for considering two morphological openings as equivalent. Default is 1e-6.
- 'max_half_window': int
The maximum allowable window size. If None (default), will be set to (len(data) - 1) / 2.
- 'min_half_window': int
The minimum half-window size. If None (default), will be set to 1.
- Returns:
- baselinenumpy.ndarray, shape (N,)
The calculated baseline.
- paramsdict
A dictionary with the following items:
- 'weights': numpy.ndarray, shape (N,)
The weight array used for fitting the data.
- 'half_window': int
The half window used for the morphological calculations.
- Raises:
- ValueError
Raised if half_window is < 1, if lam or lam_smooth is <= 0, or if p is not between 0 and 1.
Notes
The optimal opening is calculated as the element-wise minimum of the opening and the average of the erosion and dilation of the opening. The reference used the erosion and dilation of the smoothed data, rather than the opening, which tends to overestimate the baseline.
Rather than setting knots at the intersection points of the optimal opening and the smoothed data as described in the reference, weights are assigned to 1 - p at the intersection points and p elsewhere. This simplifies the penalized spline calculation by allowing the use of equally spaced knots, but should otherwise give similar results as the reference algorithm.
References
Gonzalez-Vidal, J., et al. Automatic morphology-based cubic p-spline fitting methodology for smoothing and baseline-removal of Raman spectra. Journal of Raman Spectroscopy. 2017, 48(6), 878-883.
- mwmv(data, half_window=None, smooth_half_window=None, pad_kwargs=None, **window_kwargs)
Moving window minimum value (MWMV) baseline.
- Parameters:
- dataarray-like, shape (N,)
The y-values of the measured data, with N data points.
- half_windowint, optional
The half-window used for the morphology functions. If a value is input, then that value will be used. Default is None, which will optimize the half-window size using
optimize_window()
and window_kwargs.- smooth_half_windowint, optional
The half-window to use for smoothing the data after performing the morphological operation. Default is None, which will use the same value as used for the morphological operation.
- pad_kwargsdict, optional
A dictionary of keyword arguments to pass to
pad_edges()
for padding the edges of the data to prevent edge effects from the moving average.- **window_kwargs
Values for setting the half window used for the morphology operations. Items include:
- 'increment': int
The step size for iterating half windows. Default is 1.
- 'max_hits': int
The number of consecutive half windows that must produce the same morphological opening before accepting the half window as the optimum value. Default is 1.
- 'window_tol': float
The tolerance value for considering two morphological openings as equivalent. Default is 1e-6.
- 'max_half_window': int
The maximum allowable window size. If None (default), will be set to (len(data) - 1) / 2.
- 'min_half_window': int
The minimum half-window size. If None (default), will be set to 1.
- Returns:
- baselinenumpy.ndarray, shape (N,)
The calculated baseline.
- dict
A dictionary with the following items:
- 'half_window': int
The half window used for the morphological calculations.
Notes
Performs poorly when baseline is rapidly changing.
References
Yaroshchyk, P., et al. Automatic correction of continuum background in Laser-induced Breakdown Spectroscopy using a model-free algorithm. Spectrochimica Acta Part B, 2014, 99, 138-149.
- noise_median(data, half_window=None, smooth_half_window=None, sigma=None, **pad_kwargs)
The noise-median method for baseline identification.
Assumes the baseline can be considered as the median value within a moving window, and the resulting baseline is then smoothed with a Gaussian kernel.
- Parameters:
- dataarray-like, shape (N,)
The y-values of the measured data, with N data points.
- half_windowint, optional
The index-based size to use for the median window. The total window size will range from [-half_window, ..., half_window] with size 2 * half_window + 1. Default is None, which will use twice the output from
optimize_window()
, which is an okay starting value.- smooth_half_windowint, optional
The half window to use for smoothing. Default is None, which will use the same value as half_window.
- sigmafloat, optional
The standard deviation of the smoothing Gaussian kernel. Default is None, which will use (2 * smooth_half_window + 1) / 6.
- **pad_kwargs
Additional keyword arguments to pass to
pad_edges()
for padding the edges of the data to prevent edge effects from convolution.
- Returns:
- baselinenumpy.ndarray, shape (N,)
The calculated and smoothed baseline.
- dict
An empty dictionary, just to match the output of all other algorithms.
References
Friedrichs, M., A model-free algorithm for the removal of baseline artifacts. J. Biomolecular NMR, 1995, 5, 147-153.
- optimize_extended_range(data, method='asls', side='both', width_scale=0.1, height_scale=1.0, sigma_scale=1.0 / 12.0, min_value=2, max_value=8, step=1, pad_kwargs=None, method_kwargs=None)
Extends data and finds the best parameter value for the given baseline method.
Adds additional data to the left and/or right of the input data, and then iterates through parameter values to find the best fit. Useful for calculating the optimum lam or poly_order value required to optimize other algorithms.
- Parameters:
- dataarray-like, shape (N,)
The y-values of the measured data, with N data points.
- methodstr, optional
A string indicating the Whittaker-smoothing-based, polynomial, or spline method to use for fitting the baseline. Default is 'asls'.
- side{'both', 'left', 'right'}, optional
The side of the measured data to extend. Default is 'both'.
- width_scalefloat, optional
The number of data points added to each side is width_scale * N. Default is 0.1.
- height_scalefloat, optional
The height of the added Gaussian peak(s) is calculated as height_scale * max(data). Default is 1.
- sigma_scalefloat, optional
The sigma value for the added Gaussian peak(s) is calculated as sigma_scale * width_scale * N. Default is 1/12, which will make the Gaussian span +- 6 sigma, making its total width about half of the added length.
- min_valueint or float, optional
The minimum value for the lam or poly_order value to use with the indicated method. If using a polynomial method, min_value must be an integer. If using a Whittaker-smoothing-based method, min_value should be the exponent to raise to the power of 10 (eg. a min_value value of 2 designates a lam value of 10**2). Default is 2.
- max_valueint or float, optional
The maximum value for the lam or poly_order value to use with the indicated method. If using a polynomial method, max_value must be an integer. If using a Whittaker-smoothing-based method, max_value should be the exponent to raise to the power of 10 (eg. a max_value value of 3 designates a lam value of 10**3). Default is 8.
- stepint or float, optional
The step size for iterating the parameter value from min_value to max_value. If using a polynomial method, step must be an integer.
- pad_kwargsdict, optional
A dictionary of options to pass to
pad_edges()
for padding the edges of the data when adding the extended left and/or right sections. Default is None, which will use an empty dictionary.- method_kwargsdict, optional
A dictionary of keyword arguments to pass to the selected method function. Default is None, which will use an empty dictionary.
- Returns:
- baselinenumpy.ndarray, shape (N,)
The baseline calculated with the optimum parameter.
- method_paramsdict
A dictionary with the following items:
- 'optimal_parameter': int or float
The lam or poly_order value that produced the lowest root-mean-squared-error.
- 'min_rmse': float
The minimum root-mean-squared-error obtained when using the optimal parameter.
Additional items depend on the output of the selected method.
- Raises:
- ValueError
Raised if side is not 'left', 'right', or 'both'.
- TypeError
Raised if using a polynomial method and min_value, max_value, or step is not an integer.
- ValueError
Raised if using a Whittaker-smoothing-based method and min_value, max_value, or step is greater than 100.
Notes
Based on the extended range penalized least squares (erPLS) method from [5]. The method proposed by [5] was for optimizing lambda only for the aspls method by extending only the right side of the spectrum. The method was modified by allowing extending either side following [6], and for optimizing lambda or the polynomial degree for all of the affected algorithms in pybaselines.
References
[5] (1,2)Zhang, F., et al. An Automatic Baseline Correction Method Based on the Penalized Least Squares Method. Sensors, 2020, 20(7), 2015.
[6]Krishna, H., et al. Range-independent background subtraction algorithm for recovery of Raman spectra of biological tissue. Journal of Raman Spectroscopy. 2012, 43(12), 1884-1894.
- penalized_poly(data, poly_order=2, tol=0.001, max_iter=250, weights=None, cost_function='asymmetric_truncated_quadratic', threshold=None, alpha_factor=0.99, return_coef=False)
Fits a polynomial baseline using a non-quadratic cost function.
The non-quadratic cost functions penalize residuals with larger values, giving a more robust fit compared to normal least-squares.
- Parameters:
- dataarray-like, shape (N,)
The y-values of the measured data, with N data points.
- poly_orderint, optional
The polynomial order for fitting the baseline. Default is 2.
- tolfloat, optional
The exit criteria. Default is 1e-3.
- max_iterint, optional
The maximum number of iterations. Default is 250.
- weightsarray-like, shape (N,), optional
The weighting array. If None (default), then will be an array with size equal to N and all values set to 1.
- cost_functionstr, optional
The non-quadratic cost function to minimize. Must indicate symmetry of the method by appending 'a' or 'asymmetric' for asymmetric loss, and 's' or 'symmetric' for symmetric loss. Default is 'asymmetric_truncated_quadratic'. Available methods, and their associated reference, are:
- thresholdfloat, optional
The threshold value for the loss method, where the function goes from quadratic loss (such as used for least squares) to non-quadratic. For symmetric loss methods, residual values with absolute value less than threshold will have quadratic loss. For asymmetric loss methods, residual values less than the threshold will have quadratic loss. Default is None, which sets threshold to one-tenth of the standard deviation of the input data.
- alpha_factorfloat, optional
A value between 0 and 1 that controls the value of the penalty. Default is 0.99. Typically should not need to change this value.
- return_coefbool, optional
If True, will convert the polynomial coefficients for the fit baseline to a form that fits the input x_data and return them in the params dictionary. Default is False, since the conversion takes time.
- Returns:
- baselinenumpy.ndarray, shape (N,)
The calculated baseline.
- paramsdict
A dictionary with the following items:
- 'weights': numpy.ndarray, shape (N,)
The weight array used for fitting the data.
- 'tol_history': numpy.ndarray
An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.
- 'coef': numpy.ndarray, shape (poly_order + 1,)
Only if return_coef is True. The array of polynomial parameters for the baseline, in increasing order. Can be used to create a polynomial using
numpy.polynomial.polynomial.Polynomial
.
- Raises:
- ValueError
Raised if alpha_factor is not between 0 and 1.
Notes
In baseline literature, this procedure is sometimes called "backcor".
References
- poly(data, poly_order=2, weights=None, return_coef=False)
Computes a polynomial that fits the baseline of the data.
- Parameters:
- dataarray-like, shape (N,)
The y-values of the measured data, with N data points.
- poly_orderint, optional
The polynomial order for fitting the baseline. Default is 2.
- weightsarray-like, shape (N,), optional
The weighting array. If None (default), then will be an array with size equal to N and all values set to 1.
- return_coefbool, optional
If True, will convert the polynomial coefficients for the fit baseline to a form that fits the input x_data and return them in the params dictionary. Default is False, since the conversion takes time.
- Returns:
- baselinenumpy.ndarray, shape (N,)
The calculated baseline.
- paramsdict
A dictionary with the following items:
- 'weights': numpy.ndarray, shape (N,)
The weight array used for fitting the data.
- 'coef': numpy.ndarray, shape (poly_order,)
Only if return_coef is True. The array of polynomial parameters for the baseline, in increasing order. Can be used to create a polynomial using
numpy.polynomial.polynomial.Polynomial
.
Notes
To only fit regions without peaks, supply a weight array with zero values at the indices where peaks are located.
- psalsa(data, lam=100000.0, p=0.5, k=None, diff_order=2, max_iter=50, tol=0.001, weights=None)
Peaked Signal's Asymmetric Least Squares Algorithm (psalsa).
Similar to the asymmetric least squares (AsLS) algorithm, but applies an exponential decay weighting to values greater than the baseline to allow using a higher p value to better fit noisy data.
- Parameters:
- dataarray-like, shape (N,)
The y-values of the measured data, with N data points. Must not contain missing data (NaN) or Inf.
- lamfloat, optional
The smoothing parameter. Larger values will create smoother baselines. Default is 1e5.
- pfloat, optional
The penalizing weighting factor. Must be between 0 and 1. Values greater than the baseline will be given p weight, and values less than the baseline will be given p - 1 weight. Default is 0.5.
- kfloat, optional
A factor that controls the exponential decay of the weights for baseline values greater than the data. Should be approximately the height at which a value could be considered a peak. Default is None, which sets k to one-tenth of the standard deviation of the input data. A large k value will produce similar results to
asls()
.- diff_orderint, optional
The order of the differential matrix. Must be greater than 0. Default is 2 (second order differential matrix). Typical values are 2 or 1.
- max_iterint, optional
The max number of fit iterations. Default is 50.
- tolfloat, optional
The exit criteria. Default is 1e-3.
- weightsarray-like, shape (N,), optional
The weighting array. If None (default), then the initial weights will be an array with size equal to N and all values set to 1.
- Returns:
- baselinenumpy.ndarray, shape (N,)
The calculated baseline.
- paramsdict
A dictionary with the following items:
- 'weights': numpy.ndarray, shape (N,)
The weight array used for fitting the data.
- 'tol_history': numpy.ndarray
An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.
- Raises:
- ValueError
Raised if p is not between 0 and 1.
Notes
The exit criteria for the original algorithm was to check whether the signs of the residuals do not change between two iterations, but the comparison of the l2 norms of the weight arrays between iterations is used instead to be more comparable to other Whittaker-smoothing-based algorithms.
References
Oller-Moreno, S., et al. Adaptive Asymmetric Least Squares baseline estimation for analytical instruments. 2014 IEEE 11th International Multi-Conference on Systems, Signals, and Devices, 2014, 1-5.
- pspline_airpls(data, lam=1000.0, num_knots=100, spline_degree=3, diff_order=2, max_iter=50, tol=0.001, weights=None)
A penalized spline version of the airPLS algorithm.
- Parameters:
- dataarray-like, shape (N,)
The y-values of the measured data, with N data points. Must not contain missing data (NaN) or Inf.
- lamfloat, optional
The smoothing parameter. Larger values will create smoother baselines. Default is 1e3.
- num_knotsint, optional
The number of knots for the spline. Default is 100.
- spline_degreeint, optional
The degree of the spline. Default is 3, which is a cubic spline.
- diff_orderint, optional
The order of the differential matrix. Must be greater than 0. Default is 2 (second order differential matrix). Typical values are 2 or 1.
- max_iterint, optional
The max number of fit iterations. Default is 50.
- tolfloat, optional
The exit criteria. Default is 1e-3.
- weightsarray-like, shape (N,), optional
The weighting array. If None (default), then the initial weights will be an array with size equal to N and all values set to 1.
- Returns:
- baselinenumpy.ndarray, shape (N,)
The calculated baseline.
- paramsdict
A dictionary with the following items:
- 'weights': numpy.ndarray, shape (N,)
The weight array used for fitting the data.
- 'tol_history': numpy.ndarray
An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.
See also
References
Zhang, Z.M., et al. Baseline correction using adaptive iteratively reweighted penalized least squares. Analyst, 2010, 135(5), 1138-1146.
Eilers, P., et al. Splines, knots, and penalties. Wiley Interdisciplinary Reviews: Computational Statistics, 2010, 2(6), 637-653.
- pspline_arpls(data, lam=1000.0, num_knots=100, spline_degree=3, diff_order=2, max_iter=50, tol=0.001, weights=None)
A penalized spline version of the arPLS algorithm.
- Parameters:
- dataarray-like, shape (N,)
The y-values of the measured data, with N data points. Must not contain missing data (NaN) or Inf.
- lamfloat, optional
The smoothing parameter. Larger values will create smoother baselines. Default is 1e3.
- num_knotsint, optional
The number of knots for the spline. Default is 100.
- spline_degreeint, optional
The degree of the spline. Default is 3, which is a cubic spline.
- diff_orderint, optional
The order of the differential matrix. Must be greater than 0. Default is 2 (second order differential matrix). Typical values are 2 or 1.
- max_iterint, optional
The max number of fit iterations. Default is 50.
- tolfloat, optional
The exit criteria. Default is 1e-3.
- weightsarray-like, shape (N,), optional
The weighting array. If None (default), then the initial weights will be an array with size equal to N and all values set to 1.
- Returns:
- baselinenumpy.ndarray, shape (N,)
The calculated baseline.
- paramsdict
A dictionary with the following items:
- 'weights': numpy.ndarray, shape (N,)
The weight array used for fitting the data.
- 'tol_history': numpy.ndarray
An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.
See also
References
Baek, S.J., et al. Baseline correction using asymmetrically reweighted penalized least squares smoothing. Analyst, 2015, 140, 250-257.
Eilers, P., et al. Splines, knots, and penalties. Wiley Interdisciplinary Reviews: Computational Statistics, 2010, 2(6), 637-653.
- pspline_asls(data, lam=1000.0, p=0.01, num_knots=100, spline_degree=3, diff_order=2, max_iter=50, tol=0.001, weights=None)
A penalized spline version of the asymmetric least squares (AsLS) algorithm.
- Parameters:
- dataarray-like, shape (N,)
The y-values of the measured data, with N data points. Must not contain missing data (NaN) or Inf.
- lamfloat, optional
The smoothing parameter. Larger values will create smoother baselines. Default is 1e3.
- pfloat, optional
The penalizing weighting factor. Must be between 0 and 1. Values greater than the baseline will be given p weight, and values less than the baseline will be given p - 1 weight. Default is 1e-2.
- num_knotsint, optional
The number of knots for the spline. Default is 100.
- spline_degreeint, optional
The degree of the spline. Default is 3, which is a cubic spline.
- diff_orderint, optional
The order of the differential matrix. Must be greater than 0. Default is 2 (second order differential matrix). Typical values are 2 or 1.
- max_iterint, optional
The max number of fit iterations. Default is 50.
- tolfloat, optional
The exit criteria. Default is 1e-3.
- weightsarray-like, shape (N,), optional
The weighting array. If None (default), then the initial weights will be an array with size equal to N and all values set to 1.
- Returns:
- baselinenumpy.ndarray, shape (N,)
The calculated baseline.
- paramsdict
A dictionary with the following items:
- 'weights': numpy.ndarray, shape (N,)
The weight array used for fitting the data.
- 'tol_history': numpy.ndarray
An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.
- Raises:
- ValueError
Raised if p is not between 0 and 1.
See also
References
Eilers, P. A Perfect Smoother. Analytical Chemistry, 2003, 75(14), 3631-3636.
Eilers, P., et al. Baseline correction with asymmetric least squares smoothing. Leiden University Medical Centre Report, 2005, 1(1).
Eilers, P., et al. Splines, knots, and penalties. Wiley Interdisciplinary Reviews: Computational Statistics, 2010, 2(6), 637-653.
- pspline_aspls(data, lam=10000.0, num_knots=100, spline_degree=3, diff_order=2, max_iter=100, tol=0.001, weights=None, alpha=None)
A penalized spline version of the asPLS algorithm.
- Parameters:
- dataarray-like, shape (N,)
The y-values of the measured data, with N data points. Must not contain missing data (NaN) or Inf.
- lamfloat, optional
The smoothing parameter. Larger values will create smoother baselines. Default is 1e3.
- num_knotsint, optional
The number of knots for the spline. Default is 100.
- spline_degreeint, optional
The degree of the spline. Default is 3, which is a cubic spline.
- diff_orderint, optional
The order of the differential matrix. Must be greater than 0. Default is 2 (second order differential matrix). Typical values are 2 or 1.
- max_iterint, optional
The max number of fit iterations. Default is 100.
- tolfloat, optional
The exit criteria. Default is 1e-3.
- weightsarray-like, shape (N,), optional
The weighting array. If None (default), then the initial weights will be an array with size equal to N and all values set to 1.
- alphaarray-like, shape (N,), optional
An array of values that control the local value of lam to better fit peak and non-peak regions. If None (default), then the initial values will be an array with size equal to N and all values set to 1.
- Returns:
- baselinenumpy.ndarray, shape (N,)
The calculated baseline.
- paramsdict
A dictionary with the following items:
- 'weights': numpy.ndarray, shape (N,)
The weight array used for fitting the data.
- 'alpha': numpy.ndarray, shape (N,)
The array of alpha values used for fitting the data in the final iteration.
- 'tol_history': numpy.ndarray
An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.
See also
Notes
The weighting uses an asymmetric coefficient (k in the asPLS paper) of 0.5 instead of the 2 listed in the asPLS paper. pybaselines uses the factor of 0.5 since it matches the results in Table 2 and Figure 5 of the asPLS paper closer than the factor of 2 and fits noisy data much better.
References
Zhang, F., et al. Baseline correction for infrared spectra using adaptive smoothness parameter penalized least squares method. Spectroscopy Letters, 2020, 53(3), 222-233.
Eilers, P., et al. Splines, knots, and penalties. Wiley Interdisciplinary Reviews: Computational Statistics, 2010, 2(6), 637-653.
- pspline_derpsalsa(data, lam=100.0, p=0.01, k=None, num_knots=100, spline_degree=3, diff_order=2, max_iter=50, tol=0.001, weights=None, smooth_half_window=None, num_smooths=16, **pad_kwargs)
A penalized spline version of the derpsalsa algorithm.
- Parameters:
- dataarray-like, shape (N,)
The y-values of the measured data, with N data points. Must not contain missing data (NaN) or Inf.
- lamfloat, optional
The smoothing parameter. Larger values will create smoother baselines. Default is 1e2.
- pfloat, optional
The penalizing weighting factor. Must be between 0 and 1. Values greater than the baseline will be given p weight, and values less than the baseline will be given p - 1 weight. Default is 1e-2.
- kfloat, optional
A factor that controls the exponential decay of the weights for baseline values greater than the data. Should be approximately the height at which a value could be considered a peak. Default is None, which sets k to one-tenth of the standard deviation of the input data. A large k value will produce similar results to
asls()
.- num_knotsint, optional
The number of knots for the spline. Default is 100.
- spline_degreeint, optional
The degree of the spline. Default is 3, which is a cubic spline.
- diff_orderint, optional
The order of the differential matrix. Must be greater than 0. Default is 2 (second order differential matrix). Typical values are 2 or 1.
- max_iterint, optional
The max number of fit iterations. Default is 50.
- tolfloat, optional
The exit criteria. Default is 1e-3.
- weightsarray-like, shape (N,), optional
The weighting array. If None (default), then the initial weights will be an array with size equal to N and all values set to 1.
- smooth_half_windowint, optional
The half-window to use for smoothing the data before computing the first and second derivatives. Default is None, which will use
len(data) / 200
.- num_smoothsint, optional
The number of times to smooth the data before computing the first and second derivatives. Default is 16.
- **pad_kwargs
Additional keyword arguments to pass to
pad_edges()
for padding the edges of the data to prevent edge effects from smoothing.
- Returns:
- baselinenumpy.ndarray, shape (N,)
The calculated baseline.
- paramsdict
A dictionary with the following items:
- 'weights': numpy.ndarray, shape (N,)
The weight array used for fitting the data.
- 'tol_history': numpy.ndarray
An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.
- Raises:
- ValueError
Raised if p is not between 0 and 1.
See also
References
Korepanov, V. Asymmetric least-squares baseline algorithm with peak screening for automatic processing of the Raman spectra. Journal of Raman Spectroscopy. 2020, 51(10), 2061-2065.
Eilers, P., et al. Splines, knots, and penalties. Wiley Interdisciplinary Reviews: Computational Statistics, 2010, 2(6), 637-653.
- pspline_drpls(data, lam=1000.0, eta=0.5, num_knots=100, spline_degree=3, diff_order=2, max_iter=50, tol=0.001, weights=None)
A penalized spline version of the drPLS algorithm.
- Parameters:
- dataarray-like, shape (N,)
The y-values of the measured data, with N data points. Must not contain missing data (NaN) or Inf.
- lamfloat, optional
The smoothing parameter. Larger values will create smoother baselines. Default is 1e3.
- etafloat
A term for controlling the value of lam; should be between 0 and 1. Low values will produce smoother baselines, while higher values will more aggressively fit peaks. Default is 0.5.
- num_knotsint, optional
The number of knots for the spline. Default is 100.
- spline_degreeint, optional
The degree of the spline. Default is 3, which is a cubic spline.
- diff_orderint, optional
The order of the differential matrix. Must be greater than 0. Default is 2 (second order differential matrix). Typical values are 2 or 1.
- max_iterint, optional
The max number of fit iterations. Default is 50.
- tolfloat, optional
The exit criteria. Default is 1e-3.
- weightsarray-like, shape (N,), optional
The weighting array. If None (default), then the initial weights will be an array with size equal to N and all values set to 1.
- Returns:
- baselinenumpy.ndarray, shape (N,)
The calculated baseline.
- paramsdict
A dictionary with the following items:
- 'weights': numpy.ndarray, shape (N,)
The weight array used for fitting the data.
- 'tol_history': numpy.ndarray
An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.
- Raises:
- ValueError
Raised if eta is not between 0 and 1 or if diff_order is less than 2.
See also
References
Xu, D. et al. Baseline correction method based on doubly reweighted penalized least squares, Applied Optics, 2019, 58, 3913-3920.
Eilers, P., et al. Splines, knots, and penalties. Wiley Interdisciplinary Reviews: Computational Statistics, 2010, 2(6), 637-653.
- pspline_iarpls(data, lam=1000.0, num_knots=100, spline_degree=3, diff_order=2, max_iter=50, tol=0.001, weights=None)
A penalized spline version of the IarPLS algorithm.
- Parameters:
- dataarray-like, shape (N,)
The y-values of the measured data, with N data points. Must not contain missing data (NaN) or Inf.
- lamfloat, optional
The smoothing parameter. Larger values will create smoother baselines. Default is 1e3.
- num_knotsint, optional
The number of knots for the spline. Default is 100.
- spline_degreeint, optional
The degree of the spline. Default is 3, which is a cubic spline.
- diff_orderint, optional
The order of the differential matrix. Must be greater than 0. Default is 2 (second order differential matrix). Typical values are 2 or 1.
- max_iterint, optional
The max number of fit iterations. Default is 50.
- tolfloat, optional
The exit criteria. Default is 1e-3.
- weightsarray-like, shape (N,), optional
The weighting array. If None (default), then the initial weights will be an array with size equal to N and all values set to 1.
- Returns:
- baselinenumpy.ndarray, shape (N,)
The calculated baseline.
- paramsdict
A dictionary with the following items:
- 'weights': numpy.ndarray, shape (N,)
The weight array used for fitting the data.
- 'tol_history': numpy.ndarray
An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.
See also
References
Ye, J., et al. Baseline correction method based on improved asymmetrically reweighted penalized least squares for Raman spectrum. Applied Optics, 2020, 59, 10933-10943.
Eilers, P., et al. Splines, knots, and penalties. Wiley Interdisciplinary Reviews: Computational Statistics, 2010, 2(6), 637-653.
- pspline_iasls(data, lam=10.0, p=0.01, lam_1=0.0001, num_knots=100, spline_degree=3, max_iter=50, tol=0.001, weights=None, diff_order=2)
A penalized spline version of the IAsLS algorithm.
- Parameters:
- dataarray-like, shape (N,)
The y-values of the measured data, with N data points. Must not contain missing data (NaN) or Inf.
- lamfloat, optional
The smoothing parameter. Larger values will create smoother baselines. Default is 1e1.
- pfloat, optional
The penalizing weighting factor. Must be between 0 and 1. Values greater than the baseline will be given p weight, and values less than the baseline will be given p - 1 weight. Default is 1e-2.
- lam_1float, optional
The smoothing parameter for the first derivative of the residual. Default is 1e-4.
- num_knotsint, optional
The number of knots for the spline. Default is 100.
- spline_degreeint, optional
The degree of the spline. Default is 3, which is a cubic spline.
- max_iterint, optional
The max number of fit iterations. Default is 50.
- tolfloat, optional
The exit criteria. Default is 1e-3.
- weightsarray-like, shape (N,), optional
The weighting array. If None (default), then the initial weights will be an array with size equal to N and all values set to 1.
- diff_orderint, optional
The order of the differential matrix. Must be greater than 1. Default is 2 (second order differential matrix). Typical values are 2 or 3.
- Returns:
- baselinenumpy.ndarray, shape (N,)
The calculated baseline.
- paramsdict
A dictionary with the following items:
- 'weights': numpy.ndarray, shape (N,)
The weight array used for fitting the data.
- 'tol_history': numpy.ndarray
An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.
- Raises:
- ValueError
Raised if p is not between 0 and 1 or if diff_order is less than 2.
See also
References
He, S., et al. Baseline correction for raman spectra using an improved asymmetric least squares method, Analytical Methods, 2014, 6(12), 4402-4407.
Eilers, P., et al. Splines, knots, and penalties. Wiley Interdisciplinary Reviews: Computational Statistics, 2010, 2(6), 637-653.
- pspline_mpls(data, half_window=None, lam=1000.0, p=0.0, num_knots=100, spline_degree=3, diff_order=2, tol=0.001, max_iter=50, weights=None, **window_kwargs)
A penalized spline version of the morphological penalized least squares (MPLS) algorithm.
- Parameters:
- dataarray-like, shape (N,)
The y-values of the measured data, with N data points.
- half_windowint, optional
The half-window used for the morphology functions. If a value is input, then that value will be used. Default is None, which will optimize the half-window size using
optimize_window()
and window_kwargs.- lamfloat, optional
The smoothing parameter. Larger values will create smoother baselines. Default is 1e3.
- pfloat, optional
The penalizing weighting factor. Must be between 0 and 1. Anchor points identified by the procedure in [32] are given a weight of 1 - p, and all other points have a weight of p. Default is 0.0.
- num_knotsint, optional
The number of knots for the spline. Default is 100.
- spline_degreeint, optional
The degree of the spline. Default is 3, which is a cubic spline.
- diff_orderint, optional
The order of the differential matrix. Must be greater than 0. Default is 2 (second order differential matrix). Typical values are 2 or 1.
- max_iterint, optional
The max number of fit iterations. Default is 50.
- tolfloat, optional
The exit criteria. Default is 1e-3.
- weightsarray-like, shape (N,), optional
The weighting array. If None (default), then the weights will be calculated following the procedure in [32].
- **window_kwargs
Values for setting the half window used for the morphology operations. Items include:
- 'increment': int
The step size for iterating half windows. Default is 1.
- 'max_hits': int
The number of consecutive half windows that must produce the same morphological opening before accepting the half window as the optimum value. Default is 1.
- 'window_tol': float
The tolerance value for considering two morphological openings as equivalent. Default is 1e-6.
- 'max_half_window': int
The maximum allowable window size. If None (default), will be set to (len(data) - 1) / 2.
- 'min_half_window': int
The minimum half-window size. If None (default), will be set to 1.
- Returns:
- baselinenumpy.ndarray, shape (N,)
The calculated baseline.
- paramsdict
A dictionary with the following items:
- 'weights': numpy.ndarray, shape (N,)
The weight array used for fitting the data.
- 'half_window': int
The half window used for the morphological calculations.
- Raises:
- ValueError
Raised if p is not between 0 and 1.
See also
References
Eilers, P., et al. Splines, knots, and penalties. Wiley Interdisciplinary Reviews: Computational Statistics, 2010, 2(6), 637-653.
- pspline_psalsa(data, lam=1000.0, p=0.5, k=None, num_knots=100, spline_degree=3, diff_order=2, max_iter=50, tol=0.001, weights=None)
A penalized spline version of the psalsa algorithm.
- Parameters:
- dataarray-like, shape (N,)
The y-values of the measured data, with N data points. Must not contain missing data (NaN) or Inf.
- lamfloat, optional
The smoothing parameter. Larger values will create smoother baselines. Default is 1e3.
- pfloat, optional
The penalizing weighting factor. Must be between 0 and 1. Values greater than the baseline will be given p weight, and values less than the baseline will be given p - 1 weight. Default is 0.5.
- kfloat, optional
A factor that controls the exponential decay of the weights for baseline values greater than the data. Should be approximately the height at which a value could be considered a peak. Default is None, which sets k to one-tenth of the standard deviation of the input data. A large k value will produce similar results to
asls()
.- num_knotsint, optional
The number of knots for the spline. Default is 100.
- spline_degreeint, optional
The degree of the spline. Default is 3, which is a cubic spline.
- diff_orderint, optional
The order of the differential matrix. Must be greater than 0. Default is 2 (second order differential matrix). Typical values are 2 or 1.
- max_iterint, optional
The max number of fit iterations. Default is 50.
- tolfloat, optional
The exit criteria. Default is 1e-3.
- weightsarray-like, shape (N,), optional
The weighting array. If None (default), then the initial weights will be an array with size equal to N and all values set to 1.
- Returns:
- baselinenumpy.ndarray, shape (N,)
The calculated baseline.
- paramsdict
A dictionary with the following items:
- 'weights': numpy.ndarray, shape (N,)
The weight array used for fitting the data.
- 'tol_history': numpy.ndarray
An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.
- Raises:
- ValueError
Raised if p is not between 0 and 1.
See also
References
Oller-Moreno, S., et al. Adaptive Asymmetric Least Squares baseline estimation for analytical instruments. 2014 IEEE 11th International Multi-Conference on Systems, Signals, and Devices, 2014, 1-5.
Eilers, P., et al. Splines, knots, and penalties. Wiley Interdisciplinary Reviews: Computational Statistics, 2010, 2(6), 637-653.
- quant_reg(data, poly_order=2, quantile=0.05, tol=1e-06, max_iter=250, weights=None, eps=None, return_coef=False)
Approximates the baseline of the data using quantile regression.
- Parameters:
- dataarray-like, shape (N,)
The y-values of the measured data, with N data points.
- poly_orderint, optional
The polynomial order for fitting the baseline. Default is 2.
- quantilefloat, optional
The quantile at which to fit the baseline. Default is 0.05.
- tolfloat, optional
The exit criteria. Default is 1e-6. For extreme quantiles (quantile < 0.01 or quantile > 0.99), may need to use a lower value to get a good fit.
- max_iterint, optional
The maximum number of iterations. Default is 250. For extreme quantiles (quantile < 0.01 or quantile > 0.99), may need to use a higher value to ensure convergence.
- weightsarray-like, shape (N,), optional
The weighting array. If None (default), then will be an array with size equal to N and all values set to 1.
- epsfloat, optional
A small value added to the square of the residual to prevent dividing by 0. Default is None, which uses the square of the maximum-absolute-value of the fit each iteration multiplied by 1e-6.
- return_coefbool, optional
If True, will convert the polynomial coefficients for the fit baseline to a form that fits the input x_data and return them in the params dictionary. Default is False, since the conversion takes time.
- Returns:
- baselinenumpy.ndarray, shape (N,)
The calculated baseline.
- paramsdict
A dictionary with the following items:
- 'weights': numpy.ndarray, shape (N,)
The weight array used for fitting the data.
- 'tol_history': numpy.ndarray
An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.
- 'coef': numpy.ndarray, shape (poly_order + 1,)
Only if return_coef is True. The array of polynomial parameters for the baseline, in increasing order. Can be used to create a polynomial using
numpy.polynomial.polynomial.Polynomial
.
- Raises:
- ValueError
Raised if quantile is not between 0 and 1.
Notes
Application of quantile regression for baseline fitting ss described in [23].
Performs quantile regression using iteratively reweighted least squares (IRLS) as described in [24].
References
[23]Komsta, Ł. Comparison of Several Methods of Chromatographic Baseline Removal with a New Approach Based on Quantile Regression. Chromatographia, 2011, 73, 721-731.
[24]Schnabel, S., et al. Simultaneous estimation of quantile curves using quantile sheets. AStA Advances in Statistical Analysis, 2013, 97, 77-87.
- ria(data, half_window=None, max_iter=500, tol=0.01, side='both', width_scale=0.1, height_scale=1.0, sigma_scale=1.0 / 12.0, **pad_kwargs)
Range Independent Algorithm (RIA).
Adds additional data to the left and/or right of the input data, and then iteratively smooths until the area of the additional data is removed.
- Parameters:
- dataarray-like, shape (N,)
The y-values of the measured data, with N data points.
- half_windowint, optional
The half-window to use for the smoothing each iteration. Should be approximately equal to the full-width-at-half-maximum of the peaks or features in the data. Default is None, which will use the output of
optimize_window()
, which is not always a good value, but at least scales with the number of data points and gives a starting point for tuning the parameter.- max_iterint, optional
The maximum number of iterations. Default is 500.
- tolfloat, optional
The exit criteria. Default is 1e-2.
- side{'both', 'left', 'right'}, optional
The side of the measured data to extend. Default is 'both'.
- width_scalefloat, optional
The number of data points added to each side is width_scale * N. Default is 0.1.
- height_scalefloat, optional
The height of the added Gaussian peak(s) is calculated as height_scale * max(data). Default is 1.
- sigma_scalefloat, optional
The sigma value for the added Gaussian peak(s) is calculated as sigma_scale * width_scale * N. Default is 1/12, which will make the Gaussian span +- 6 sigma, making its total width about half of the added length.
- **pad_kwargs
Additional keyword arguments to pass to
pad_edges()
for padding the edges of the data when adding the extended left and/or right sections.
- Returns:
- baselinenumpy.ndarray, shape (N,)
The calculated baseline.
- paramsdict
A dictionary with the following items:
- 'tol_history': numpy.ndarray
An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge (if the array length is equal to max_iter) or the areas of the smoothed extended regions exceeded their initial areas (if the array length is < max_iter).
- Raises:
- ValueError
Raised if side is not 'left', 'right', or 'both'.
References
Krishna, H., et al. Range-independent background subtraction algorithm for recovery of Raman spectra of biological tissue. J Raman Spectroscopy. 43(12) (2012) 1884-1894.
- rolling_ball(data, half_window=None, smooth_half_window=None, pad_kwargs=None, **window_kwargs)
The rolling ball baseline algorithm.
Applies a minimum and then maximum moving window, and subsequently smooths the result, giving a baseline that resembles rolling a ball across the data.
- Parameters:
- dataarray-like, shape (N,)
The y-values of the measured data, with N data points.
- half_windowint, optional
The half-window used for the morphology functions. If a value is input, then that value will be used. Default is None, which will optimize the half-window size using
optimize_window()
and window_kwargs.- smooth_half_windowint, optional
The half-window to use for smoothing the data after performing the morphological operation. Default is None, which will use the same value as used for the morphological operation.
- pad_kwargsdict, optional
A dictionary of keyword arguments to pass to
pad_edges()
for padding the edges of the data to prevent edge effects from the moving average.- **window_kwargs
Values for setting the half window used for the morphology operations. Items include:
- 'increment': int
The step size for iterating half windows. Default is 1.
- 'max_hits': int
The number of consecutive half windows that must produce the same morphological opening before accepting the half window as the optimum value. Default is 1.
- 'window_tol': float
The tolerance value for considering two morphological openings as equivalent. Default is 1e-6.
- 'max_half_window': int
The maximum allowable window size. If None (default), will be set to (len(data) - 1) / 2.
- 'min_half_window': int
The minimum half-window size. If None (default), will be set to 1.
- Returns:
- baselinenumpy.ndarray, shape (N,)
The calculated baseline.
- dict
A dictionary with the following items:
- 'half_window': int
The half window used for the morphological calculations.
References
Kneen, M.A., et al. Algorithm for fitting XRF, SEM and PIXE X-ray spectra backgrounds. Nuclear Instruments and Methods in Physics Research B, 1996, 109, 209-213.
Liland, K., et al. Optimal Choice of Baseline Correction for Multivariate Calibration of Spectra. Applied Spectroscopy, 2010, 64(9), 1007-1016.
- rubberband(data, segments=1, lam=None, diff_order=2, weights=None, smooth_half_window=None, **pad_kwargs)
Identifies baseline points by fitting a convex hull to the bottom of the data.
- Parameters:
- dataarray-like, shape (N,)
The y-values of the measured data, with N data points.
- segmentsint or array-like[int], optional
Used to fit multiple convex hulls to the data to negate the effects of concave data. If the input is an integer, it sets the number of equally sized segments the data will be split into. If the input is an array-like, each integer in the array will be the index that splits two segments, which allows constructing unequally sized segments. Default is 1, which fits a single convex hull to the data.
- lamfloat or None, optional
The smoothing parameter for interpolating the baseline points using Whittaker smoothing. Set to 0 or None to use linear interpolation instead. Default is None, which does not smooth.
- diff_orderint, optional
The order of the differential matrix if using Whittaker smoothing. Must be greater than 0. Default is 2 (second order differential matrix). Typical values are 2 or 1.
- weightsarray-like, shape (N,), optional
The weighting array, used to override the function's baseline identification to designate peak points. Only elements with 0 or False values will have an effect; all non-zero values are considered potential baseline points. If None (default), then will be an array with size equal to N and all values set to 1.
- smooth_half_windowint or None, optional
The half window to use for smoothing the input data with a moving average before calculating the convex hull, which gives much better results for noisy data. Set to None (default) or 0 to not smooth the data.
- Returns:
- baselinenumpy.ndarray, shape (N,)
The calculated baseline.
- dict
A dictionary with the following items:
- 'mask': numpy.ndarray, shape (N,)
The boolean array designating baseline points as True and peak points as False.
- Raises:
- ValueError
Raised if the number of segments per window for the fitting is less than poly_order + 1 or greater than the total number of points, or if the values in self.x are not strictly increasing.
- snip(data, max_half_window=None, decreasing=False, smooth_half_window=None, filter_order=2, **pad_kwargs)
Statistics-sensitive Non-linear Iterative Peak-clipping (SNIP).
- Parameters:
- dataarray-like, shape (N,)
The y-values of the measured data, with N data points.
- max_half_windowint or Sequence(int, int), optional
The maximum number of iterations. Should be set such that max_half_window is approxiamtely
(w-1)/2
, wherew
is the index-based width of a feature or peak. max_half_window can also be a sequence of two integers for asymmetric peaks, with the first item corresponding to the max_half_window of the peak's left edge, and the second item for the peak's right edge [29]. Default is None, which will use the output fromoptimize_window()
, which is an okay starting value.- decreasingbool, optional
If False (default), will iterate through window sizes from 1 to max_half_window. If True, will reverse the order and iterate from max_half_window to 1, which gives a smoother baseline according to [29] and [30].
- smooth_half_windowint, optional
The half window to use for smoothing the data. If smooth_half_window is greater than 0, will perform a moving average smooth on the data for each window, which gives better results for noisy data [29]. Default is None, which will not perform any smoothing.
- filter_order{2, 4, 6, 8}, optional
If the measured data has a more complicated baseline consisting of other elements such as Compton edges, then a higher filter_order should be selected [29]. Default is 2, which works well for approximating a linear baseline.
- **pad_kwargs
Additional keyword arguments to pass to
pad_edges()
for padding the edges of the data to prevent edge effects from convolution.
- Returns:
- baselinenumpy.ndarray, shape (N,)
The calculated baseline.
- dict
An empty dictionary, just to match the output of all other algorithms.
- Raises:
- ValueError
Raised if filter_order is not 2, 4, 6, or 8.
- Warns:
- UserWarning
Raised if max_half_window is greater than (len(data) - 1) // 2.
Notes
Algorithm initially developed by [27], and this specific version of the algorithm is adapted from [28], [29], and [30].
If data covers several orders of magnitude, better results can be obtained by first transforming the data using log-log-square transform before using SNIP [28]:
transformed_data = np.log(np.log(np.sqrt(data + 1) + 1) + 1)
and then baseline can then be reverted back to the original scale using inverse:
baseline = -1 + (np.exp(np.exp(snip(transformed_data)) - 1) - 1)**2
References
[27]Ryan, C.G., et al. SNIP, A Statistics-Sensitive Background Treatment For The Quantitative Analysis Of Pixe Spectra In Geoscience Applications. Nuclear Instruments and Methods in Physics Research B, 1988, 934, 396-402.
[28] (1,2)Morháč, M., et al. Background elimination methods for multidimensional coincidence γ-ray spectra. Nuclear Instruments and Methods in Physics Research A, 1997, 401, 113-132.
- std_distribution(data, half_window=None, interp_half_window=5, fill_half_window=3, num_std=1.1, smooth_half_window=None, weights=None, **pad_kwargs)
Identifies baseline segments by analyzing the rolling standard deviation distribution.
The rolling standard deviations are split into two distributions, with the smaller distribution assigned to noise. Baseline points are then identified as any point where the rolling standard deviation is less than a multiple of the median of the noise's standard deviation distribution.
- Parameters:
- dataarray-like, shape (N,)
The y-values of the measured data, with N data points.
- half_windowint, optional
The half-window to use for the rolling standard deviation calculation. Should be approximately equal to the full-width-at-half-maximum of the peaks or features in the data. Default is None, which will use half of the value from
optimize_window()
, which is not always a good value, but at least scales with the number of data points and gives a starting point for tuning the parameter.- interp_half_windowint, optional
When interpolating between baseline segments, will use the average of
data[i-interp_half_window:i+interp_half_window+1]
, where i is the index of the peak start or end, to fit the linear segment. Default is 5.- fill_half_windowint, optional
When a point is identified as a peak point, all points +- fill_half_window are likewise set as peak points. Default is 3.
- num_stdfloat, optional
The number of standard deviations to include when thresholding. Higher values will assign more points as baseline. Default is 1.1.
- smooth_half_windowint, optional
The half window to use for smoothing the interpolated baseline with a moving average. Default is None, which will use half_window. Set to 0 to not smooth the baseline.
- weightsarray-like, shape (N,), optional
The weighting array, used to override the function's baseline identification to designate peak points. Only elements with 0 or False values will have an effect; all non-zero values are considered baseline points. If None (default), then will be an array with size equal to N and all values set to 1.
- **pad_kwargs
Additional keyword arguments to pass to
pad_edges()
for padding the edges of the data to prevent edge effects from the moving average smoothing.
- Returns:
- baselinenumpy.ndarray, shape (N,)
The calculated baseline.
- paramsdict
A dictionary with the following items:
- 'mask': numpy.ndarray, shape (N,)
The boolean array designating baseline points as True and peak points as False.
References
Wang, K.C., et al. Distribution-Based Classification Method for Baseline Correction of Metabolomic 1D Proton Nuclear Magnetic Resonance Spectra. Analytical Chemistry. 2013, 85, 1231-1239.
- swima(data, min_half_window=3, max_half_window=None, smooth_half_window=None, **pad_kwargs)
Small-window moving average (SWiMA) baseline.
Computes an iterative moving average to smooth peaks and obtain the baseline.
- Parameters:
- dataarray-like, shape (N,)
The y-values of the measured data, with N data points.
- min_half_windowint, optional
The minimum half window value that must be reached before the exit criteria is considered. Can be increased to reduce the calculation time. Default is 3.
- max_half_windowint, optional
The maximum number of iterations. Default is None, which will use (N - 1) / 2. Typically does not need to be specified.
- smooth_half_windowint, optional
The half window to use for smoothing the input data with a moving average. Default is None, which will use N / 50. Use a value of 0 or less to not smooth the data. See Notes below for more details.
- **pad_kwargs
Additional keyword arguments to pass to
pad_edges()
for padding the edges of the data to prevent edge effects from convolution.
- Returns:
- baselinenumpy.ndarray, shape (N,)
The calculated baseline.
- dict
A dictionary with the following items:
- 'half_window': list(int)
A list of the half windows at which the exit criteria was reached. Has a length of 1 if the main exit criteria was intially reached, otherwise has a length of 2.
- 'converged': list(bool or None)
A list of the convergence status. Has a length of 1 if the main exit criteria was intially reached, otherwise has a length of 2. Each convergence status is True if the main exit criteria was reached, False if the second exit criteria was reached, and None if max_half_window is reached before either exit criteria.
Notes
This algorithm requires the input data to be fairly smooth (noise-free), so it is recommended to either smooth the data beforehand, or specify a smooth_half_window value. Non-smooth data can cause the exit criteria to be reached prematurely (can be avoided by setting a larger min_half_window), while over-smoothed data can cause the exit criteria to be reached later than optimal.
The half-window at which convergence occurs is roughly close to the index-based full-width-at-half-maximum of a peak or feature, but can vary. Therfore, it is better to set a min_half_window that is smaller than expected to not miss the exit criteria.
If the main exit criteria is not reached on the initial fit, a gaussian baseline (which is well handled by this algorithm) is added to the data, and it is re-fit.
References
Schulze, H., et al. A Small-Window Moving Average-Based Fully Automated Baseline Estimation Method for Raman Spectra. Applied Spectroscopy, 2012, 66(7), 757-764.
- tophat(data, half_window=None, **window_kwargs)
Estimates the baseline using a top-hat transformation (morphological opening).
- Parameters:
- dataarray-like, shape (N,)
The y-values of the measured data, with N data points.
- half_windowint, optional
The half-window used for the morphological opening. If a value is input, then that value will be used. Default is None, which will optimize the half-window size using
optimize_window()
and window_kwargs.- **window_kwargs
Values for setting the half window used for the morphology operations. Items include:
- 'increment': int
The step size for iterating half windows. Default is 1.
- 'max_hits': int
The number of consecutive half windows that must produce the same morphological opening before accepting the half window as the optimum value. Default is 1.
- 'window_tol': float
The tolerance value for considering two morphological openings as equivalent. Default is 1e-6.
- 'max_half_window': int
The maximum allowable window size. If None (default), will be set to (len(data) - 1) / 2.
- 'min_half_window': int
The minimum half-window size. If None (default), will be set to 1.
- Returns:
- baselinenumpy.ndarray, shape (N,)
The calculated baseline.
- dict
A dictionary with the following items:
- 'half_window': int
The half window used for the morphological calculations.
Notes
The actual top-hat transformation is defined as data - opening(data), where opening is the morphological opening operation. This function, however, returns opening(data), since that is technically the baseline defined by the operation.
References
Perez-Pueyo, R., et al. Morphology-Based Automated Baseline Removal for Raman Spectra of Artistic Pigments. Applied Spectroscopy, 2010, 64, 595-600.