pybaselines.polynomial

Module Contents

Functions

goldindec

Fits a polynomial baseline using a non-quadratic cost function.

imodpoly

The improved modofied polynomial (IModPoly) baseline algorithm.

loess

Locally estimated scatterplot smoothing (LOESS).

modpoly

The modified polynomial (ModPoly) baseline algorithm.

penalized_poly

Fits a polynomial baseline using a non-quadratic cost function.

poly

Computes a polynomial that fits the baseline of the data.

quant_reg

Approximates the baseline of the data using quantile regression.

pybaselines.polynomial.goldindec(data, x_data=None, poly_order=2, tol=0.001, max_iter=250, weights=None, cost_function='asymmetric_indec', peak_ratio=0.5, alpha_factor=0.99, tol_2=0.001, tol_3=1e-06, max_iter_2=100, return_coef=False)[source]

Fits a polynomial baseline using a non-quadratic cost function.

The non-quadratic cost functions penalize residuals with larger values, giving a more robust fit compared to normal least-squares.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points.

x_dataarray-like, shape (N,), optional

The x-values of the measured data. Default is None, which will create an array from -1 to 1 with N points.

poly_orderint, optional

The polynomial order for fitting the baseline. Default is 2.

tolfloat, optional

The exit criteria for the fitting with a given threshold value. Default is 1e-3.

max_iterint, optional

The maximum number of iterations for fitting a threshold value. Default is 250.

weightsarray-like, shape (N,), optional

The weighting array. If None (default), then will be an array with size equal to N and all values set to 1.

cost_functionstr, optional

The non-quadratic cost function to minimize. Unlike penalized_poly(), this function only works with asymmetric cost functions, so the symmetry prefix ('a' or 'asymmetric') is optional (eg. 'indec' and 'a_indec' are the same). Default is 'asymmetric_indec'. Available methods, and their associated reference, are:

  • 'asymmetric_indec'[18]

  • 'asymmetric_truncated_quadratic'[19]

  • 'asymmetric_huber'[19]

peak_ratiofloat, optional

A value between 0 and 1 that designates how many points in the data belong to peaks. Values are valid within ~10% of the actual peak ratio. Default is 0.5.

alpha_factorfloat, optional

A value between 0 and 1 that controls the value of the penalty. Default is 0.99. Typically should not need to change this value.

tol_2float, optional

The exit criteria for the difference between the optimal up-down ratio (number of points above 0 in the residual compared to number of points below 0) and the up-down ratio for a given threshold value. Default is 1e-3.

tol_3float, optional

The exit criteria for the relative change in the threshold value. Default is 1e-6.

max_iter_2float, optional

The number of iterations for iterating between different threshold values. Default is 100.

return_coefbool, optional

If True, will convert the polynomial coefficients for the fit baseline to a form that fits the input x_data and return them in the params dictionary. Default is False, since the conversion takes time.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

paramsdict

A dictionary with the following items:

  • 'weights': numpy.ndarray, shape (N,)

    The weight array used for fitting the data.

  • 'tol_history': numpy.ndarray, shape (J, K)

    An array containing the calculated tolerance values for each iteration of both threshold values and fit values. Index 0 are the tolerence values for the difference in up-down ratios, index 1 are the tolerance values for the relative change in the threshold, and indices >= 2 are the tolerance values for each fit. All values that were not used in fitting have values of 0. Shape J is 2 plus the number of iterations for the threshold to converge (related to max_iter_2, tol_2, tol_3), and shape K is the maximum of the number of iterations for the threshold and the maximum number of iterations for all of the fits of the various threshold values (related to max_iter and tol).

  • 'threshold'float

    The optimal threshold value. Could be used in penalized_poly() for fitting other similar data.

  • 'coef': numpy.ndarray, shape (poly_order + 1,)

    Only if return_coef is True. The array of polynomial parameters for the baseline, in increasing order. Can be used to create a polynomial using numpy.polynomial.polynomial.Polynomial.

Raises:
ValueError

Raised if alpha_factor or peak_ratio are not between 0 and 1, or if the specified cost function is symmetric.

References

[18]

Liu, J., et al. Goldindec: A Novel Algorithm for Raman Spectrum Baseline Correction. Applied Spectroscopy, 2015, 69(7), 834-842.

[19] (1,2)

Mazet, V., et al. Background removal from spectra by designing and minimising a non-quadratic cost function. Chemometrics and Intelligent Laboratory Systems, 2005, 76(2), 121-133.

pybaselines.polynomial.imodpoly(data, x_data=None, poly_order=2, tol=0.001, max_iter=250, weights=None, use_original=False, mask_initial_peaks=True, return_coef=False, num_std=1)[source]

The improved modofied polynomial (IModPoly) baseline algorithm.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points.

x_dataarray-like, shape (N,), optional

The x-values of the measured data. Default is None, which will create an array from -1 to 1 with N points.

poly_orderint, optional

The polynomial order for fitting the baseline. Default is 2.

tolfloat, optional

The exit criteria. Default is 1e-3.

max_iterint, optional

The maximum number of iterations. Default is 250.

weightsarray-like, shape (N,), optional

The weighting array. If None (default), then will be an array with size equal to N and all values set to 1.

use_originalbool, optional

If False (default), will compare the baseline of each iteration with the y-values of that iteration [4] when choosing minimum values. If True, will compare the baseline with the original y-values given by data [5].

mask_initial_peaksbool, optional

If True (default), will mask any data where the initial baseline fit + the standard deviation of the residual is less than measured data [6].

return_coefbool, optional

If True, will convert the polynomial coefficients for the fit baseline to a form that fits the input x_data and return them in the params dictionary. Default is False, since the conversion takes time.

num_stdfloat, optional

The number of standard deviations to include when thresholding. Default is 1.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

paramsdict

A dictionary with the following items:

  • 'weights': numpy.ndarray, shape (N,)

    The weight array used for fitting the data.

  • 'tol_history': numpy.ndarray

    An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.

  • 'coef': numpy.ndarray, shape (poly_order + 1,)

    Only if return_coef is True. The array of polynomial parameters for the baseline, in increasing order. Can be used to create a polynomial using numpy.polynomial.polynomial.Polynomial.

Notes

Algorithm originally developed in [6].

References

[4]

Gan, F., et al. Baseline correction by improved iterative polynomial fitting with automatic threshold. Chemometrics and Intelligent Laboratory Systems, 2006, 82, 59-65.

[5]

Lieber, C., et al. Automated method for subtraction of fluorescence from biological raman spectra. Applied Spectroscopy, 2003, 57(11), 1363-1367.

[6] (1,2)

Zhao, J., et al. Automated Autofluorescence Background Subtraction Algorithm for Biomedical Raman Spectroscopy, Applied Spectroscopy, 2007, 61(11), 1225-1232.

pybaselines.polynomial.loess(data, x_data=None, fraction=0.2, total_points=None, poly_order=1, scale=3.0, tol=0.001, max_iter=10, symmetric_weights=False, use_threshold=False, num_std=1, use_original=False, weights=None, return_coef=False, conserve_memory=True, delta=0.0)[source]

Locally estimated scatterplot smoothing (LOESS).

Performs polynomial regression at each data point using the nearest points.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points.

x_dataarray-like, shape (N,), optional

The x-values of the measured data. Default is None, which will create an array from -1 to 1 with N points.

fractionfloat, optional

The fraction of N data points to include for the fitting on each point. Default is 0.2. Not used if total_points is not None.

total_pointsint, optional

The total number of points to include for the fitting on each point. Default is None, which will use fraction * N to determine the number of points.

scalefloat, optional

A scale factor applied to the weighted residuals to control the robustness of the fit. Default is 3.0, as used in [9]. Note that the original loess procedure in [10] used a scale of ~4.05.

poly_orderint, optional

The polynomial order for fitting the baseline. Default is 1.

tolfloat, optional

The exit criteria. Default is 1e-3.

max_iterint, optional

The maximum number of iterations. Default is 10.

symmetric_weightsbool, optional

If False (default), will apply weighting asymmetrically, with residuals < 0 having a weight of 1, according to [9]. If True, will apply weighting the same for both positive and negative residuals, which is regular LOESS. If use_threshold is True, this parameter is ignored.

use_thresholdbool, optional

If False (default), will compute weights each iteration to perform the robust fitting, which is regular LOESS. If True, will apply a threshold on the data being fit each iteration, based on the maximum values of the data and the fit baseline, as proposed by [11], similar to the modpoly and imodpoly techniques.

num_stdfloat, optional

The number of standard deviations to include when thresholding. Default is 1, which is the value used for the imodpoly technique. Only used if use_threshold is True.

use_originalbool, optional

If False (default), will compare the baseline of each iteration with the y-values of that iteration [12] when choosing minimum values for thresholding. If True, will compare the baseline with the original y-values given by data [13]. Only used if use_threshold is True.

weightsarray-like, shape (N,), optional

The weighting array. If None (default), then will be an array with size equal to N and all values set to 1.

return_coefbool, optional

If True, will convert the polynomial coefficients for the fit baseline to a form that fits the input x_data and return them in the params dictionary. Default is False, since the conversion takes time.

conserve_memorybool, optional

If False, will cache the distance-weighted kernels for each value in x_data on the first iteration and reuse them on subsequent iterations to save time. The shape of the array of kernels is (len(x_data), total_points). If True (default), will recalculate the kernels each iteration, which uses very little memory, but is slower. Can usually set to False unless x_data and`total_points` are quite large and the function causes memory issues when cacheing the kernels. If numba is installed, there is no significant time difference since the calculations are sped up.

deltafloat, optional

If delta is > 0, will skip all but the last x-value in the range x_last + delta, where x_last is the last x-value to be fit using weighted least squares, and instead use linear interpolation to calculate the fit for those x-values (same behavior as in statsmodels [14] and Cleveland's original Fortran lowess implementation [15]). Fits all x-values if delta is <= 0. Default is 0.0. Note that x_data is scaled to fit in the range [-1, 1], so delta should likewise be scaled. For example, if the desired delta value was 0.01 * (max(x_data) - min(x_data)), then the correctly scaled delta would be 0.02 (ie. 0.01 * (1 - (-1))).

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

paramsdict

A dictionary with the following items:

  • 'weights': numpy.ndarray, shape (N,)

    The weight array used for fitting the data. Does NOT contain the individual distance-weighted kernels for each x-value.

  • 'tol_history': numpy.ndarray

    An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.

  • 'coef': numpy.ndarray, shape (N, poly_order + 1)

    Only if return_coef is True. The array of polynomial parameters for the baseline, in increasing order. Can be used to create a polynomial using numpy.polynomial.polynomial.Polynomial. If delta is > 0, the coefficients for any skipped x-value will all be 0.

Raises:
ValueError

Raised if the number of points per window for the fitting is less than poly_order + 1 or greater than the total number of points.

Notes

The iterative, robust, aspect of the fitting can be achieved either through reweighting based on the residuals (the typical usage), or thresholding the fit data based on the residuals, as proposed by [11], similar to the modpoly and imodpoly techniques.

In baseline literature, this procedure is sometimes called "rbe", meaning "robust baseline estimate".

References

[9] (1,2)

Ruckstuhl, A.F., et al. Baseline subtraction using robust local regression estimation. J. Quantitative Spectroscopy and Radiative Transfer, 2001, 68, 179-193.

[10]

Cleveland, W. Robust locally weighted regression and smoothing scatterplots. Journal of the American Statistical Association, 1979, 74(368), 829-836.

[11] (1,2)

Komsta, Ł. Comparison of Several Methods of Chromatographic Baseline Removal with a New Approach Based on Quantile Regression. Chromatographia, 2011, 73, 721-731.

[12]

Gan, F., et al. Baseline correction by improved iterative polynomial fitting with automatic threshold. Chemometrics and Intelligent Laboratory Systems, 2006, 82, 59-65.

[13]

Lieber, C., et al. Automated method for subtraction of fluorescence from biological raman spectra. Applied Spectroscopy, 2003, 57(11), 1363-1367.

[15]

https://www.netlib.org/go (lowess.f is the file).

pybaselines.polynomial.modpoly(data, x_data=None, poly_order=2, tol=0.001, max_iter=250, weights=None, use_original=False, mask_initial_peaks=False, return_coef=False)[source]

The modified polynomial (ModPoly) baseline algorithm.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points.

x_dataarray-like, shape (N,), optional

The x-values of the measured data. Default is None, which will create an array from -1 to 1 with N points.

poly_orderint, optional

The polynomial order for fitting the baseline. Default is 2.

tolfloat, optional

The exit criteria. Default is 1e-3.

max_iterint, optional

The maximum number of iterations. Default is 250.

weightsarray-like, shape (N,), optional

The weighting array. If None (default), then will be an array with size equal to N and all values set to 1.

use_originalbool, optional

If False (default), will compare the baseline of each iteration with the y-values of that iteration [1] when choosing minimum values. If True, will compare the baseline with the original y-values given by data [2].

mask_initial_peaksbool, optional

If True, will mask any data where the initial baseline fit + the standard deviation of the residual is less than measured data [3]. Default is False.

return_coefbool, optional

If True, will convert the polynomial coefficients for the fit baseline to a form that fits the input x_data and return them in the params dictionary. Default is False, since the conversion takes time.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

paramsdict

A dictionary with the following items:

  • 'weights': numpy.ndarray, shape (N,)

    The weight array used for fitting the data.

  • 'tol_history': numpy.ndarray

    An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.

  • 'coef': numpy.ndarray, shape (poly_order + 1,)

    Only if return_coef is True. The array of polynomial parameters for the baseline, in increasing order. Can be used to create a polynomial using numpy.polynomial.polynomial.Polynomial.

Notes

Algorithm originally developed in [2] and then slightly modified in [1].

References

[1] (1,2)

Gan, F., et al. Baseline correction by improved iterative polynomial fitting with automatic threshold. Chemometrics and Intelligent Laboratory Systems, 2006, 82, 59-65.

[2] (1,2)

Lieber, C., et al. Automated method for subtraction of fluorescence from biological raman spectra. Applied Spectroscopy, 2003, 57(11), 1363-1367.

[3]

Zhao, J., et al. Automated Autofluorescence Background Subtraction Algorithm for Biomedical Raman Spectroscopy, Applied Spectroscopy, 2007, 61(11), 1225-1232.

pybaselines.polynomial.penalized_poly(data, x_data=None, poly_order=2, tol=0.001, max_iter=250, weights=None, cost_function='asymmetric_truncated_quadratic', threshold=None, alpha_factor=0.99, return_coef=False)[source]

Fits a polynomial baseline using a non-quadratic cost function.

The non-quadratic cost functions penalize residuals with larger values, giving a more robust fit compared to normal least-squares.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points.

x_dataarray-like, shape (N,), optional

The x-values of the measured data. Default is None, which will create an array from -1 to 1 with N points.

poly_orderint, optional

The polynomial order for fitting the baseline. Default is 2.

tolfloat, optional

The exit criteria. Default is 1e-3.

max_iterint, optional

The maximum number of iterations. Default is 250.

weightsarray-like, shape (N,), optional

The weighting array. If None (default), then will be an array with size equal to N and all values set to 1.

cost_functionstr, optional

The non-quadratic cost function to minimize. Must indicate symmetry of the method by appending 'a' or 'asymmetric' for asymmetric loss, and 's' or 'symmetric' for symmetric loss. Default is 'asymmetric_truncated_quadratic'. Available methods, and their associated reference, are:

  • 'asymmetric_truncated_quadratic'[7]

  • 'symmetric_truncated_quadratic'[7]

  • 'asymmetric_huber'[7]

  • 'symmetric_huber'[7]

  • 'asymmetric_indec'[8]

  • 'symmetric_indec'[8]

thresholdfloat, optional

The threshold value for the loss method, where the function goes from quadratic loss (such as used for least squares) to non-quadratic. For symmetric loss methods, residual values with absolute value less than threshold will have quadratic loss. For asymmetric loss methods, residual values less than the threshold will have quadratic loss. Default is None, which sets threshold to one-tenth of the standard deviation of the input data.

alpha_factorfloat, optional

A value between 0 and 1 that controls the value of the penalty. Default is 0.99. Typically should not need to change this value.

return_coefbool, optional

If True, will convert the polynomial coefficients for the fit baseline to a form that fits the input x_data and return them in the params dictionary. Default is False, since the conversion takes time.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

paramsdict

A dictionary with the following items:

  • 'weights': numpy.ndarray, shape (N,)

    The weight array used for fitting the data.

  • 'tol_history': numpy.ndarray

    An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.

  • 'coef': numpy.ndarray, shape (poly_order + 1,)

    Only if return_coef is True. The array of polynomial parameters for the baseline, in increasing order. Can be used to create a polynomial using numpy.polynomial.polynomial.Polynomial.

Raises:
ValueError

Raised if alpha_factor is not between 0 and 1.

Notes

In baseline literature, this procedure is sometimes called "backcor".

References

[7] (1,2,3,4)

Mazet, V., et al. Background removal from spectra by designing and minimising a non-quadratic cost function. Chemometrics and Intelligent Laboratory Systems, 2005, 76(2), 121-133.

[8] (1,2)

Liu, J., et al. Goldindec: A Novel Algorithm for Raman Spectrum Baseline Correction. Applied Spectroscopy, 2015, 69(7), 834-842.

pybaselines.polynomial.poly(data, x_data=None, poly_order=2, weights=None, return_coef=False)[source]

Computes a polynomial that fits the baseline of the data.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points.

x_dataarray-like, shape (N,), optional

The x-values of the measured data. Default is None, which will create an array from -1 to 1 with N points.

poly_orderint, optional

The polynomial order for fitting the baseline. Default is 2.

weightsarray-like, shape (N,), optional

The weighting array. If None (default), then will be an array with size equal to N and all values set to 1.

return_coefbool, optional

If True, will convert the polynomial coefficients for the fit baseline to a form that fits the input x_data and return them in the params dictionary. Default is False, since the conversion takes time.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

paramsdict

A dictionary with the following items:

  • 'weights': numpy.ndarray, shape (N,)

    The weight array used for fitting the data.

  • 'coef': numpy.ndarray, shape (poly_order,)

    Only if return_coef is True. The array of polynomial parameters for the baseline, in increasing order. Can be used to create a polynomial using numpy.polynomial.polynomial.Polynomial.

Notes

To only fit regions without peaks, supply a weight array with zero values at the indices where peaks are located.

pybaselines.polynomial.quant_reg(data, x_data=None, poly_order=2, quantile=0.05, tol=1e-06, max_iter=250, weights=None, eps=None, return_coef=False)[source]

Approximates the baseline of the data using quantile regression.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points.

x_dataarray-like, shape (N,), optional

The x-values of the measured data. Default is None, which will create an array from -1 to 1 with N points.

poly_orderint, optional

The polynomial order for fitting the baseline. Default is 2.

quantilefloat, optional

The quantile at which to fit the baseline. Default is 0.05.

tolfloat, optional

The exit criteria. Default is 1e-6. For extreme quantiles (quantile < 0.01 or quantile > 0.99), may need to use a lower value to get a good fit.

max_iterint, optional

The maximum number of iterations. Default is 250. For extreme quantiles (quantile < 0.01 or quantile > 0.99), may need to use a higher value to ensure convergence.

weightsarray-like, shape (N,), optional

The weighting array. If None (default), then will be an array with size equal to N and all values set to 1.

epsfloat, optional

A small value added to the square of the residual to prevent dividing by 0. Default is None, which uses the square of the maximum-absolute-value of the fit each iteration multiplied by 1e-6.

return_coefbool, optional

If True, will convert the polynomial coefficients for the fit baseline to a form that fits the input x_data and return them in the params dictionary. Default is False, since the conversion takes time.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

paramsdict

A dictionary with the following items:

  • 'weights': numpy.ndarray, shape (N,)

    The weight array used for fitting the data.

  • 'tol_history': numpy.ndarray

    An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.

  • 'coef': numpy.ndarray, shape (poly_order + 1,)

    Only if return_coef is True. The array of polynomial parameters for the baseline, in increasing order. Can be used to create a polynomial using numpy.polynomial.polynomial.Polynomial.

Raises:
ValueError

Raised if quantile is not between 0 and 1.

Notes

Application of quantile regression for baseline fitting as described in [16].

Performs quantile regression using iteratively reweighted least squares (IRLS) as described in [17].

References

[16]

Komsta, Ł. Comparison of Several Methods of Chromatographic Baseline Removal with a New Approach Based on Quantile Regression. Chromatographia, 2011, 73, 721-731.

[17]

Schnabel, S., et al. Simultaneous estimation of quantile curves using quantile sheets. AStA Advances in Statistical Analysis, 2013, 97, 77-87.