pybaselines.classification
Module Contents
Functions
Continuous wavelet transform baseline recognition (CWTBR) algorithm. 

Dietrich's method for identifying baseline regions. 

Fully automatic baseline correction (fabc). 

Identifies baseline segments by thresholding the rolling standard deviation distribution. 

Golotvin's method for identifying baseline regions. 

Identifies baseline points by fitting a convex hull to the bottom of the data. 

Identifies baseline segments by analyzing the rolling standard deviation distribution. 
 pybaselines.classification.cwt_br(data, x_data=None, poly_order=5, scales=None, num_std=1.0, min_length=2, max_iter=50, tol=0.001, symmetric=False, weights=None, **pad_kwargs)[source]
Continuous wavelet transform baseline recognition (CWTBR) algorithm.
 Parameters:
 dataarraylike, shape (N,)
The yvalues of the measured data, with N data points.
 x_dataarraylike, shape (N,), optional
The xvalues of the measured data. Default is None, which will create an array from 1 to 1 with N points.
 poly_orderint, optional
The polynomial order for fitting the baseline. Default is 5.
 scalesarraylike, optional
The scales at which to perform the continuous wavelet transform. Default is None,
 num_stdfloat, optional
The number of standard deviations to include when thresholding. Default is 1.0.
 min_lengthint, optional
Any region of consecutive baseline points less than min_length is considered to be a false positive and all points in the region are converted to peak points. A higher min_length ensures less points are falsely assigned as baseline points. Default is 2, which only removes lone baseline points.
 max_iterint, optional
The maximum number of iterations. Default is 50.
 tolfloat, optional
The exit criteria. Default is 1e3.
 symmetricbool, optional
When fitting the identified baseline points with a polynomial, if symmetric is False (default), will add any point i as a baseline point where the fit polynomial is greater than the input data for
N/100
consecutive points on both sides of point i. If symmetric is True, then it means that both positive and negative peaks exist and baseline points are not modified during the polynomial fitting. weightsarraylike, shape (N,), optional
The weighting array, used to override the function's baseline identification to designate peak points. Only elements with 0 or False values will have an effect; all nonzero values are considered baseline points. If None (default), then will be an array with size equal to N and all values set to 1.
 **pad_kwargs
Additional keyword arguments to pass to
pad_edges()
for padding the edges of the data to prevent edge effects from convolution for the continuous wavelet transform.
 Returns:
 baselinenumpy.ndarray, shape (N,)
The calculated baseline.
 paramsdict
A dictionary with the following items:
 'mask': numpy.ndarray, shape (N,)
The boolean array designating baseline points as True and peak points as False.
 'tol_history': numpy.ndarray
An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.
 'best_scale'scalar
The scale at which the Shannon entropy of the continuous wavelet transform of the data is at a minimum.
Notes
Uses the standard deviation for determining outliers during polynomial fitting rather than the standard error as used in the reference since the number of standard errors to include when thresholding varies with data size while the number of standard deviations is independent of data size.
References
Bertinetto, C., et al. Automatic Baseline Recognition for the Correction of Large Sets of Spectra Using Continuous Wavelet Transform and Iterative Fitting. Applied Spectroscopy, 2014, 68(2), 155164.
 pybaselines.classification.dietrich(data, x_data=None, smooth_half_window=None, num_std=3.0, interp_half_window=5, poly_order=5, max_iter=50, tol=0.001, weights=None, return_coef=False, min_length=2, **pad_kwargs)[source]
Dietrich's method for identifying baseline regions.
Calculates the power spectrum of the data as the squared derivative of the data. Then baseline points are identified by iteratively removing points where the mean of the power spectrum is less than num_std times the standard deviation of the power spectrum.
 Parameters:
 dataarraylike, shape (N,)
The yvalues of the measured data, with N data points.
 x_dataarraylike, shape (N,), optional
The xvalues of the measured data. Default is None, which will create an array from 1 to 1 with N points.
 smooth_half_windowint, optional
The half window to use for smoothing the input data with a moving average. Default is None, which will use N / 256. Set to 0 to not smooth the data.
 num_stdfloat, optional
The number of standard deviations to include when thresholding. Higher values will assign more points as baseline. Default is 3.0.
 interp_half_windowint, optional
When interpolating between baseline segments, will use the average of
data[iinterp_half_window:i+interp_half_window+1]
, where i is the index of the peak start or end, to fit the linear segment. Default is 5. poly_orderint, optional
The polynomial order for fitting the identified baseline. Default is 5.
 max_iterint, optional
The maximum number of iterations for fitting a polynomial to the identified baseline. If max_iter is 0, the returned baseline will be just the linear interpolation of the baseline segments. Default is 50.
 tolfloat, optional
The exit criteria for fitting a polynomial to the identified baseline points. Default is 1e3.
 weightsarraylike, shape (N,), optional
The weighting array, used to override the function's baseline identification to designate peak points. Only elements with 0 or False values will have an effect; all nonzero values are considered baseline points. If None (default), then will be an array with size equal to N and all values set to 1.
 return_coefbool, optional
If True, will convert the polynomial coefficients for the fit baseline to a form that fits the input x_data and return them in the params dictionary. Default is False, since the conversion takes time.
 min_lengthint, optional
Any region of consecutive baseline points less than min_length is considered to be a false positive and all points in the region are converted to peak points. A higher min_length ensures less points are falsely assigned as baseline points. Default is 2, which only removes lone baseline points.
 **pad_kwargs
Additional keyword arguments to pass to
pad_edges()
for padding the edges of the data to prevent edge effects from smoothing.
 Returns:
 baselinenumpy.ndarray, shape (N,)
The calculated baseline.
 paramsdict
A dictionary with the following items:
 'mask': numpy.ndarray, shape (N,)
The boolean array designating baseline points as True and peak points as False.
 'coef': numpy.ndarray, shape (poly_order,)
Only if return_coef is True and max_iter is greater than 0. The array of polynomial coefficients for the baseline, in increasing order. Can be used to create a polynomial using
numpy.polynomial.polynomial.Polynomial
.
 'tol_history': numpy.ndarray
Only if max_iter is greater than 1. An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.
Notes
When choosing parameters, first choose a smooth_half_window that appropriately smooths the data, and then reduce num_std until no peak regions are included in the baseline. If no value of num_std works, change smooth_half_window and repeat.
If max_iter is 0, the baseline is simply a linear interpolation of the identified baseline points. Otherwise, a polynomial is iteratively fit through the baseline points, and the interpolated sections are replaced each iteration with the polynomial fit.
References
Dietrich, W., et al. Fast and Precise Automatic Baseline Correction of One and TwoDimensional NMR Spectra. Journal of Magnetic Resonance. 1991, 91, 111.
 pybaselines.classification.fabc(data, lam=1000000.0, scale=None, num_std=3.0, diff_order=2, min_length=2, weights=None, weights_as_mask=False, x_data=None, **pad_kwargs)[source]
Fully automatic baseline correction (fabc).
Similar to Dietrich's method, except that the derivative is estimated using a continuous wavelet transform and the baseline is calculated using Whittaker smoothing through the identified baseline points.
 Parameters:
 dataarraylike, shape (N,)
The yvalues of the measured data, with N data points.
 lamfloat, optional
The smoothing parameter. Larger values will create smoother baselines. Default is 1e6.
 scaleint, optional
The scale at which to calculate the continuous wavelet transform. Should be approximately equal to the indexbased fullwidthathalfmaximum of the peaks or features in the data. Default is None, which will use half of the value from
optimize_window()
, which is not always a good value, but at least scales with the number of data points and gives a starting point for tuning the parameter. num_stdfloat, optional
The number of standard deviations to include when thresholding. Higher values will assign more points as baseline. Default is 3.0.
 diff_orderint, optional
The order of the differential matrix. Must be greater than 0. Default is 2 (second order differential matrix). Typical values are 2 or 1.
 min_lengthint, optional
Any region of consecutive baseline points less than min_length is considered to be a false positive and all points in the region are converted to peak points. A higher min_length ensures less points are falsely assigned as baseline points. Default is 2, which only removes lone baseline points.
 weightsarraylike, shape (N,), optional
The weighting array, used to override the function's baseline identification to designate peak points. Only elements with 0 or False values will have an effect; all nonzero values are considered baseline points. If None (default), then will be an array with size equal to N and all values set to 1.
 weights_as_maskbool, optional
If True, signifies that the input weights is the mask to use for fitting, which skips the continuous wavelet calculation and just smooths the input data. Default is False.
 x_dataarraylike, optional
The xvalues. Not used by this function, but input is allowed for consistency with other functions.
 **pad_kwargs
Additional keyword arguments to pass to
pad_edges()
for padding the edges of the data to prevent edge effects from convolution for the continuous wavelet transform.
 Returns:
 baselinenumpy.ndarray, shape (N,)
The calculated baseline.
 paramsdict
A dictionary with the following items:
 'mask': numpy.ndarray, shape (N,)
The boolean array designating baseline points as True and peak points as False.
 'weights': numpy.ndarray, shape (N,)
The weight array used for fitting the data.
Notes
The classification of baseline points is similar to
dietrich()
, except that this method approximates the first derivative using a continous wavelet transform with the Haar wavelet, which is more robust than the numerical derivative in Dietrich's method.References
Cobas, J., et al. A new generalpurpose fully automatic baselinecorrection procedure for 1D and 2D NMR data. Journal of Magnetic Resonance, 2006, 183(1), 145151.
 pybaselines.classification.fastchrom(data, x_data=None, half_window=None, threshold=None, min_fwhm=None, interp_half_window=5, smooth_half_window=None, weights=None, max_iter=100, min_length=2, **pad_kwargs)[source]
Identifies baseline segments by thresholding the rolling standard deviation distribution.
Baseline points are identified as any point where the rolling standard deviation is less than the specified threshold. Peak regions are iteratively interpolated until the baseline is below the data.
 Parameters:
 dataarraylike, shape (N,)
The yvalues of the measured data, with N data points.
 x_dataarraylike, shape (N,), optional
The xvalues of the measured data. Default is None, which will create an array from 1 to 1 with N points.
 half_windowint, optional
The halfwindow to use for the rolling standard deviation calculation. Should be approximately equal to the fullwidthathalfmaximum of the peaks or features in the data. Default is None, which will use half of the value from
optimize_window()
, which is not always a good value, but at least scales with the number of data points and gives a starting point for tuning the parameter. thresholdfloat of Callable, optional
All points in the rolling standard deviation below threshold will be considered as baseline. Higher values will assign more points as baseline. Default is None, which will set the threshold as the 15th percentile of the rolling standard deviation. If threshold is Callable, it should take the rolling standard deviation as the only argument and output a float.
 min_fwhmint, optional
After creating the interpolated baseline, any region where the baseline is greater than the data for min_fwhm consecutive points will have an additional baseline point added and reinterpolated. Should be set to approximately the indexbased fullwidthathalfmaximum of the smallest peak. Default is None, which uses 2 * half_window.
 interp_half_windowint, optional
When interpolating between baseline segments, will use the average of
data[iinterp_half_window:i+interp_half_window+1]
, where i is the index of the peak start or end, to fit the linear segment. Default is 5. smooth_half_windowint, optional
The half window to use for smoothing the interpolated baseline with a moving average. Default is None, which will use half_window. Set to 0 to not smooth the baseline.
 weightsarraylike, shape (N,), optional
The weighting array, used to override the function's baseline identification to designate peak points. Only elements with 0 or False values will have an effect; all nonzero values are considered baseline points. If None (default), then will be an array with size equal to N and all values set to 1.
 max_iterint, optional
The maximum number of iterations to attempt to fill in regions where the baseline is greater than the input data. Default is 100.
 min_lengthint, optional
Any region of consecutive baseline points less than min_length is considered to be a false positive and all points in the region are converted to peak points. A higher min_length ensures less points are falsely assigned as baseline points. Default is 2, which only removes lone baseline points.
 **pad_kwargs
Additional keyword arguments to pass to
pad_edges()
for padding the edges of the data to prevent edge effects from the moving average smoothing.
 Returns:
 baselinenumpy.ndarray, shape (N,)
The calculated baseline.
 paramsdict
A dictionary with the following items:
 'mask': numpy.ndarray, shape (N,)
The boolean array designating baseline points as True and peak points as False.
Notes
Only covers the baseline correction from FastChrom, not its peak finding and peak grouping capabilities.
References
Johnsen, L., et al. An automated method for baseline correction, peak finding and peak grouping in chromatographic data. Analyst. 2013, 138, 35023511.
 pybaselines.classification.golotvin(data, x_data=None, half_window=None, num_std=2.0, sections=32, smooth_half_window=None, interp_half_window=5, weights=None, min_length=2, **pad_kwargs)[source]
Golotvin's method for identifying baseline regions.
Divides the data into sections and takes the minimum standard deviation of all sections as the noise standard deviation for the entire data. Then classifies any point where the rolling max minus min is less than
num_std * noise standard deviation
as belonging to the baseline. Parameters:
 dataarraylike, shape (N,)
The yvalues of the measured data, with N data points.
 x_dataarraylike, shape (N,), optional
The xvalues of the measured data. Default is None, which will create an array from 1 to 1 with N points.
 half_windowint, optional
The halfwindow to use for the rolling maximum and rolling minimum calculations. Should be approximately equal to the fullwidthathalfmaximum of the peaks or features in the data. Default is None, which will use half of the value from
optimize_window()
, which is not always a good value, but at least scales with the number of data points and gives a starting point for tuning the parameter. num_stdfloat, optional
The number of standard deviations to include when thresholding. Higher values will assign more points as baseline. Default is 3.0.
 sectionsint, optional
The number of sections to divide the input data into for finding the minimum standard deviation.
 smooth_half_windowint, optional
The half window to use for smoothing the interpolated baseline with a moving average. Default is None, which will use half_window. Set to 0 to not smooth the baseline.
 interp_half_windowint, optional
When interpolating between baseline segments, will use the average of
data[iinterp_half_window:i+interp_half_window+1]
, where i is the index of the peak start or end, to fit the linear segment. Default is 5. weightsarraylike, shape (N,), optional
The weighting array, used to override the function's baseline identification to designate peak points. Only elements with 0 or False values will have an effect; all nonzero values are considered baseline points. If None (default), then will be an array with size equal to N and all values set to 1.
 min_lengthint, optional
Any region of consecutive baseline points less than min_length is considered to be a false positive and all points in the region are converted to peak points. A higher min_length ensures less points are falsely assigned as baseline points. Default is 2, which only removes lone baseline points.
 **pad_kwargs
Additional keyword arguments to pass to
pad_edges()
for padding the edges of the data to prevent edge effects from the moving average smoothing.
 Returns:
 baselinenumpy.ndarray, shape (N,)
The calculated baseline.
 paramsdict
A dictionary with the following items:
 'mask': numpy.ndarray, shape (N,)
The boolean array designating baseline points as True and peak points as False.
References
Golotvin, S., et al. Improved Baseline Recognition and Modeling of FT NMR Spectra. Journal of Magnetic Resonance. 2000, 146, 122125.
 pybaselines.classification.rubberband(data, x_data=None, segments=1, lam=None, diff_order=2, weights=None, smooth_half_window=None, **pad_kwargs)[source]
Identifies baseline points by fitting a convex hull to the bottom of the data.
 Parameters:
 dataarraylike, shape (N,)
The yvalues of the measured data, with N data points.
 x_dataarraylike, shape (N,), optional
The xvalues of the measured data. Default is None, which will create an array from 1 to 1 with N points.
 segmentsint or arraylike[int], optional
Used to fit multiple convex hulls to the data to negate the effects of concave data. If the input is an integer, it sets the number of equally sized segments the data will be split into. If the input is an arraylike, each integer in the array will be the index that splits two segments, which allows constructing unequally sized segments. Default is 1, which fits a single convex hull to the data.
 lamfloat or None, optional
The smoothing parameter for interpolating the baseline points using Whittaker smoothing. Set to 0 or None to use linear interpolation instead. Default is None, which does not smooth.
 diff_orderint, optional
The order of the differential matrix if using Whittaker smoothing. Must be greater than 0. Default is 2 (second order differential matrix). Typical values are 2 or 1.
 weightsarraylike, shape (N,), optional
The weighting array, used to override the function's baseline identification to designate peak points. Only elements with 0 or False values will have an effect; all nonzero values are considered potential baseline points. If None (default), then will be an array with size equal to N and all values set to 1.
 smooth_half_windowint or None, optional
The half window to use for smoothing the input data with a moving average before calculating the convex hull, which gives much better results for noisy data. Set to None (default) or 0 to not smooth the data.
 Returns:
 baselinenumpy.ndarray, shape (N,)
The calculated baseline.
 dict
A dictionary with the following items:
 'mask': numpy.ndarray, shape (N,)
The boolean array designating baseline points as True and peak points as False.
 Raises:
 ValueError
Raised if the number of segments per window for the fitting is less than poly_order + 1 or greater than the total number of points, or if the values in self.x are not strictly increasing.
 pybaselines.classification.std_distribution(data, x_data=None, half_window=None, interp_half_window=5, fill_half_window=3, num_std=1.1, smooth_half_window=None, weights=None, **pad_kwargs)[source]
Identifies baseline segments by analyzing the rolling standard deviation distribution.
The rolling standard deviations are split into two distributions, with the smaller distribution assigned to noise. Baseline points are then identified as any point where the rolling standard deviation is less than a multiple of the median of the noise's standard deviation distribution.
 Parameters:
 dataarraylike, shape (N,)
The yvalues of the measured data, with N data points.
 x_dataarraylike, shape (N,), optional
The xvalues of the measured data. Default is None, which will create an array from 1 to 1 with N points.
 half_windowint, optional
The halfwindow to use for the rolling standard deviation calculation. Should be approximately equal to the fullwidthathalfmaximum of the peaks or features in the data. Default is None, which will use half of the value from
optimize_window()
, which is not always a good value, but at least scales with the number of data points and gives a starting point for tuning the parameter. interp_half_windowint, optional
When interpolating between baseline segments, will use the average of
data[iinterp_half_window:i+interp_half_window+1]
, where i is the index of the peak start or end, to fit the linear segment. Default is 5. fill_half_windowint, optional
When a point is identified as a peak point, all points + fill_half_window are likewise set as peak points. Default is 3.
 num_stdfloat, optional
The number of standard deviations to include when thresholding. Higher values will assign more points as baseline. Default is 1.1.
 smooth_half_windowint, optional
The half window to use for smoothing the interpolated baseline with a moving average. Default is None, which will use half_window. Set to 0 to not smooth the baseline.
 weightsarraylike, shape (N,), optional
The weighting array, used to override the function's baseline identification to designate peak points. Only elements with 0 or False values will have an effect; all nonzero values are considered baseline points. If None (default), then will be an array with size equal to N and all values set to 1.
 **pad_kwargs
Additional keyword arguments to pass to
pad_edges()
for padding the edges of the data to prevent edge effects from the moving average smoothing.
 Returns:
 baselinenumpy.ndarray, shape (N,)
The calculated baseline.
 paramsdict
A dictionary with the following items:
 'mask': numpy.ndarray, shape (N,)
The boolean array designating baseline points as True and peak points as False.
References
Wang, K.C., et al. DistributionBased Classification Method for Baseline Correction of Metabolomic 1D Proton Nuclear Magnetic Resonance Spectra. Analytical Chemistry. 2013, 85, 12311239.