pybaselines.smooth

Module Contents

Functions

ipsa

Iterative Polynomial Smoothing Algorithm (IPSA).

noise_median

The noise-median method for baseline identification.

ria

Range Independent Algorithm (RIA).

snip

Statistics-sensitive Non-linear Iterative Peak-clipping (SNIP).

swima

Small-window moving average (SWiMA) baseline.

pybaselines.smooth.ipsa(data, half_window=None, max_iter=500, tol=None, roi=None, original_criteria=False, x_data=None, **pad_kwargs)[source]

Iterative Polynomial Smoothing Algorithm (IPSA).

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points.

half_windowint

The half-window to use for the smoothing each iteration. Should be approximately equal to the full-width-at-half-maximum of the peaks or features in the data. Default is None, which will use 4 times the output of optimize_window(), which is not always a good value, but at least scales with the number of data points and gives a starting point for tuning the parameter.

max_iterint, optional

The maximum number of iterations. Default is 500.

tolfloat, optional

The exit criteria. Default is None, which uses 1e-3 if original_criteria is False, and 1 / (max(data) - min(data)) if original_criteria is True.

roislice or array-like, shape(N,)

The region of interest, such that np.asarray(data)[roi] gives the values for calculating the tolerance if original_criteria is True. Not used if original_criteria is True. Default is None, which uses all values in data.

original_criteriabool, optional

Whether to use the original exit criteria from the reference, which is difficult to use since it requires knowledge of how high the peaks should be after baseline correction. If False (default), then compares norm(old, new) / norm(old), where old is the previous iteration's baseline, and new is the current iteration's baseline.

x_dataarray-like, optional

The x-values. Not used by this function, but input is allowed for consistency with other functions.

**pad_kwargs

Additional keyword arguments to pass to pad_edges() for padding the edges of the data to prevent edge effects from convolution.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

paramsdict

A dictionary with the following items:

  • 'tol_history': numpy.ndarray

    An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge.

References

Wang, T., et al. Background Subtraction of Raman Spectra Based on Iterative Polynomial Smoothing. Applied Spectroscopy. 71(6) (2017) 1169-1179.

pybaselines.smooth.noise_median(data, half_window=None, smooth_half_window=None, sigma=None, x_data=None, **pad_kwargs)[source]

The noise-median method for baseline identification.

Assumes the baseline can be considered as the median value within a moving window, and the resulting baseline is then smoothed with a Gaussian kernel.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points.

half_windowint, optional

The index-based size to use for the median window. The total window size will range from [-half_window, ..., half_window] with size 2 * half_window + 1. Default is None, which will use twice the output from optimize_window(), which is an okay starting value.

smooth_half_windowint, optional

The half window to use for smoothing. Default is None, which will use the same value as half_window.

sigmafloat, optional

The standard deviation of the smoothing Gaussian kernel. Default is None, which will use (2 * smooth_half_window + 1) / 6.

x_dataarray-like, optional

The x-values. Not used by this function, but input is allowed for consistency with other functions.

**pad_kwargs

Additional keyword arguments to pass to pad_edges() for padding the edges of the data to prevent edge effects from convolution.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated and smoothed baseline.

dict

An empty dictionary, just to match the output of all other algorithms.

References

Friedrichs, M., A model-free algorithm for the removal of baseline artifacts. J. Biomolecular NMR, 1995, 5, 147-153.

pybaselines.smooth.ria(data, x_data=None, half_window=None, max_iter=500, tol=0.01, side='both', width_scale=0.1, height_scale=1.0, sigma_scale=1.0 / 12.0, **pad_kwargs)[source]

Range Independent Algorithm (RIA).

Adds additional data to the left and/or right of the input data, and then iteratively smooths until the area of the additional data is removed.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points.

x_dataarray-like, shape (N,), optional

The x-values of the measured data. Default is None, which will create an array from -1 to 1 with N points.

half_windowint, optional

The half-window to use for the smoothing each iteration. Should be approximately equal to the full-width-at-half-maximum of the peaks or features in the data. Default is None, which will use the output of optimize_window(), which is not always a good value, but at least scales with the number of data points and gives a starting point for tuning the parameter.

max_iterint, optional

The maximum number of iterations. Default is 500.

tolfloat, optional

The exit criteria. Default is 1e-2.

side{'both', 'left', 'right'}, optional

The side of the measured data to extend. Default is 'both'.

width_scalefloat, optional

The number of data points added to each side is width_scale * N. Default is 0.1.

height_scalefloat, optional

The height of the added Gaussian peak(s) is calculated as height_scale * max(data). Default is 1.

sigma_scalefloat, optional

The sigma value for the added Gaussian peak(s) is calculated as sigma_scale * width_scale * N. Default is 1/12, which will make the Gaussian span +- 6 sigma, making its total width about half of the added length.

**pad_kwargs

Additional keyword arguments to pass to pad_edges() for padding the edges of the data when adding the extended left and/or right sections.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

paramsdict

A dictionary with the following items:

  • 'tol_history': numpy.ndarray

    An array containing the calculated tolerance values for each iteration. The length of the array is the number of iterations completed. If the last value in the array is greater than the input tol value, then the function did not converge (if the array length is equal to max_iter) or the areas of the smoothed extended regions exceeded their initial areas (if the array length is < max_iter).

Raises:
ValueError

Raised if side is not 'left', 'right', or 'both'.

References

Krishna, H., et al. Range-independent background subtraction algorithm for recovery of Raman spectra of biological tissue. J Raman Spectroscopy. 43(12) (2012) 1884-1894.

pybaselines.smooth.snip(data, max_half_window=None, decreasing=False, smooth_half_window=None, filter_order=2, x_data=None, **pad_kwargs)[source]

Statistics-sensitive Non-linear Iterative Peak-clipping (SNIP).

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points.

max_half_windowint or Sequence(int, int), optional

The maximum number of iterations. Should be set such that max_half_window is approxiamtely (w-1)/2, where w is the index-based width of a feature or peak. max_half_window can also be a sequence of two integers for asymmetric peaks, with the first item corresponding to the max_half_window of the peak's left edge, and the second item for the peak's right edge [3]. Default is None, which will use the output from optimize_window(), which is an okay starting value.

decreasingbool, optional

If False (default), will iterate through window sizes from 1 to max_half_window. If True, will reverse the order and iterate from max_half_window to 1, which gives a smoother baseline according to [3] and [4].

smooth_half_windowint, optional

The half window to use for smoothing the data. If smooth_half_window is greater than 0, will perform a moving average smooth on the data for each window, which gives better results for noisy data [3]. Default is None, which will not perform any smoothing.

filter_order{2, 4, 6, 8}, optional

If the measured data has a more complicated baseline consisting of other elements such as Compton edges, then a higher filter_order should be selected [3]. Default is 2, which works well for approximating a linear baseline.

x_dataarray-like, optional

The x-values. Not used by this function, but input is allowed for consistency with other functions.

**pad_kwargs

Additional keyword arguments to pass to pad_edges() for padding the edges of the data to prevent edge effects from convolution.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

dict

An empty dictionary, just to match the output of all other algorithms.

Raises:
ValueError

Raised if filter_order is not 2, 4, 6, or 8.

Warns:
UserWarning

Raised if max_half_window is greater than (len(data) - 1) // 2.

Notes

Algorithm initially developed by [1], and this specific version of the algorithm is adapted from [2], [3], and [4].

If data covers several orders of magnitude, better results can be obtained by first transforming the data using log-log-square transform before using SNIP [2]:

transformed_data =  np.log(np.log(np.sqrt(data + 1) + 1) + 1)

and then baseline can then be reverted back to the original scale using inverse:

baseline = -1 + (np.exp(np.exp(snip(transformed_data)) - 1) - 1)**2

References

[1]

Ryan, C.G., et al. SNIP, A Statistics-Sensitive Background Treatment For The Quantitative Analysis Of Pixe Spectra In Geoscience Applications. Nuclear Instruments and Methods in Physics Research B, 1988, 934, 396-402.

[2] (1,2)

Morháč, M., et al. Background elimination methods for multidimensional coincidence γ-ray spectra. Nuclear Instruments and Methods in Physics Research A, 1997, 401, 113-132.

[3] (1,2,3,4,5)

Morháč, M., et al. Peak Clipping Algorithms for Background Estimation in Spectroscopic Data. Applied Spectroscopy, 2008, 62(1), 91-106.

[4] (1,2)

Morháč, M. An algorithm for determination of peak regions and baseline elimination in spectroscopic data. Nuclear Instruments and Methods in Physics Research A, 2009, 60, 478-487.

pybaselines.smooth.swima(data, min_half_window=3, max_half_window=None, smooth_half_window=None, x_data=None, **pad_kwargs)[source]

Small-window moving average (SWiMA) baseline.

Computes an iterative moving average to smooth peaks and obtain the baseline.

Parameters:
dataarray-like, shape (N,)

The y-values of the measured data, with N data points.

min_half_windowint, optional

The minimum half window value that must be reached before the exit criteria is considered. Can be increased to reduce the calculation time. Default is 3.

max_half_windowint, optional

The maximum number of iterations. Default is None, which will use (N - 1) / 2. Typically does not need to be specified.

smooth_half_windowint, optional

The half window to use for smoothing the input data with a moving average. Default is None, which will use N / 50. Use a value of 0 or less to not smooth the data. See Notes below for more details.

x_dataarray-like, optional

The x-values. Not used by this function, but input is allowed for consistency with other functions.

**pad_kwargs

Additional keyword arguments to pass to pad_edges() for padding the edges of the data to prevent edge effects from convolution.

Returns:
baselinenumpy.ndarray, shape (N,)

The calculated baseline.

dict

A dictionary with the following items:

  • 'half_window': list(int)

    A list of the half windows at which the exit criteria was reached. Has a length of 1 if the main exit criteria was intially reached, otherwise has a length of 2.

  • 'converged': list(bool or None)

    A list of the convergence status. Has a length of 1 if the main exit criteria was intially reached, otherwise has a length of 2. Each convergence status is True if the main exit criteria was reached, False if the second exit criteria was reached, and None if max_half_window is reached before either exit criteria.

Notes

This algorithm requires the input data to be fairly smooth (noise-free), so it is recommended to either smooth the data beforehand, or specify a smooth_half_window value. Non-smooth data can cause the exit criteria to be reached prematurely (can be avoided by setting a larger min_half_window), while over-smoothed data can cause the exit criteria to be reached later than optimal.

The half-window at which convergence occurs is roughly close to the index-based full-width-at-half-maximum of a peak or feature, but can vary. Therfore, it is better to set a min_half_window that is smaller than expected to not miss the exit criteria.

If the main exit criteria is not reached on the initial fit, a gaussian baseline (which is well handled by this algorithm) is added to the data, and it is re-fit.

References

Schulze, H., et al. A Small-Window Moving Average-Based Fully Automated Baseline Estimation Method for Raman Spectra. Applied Spectroscopy, 2012, 66(7), 757-764.