Spline Baselines

The contents of pybaselines.spline contain algorithms for fitting splines to the baseline.

Introduction

A spline is a piecewise joining of individual curves. There are different types of splines, but only basis splines (B-splines) will be discussed since they are predominantly used in pybaselines. B-splines can be expressed as:

\[z(x) = \sum\limits_{i}^N \sum\limits_{j}^M {B_j(x_i) c_j}\]

where \(N\) is the number of points in \(x\), \(M\) is the number of spline basis functions, \(B_j(x_i)\) is the j-th basis function evaluated at \(x_i\), and \(c_j\) is the coefficient for the j-th basis (which is analogous to the height of the j-th basis). In pybaselines, the number of spline basis functions, \(M\), is calculated as the number of knots, num_knots, plus the spline degree minus 1.

For regular B-spline fitting, the spline coefficients that best fit the data are gotten from minimizing the least-squares:

\[\sum\limits_{i}^N w_i (y_i - \sum\limits_{j}^M {B_j(x_i) c_j})^2\]

where \(y_i\) and \(x_i\) are the measured data, and \(w_i\) is the weighting. In order to control the smoothness of the fitting spline, a penalty on the finite-difference between spline coefficients is added, resulting in penalized B-splines called P-splines (several good papers exist for an introduction to P-splines). The minimized function for P-splines is thus:

\[\sum\limits_{i}^N w_i (y_i - \sum\limits_{j}^M {B_j(x_i) c_j})^2 + \lambda \sum\limits_{i}^{M - d} (\Delta^d c_i)^2\]

where \(\lambda\) is the penalty scale factor, and \(\Delta^d\) is the finite-difference operator of order d. Note that P-splines use uniformly spaced knots so that the finite-difference is easy to calculate.

The resulting linear equation for solving the above minimization is:

\[(B^{\top} W B + \lambda D_d^{\top} D_d) c = B^{\top} W y\]

where \(W\) is the diagaonal matrix of the weights, \(B\) is the matrix containing all of the spline basis functions, and \(D_d\) is the matrix version of \(\Delta^d\) (same as explained for Whittaker-smoothing-based algorithms). P-splines have similarities with Whittaker smoothing; in fact, if the number of basis functions, \(M\), is set up to be equal to the number of data points, \(N\), and the spline degree is set to 0, then \(B\) becomes the identity matrix and the above equation becomes identical to the equation used for Whittaker smoothing.

Algorithms

mixture_model (Mixture Model)

mixture_model() considers the data as a mixture model composed of a baseline with noise and peaks. The weighting for the penalized spline fitting the baseline is iteratively determined by fitting the residual with a normal distribution centered at 0 (representing the noise), and a uniform distribution for residuals >= 0 (and a third uniform distribution for residuals <= 0 if symmetric is set to True) representing peaks. After fitting the total model to the residuals, the weighting is calculated from the posterior probability for each value in the residual belonging to the noise's normal distribution.

(Source code, png)

irsqr (Iterative Reweighted Spline Quantile Regression)

irsqr() uses penalized splines and iterative reweighted least squares to perform quantile regression on the data.

(Source code, png)

corner_cutting (Corner-Cutting Method)

corner_cutting() iteratively removes corner points and then creates a quadratic Bezier spline from the remaining points. Continuity between the individual Bezier curves is maintained by adding control points halfway between all but the first and last non-corner points.

(Source code, png)

pspline_asls (Penalized Spline Version of asls)

pspline_asls() is a penalized spline version of asls().

Minimized function:

\[\sum\limits_{i}^N w_i (y_i - \sum\limits_{j}^M {B_j(x_i) c_j})^2 + \lambda \sum\limits_{i}^{M - d} (\Delta^d c_i)^2\]

Linear system:

\[(B^{\top} W B + \lambda D_d^{\top} D_d) c = B^{\top} W y\]

Weighting:

\[\begin{split}w_i = \left\{\begin{array}{cr} p & y_i > z_i \\ 1 - p & y_i \le z_i \end{array}\right.\end{split}\]

(Source code, png)

pspline_iasls (Penalized Spline Version of iasls)

pspline_iasls() is a penalized spline version of iasls().

Minimized function:

\[\sum\limits_{i}^N (w_i (y_i - \sum\limits_{j}^M {B_j(x_i) c_j}))^2 + \lambda \sum\limits_{i}^{M - 2} (\Delta^2 c_i)^2 + \lambda_1 \sum\limits_{i}^{N - 1} (\Delta^1 (y_i - \sum\limits_{j}^M {B_j(x_i) c_j}))^2\]

Linear system:

\[(B^{\top} W^{\top} W B + \lambda_1 B^{\top} D_1^{\top} D_1 B + \lambda D_2^{\top} D_2) c = (B^{\top} W^{\top} W B + \lambda_1 B^{\top} D_1^{\top} D_1) y\]

Weighting:

\[\begin{split}w_i = \left\{\begin{array}{cr} p & y_i > z_i \\ 1 - p & y_i \le z_i \end{array}\right.\end{split}\]

(Source code, png)

pspline_airpls (Penalized Spline Version of airpls)

pspline_airpls() is a penalized spline version of airpls().

Minimized function:

\[\sum\limits_{i}^N w_i (y_i - \sum\limits_{j}^M {B_j(x_i) c_j})^2 + \lambda \sum\limits_{i}^{M - d} (\Delta^d c_i)^2\]

Linear system:

\[(B^{\top} W B + \lambda D_d^{\top} D_d) c = B^{\top} W y\]

Weighting:

\[\begin{split}w_i = \left\{\begin{array}{cr} 0 & y_i \ge z_i \\ exp{\left(\frac{t (y_i - z_i)}{|\mathbf{r}^-|}\right)} & y_i < z_i \end{array}\right.\end{split}\]

where \(t\) is the iteration number and \(|\mathbf{r}^-|\) is the l1-norm of the negative values in the residual vector \(\mathbf r\), ie. \(\sum\limits_{y_i - z_i < 0} |y_i - z_i|\).

(Source code, png)

pspline_arpls (Penalized Spline Version of arpls)

pspline_arpls() is a penalized spline version of arpls().

Minimized function:

\[\sum\limits_{i}^N w_i (y_i - \sum\limits_{j}^M {B_j(x_i) c_j})^2 + \lambda \sum\limits_{i}^{M - d} (\Delta^d c_i)^2\]

Linear system:

\[(B^{\top} W B + \lambda D_d^{\top} D_d) c = B^{\top} W y\]

Weighting:

\[w_i = \frac {1} {1 + exp{\left(\frac {2(r_i - (-\mu^- + 2 \sigma^-))} {\sigma^-} \right)}}\]

where \(r_i = y_i - z_i\) and \(\mu^-\) and \(\sigma^-\) are the mean and standard deviation, respectively, of the negative values in the residual vector \(\mathbf r\).

(Source code, png)

pspline_drpls (Penalized Spline Version of drpls)

pspline_drpls() is a penalized spline version of drpls().

Minimized function:

\[\sum\limits_{i}^N w_i (y_i - \sum\limits_{j}^M {B_j(x_i) c_j})^2 + \lambda \sum\limits_{i}^{M - 2}(1 - \eta w_{i,intp}) (\Delta^2 c_i)^2 + \sum\limits_{i}^{M - 1} (\Delta^1 (c_i))^2\]

where \(\eta\) is a value between 0 and 1 that controls the effective value of \(\lambda\). \(w_{intp}\) are the weights, \(w\), after interpolating using \(x\) and the basis midpoints in order to map the weights from length \(N\) to length \(M\).

Linear system:

\[(B^{\top}W B + D_1^{\top} D_1 + \lambda (I - \eta W_{intp}) D_2^{\top} D_2) c = B^{\top} W y\]

where \(I\) is the identity matrix.

Weighting:

\[w_i = \frac{1}{2}\left( 1 - \frac {exp(t)(r_i - (-\mu^- + 2 \sigma^-))/\sigma^-} {1 + abs[exp(t)(r_i - (-\mu^- + 2 \sigma^-))/\sigma^-]} \right)\]

where \(r_i = y_i - z_i\), \(t\) is the iteration number, and \(\mu^-\) and \(\sigma^-\) are the mean and standard deviation, respectively, of the negative values in the residual vector \(\mathbf r\).

(Source code, png)

pspline_iarpls (Penalized Spline Version of iarpls)

pspline_iarpls() is a penalized spline version of iarpls().

Minimized function:

\[\sum\limits_{i}^N w_i (y_i - \sum\limits_{j}^M {B_j(x_i) c_j})^2 + \lambda \sum\limits_{i}^{M - d} (\Delta^d c_i)^2\]

Linear system:

\[(B^{\top} W B + \lambda D_d^{\top} D_d) c = B^{\top} W y\]

Weighting:

\[w_i = \frac{1}{2}\left( 1 - \frac {exp(t)(r_i - 2 \sigma^-)/\sigma^-} {\sqrt{1 + [exp(t)(r_i - 2 \sigma^-)/\sigma^-]^2}} \right)\]

where \(r_i = y_i - z_i\), \(t\) is the iteration number, and \(\sigma^-\) is the standard deviation of the negative values in the residual vector \(\mathbf r\).

(Source code, png)

pspline_aspls (Penalized Spline Version of aspls)

pspline_aspls() is a penalized spline version of aspls().

Minimized function:

\[\sum\limits_{i}^N w_i (y_i - \sum\limits_{j}^M {B_j(x_i) c_j})^2 + \lambda \sum\limits_{i}^{M - d} \alpha_{i,intp} (\Delta^d c_i)^2\]

where

\[\alpha_i = \frac {abs(r_i)} {max(abs(\mathbf r))}\]

and \(\alpha_{intp}\) is the \(\alpha\) array after interpolating using \(x\) and the basis midpoints in order to map \(\alpha\) from length \(N\) to length \(M\).

Linear system:

\[(B^{\top} W B + \lambda \alpha_{intp} D_d^{\top} D_d) c = B^{\top} W y\]

Weighting:

\[w_i = \frac {1} {1 + exp{\left(\frac {0.5 (r_i - \sigma^-)} {\sigma^-} \right)}}\]

where \(r_i = y_i - z_i\) and \(\sigma^-\) is the standard deviation of the negative values in the residual vector \(\mathbf r\). (Note that the \(0.5 (r_i - \sigma^-) / \sigma^-\) term is different than the published version of the asPLS, which used \(2 (r_i - \sigma^-) / \sigma^-\). pybaselines uses the factor of 0.5 since it matches the results in Table 2 and Figure 5 of the asPLS paper closer than the factor of 2 and fits noisy data much better).

(Source code, png)

pspline_psalsa (Penalized Spline Version of psalsa)

pspline_psalsa() is a penalized spline version of psalsa().

Minimized function:

\[\sum\limits_{i}^N w_i (y_i - \sum\limits_{j}^M {B_j(x_i) c_j})^2 + \lambda \sum\limits_{i}^{M - d} (\Delta^d c_i)^2\]

Linear system:

\[(B^{\top} W B + \lambda D_d^{\top} D_d) c = B^{\top} W y\]

Weighting:

\[\begin{split}w_i = \left\{\begin{array}{cr} p \cdot exp{\left(\frac{-(y_i - z_i)}{k}\right)} & y_i > z_i \\ 1 - p & y_i \le z_i \end{array}\right.\end{split}\]

where \(k\) is a factor that controls the exponential decay of the weights for baseline values greater than the data and should be approximately the height at which a value could be considered a peak.

(Source code, png)

pspline_derpsalsa (Penalized Spline Version of derpsalsa)

pspline_derpsalsa() is a penalized spline version of derpsalsa().

Minimized function:

\[\sum\limits_{i}^N w_i (y_i - \sum\limits_{j}^M {B_j(x_i) c_j})^2 + \lambda \sum\limits_{i}^{M - d} (\Delta^d c_i)^2\]

Linear system:

\[(B^{\top} W B + \lambda D_d^{\top} D_d) c = B^{\top} W y\]

Weighting:

\[w_i = w_{0i} * w_{1i} * w_{2i}\]

where:

\[\begin{split}w_{0i} = \left\{\begin{array}{cr} p \cdot exp{\left(\frac{-[(y_i - z_i)/k]^2}{2}\right)} & y_i > z_i \\ 1 - p & y_i \le z_i \end{array}\right.\end{split}\]

\[w_{1i} = exp{\left(\frac{-[y_{sm_i}' / rms(y_{sm}')]^2}{2}\right)}\]

\[w_{2i} = exp{\left(\frac{-[y_{sm_i}'' / rms(y_{sm}'')]^2}{2}\right)}\]

\(k\) is a factor that controls the exponential decay of the weights for baseline values greater than the data and should be approximately the height at which a value could be considered a peak, \(y_{sm}'\) and \(y_{sm}''\) are the first and second derivatives, respectively, of the smoothed data, \(y_{sm}\), and \(rms()\) is the root-mean-square operator. \(w_1\) and \(w_2\) are precomputed, while \(w_0\) is updated each iteration.

(Source code, png)