Note
Go to the end to download the full example code
lam vs data size
When publishing new Whittaker-smoothing-based algorithms in literature, the lam value
used by the researchers is usually reported as a single value or a range of values.
However, these values are deceptive since the lam value required for a particular
Whittaker-smoothing-based algorithm is dependent on the number of data points. Thus,
this can cause issues when adapting an algorithm to a new set of data since the published
optimal lam value is not universal. This example shows an analysis of this dependence
for all available functions in the pybaselines.whittaker
module.
Note that the exact optimal lam values reported in this example are not of significant use since they depend on many other factors such as the baseline curvature, noise, peaks, etc.; however, the examined trends can be used to simplify the process of selecting lam values for fitting new datasets.
from itertools import cycle
import matplotlib.pyplot as plt
import numpy as np
from pybaselines import Baseline
# local import with setup code
from example_helpers import make_data, optimize_lam
def iasls(*args, lam=1, p=0.05, **kwargs):
"""Ensures the `lam_1` value for whittaker.iasls scales with `lam`."""
# not sure if lam_1 should be fixed or proportional to lam;
# both give similar results
return Baseline().iasls(*args, lam=lam, lam_1=1e-8 * lam, p=p, **kwargs)
Three baselines will be tested: an exponentially decaying baseline, a gaussian baseline, and a sinusoidal baseline. The exponential baseline is the most smooth and the sine baseline is the least smooth, so it would be expected that a lower lam value would be required to fit the sine baseline and a higher value for the exponential baseline. The three different plots are shown below.
bkg_dict = {}
bkg_types = ('exponential', 'gaussian', 'sine')
for bkg_type in bkg_types:
bkg_dict[bkg_type] = {}
plt.plot(make_data(1000, bkg_type)[0], label=bkg_type)
plt.legend()
For each function, the optimal lam value will be calculated for data sizes ranging from 500 to 20000 points. Further, the intercept and slope of the linear fit of the log of the data size, N, and the log of the lam value will be reported.
print('Function, baseline type, intercept & slope of log(N) vs log(lam) fit')
print('-' * 60)
show_plots = False # for debugging
num_points = np.logspace(np.log10(500), np.log10(20000), 6, dtype=int)
symbols = cycle(['o', 's', 'd'])
for i, func_name in enumerate((
'asls', 'iasls', 'airpls', 'arpls', 'iarpls', 'drpls', 'aspls', 'psalsa', 'derpsalsa'
)):
legend = [[], []]
_, ax = plt.subplots(num=func_name)
for bkg_type in bkg_types:
best_lams = np.empty_like(num_points, float)
min_lam = None
for j, num_x in enumerate(num_points):
if func_name == 'iasls':
func = iasls
else:
func = getattr(Baseline(), func_name)
y, baseline = make_data(num_x, bkg_type)
# use a slightly lower tolerance to speed up the calculation
min_lam = optimize_lam(y, baseline, func, min_lam, tol=1e-2, max_iter=50)
best_lams[j] = min_lam
if show_plots:
plt.figure(num=num_x)
if i == 0:
plt.plot(y)
plt.plot(baseline)
plt.plot(func(y, lam=10**min_lam)[0], '--')
fit = np.polynomial.polynomial.Polynomial.fit(np.log10(num_points), best_lams, 1)
coeffs = fit.convert().coef
print(f'{func_name:<11} {bkg_type:<13} {coeffs}')
line = 10**fit(np.log10(num_points))
bkg_dict[bkg_type][func_name] = line
handle_1 = ax.plot(num_points, line, label=bkg_type)[0]
handle_2 = ax.plot(num_points, 10**best_lams, next(symbols))[0]
legend[0].append((handle_1, handle_2))
legend[1].append(bkg_type)
ax.loglog()
ax.legend(*legend)
ax.set_xlabel('Input Array Size, N')
ax.set_ylabel('Optimal lam Value')
ax.set_title(func_name)
Function, baseline type, intercept & slope of log(N) vs log(lam) fit
------------------------------------------------------------
asls exponential [-5.86444476 3.96627881]
asls gaussian [-6.26444476 3.96627881]
asls sine [-7.20791573 3.99298647]
iasls exponential [-6.95319771 4.0821202 ]
iasls gaussian [-6.86444476 3.96627881]
iasls sine [-8.15319771 4.0821202 ]
airpls exponential [-5.39769977 3.96625643]
airpls gaussian [-5.17038765 4.0108367 ]
airpls sine [-6.03111749 3.96628054]
arpls exponential [-7.39857346 4.17128073]
arpls gaussian [-7.46113917 4.18915776]
arpls sine [-8.4685327 3.94839806]
iarpls exponential [-7.03210709 3.82369727]
iarpls gaussian [-7.91587793 4.10002995]
iarpls sine [-7.38404418 3.46708566]
drpls exponential [-7.1986639 4.17130657]
drpls gaussian [-6.53722308 4.01088492]
drpls sine [-8.7677514 4.09104086]
aspls exponential [-5.44074137 4.13570744]
aspls gaussian [-5.63793407 4.21586273]
aspls sine [-6.74718255 4.03754167]
psalsa exponential [-6.48060361 4.03756673]
psalsa gaussian [-6.6635041 4.10887608]
psalsa sine [-7.82258729 4.0019408 ]
derpsalsa exponential [-9.26767336 4.09101856]
derpsalsa gaussian [-8.7097234 4.05541159]
derpsalsa sine [-9.65235478 3.87710469]
To further analyze the relationship of lam and data size, the best fit lines for all algorithms for each baseline type are shown below. Interestingly, for each baseline type, the slopes of the lam vs data size lines are approximately the same for all algorithms; only the intercept is different. This makes sense since all functions use very similar linear equations for solving for the baseline; thus, while the optimal lam values may differ between the algorithms, the relationship between lam and the data size should be similar for all of them.
for bkg_type in bkg_types:
plt.figure()
for func_name, line in bkg_dict[bkg_type].items():
plt.plot(num_points, line, label=func_name)
plt.legend()
plt.loglog()
plt.xlabel('Input Array Size, N')
plt.ylabel('Optimal lam Value')
plt.title(f'{bkg_type} baseline')
plt.show()
Total running time of the script: (0 minutes 32.917 seconds)