lam vs data size

When publishing new Whittaker-smoothing-based algorithms in literature, the lam value used by the researchers is usually reported as a single value or a range of values. However, these values are deceptive since the lam value required for a particular Whittaker-smoothing-based algorithm is dependent on the number of data points. Thus, this can cause issues when adapting an algorithm to a new set of data since the published optimal lam value is not universal. This example shows an analysis of this dependence for all available functions in the pybaselines.whittaker module.

Note that the exact optimal lam values reported in this example are not of significant use since they depend on many other factors such as the baseline curvature, noise, peaks, etc.; however, the examined trends can be used to simplify the process of selecting lam values for fitting new datasets.

from itertools import cycle

import matplotlib.pyplot as plt
import numpy as np

from pybaselines import Baseline

# local import with setup code
from example_helpers import make_data, optimize_lam


def iasls(*args, lam=1, p=0.05, **kwargs):
    """Ensures the `lam_1` value for whittaker.iasls scales with `lam`."""
    # not sure if lam_1 should be fixed or proportional to lam;
    # both give similar results
    return Baseline().iasls(*args, lam=lam, lam_1=1e-8 * lam, p=p, **kwargs)

Three baselines will be tested: an exponentially decaying baseline, a gaussian baseline, and a sinusoidal baseline. The exponential baseline is the most smooth and the sine baseline is the least smooth, so it would be expected that a lower lam value would be required to fit the sine baseline and a higher value for the exponential baseline. The three different plots are shown below.

bkg_dict = {}
bkg_types = ('exponential', 'gaussian', 'sine')
for bkg_type in bkg_types:
    bkg_dict[bkg_type] = {}
    plt.plot(make_data(1000, bkg_type)[0], label=bkg_type)
plt.legend()

For each function, the optimal lam value will be calculated for data sizes ranging from 500 to 20000 points. Further, the intercept and slope of the linear fit of the log of the data size, N, and the log of the lam value will be reported.

print('Function, baseline type, intercept & slope of log(N) vs log(lam) fit')
print('-' * 60)

show_plots = False  # for debugging
num_points = np.logspace(np.log10(500), np.log10(20000), 6, dtype=int)
symbols = cycle(['o', 's', 'd'])
for i, func_name in enumerate((
    'asls', 'iasls', 'airpls', 'arpls', 'iarpls', 'drpls', 'aspls', 'psalsa', 'derpsalsa'
)):
    legend = [[], []]
    _, ax = plt.subplots(num=func_name)
    for bkg_type in bkg_types:
        best_lams = np.empty_like(num_points, float)
        min_lam = None
        for j, num_x in enumerate(num_points):
            if func_name == 'iasls':
                func = iasls
            else:
                func = getattr(Baseline(), func_name)
            y, baseline = make_data(num_x, bkg_type)
            # use a slightly lower tolerance to speed up the calculation
            min_lam = optimize_lam(y, baseline, func, min_lam, tol=1e-2, max_iter=50)
            best_lams[j] = min_lam

            if show_plots:
                plt.figure(num=num_x)
                if i == 0:
                    plt.plot(y)
                plt.plot(baseline)
                plt.plot(func(y, lam=10**min_lam)[0], '--')

        fit = np.polynomial.polynomial.Polynomial.fit(np.log10(num_points), best_lams, 1)
        coeffs = fit.convert().coef
        print(f'{func_name:<11} {bkg_type:<13} {coeffs}')
        line = 10**fit(np.log10(num_points))
        bkg_dict[bkg_type][func_name] = line

        handle_1 = ax.plot(num_points, line, label=bkg_type)[0]
        handle_2 = ax.plot(num_points, 10**best_lams, next(symbols))[0]
        legend[0].append((handle_1, handle_2))
        legend[1].append(bkg_type)

    ax.loglog()
    ax.legend(*legend)
    ax.set_xlabel('Input Array Size, N')
    ax.set_ylabel('Optimal lam Value')
    ax.set_title(func_name)

Function, baseline type, intercept & slope of log(N) vs log(lam) fit
------------------------------------------------------------
asls        exponential   [-5.86444476  3.96627881]
asls        gaussian      [-6.26444476  3.96627881]
asls        sine          [-7.20791573  3.99298647]
iasls       exponential   [-6.95319771  4.0821202 ]
iasls       gaussian      [-6.86444476  3.96627881]
iasls       sine          [-8.15319771  4.0821202 ]
airpls      exponential   [-5.39769977  3.96625643]
airpls      gaussian      [-5.17038765  4.0108367 ]
airpls      sine          [-6.03111749  3.96628054]
arpls       exponential   [-7.39857346  4.17128073]
arpls       gaussian      [-7.46113917  4.18915776]
arpls       sine          [-8.4685327   3.94839806]
iarpls      exponential   [-7.03210709  3.82369727]
iarpls      gaussian      [-7.91587793  4.10002995]
iarpls      sine          [-7.38404418  3.46708566]
drpls       exponential   [-7.1986639   4.17130657]
drpls       gaussian      [-6.53722308  4.01088492]
drpls       sine          [-8.7677514   4.09104086]
aspls       exponential   [-5.44074137  4.13570744]
aspls       gaussian      [-5.63793407  4.21586273]
aspls       sine          [-6.74718255  4.03754167]
psalsa      exponential   [-6.48060361  4.03756673]
psalsa      gaussian      [-6.6635041   4.10887608]
psalsa      sine          [-7.82258729  4.0019408 ]
derpsalsa   exponential   [-9.26767336  4.09101856]
derpsalsa   gaussian      [-8.7097234   4.05541159]
derpsalsa   sine          [-9.65235478  3.87710469]

To further analyze the relationship of lam and data size, the best fit lines for all algorithms for each baseline type are shown below. Interestingly, for each baseline type, the slopes of the lam vs data size lines are approximately the same for all algorithms; only the intercept is different. This makes sense since all functions use very similar linear equations for solving for the baseline; thus, while the optimal lam values may differ between the algorithms, the relationship between lam and the data size should be similar for all of them.

for bkg_type in bkg_types:
    plt.figure()
    for func_name, line in bkg_dict[bkg_type].items():
        plt.plot(num_points, line, label=func_name)
    plt.legend()
    plt.loglog()
    plt.xlabel('Input Array Size, N')
    plt.ylabel('Optimal lam Value')
    plt.title(f'{bkg_type} baseline')

plt.show()

Total running time of the script: (0 minutes 32.917 seconds)

Gallery generated by Sphinx-Gallery