Note
Go to the end to download the full example code.
Fitting Multiple Datasets
When fitting multiple datasets that all share the same independant variable, pybaselines
allows saving time by reusing the same Baseline object to allow only
performing some of the computationally heavy setup only once. For example,
polynomial methods will only compute the Vandermonde
matrix, and potentially its pseudoinverse, once. Likewise,
spline methods will only have to compute the spline
basis matrix once. Note that this only applies if the same non-data parameters
(eg. poly_order, lam, etc.) are used for each fit.
This example will explore the efficiency of reusing the same Baseline object when fitting
multiple datasets for different types of algorithms.
from collections import defaultdict
from time import perf_counter
import matplotlib.pyplot as plt
import numpy as np
from pybaselines import Baseline
from pybaselines.utils import gaussian
num_points = 1000 # number of data points in one set of data
num_fits = 1000 # equivalent to the number of data in a dataset
x = np.linspace(0, 1000, num_points)
signal = (
+ gaussian(x, 6, 150, 5)
+ gaussian(x, 8, 350, 11)
+ gaussian(x, 6, 550, 6)
+ gaussian(x, 13, 700, 8)
+ gaussian(x, 9, 880, 7)
)
baseline = 5 + 10 * np.exp(-x / 600) + gaussian(x, 15, 1000, 400)
noise = np.random.default_rng(0).normal(0, 0.1, len(x))
y = signal + baseline + noise
Six different methods will be timed. The polynomial method penalized_poly(),
the spline method mixture_model(), the Whittaker smoothing method
iarpls(), the morphological method mor(), the smoothing
method ria(), and the classification method std_distribution().
methods = (
('penalized_poly', {'poly_order': 4}),
('mixture_model', {'lam': 1e5}),
('iarpls', {'lam': 1e5}),
('mor', {'half_window': 30}),
('ria', {'half_window': 20}),
('std_distribution', {'half_window': 25})
)
plt.plot(x, y)
timings = defaultdict(list)
for method, kwargs in methods:
baseline_fitter = Baseline(x_data=x, check_finite=False, assume_sorted=True)
for reuse_object in (True, False):
for i in range(num_fits):
if reuse_object:
func = getattr(baseline_fitter, method)
else:
func = getattr(Baseline(x, check_finite=False, assume_sorted=True), method)
t0 = perf_counter()
calc_baseline = func(y, **kwargs)[0]
t1 = perf_counter()
if i == 0 and reuse_object: # only plot once per algorithm
plt.plot(x, calc_baseline, label=method)
elif i > 0:
# only add timings after first call to allow for any necessary compilation
timings[f'{method}_{reuse_object}'].append(t1 - t0)
plt.legend()
plt.tight_layout()

The total times for each method when using a new Baseline object each call
and when reusing the same Baseline object are plotted below, as well as
the relative time reduction by reusing the same Baseline object. Note
that time reductions less than +/-5% can be considered as irrelevant.
fig, (ax1, ax2) = plt.subplots(nrows=2, sharex=True)
for i, (method, _) in enumerate(methods):
reuse = sum(timings[f'{method}_True'])
new = sum(timings[f'{method}_False'])
if i == 0:
new_label = 'New'
reuse_label = 'Reuse'
else:
new_label = ''
reuse_label = ''
plt.plot()
ax1.bar(i - 0.2, new, width=0.4, label=new_label, color='c')
ax1.bar(i + 0.2, reuse, width=0.4, label=reuse_label, color='m')
speedup = 100 * (new - reuse) / new
bar = ax2.bar(i, speedup)
ax2.bar_label(bar, fmt='{:.1f}%')
ax1.legend()
ax2.set_xticks(np.arange(len(methods)), [method for method, _ in methods], rotation=15)
ax2.set_ylim(ax2.get_ylim() + np.array([-5, 5])) # add space for the bar labels
ax1.set_ylabel('Total time (s)')
ax2.set_ylabel('Relative Time Reduction (%)')
fig.tight_layout()
plt.show()

As expected, the polynomial and spline methods see a significant time reduction by reusing
the same Baseline object to fit the entire dataset while the other methods see no difference.
Note that these results are a generalization; algorithms that are more computationally intensive
will see less of a benefit from reuse since less time is from the setup and more from the actual
algorithm.
Total running time of the script: (0 minutes 15.032 seconds)