Improving Performance

pybaselines was designed for performant single-threaded, single-process usage. This page gives tips for improving the performance when fitting multiple datasets.

When fitting multiple datasets that share the same independant variables, it is more efficient to reuse the same Baseline object rather than creating a new Baseline object for each method call since much of the setup only needs to be done once and can be reused otherwise, as shown in this example.

For methods that require a half_window parameter, such as morphological and smoothing algorithms, the half_window is estimated using the optimize_window() function if no half_window value is given, which can significantly increase computation time when fitting multiple datasets. If all data have similar peak widths, it would be much faster to either specify the half_window value or use optimize_window() on a single set of data and then use the output half_window value for all subsequent baseline fits for the dataset.

For fitting datasets that are quite large (>~ 5,000 individual spectra/diffractograms), users can opt to use multiprocessing or, potentially, threading to reduce the computation time. These uses will be addressed below.

Parallel Processing

Multiprocessing through the standard library multiprocessing module or third-party libraries works well with pybaselines. A simple usage is shown below:

from concurrent.futures import ProcessPoolExecutor
from functools import partial

from pybaselines import Baseline


x = ...  # the x-values for the data
dataset = ...  # the total data, with shape (number of datasets, number of data points)
kwargs = {...}  # any keyword arguments to pass to the method

baseline_fitter = Baseline(x)
# bind any needed keyword arguments to the method
partial_func = partial(baseline_fitter.arpls, **kwargs)
baselines = np.empty_like(dataset)
with ProcessPoolExecutor() as pool:
    for i, (baseline, params) in enumerate(pool.map(partial_func, dataset)):
        baselines[i] = baseline

In pybaselines versions earlier than 1.2.0, the loess() method could cause issues when used with multiprocessing on POSIX systems, since loess spawned its own internal threads and conflicted with the fork method of spawning processes (the default method of spawning processes on POSIX systems prior to Python version 3.14). To work around this, the process start method simply needs to be explicitly set to spawn instead when using loess. The above example would be modified like so:

from multiprocessing import get_context

with ProcessPoolExecutor(mp_context=get_context('spawn')) as pool:
    ...  # do the processing

Threading

Starting with pybaselines version 1.2.0, pybaselines has experimental support for the free-threaded build of CPython (see https://py-free-threading.github.io/ for more information) to allow the use of multithreading through the standard library threading module to decrease computation time. In CPython versions earlier than 3.13, or for non-free-threaded CPython builds, it is not recommended to use multithreading with pybaselines since most operations within pybaselines do not release the GIL.

If using pybaselines version 1.2.0 or later, Baseline and Baseline2D objects are thread-safe, so the same object can be used for all threads. An example use case is shown below.

from concurrent.futures import ThreadPoolExecutor
from functools import partial

from pybaselines import Baseline


x = ...  # the x-values for the data
dataset = ...  # the total data, with shape (number of datasets, number of data points)
kwargs = {...}  # any keyword arguments to pass to the method

baseline_fitter = Baseline(x)
# bind any needed keyword arguments to the method
partial_func = partial(baseline_fitter.arpls, **kwargs)
baselines = np.empty_like(dataset)
with ThreadPoolExecutor() as pool:
    for i, (baseline, params) in enumerate(pool.map(partial_func, dataset)):
        baselines[i] = baseline

Note that thread-safety is only guaranteed if non-data inputs (eg. lam, poly_order, half_window, etc.) are the same for all method calls. Otherwise, race conditions are likely (and threading is likely not a good choice for the user in the first place...).

In pybaselines versions earlier than 1.2.0, several methods of Baseline and Baseline2D were not thread-safe, so the proper way to use multithreading would be to spawn a new Baseline or Baseline2D object for each method call, as shown below.

def func(x, baseline_method, data, **kwargs):
    """Helper to make a new Baseline each function call."""
    return getattr(Baseline(x), baseline_method)(data, **kwargs)

method = 'arpls'  # a string designating the method to use
partial_func = partial(func, x, method, **kwargs)
baselines = np.empty_like(dataset)
with ThreadPoolExecutor() as pool:
    for i, (baseline, params) in enumerate(pool.map(partial_func, dataset)):
        baselines[i] = baseline