Note
Go to the end to download the full example code
lam vs number of knots
This example will examine the effects of lam for fitting a penalized spline baseline
while varying both the number of knots for the spline, num_knots, and the number of
data points. The function mixture_model()
is used for all calculations.
Note that the exact optimal lam values reported in this example are not of significant use since they depend on many other factors such as the baseline curvature, noise, peaks, etc.; however, the examined trends can be used to simplify the process of selecting lam values for fitting new datasets.
The baseline for this example is an exponentially decaying baseline, shown below. Other baseline types could be examined, similar to the Whittaker lam vs data size example, which should give similar results.
plt.plot(make_data(1000, bkg_type='exponential')[0])
The number of knots will vary from 20 to 1000 on a logarithmic scale. For each number of knots, the optimal lam value will be calculated for data sizes ranging from 500 to 20000 points, also on a logarithmic scale.
show_plots = False # for debugging
num_knots = np.logspace(np.log10(20), np.log10(1000.1), 5, dtype=int)
num_points = np.logspace(np.log10(500.1), np.log10(20000), 6, dtype=int)
symbols = cycle(['o', 's', 'd', 'h', '^', 'x'])
best_lams = np.empty((len(num_knots), len(num_points)))
for i, num_knot in enumerate(num_knots):
min_lam = 0
for j, num_x in enumerate(num_points):
func = partial(Baseline().mixture_model, num_knots=num_knot, diff_order=2)
y, baseline = make_data(num_x, bkg_type='exponential')
# use a slightly lower tolerance to speed up the calculation
min_lam = optimize_lam(y, baseline, func, min_lam, tol=1e-2, max_iter=50)
best_lams[i, j] = min_lam
if show_plots:
plt.figure(num=num_x)
if i == 0:
plt.plot(y)
plt.plot(baseline)
plt.plot(func(y, lam=10**min_lam)[0], '--')
First, examine the relationship between the number of data points, N, and the optimal lam for fixed numbers of knots. The plots below show that the slopes of the best-fit lines are relatively small, indicating that the optimal lam increases only slightly as the number of points increases. The intercepts of the best-fit lines increase significantly as the number of knots increases, which will be investigated below.
print('Number of knots, intercept & slope of log(N) vs log(lam) fit')
print('-' * 60)
_, ax = plt.subplots()
legend = [[], []]
for i, num_knot in enumerate(num_knots):
fit = np.polynomial.polynomial.Polynomial.fit(np.log10(num_points), best_lams[i], 1)
coeffs = fit.convert().coef
print(f'{num_knot:<6} {coeffs}')
line = 10**fit(np.log10(num_points))
handle_1 = ax.plot(num_points, line)[0]
handle_2 = ax.plot(num_points, 10**best_lams[i], next(symbols))[0]
legend[0].append((handle_1, handle_2))
legend[1].append(f'num_knots={num_knot}')
ax.loglog()
ax.legend(*legend)
ax.set_xlabel('Input Array Size, N')
ax.set_ylabel('Optimal lam Value')
Number of knots, intercept & slope of log(N) vs log(lam) fit
------------------------------------------------------------
20 [-2.91574003 1.05213815]
53 [-1.57809698 1.0699548 ]
141 [-0.21574003 1.05213815]
376 [1.08425997 1.05213815]
1000 [2.32190302 1.0699548 ]
Now, examine the relationship between the number of knots, and the optimal lam for fixed numbers of points. The plots below show that the slopes of the best-fit lines are much greater than the slopes from the lam versus N plots. The lam versus number of knots plots closely resemble the lam versus N plots from the Whittaker lam vs data size example, which makes sense since the number of data points for Whittaker smoothing is more analogous to the number of knots for penalized splines when considering their minimized linear equations.
print('Number of points, intercept & slope of log(number of knots) vs log(lam) fit')
print('-' * 80)
_, ax = plt.subplots()
legend = [[], []]
for i, num_x in enumerate(num_points):
fit = np.polynomial.polynomial.Polynomial.fit(np.log10(num_knots), best_lams[:, i], 1)
coeffs = fit.convert().coef
print(f'{num_x:<6} {coeffs}')
line = 10**fit(np.log10(num_knots))
handle_1 = ax.plot(num_knots, line)[0]
handle_2 = ax.plot(num_knots, 10**best_lams[:, i], next(symbols))[0]
legend[0].append((handle_1, handle_2))
legend[1].append(f'data size={num_x}')
ax.loglog()
ax.legend(*legend)
ax.set_xlabel('Number of Knots')
ax.set_ylabel('Optimal lam Value')
plt.show()
Number of points, intercept & slope of log(number of knots) vs log(lam) fit
--------------------------------------------------------------------------------
500 [-4.08862141 3.08316209]
1045 [-3.70970463 3.1301788 ]
2186 [-3.49916302 3.10667044]
4573 [-2.89916302 3.10667044]
9563 [-2.79916302 3.10667044]
20000 [-2.39916302 3.10667044]
Total running time of the script: (0 minutes 14.285 seconds)