Kennard–Stone
The Kennard–Stone algorithm (Kennard & Stone, 1969) — also known as CADEX — selects a training set that covers the feature space as uniformly as possible. It is the most widely used rational splitting method for spectroscopic and tabular calibration data.
How it works
- Compute all pairwise distances between samples.
- Select the two samples with the largest mutual distance as the first two training points.
- Iteratively select the sample that has the largest minimum distance to any already-selected training point (the "maximin" criterion).
- Continue until the desired training-set size is reached.
The test set is whatever was not selected.
The result is a training set that spans the convex hull of the data — training points are spread throughout the feature space rather than clustered in any region.
When to use it
- Small to medium datasets (up to a few thousand samples) where random splits could, by chance, concentrate training and test points in the same region.
- Calibration and chemometrics — Kennard–Stone is the de facto standard for NIR, Raman, and other spectroscopic calibration splits.
- Regression when you want to avoid interpolation bias: test points near training points look easy even for a bad model.
- No target variable is needed — the split is based on X only. If you have y and want to cover both, use
SPXYSplitinstead.
Usage
using DataSplits
# Default Euclidean metric.
res = partition(X, KennardStoneSplit(); train = 0.8, test = 0.2)
X_train, X_test = splitdata(res, X)
# Custom metric.
using Distances
res = partition(X, KennardStoneSplit(Cityblock()); train = 70, test = 30)
# Large dataset — no full matrix in memory, but ~6× slower.
res = partition(X, LazyKennardStoneSplit(); train = 0.8, test = 0.2)
# Train / validation / test.
res = partition(X, KennardStoneSplit(), KennardStoneSplit();
train = 70, validation = 10, test = 20)
X_tr, X_val, X_te = splitdata(res, X)Eager vs. lazy
KennardStoneSplit | LazyKennardStoneSplit | |
|---|---|---|
| Distance matrix | Precomputed once | Computed on-the-fly each step |
| Peak memory | O(N²) — 8 MiB at N=1000 | O(N) — no full matrix |
| Speed (N=1000) | 7.3 ms | 43 ms (~6× slower) |
| Use when | Dataset fits in memory | N×N matrix does not fit in RAM |
Limitations
- Quadratic memory for the eager variant — the N×N distance matrix is ~8 MiB at N=1000 and ~200 MiB at N=5000. Use
LazyKennardStoneSplitwhen this does not fit in RAM, accepting ~6× slower runtime. - Deterministic — there is no randomness, so you cannot average over multiple Kennard–Stone splits. Use
MoraisLimaMartinSplitif you need stochastic variants. - Feature scale matters — if features have very different ranges the Euclidean metric will be dominated by high-variance features. Standardise X before splitting, or use a scaled metric (e.g.
WeightedEuclidean).
API reference
References
Kennard, R. W.; Stone, L. A. Computer Aided Design of Experiments. Technometrics 1969, 11(1), 137–148. https://doi.org/10.1080/00401706.1969.10490666.