Kennard–Stone

The Kennard–Stone algorithm (Kennard & Stone, 1969) — also known as CADEX — selects a training set that covers the feature space as uniformly as possible. It is the most widely used rational splitting method for spectroscopic and tabular calibration data.

How it works

Compute all pairwise distances between samples.
Select the two samples with the largest mutual distance as the first two training points.
Iteratively select the sample that has the largest minimum distance to any already-selected training point (the "maximin" criterion).
Continue until the desired training-set size is reached.

The test set is whatever was not selected.

The result is a training set spread throughout the feature space — training points cover the full extent of the data distribution rather than clustering in any region.

When to use it

Small to medium datasets (up to a few thousand samples) where random splits could, by chance, concentrate training and test points in the same region.
Calibration and chemometrics — Kennard–Stone is the de facto standard for NIR, Raman, and other spectroscopic calibration splits.
Regression when you want to avoid interpolation bias: test points near training points look easy even for a bad model.
No target variable is needed — the split is based on X only. If you have y and want to cover both, use SPXYSplit instead.

Usage

using DataSplits

# Default Euclidean metric.
res = partition(X, KennardStoneSplit(); train = 0.8, test = 0.2)
X_train, X_test = splitdata(res, X)

# Custom metric.
using Distances
res = partition(X, KennardStoneSplit(Cityblock()); train = 70, test = 30)

# Large dataset — no full matrix in memory, but ~6× slower.
res = partition(X, LazyKennardStoneSplit(); train = 0.8, test = 0.2)

# Train / validation / test.
res = partition(X, KennardStoneSplit(), KennardStoneSplit();
                train = 70, validation = 10, test = 20)
X_tr, X_val, X_te = splitdata(res, X)

Eager vs. lazy

	`KennardStoneSplit`	`LazyKennardStoneSplit`
Distance matrix	Precomputed once	Computed on-the-fly each step
Peak memory	O(N²) — 8 MiB at N=1000	O(N) — no full matrix
Speed (N=1000)	7.3 ms	43 ms (~6× slower)
Use when	Dataset fits in memory	N×N matrix does not fit in RAM

Limitations

Quadratic memory for the eager variant — the N×N distance matrix is ~8 MiB at N=1000 and ~200 MiB at N=5000. Use LazyKennardStoneSplit when this does not fit in RAM, accepting ~6× slower runtime.
Deterministic — there is no randomness, so you cannot average over multiple Kennard–Stone splits. Use MoraisLimaMartinSplit if you need stochastic variants.
Feature scale matters — if features have very different ranges the Euclidean metric will be dominated by high-variance features. Standardise X before splitting, or use a scaled metric (e.g. WeightedEuclidean).

API reference

References

Kennard, R. W.; Stone, L. A. Computer Aided Design of Experiments. Technometrics 1969, 11(1), 137–148. https://doi.org/10.1080/00401706.1969.10490666.