Kennard–Stone Split
The Kennard–Stone algorithm (also known as CADEX in some literature) is a deterministic method for selecting a representative training set from a dataset. It iteratively chooses the sample that is farthest (in feature space) from all previously selected samples, starting from the most distant pair. This ensures that the training set covers the full range of the feature space, making it especially useful for rational dataset splitting.
CADEX stands for Computer Aided Design of Experiments and is an alias for the Kennard–Stone algorithm in DataSplits.
How it works
- Compute the pairwise distance matrix for all samples.
- Select the two samples that are farthest apart as the initial training set.
- Iteratively add the sample that is farthest from the current training set (i.e., has the largest minimum distance to any selected sample).
- Continue until the desired number of training samples is reached.
Usage
using DataSplits, Distances
splitter = KennardStoneSplit(0.8)
train, test = split(X, splitter)
X
: Data matrix (samples × features)0.8
: Fraction of samples to use for training
Options
KennardStoneSplit(frac; metric=Euclidean())
:frac
is the fraction of samples to use for training (between 0 and 1).metric
is the distance metric (default: Euclidean).
Implementation Note
- For very large datasets, use
LazyKennardStone
which is a memory-efficient variant of this algorithm.
See also
- SPXY Split: Jointly maximizes diversity in both features and target property.
Reference
Kennard, R. W.; Stone, L. A. Computer Aided Design of Experiments. Technometrics 1969, 11 (1), 137–148. https://doi.org/10.1080/00401706.1969.10490666.