Distance-Based Splitting

Distance-based strategies select the training set so that it covers the feature space as uniformly as possible. The key insight is that a test point near a training point is an interpolation problem — easy for the model. A test point far from all training points is an extrapolation problem — what you actually want to measure.

These strategies are most valuable for small-to-medium datasets (up to a few thousand samples) where a random split could, by chance, cluster training and test points together and produce an overly optimistic error estimate.

The family at a glance

StrategyCoversPeak memorySpeed (N=1000)
KennardStoneSplitX onlyO(N²) — full matrix7.3 ms — deterministic
LazyKennardStoneSplitX onlyO(N) — no full matrix43 ms (~6× slower)
SPXYSplitX + yO(N²)14 ms — deterministic
LazySPXYSplitX + yO(N)445 ms (~32× slower)
MDKSSplitX (Mahalanobis) + yO(N²)12 ms — deterministic
LazyMDKSSplitX (Mahalanobis) + yO(N)708 ms (~57× slower)
OptiSimSplitXO(N²)59 ms — tunable via max_subsample_size
LazyOptiSimSplitXO(N)304 ms (~5× slower)
MinimumDissimilaritySplitXO(N²)11 ms — fastest greedy
LazyMinimumDissimilaritySplitXO(N)34 ms (~3× slower)
MaximumDissimilaritySplitXO(N²)1.3 s — all candidates
LazyMaximumDissimilaritySplitXO(N)14.8 s (~11× slower)
MoraisLimaMartinSplitXO(N²)6.7 ms — KS + random swap

For deep dives see: Kennard–Stone, SPXY, OptiSim, Morais–Lima–Martin.

When to use which

Use Kennard–Stone when you have only features (no target) and you want the simplest, most established diversity-based split. It is the gold standard for spectroscopic and tabular data calibration sets.

Use SPXY or MDKS when you have both features and a regression target and you care about covering the response range, not just the feature space. SPXY uses Euclidean distance for both; MDKS uses Mahalanobis for features (accounting for correlations) and Euclidean for the target.

Use OptiSim when Kennard–Stone is too slow or you want a tunable trade-off between speed and diversity. max_subsample_size controls how many candidates are evaluated at each step; smaller values are faster but greedier.

Use MinimumDissimilarity as the fastest greedy option — it is OptiSim with max_subsample_size = 1.

Use MaximumDissimilarity when you want the global maximum spread and can afford the O(N²) cost per step. Note it greedily includes outliers; remove them first if that is undesirable.

Use MoraisLimaMartinSplit when you want the coverage of Kennard–Stone but with a random perturbation for ensemble diversity.

Use the Lazy variants only when the full N×N distance matrix does not fit in RAM (roughly N > 5 000 on a 32 GiB machine). They never hold more than a constant number of distance values in memory at once (O(N) peak), but recompute distances on-the-fly, making them 3–57× slower. Note that profiling tools report higher total allocated bytes for lazy variants because each on-the-fly distance computation allocates a small transient object; this does not reflect peak resident memory.

Minimal example

using DataSplits

res = partition(X, KennardStoneSplit(); train = 0.8, test = 0.2)
X_train, X_test = splitdata(res, X)

All strategies accept a custom distance metric as the first constructor argument:

using Distances
res = partition(X, KennardStoneSplit(Cityblock()); train = 0.8, test = 0.2)

For strategies that also use the target variable, pass target = y:

res = partition(X, SPXYSplit(); target = y, train = 0.8, test = 0.2)

See Getting Started for cohort sizes, three-cohort splits, DataFrames, and materialising results.