SPXY Split

The SPXY algorithm extends Kennard–Stone by considering both the feature matrix (X) and the target vector (y) when splitting data. It constructs a joint distance matrix as the sum of normalized pairwise distances in X and y, then applies the maximin selection. This ensures the training set is diverse in both predictors and response, which is especially important for regression tasks where the target distribution matters.

MDKSSplit(frac): Alias for SPXYSplit(frac; metric=Mahalanobis()) (SPXY algorithm using Mahalanobis distance).

How it works

Compute the normalized pairwise distance matrix for X (features).
Compute the normalized pairwise distance matrix for y (target).
Add the two matrices to form a joint distance matrix.
Apply the Kennard–Stone maximin selection on the joint distance matrix to select a representative training set.

Usage

using DataSplits, Distances
train, test = split(X, y, SPXYSplit(0.7; metric=Cityblock()))

X: Feature matrix (samples × features)
y: Target vector (length matches number of samples)
0.7: Fraction of samples to use for training
metric: Distance metric for both X and y (default: Euclidean)

Options

SPXYSplit(frac; metric=Euclidean()): frac is the fraction of samples to use for training (between 0 and 1). metric is the distance metric (default: Euclidean).
MDKSSplit(frac): Alias for SPXY with Mahalanobis distance.

Notes

This algorithm is most appropriate for regression and continuous targets.
For classification, you may need to encode the target appropriately.
You must call split((X, y), strategy) or split(X, y, strategy); calling split(X, strategy) will error.

Reference