Cluster Stratified Split
Cluster Stratified Split is a train/test splitting strategy that ensures each cluster (as determined by a clustering algorithm) is split into train and test sets according to a specified allocation method and user-defined train fraction.
Allocation Methods
- Equal allocation: Randomly selects a fixed number
n
of samples from each cluster, then splits those into train/test according to the user-specifiedfrac
(fraction for train set). The rest of the cluster is unused. - Proportional allocation: Uses all samples in each cluster, splits them into train/test according to the user-specified
frac
(fraction for train set). - Neyman allocation: Randomly selects a quota from each cluster (proportional to cluster size and mean standard deviation of features), then splits those into train/test according to the user-specified
frac
(fraction for train set). The rest of the cluster is unused.
Usage
using DataSplits, Clustering
# Assume you have a ClusteringResult `clusters` and a data matrix X
splitter = ClusterStratifiedSplit(clusters, :equal; n=4, frac=0.7)
train_idx, test_idx = split(X, splitter)
splitter = ClusterStratifiedSplit(clusters, :proportional; frac=0.7)
train_idx, test_idx = split(X, splitter)
splitter = ClusterStratifiedSplit(clusters, :neyman; n=4, frac=0.7)
train_idx, test_idx = split(X, splitter)
Arguments
clusters
: ClusteringResult (from Clustering.jl)allocation
::equal
,:proportional
, or:neyman
n
: Number of samples per cluster (for:equal
and:neyman
)frac
: Fraction of selected samples to use for train (rest go to test)
Notes
- If
n
is greater than the cluster size, all samples in the cluster are used. - For
:proportional
, all samples are always used. - For
frac=1.0
, all selected samples go to train; forfrac=0.0
, all go to test. - The split is performed per cluster; rounding is handled so that the train set always gets the larger share when the split is not exact.
See also: Clustering.jl