DataSplits

DataSplits is a Julia library for rational train/test splitting algorithms. It provides a variety of strategies for splitting datasets. In several applications random selection is not an appropriate choice and may lead to overestimating model performance.

StrategyPurposeComplexity
KennardStoneSplitMaximin split on XO(N²) time, O(N²) memory
LazyKennardStoneSplitSame, streamedO(N²) time, O(N) mem
SPXYSplitJoint X–y maximin (SPXY)O(N²) time, O(N²) mem
OptiSimSplitOptimisable dissimilarity-based splittingO(N²) time, O(N²) memory
MinimumDissimilaritySplitGreedy dissimilarity with one candidateO(N²) time, O(N²) memory
MaximumDissimilaritySplitGreedy dissimilarity with full poolO(N²) time, O(N²) memory
ClusterShuffleSplitCluster-based shuffle splitO(N²) time, O(N²) memory
ClusterStratifiedSplitCluster-based stratified split (equal, proportional, Neyman). Selects a quota per cluster, then splits into train/test according to user fraction.O(N²) time, O(N²) memory

All splitting strategies in DataSplits are designed to work with any AbstractArray, including those with non-standard axes. This is achieved by mapping user indices to internal positions, ensuring correctness and extensibility for all data types.

julia> using DataSplits, Distances

julia> train, test = split(X, KennardStoneSplit(0.8))
julia> train, test = split((X, y), SPXYSplit(0.7; metric = Cityblock()))
julia> train, test = split(X, ClusterStratifiedSplit(clusters, :equal; n=4, frac=0.7))