DataSplits.jl

DataSplits is a Julia library of train/test and cross-validation splitting strategies for cases where random selection misleads — small datasets, regression over continuous targets, grouped observations, time series, molecular or geospatial data.

One entry point covers everything: partition.

Installation

using Pkg
Pkg.add("DataSplits")

Quick start

using DataSplits

# Diversity-based split — training set that covers the full feature space.
res = partition(X, KennardStoneSplit(); train = 0.8, test = 0.2)
X_train, X_test = splitdata(res, X)

# Cover features and target jointly (SPXY).
res = partition(X, SPXYSplit(); target = y, train = 80, test = 20)

# Train / validation / test in one call.
res = partition(X, RandomSplit(), KennardStoneSplit();
                train = 70, validation = 10, test = 20)
X_tr, X_val, X_te = splitdata(res, X)

# Group-aware k-fold: no patient, scaffold, or batch spans two folds.
cvs = partition(X, GroupKFold(5); groups = patient_ids)
for (X_tr, X_te) in splitview(cvs, X)
    fit!(model, X_tr)
    evaluate(model, X_te)
end

Strategy catalogue

TaskStrategy
Cover feature space (maximin)KennardStoneSplit / LazyKennardStoneSplit
Cover features + target jointlySPXYSplit, MDKSSplit
Diversity selection (subsample)OptiSimSplit, MinimumDissimilaritySplit, MaximumDissimilaritySplit
Kennard–Stone + random swapMoraisLimaMartinSplit
Group-aware train/testGroupShuffleSplit, GroupStratifiedSplit
Time-ordered train/testTimeSplit (TimeSplitOldest, TimeSplitNewest)
Train on extreme target valuesTargetPropertySplit (TargetPropertyHigh, TargetPropertyLow)
Random baselineRandomSplit
Plain k-foldKFold
Stratified k-foldStratifiedKFold
Group k-foldGroupKFold, StratifiedGroupKFold
Leave-group-outLeaveOneGroupOut, LeavePGroupsOut
Time-series CVTimeSeriesSplit, BlockedCV, PurgedKFold
Resampling CVShuffleSplit, StratifiedShuffleSplit, GroupShuffleSplitCV, BootstrapSplit
Repeated CVRepeatedKFold, RepeatedStratifiedKFold
Nested CVNestedCV
Predefined fold assignmentsPredefinedSplit
Leave-p-outLeavePOut, LeaveOneOut
Cluster assignmentsphere_exclusion

Conventions

  • Matrices follow the Julia ML convention: columns are samples, rows are features. Tables.jl inputs (e.g. DataFrame) use rows as samples and are converted internally.
  • Custom containers must implement MLUtils.numobs and MLUtils.getobs.
  • Cohort sizes (train, validation, test) are set on partition, not on the strategy. They accept integer counts, integer percentages summing to 100, or (0,1) fractions summing to 1.