Getting Started
Basic API
DataSplits exposes a single entry point, partition, in three forms:
# Two cohorts (train / test).
res = partition(data, alg; train, test, kwargs...)
# Three cohorts (train / validation / test) — two strategies.
res = partition(data, test_alg, val_alg; train, validation, test, kwargs...)
# Cross-validation — returns a CrossValidationSplit of folds.
cvs = partition(data, cv_alg; kwargs...)alg is any subtype of AbstractSplitStrategy; CV strategies subtype AbstractCVStrategy. Cohort sizes (train, validation, test) live on partition, not on the strategy struct.
Cohort sizes
train, validation, and test accept either absolute integer counts (summing to numobs(data)), integer percentages (summing to 100), or (0, 1) fractions (summing to 1.0).
partition(X, RandomSplit(); train = 80, test = 20) # percentages
partition(X, RandomSplit(); train = 150, test = 40) # absolute
partition(X, RandomSplit(); train = 0.8, test = 0.2) # fractionsCV strategies do not take train/test keywords — fold sizes are determined by the strategy itself (typically through k or n_splits). Resampling CV strategies (subtypes of AbstractResamplingCVStrategy such as ShuffleSplit) are the exception: they do accept train/test, applied per fold.
Data formats
data can be a matrix, a vector, a Tables.jl-compatible container (e.g. DataFrame), or any custom type implementing the MLUtils interface (MLUtils.numobs, MLUtils.getobs).
For matrices DataSplits follows the Julia ML convention: columns are samples, rows are features. Transpose row-major data before passing it in (X' or permutedims(X)). For Tables.jl inputs rows are samples — DataSplits converts internally when a strategy needs an F×N matrix.
All returned indices are positive integers in 1:N. For non-standard arrays or custom containers, materialise subsets with splitdata / splitview rather than indexing directly.
Auxiliary slots
Strategies declare which auxiliary inputs they consume via the consumes trait. The three slots are:
target— response / property vector (e.g. forSPXYSplit,StratifiedKFold).time— temporal ordering vector (e.g. forTimeSplit,TimeSeriesSplit,BlockedCV).groups— group-membership vector (e.g. forGroupShuffleSplit,GroupKFold).
partition(X, SPXYSplit(); target = y, train = 80, test = 20)
partition(X, GroupKFold(5); groups = patient_ids)
partition(X, TimeSeriesSplit(5); time = timestamps)Single-input shorthand
When data is itself the vector a strategy needs (e.g. a timestamps vector for TimeSplit, or group IDs for GroupShuffleSplit), the corresponding keyword may be omitted — data plays both roles. This is governed by the fallback_from_data trait.
partition(timestamps, TimeSplit(); train = 0.7, test = 0.3)
partition(patient_ids, GroupShuffleSplit(); train = 0.8, test = 0.2)Randomness
Strategies that randomise (e.g. RandomSplit, ShuffleSplit, GroupShuffleSplit) accept an rng keyword for reproducibility:
using Random
partition(X, RandomSplit(); train = 0.8, test = 0.2, rng = MersenneTwister(42))Materialising splits
partition returns indices, not data subsets. Use one of:
splitdata(res, data)— returns a tuple of independent copies viaMLUtils.getobs.splitview(res, data)— returns lazy slices viaMLUtils.obsview(no copy).trainview/testview/valview— lazy view of a single cohort across one or more data sources.traindata/testdata/valdata— materialised counterparts.
res = partition(X, KennardStoneSplit(); train = 0.8, test = 0.2)
X_train, X_test = splitdata(res, X)
X_train_view, X_test_view = splitview(res, X)
X_train, y_train = trainview(res, X, y)For a CrossValidationSplit, splitdata / splitview return one element per fold:
cvs = partition(X, KFold(5))
for (X_tr, X_te) in splitview(cvs, X)
# train and evaluate
endResult accessors
Always read indices through the stable accessors rather than the result's fields directly:
trainindices/testindices/valindicesfoldsforCrossValidationSplitrowpairsfor MLJ-style(train, test)tuples