Cross-Validation

Cross-validation (CV) estimates model performance more reliably than a single train/test split by rotating the test window across the data. DataSplits provides the full standard catalogue of CV strategies, all accessible through partition with a uniform interface.

All CV strategies return a CrossValidationSplit — a collection of folds you can iterate, index, or feed directly to MLJ.

Quick reference

StrategyKey property
KFoldPlain k-fold; deterministic or shuffled
StratifiedKFoldPreserves class / quantile-bin proportions per fold
GroupKFoldNo group spans two folds
StratifiedGroupKFoldGroup integrity + class balance
ShuffleSplitIndependent random resamples; caller sets cohort sizes
StratifiedShuffleSplitStratified resampling
GroupShuffleSplitCVGroup-aware resampling
RepeatedKFoldKFold run multiple times with different shuffles
RepeatedStratifiedKFoldSame, stratified
BootstrapSplitBootstrap resampling; OOB as test
NestedCVOuter CV for evaluation, inner CV for hyperparameter tuning
LeavePOut / LeaveOneOutEvery combination of p observations as test
LeavePGroupsOut / LeaveOneGroupOutEvery combination of p groups as test
PredefinedSplitCaller provides fold assignments
TimeSeriesSplitTime-aware; see also the Time Series page

The iteration pattern

cvs = partition(X, KFold(5))

for (X_tr, X_te) in splitview(cvs, X)
    fit!(model, X_tr)
    score = evaluate(model, X_te)
end

# MLJ integration.
using MLJ
mach = machine(model, X, y)
evaluate!(mach; resampling = rowpairs(cvs), measure = accuracy)

Plain KFold

KFold divides the data into k roughly equal folds. Each fold takes a turn as the test set; the remaining k-1 folds form the training set.

# Deterministic split (default).
cvs = partition(X, KFold(5))

# Shuffle observations before folding for a different assignment each time.
cvs = partition(X, KFold(5; shuffle = true); rng = MersenneTwister(42))

Fold sizes differ by at most one observation: the first N mod k folds are one sample larger.

Stratified KFold

StratifiedKFold distributes each class (or quantile bin for continuous targets) round-robin across the k folds so every fold has nearly the same class proportions as the full dataset.

# Classification: class labels as target.
cvs = partition(X, StratifiedKFold(5); target = labels)

# Regression: continuous target binned into 10 quantile groups by default.
cvs = partition(X, StratifiedKFold(5); target = y)

# Fewer bins for sparse or discrete-heavy targets.
cvs = partition(X, StratifiedKFold(5; bins = 4); target = y)

Use StratifiedKFold instead of KFold whenever the class distribution is imbalanced or the dataset is small.

Group-aware KFold

GroupKFold assigns entire groups to single folds — no group ever appears in both the train and test cohort of the same fold. This is the standard choice for datasets with natural grouping (patients, molecular scaffolds, experimental batches).

cvs = partition(X, GroupKFold(5); groups = patient_ids)

# Shuffle group assignment order (different fold compositions each run).
cvs = partition(X, GroupKFold(5; shuffle = true);
                groups = patient_ids, rng = MersenneTwister(42))

For the most demanding case — group integrity and class balance — use StratifiedGroupKFold:

cvs = partition(X, StratifiedGroupKFold(5);
                target = labels, groups = patient_ids)

Leave-p-out and leave-group-out

LeaveOneOut produces N folds, each with a single test observation. Exhaustive but expensive for large datasets.

cvs = partition(X, LeaveOneOut())  # N folds
cvs = partition(X, LeavePOut(3))   # binomial(N, 3) folds — use only for small N

LeaveOneGroupOut / LeavePGroupsOut are the group-aware analogues — every combination of one (or p) groups takes a turn as the test cohort.

cvs = partition(X, LeaveOneGroupOut(); groups = batch_ids)  # one batch held out per fold
cvs = partition(X, LeavePGroupsOut(2); groups = site_ids)   # binomial(n_groups, 2) folds

Resampling strategies

ShuffleSplit produces n_splits independent random resamples, each sized by the caller. Unlike KFold, a single observation can appear in test in multiple folds.

cvs = partition(X, ShuffleSplit(10); train = 0.8, test = 0.2)
cvs = partition(X, ShuffleSplit(10); train = 0.8, test = 0.2,
                rng = MersenneTwister(42))

StratifiedShuffleSplit adds class balancing per resample:

cvs = partition(X, StratifiedShuffleSplit(10); target = labels,
                train = 0.8, test = 0.2)

GroupShuffleSplitCV is the group-aware resampling variant — groups are added whole, so the actual train size may overshoot slightly:

cvs = partition(X, GroupShuffleSplitCV(10);
                groups = patient_ids, train = 0.8, test = 0.2)

Bootstrap

BootstrapSplit draws N observations with replacement as the training set; the observations never drawn form the out-of-bag (OOB) test set. On average about 63.2% of unique observations land in train; the rest form the OOB test.

cvs = partition(X, BootstrapSplit(50); rng = MersenneTwister(42))

for (X_tr, X_te) in splitview(cvs, X)
    # X_tr has N observations, with duplicates — this is by design
    # X_te is the OOB set (unique observations not drawn in this bootstrap)
end

Use ShuffleSplit if you need unique indices only.

Repeated KFold

RepeatedKFold runs KFold n_repeats times with a fresh random shuffle each time, producing k × n_repeats folds. This reduces the variance of the performance estimate compared to a single k-fold run.

cvs = partition(X, RepeatedKFold(5; n_repeats = 10);
                rng = MersenneTwister(42))  # 50 folds total

RepeatedStratifiedKFold does the same with stratification:

cvs = partition(X, RepeatedStratifiedKFold(5; n_repeats = 10);
                target = labels, rng = MersenneTwister(42))

Nested cross-validation

NestedCV combines an outer CV (for unbiased performance estimation) with an inner CV (for hyperparameter tuning). For each outer fold the inner CV is applied to the outer training cohort; inner indices are remapped to the global 1:N space.

cvs = partition(X, NestedCV(KFold(5), KFold(3)))

for outerfold in folds(cvs)
    X_tr_outer, X_te_outer = splitdata(outerfold, X)

    for (X_tr, X_val) in splitview(innerfolds(outerfold), X)
        # Tune hyperparameters on (X_tr, X_val)
    end
    # Refit best model on full X_tr_outer, score on X_te_outer
end

Stratified and group-aware strategies work as both outer and inner:

cvs = partition(X, NestedCV(StratifiedKFold(5), StratifiedKFold(3));
                target = labels)

Predefined fold assignments

PredefinedSplit lets you supply the fold assignment vector directly. Observations with a negative assignment are always placed in train.

# 3 folds: obs 1-20 test in fold 0, obs 21-40 in fold 1, obs 41-60 in fold 2.
test_fold = [fill(0, 20); fill(1, 20); fill(2, 20)]
cvs = partition(X, PredefinedSplit(test_fold))

# Hold-out: last 10 observations are always in train, never tested.
test_fold = [fill(0, 40); fill(-1, 10)]
cvs = partition(X, PredefinedSplit(test_fold))