Reference

Contents

Index

DataSplits.AbstractCVStrategyType
AbstractCVStrategy <: AbstractSplitStrategy

Abstract supertype for cross-validation strategies — strategies that produce a CrossValidationSplit (a vector of folds) rather than a single train/test or train/val/test partition.

Dispatching on this subtype selects the partition(data, alg::AbstractCVStrategy; …) method, which does not accept train / test / validation keywords: fold sizes are determined by the strategy itself (typically through k).

To implement a custom CV strategy:

  • Subtype this and define _partition(data, alg; target, time, groups, rng, kwargs...) returning a CrossValidationSplit.
  • Declare consumes and (optionally) fallback_from_data exactly as for any AbstractSplitStrategy.
source
DataSplits.AbstractResamplingCVStrategyType
AbstractResamplingCVStrategy <: AbstractCVStrategy

Abstract supertype for resampling cross-validation strategies — strategies whose folds are independent random train/test splits sized by the caller, rather than fixed slices of a deterministic partition.

Subtyping this routes calls to the partition(data, alg; train, test, …) method, which forwards the resolved n_train / n_test to _partition. Used by ShuffleSplit, StratifiedShuffleSplit, and GroupShuffleSplitCV.

source
DataSplits.AbstractSplitStrategyType
AbstractSplitStrategy

Abstract supertype for all splitting strategies.

To implement a custom train/test (or train/val/test) strategy, subtype this and define:

  • _partition(data, alg; n_train, n_test, target, time, groups, rng) returning an AbstractSplitResult.
  • consumes(::MyStrategy) returning a tuple of symbols from (:data, :target, :time, :groups).
  • fallback_from_data(::MyStrategy) returning the subset of consumes that can fall back to data.

For cross-validation strategies (returning CrossValidationSplit) see AbstractCVStrategy — the contract there omits n_train / n_test.

source
DataSplits.BlockedCVType
BlockedCV(k::Integer; gap::Integer=0) <: AbstractCVStrategy

Blocked k-fold cross-validation for dependent (time- or space-ordered) data. Observations are sorted by time= and partitioned into k contiguous chronological blocks; each block takes a turn as the test cohort while the train cohort is everything else — both blocks preceding it and blocks following it.

This differs from TimeSeriesSplit (forward-only, train always precedes test) and from KFold (no temporal ordering). It matches the "blocked CV" used in time-series / spatial-statistics literature (Bergmeir & Benítez 2012, Roberts et al. 2017) when train samples should not be drawn from arbitrary positions but the test block must still be embedded in a longer train history.

A gap window (in observations) is removed from the train cohort on both sides of the test block to mitigate leakage from autocorrelation.

Atomicity rule

Observations sharing the same timestamp are never split between train and test of the same fold — block boundaries fall between distinct time values, mirroring TimeSeriesSplit and TimeSplit. gap is measured in observations and may trim partial blocks of equal timestamps from the train side; no row ever leaks into test.

Fields

  • k::Int: Number of folds (must be ≥ 2 and ≤ number of distinct time values).
  • gap::Int: Number of observations excluded from the train cohort on each side of the test block (must be ≥ 0; default 0).

Examples

# 5-fold blocked CV with a one-observation embargo on both sides.
cvs = partition(X, BlockedCV(5; gap = 1); time = timestamps)

for (X_train, X_test) in splitview(cvs, X)
    fit!(model, X_train); evaluate(model, X_test)
end

# Single-input shorthand when the timestamps are also the data.
cvs = partition(timestamps, BlockedCV(4))
source
DataSplits.BootstrapSplitType
BootstrapSplit(n_splits::Integer) <: AbstractCVStrategy

Bootstrap cross-validation. For each of n_splits iterations the train cohort is drawn from 1:N with replacement (so it has exactly N indices, with duplicates), and the test cohort is the set of out-of-bag (OOB) observations — the unique indices that were not sampled. On average about (1 - 1/e) ≈ 63.2% of unique observations land in train; the remaining ~36.8% form the OOB test.

Fields

  • n_splits::Int: Number of bootstrap resamples (must be ≥ 1).

Important notes

  • Train contains duplicates by design. This is the defining property of bootstrap sampling — it lets the model see the variance introduced by resampling the empirical distribution. If you need unique-only train indices use ShuffleSplit instead.
  • Test (OOB) size varies fold-to-fold. It is whatever the bootstrap left out; no caller-set size is honoured.
  • Cohort sizes (train, test) are not accepted: train is always N (with replacement), test is the OOB.
  • Almost surely non-empty test for N ≥ 2 (probability of all N indices being drawn at least once falls off as N! / N^N).

Examples

# 50 bootstrap resamples.
cvs = partition(X, BootstrapSplit(50); rng = MersenneTwister(42))

for (X_train, X_test) in splitview(cvs, X)
    fit!(model, X_train)        # X_train has N obs, with duplicates
    evaluate(model, X_test)     # X_test is the OOB
end
source
DataSplits.CrossValidationSplitType
CrossValidationSplit

A result type representing a k-fold cross-validation split.

Fields

  • folds::Vector{<:AbstractSplitResult}: One result per fold.
source
DataSplits.GroupKFoldType
GroupKFold(k::Integer; shuffle::Bool=false) <: AbstractCVStrategy

Group-aware k-fold cross-validation. Whole groups are assigned to a single fold; no group ever appears in both the train and test cohort of the same fold. Equivalent in spirit to scikit-learn's GroupKFold.

Groups are passed via the groups= keyword (or, by fallback, data itself plays that role).

Fields

  • k::Int: Number of folds (must be ≥ 2 and ≤ number of unique groups).
  • shuffle::Bool: When true, the order in which groups are considered for fold assignment is shuffled using the rng passed to partition, so different RNG seeds yield different fold compositions. When false (default), assignment is deterministic and reproducible without an rng.

Notes

  • Within each candidate group, the algorithm places it in the currently smallest fold, so observation counts across folds stay roughly balanced whether or not shuffling is enabled. Mirrors sklearn's GroupKFold.
  • When shuffle=false, groups are processed in descending order of size (the classic deterministic balancing). When shuffle=true, they are processed in a randomly permuted order — folds remain balanced but no longer follow size order.

Examples

# Deterministic (default).
cvs = partition(X, GroupKFold(5); groups = patient_ids)

# Shuffled — different seeds give different fold compositions.
cvs = partition(X, GroupKFold(5; shuffle = true);
                groups = patient_ids, rng = MersenneTwister(42))

for (X_train, X_test) in splitview(cvs, X)
    # train and evaluate
end

# Fallback: ids are simultaneously the data and the groups.
cvs = partition(patient_ids, GroupKFold(5))
source
DataSplits.GroupShuffleSplitType
GroupShuffleSplit() <: AbstractSplitStrategy

Group-aware train/test splitting. Accumulates whole groups into the training set (in random order) until the requested training size is reached.

Groups are passed as a vector of membership IDs via the groups= keyword. Any grouping is valid: cluster assignments, patient IDs, scaffold labels, batch numbers, site identifiers, graph communities, etc.

Notes

Because groups are added whole, the actual train cohort size may overshoot the requested n_train. No attempt is made to minimise this overshoot.

Examples

# ids is both data and groups
res = partition(ids, GroupShuffleSplit(); train=80, test=20)

# X is split; group membership provided separately
res = partition(X, GroupShuffleSplit(); groups=patient_ids, train=80, test=20)
X_train, X_test = splitdata(res, X)

# With Clustering.jl
using Clustering
res = partition(X, GroupShuffleSplit();
                groups=assignments(kmeans(X, 5)), train=80, test=20)
source
DataSplits.GroupShuffleSplitCVType
GroupShuffleSplitCV(n_splits::Integer) <: AbstractResamplingCVStrategy

Group-aware random permutation cross-validation. For each of n_splits iterations the groups are shuffled and assigned whole into the train cohort until the requested training size is reached; the remaining groups form the test cohort. Mirrors scikit-learn's GroupShuffleSplit.

Groups are passed as a vector of membership IDs via the groups= keyword (or, by fallback, data itself plays that role).

Fields

  • n_splits::Int: Number of resamples (must be ≥ 1).

Notes

  • Resamples are independent — a group can land in train in one fold and test in another. This is the defining property versus GroupKFold, where each group appears in exactly one test cohort across folds.
  • Because groups are added whole, the actual train cohort size may overshoot the requested n_train (same behaviour as the 2-cohort GroupShuffleSplit).
  • train and test must sum to N; every observation is placed in exactly one cohort per resample.

Examples

# Fractions.
cvs = partition(X, GroupShuffleSplitCV(10);
                groups = patient_ids, train = 0.8, test = 0.2)

# Absolute counts.
cvs = partition(X, GroupShuffleSplitCV(10);
                groups = patient_ids, train = 80, test = 20)

# Reproducible.
cvs = partition(X, GroupShuffleSplitCV(10);
                groups = patient_ids, train = 0.8, test = 0.2,
                rng = MersenneTwister(42))

# Fallback: ids are simultaneously the data and the groups.
cvs = partition(patient_ids, GroupShuffleSplitCV(10);
                train = 0.8, test = 0.2)
source
DataSplits.GroupStratifiedSplitType
GroupStratifiedSplit(allocation::Symbol; n=nothing) <: AbstractSplitStrategy

Group-stratified train/test splitting with flexible allocation methods.

Groups are passed as a vector of membership IDs via the groups= keyword. Any grouping is valid: cluster assignments, patient IDs, scaffold labels, batch numbers, site identifiers, graph communities, etc.

Fields

  • allocation::Symbol: Allocation method — :equal, :proportional, or :neyman.
  • n::Union{Nothing,Int}: Samples per group for :equal and :neyman.

Allocation methods

  • :proportional — use all samples from each group (shuffled).
  • :equal — select n samples from each group (requires n).
  • :neyman — select proportional to group size × within-group std (requires n).

The training fraction within each group is derived from the global cohort sizes (n_train / N).

Examples

res = partition(X, GroupStratifiedSplit(:proportional);
                groups=patient_ids, train=80, test=20)
X_train, X_test = splitdata(res, X)

# With Clustering.jl
using Clustering
res = partition(X, GroupStratifiedSplit(:equal; n=5);
                groups=assignments(kmeans(X, 4)), train=80, test=20)

References

May, R. J.; Maier, H. R.; Dandy, G. C. Data Splitting for Artificial Neural Networks Using SOM-Based Stratified Sampling. Neural Networks 2010, 23(2), 283–294.

source
DataSplits.KFoldType
KFold(k::Integer; shuffle::Bool=false) <: AbstractCVStrategy

Standard k-fold cross-validation. Splits the dataset into k roughly equal folds; each fold takes a turn as the test set while the remaining folds form the training set. Equivalent in spirit to scikit-learn's KFold.

Fields

  • k::Int: Number of folds (must be ≥ 2 and ≤ number of observations).
  • shuffle::Bool: When true, observations are randomly permuted before folding using the rng passed to partition, so different seeds yield different fold assignments. When false (default), observations are assigned in order and the split is fully deterministic.

Notes

  • Fold sizes differ by at most 1 observation (first n mod k folds are one sample larger), mirroring scikit-learn's behaviour.
  • Unlike GroupKFold, no extra keyword arguments are required.

Examples

# Deterministic (default).
cvs = partition(X, KFold(5))

# Shuffled — different seeds give different fold assignments.
cvs = partition(X, KFold(5; shuffle = true); rng = MersenneTwister(42))

for (X_train, X_test) in splitview(cvs, X)
    # train and evaluate
end
source
DataSplits.KennardStoneSplitType
KennardStoneSplit <: AbstractSplitStrategy

In-memory Kennard-Stone (CADEX) algorithm for train/test splitting.

Precomputes the full N×N distance matrix; prefer LazyKennardStoneSplit for large datasets where that is prohibitive.

Fields

  • metric::Distances.SemiMetric: Distance metric (default: Euclidean())

Examples

res = partition(X, KennardStoneSplit(); train = 80, test = 20)
X_train, X_test = splitdata(res, X)

res = partition(X, KennardStoneSplit(Cityblock()); train = 70, test = 30)
source
DataSplits.LazyKennardStoneSplitType
LazyKennardStoneSplit <: AbstractSplitStrategy

Memory-efficient Kennard-Stone (CADEX) algorithm. Computes distances on-the-fly (O(N) storage) rather than precomputing the full N×N matrix.

Fields

  • metric::Distances.SemiMetric: Distance metric (default: Euclidean())

Examples

res = partition(X, LazyKennardStoneSplit(); train = 80, test = 20)
X_train, X_test = splitdata(res, X)
source
DataSplits.LazyMDKSSplitType
LazyMDKSSplit <: AbstractSplitStrategy

Memory-efficient Minimum Dissimilarity Kennard–Stone (MDKS) splitting strategy. Uses Mahalanobis distance for X and Euclidean for y, normalised and summed as in SPXY. Computes distances on-the-fly (O(N) storage).

Fields

  • metric_X::Union{Nothing,Distances.SemiMetric}: Distance metric for X; if nothing, Mahalanobis is computed from the data at split time.
  • metric_y::Distances.SemiMetric: Distance metric for y (default: Euclidean())

Examples

res = partition(X, LazyMDKSSplit(); target=y, train=70, test=30)
X_train, X_test = splitdata(res, X)
source
DataSplits.LazyMaximumDissimilaritySplitType
LazyMaximumDissimilaritySplit(; distance_cutoff=0.35, metric=Euclidean()) <: AbstractSplitStrategy

Lazy (on-the-fly distances) variant of MaximumDissimilaritySplit. Computes distances on-the-fly (O(N) storage) rather than precomputing the full distance matrix. Prefer this over MaximumDissimilaritySplit for large datasets.

Fields

  • distance_cutoff::Float64: Similarity threshold (default: 0.35).
  • metric::Distances.SemiMetric: Distance metric (default: Euclidean()).

Examples

res = partition(X, LazyMaximumDissimilaritySplit(); train = 70, test = 30)
X_train, X_test = splitdata(res, X)
source
DataSplits.LazyMinimumDissimilaritySplitType
LazyMinimumDissimilaritySplit <: AbstractSplitStrategy

Lazy (on-the-fly distances) variant of MinimumDissimilaritySplit.

Examples

res = partition(X, LazyMinimumDissimilaritySplit(); train=70, test=30)
X_train, X_test = splitdata(res, X)
source
DataSplits.LazyOptiSimSplitType
LazyOptiSimSplit <: AbstractSplitStrategy

Memory-efficient, lazy implementation of the OptiSim (Clark 1997) dissimilarity selection strategy. Computes distances on-the-fly; avoids the full N×N matrix.

Fields

  • max_subsample_size::Int: Size of the temporary candidate subsample (default: 10)
  • distance_cutoff::Float64: Similarity threshold (default: 0.35)
  • metric::Distances.SemiMetric: Distance metric (default: Euclidean())

Notes

Emits the same undershoot warning as OptiSimSplit under _id = :datasplits_optisim_undershoot when distance_cutoff exhausts the candidate pool before reaching n_train. See ?OptiSimSplit for the silencing recipe.

References

  • Clark, R. D. (1997). OptiSim: An Extended Dissimilarity Selection Method for Finding Diverse Representative Subsets. J. Chem. Inf. Comput. Sci., 37(6), 1181–1188.

Examples

res = partition(X, LazyOptiSimSplit(); train = 70, test = 30)
X_train, X_test = splitdata(res, X)
source
DataSplits.LazySPXYSplitType
LazySPXYSplit <: AbstractSplitStrategy

Memory-efficient SPXY splitting strategy. Computes distances on-the-fly (O(N) storage) rather than precomputing the full N×N matrix.

Fields

  • metric_X::Distances.SemiMetric: Distance metric for X (default: Euclidean())
  • metric_y::Distances.SemiMetric: Distance metric for y (default: Euclidean())

Examples

res = partition(X, LazySPXYSplit(); target=y, train=80, test=20)
X_train, X_test = splitdata(res, X)
source
DataSplits.LeaveOneOutType
LeaveOneOut() <: AbstractCVStrategy

Cross-validation where each single observation takes a turn as the test set. Produces n folds (one per observation). Equivalent to LeavePOut(1).

Examples

cvs = partition(X, LeaveOneOut())

for (X_train, X_test) in splitview(cvs, X)
    # train and evaluate
end
source
DataSplits.LeavePGroupsOutType
LeavePGroupsOut(p::Integer) <: AbstractCVStrategy

Leave-p-groups-out cross-validation. Produces one fold per combination of p distinct groups: in each fold, the test cohort is exactly the observations belonging to those p groups, and the train cohort is everything else.

Equivalent to scikit-learn's LeavePGroupsOut. The number of folds is binomial(n_groups, p), which grows quickly — pick p accordingly.

Groups are passed via the groups= keyword (or, by fallback, data itself plays that role).

Fields

  • p::Int: Number of groups held out as test in each fold (must satisfy 1 ≤ p < n_groups).

Constructors

  • LeavePGroupsOut(p) — generic constructor.
  • LeaveOneGroupOut() — convenience alias for LeavePGroupsOut(1).

Examples

# One group out per fold (n_folds == n_groups).
cvs = partition(X, LeaveOneGroupOut(); groups = patient_ids)

# Two groups out per fold; n_folds == binomial(n_groups, 2).
cvs = partition(X, LeavePGroupsOut(2); groups = site_ids)

for (X_train, X_test) in splitview(cvs, X)
    # train and evaluate
end
source
DataSplits.LeavePOutType
LeavePOut(p::Integer) <: AbstractCVStrategy

Exhaustive cross-validation that uses every possible combination of p observations as the test set. Produces binomial(n, p) folds, where n is the number of observations.

Fields

  • p::Int: Number of observations in each test fold (must be ≥ 1 and < n).

Notes

  • The number of folds grows as binomial(n, p), which becomes very large quickly. Use only for small datasets or small values of p.
  • For p = 1, prefer LeaveOneOut as a convenience alias.

Examples

cvs = partition(X, LeavePOut(2))

for (X_train, X_test) in splitview(cvs, X)
    # train and evaluate
end
source
DataSplits.MDKSSplitType
MDKSSplit(; metric=nothing)

Minimum Dissimilarity Kennard–Stone (MDKS) split using Mahalanobis distance for X and Euclidean distance for y.

Fields

  • metric::Union{Nothing,Distances.PreMetric}: Distance metric for X. When nothing (default), Mahalanobis distance is computed from the covariance of X at split time.

Examples

res = partition(X, MDKSSplit(); target=y, train=70, test=30)
res = partition(X, MDKSSplit(; metric=Mahalanobis(cov(X; dims=2)));
                target=y, train=70, test=30)
X_train, X_test = splitdata(res, X)

See also

SPXYSplit

source
DataSplits.MaximumDissimilaritySplitType
MaximumDissimilaritySplit(; distance_cutoff=0.35, metric=Euclidean())

Full OptiSim strategy (Clark 1997). Alias for OptiSimSplit with max_subsample_size = N — considers all remaining candidates each iteration.

Fields

  • distance_cutoff::Float64: Similarity threshold (default: 0.35).
  • metric::Distances.SemiMetric: Distance metric (default: Euclidean()).

Notes

  • Greedily includes outliers; remove them before splitting if not desired.

References

  • Clark, R. D. (1997). OptiSim: An Extended Dissimilarity Selection Method for Finding Diverse Representative Subsets. J. Chem. Inf. Comput. Sci., 37(6), 1181–1188.

Examples

res = partition(X, MaximumDissimilaritySplit(); train=70, test=30)
X_train, X_test = splitdata(res, X)
source
DataSplits.MinimumDissimilaritySplitType
MinimumDissimilaritySplit <: AbstractSplitStrategy

Greedy dissimilarity selection (Clark 1997). Equivalent to OptiSimSplit with max_subsample_size = 1 (considers only one candidate per iteration).

Fields

  • distance_cutoff::Float64: Similarity threshold (default: 0.35).
  • metric::Distances.SemiMetric: Distance metric (default: Euclidean()).

References

  • Clark, R. D. (1997). OptiSim: An Extended Dissimilarity Selection Method for Finding Diverse Representative Subsets. J. Chem. Inf. Comput. Sci., 37(6), 1181–1188.

Examples

res = partition(X, MinimumDissimilaritySplit(); train=70, test=30)
X_train, X_test = splitdata(res, X)
source
DataSplits.MoraisLimaMartinSplitType
MoraisLimaMartinSplit(; swap_frac=0.1, metric=Euclidean())

Kennard–Stone initialisation followed by random swapping of a fraction of samples between train and test sets.

Fields

  • swap_frac::ValidFraction: Fraction of samples to swap (0 < swap_frac < 1)
  • metric::Distances.SemiMetric: Distance metric for Kennard–Stone (default: Euclidean())

Examples

res = partition(X, MoraisLimaMartinSplit(); train=80, test=20)
res = partition(X, MoraisLimaMartinSplit(; swap_frac=0.05); train=80, test=20)
source
DataSplits.NestedCVType
NestedCV(outer::AbstractCVStrategy, inner::AbstractCVStrategy) <: AbstractCVStrategy

Nested cross-validation — combine an outer CV (for unbiased performance estimation) with an inner CV (for hyperparameter selection within each outer training cohort).

For each fold of outer, the outer test cohort is held out and the outer training cohort is itself partitioned by inner. The result is a CrossValidationSplit{NestedFold} where each fold exposes the usual outer (train, test) pair plus an innerfolds(...) accessor giving the inner CV split. Inner fold indices are already remapped to the absolute 1:N index space.

Fields

  • outer::AbstractCVStrategy — produces the outer train/test split per fold.
  • inner::AbstractCVStrategy — partitions each outer training cohort.

Restrictions

  • outer must produce CrossValidationSplit{TrainTestSplit} (i.e. not nested itself).
  • inner must be a non-resampling AbstractCVStrategy — strategies subtyping AbstractResamplingCVStrategy (e.g. ShuffleSplit, StratifiedShuffleSplit, GroupShuffleSplitCV) require caller-set cohort sizes which NestedCV does not currently propagate.

Slot resolution

consumes(::NestedCV) is the union of the outer and inner strategies' declared slots. partition resolves each slot once against the full dataset, then the inner CV sees a view sliced to the outer training cohort.

Examples

# 5 × 3 nested k-fold on a classification target.
cvs = partition(X, NestedCV(StratifiedKFold(5), StratifiedKFold(3)); target = y)
for outerfold in folds(cvs)
    X_tr_outer, X_te_outer = splitdata(outerfold, X)
    y_tr_outer, _          = splitdata(outerfold, y)
    for (X_tr, X_val) in splitview(innerfolds(outerfold), X)
        # tune hyperparameters on (X_tr, X_val)
    end
    # then refit on the full outer training cohort and score on X_te_outer
end

# Group-aware nesting.
cvs = partition(X, NestedCV(GroupKFold(5), GroupKFold(3)); groups = patient_ids)
source
DataSplits.NestedFoldType
NestedFold{I} <: AbstractSplitResult

A single outer fold of a nested cross-validation split.

Fields

  • train::I — outer training indices (absolute, into 1:N).
  • test::I — outer test indices (absolute, into 1:N).
  • inner::CrossValidationSplit{TrainTestSplit{I}} — the inner cross-validation produced by applying the inner strategy to the outer training cohort. Inner fold indices are absolute (already remapped from the local 1:length(train) index space back into 1:N), so they can be used directly against the original data.

Iteration

NestedFold iterates as (train, test) — identical to TrainTestSplit for outer-loop compatibility. To access the inner CV, use the innerfolds accessor or the inner field.

source
DataSplits.OptiSimSplitType
OptiSimSplit(; max_subsample_size=10, distance_cutoff=0.35, metric=Euclidean())

OptiSim (Clark 1997) K-dissimilarity selection strategy for train/test splitting.

Fields

  • max_subsample_size::Int: Size of the temporary candidate subsample
  • distance_cutoff::Float64: Two points are "similar" if their distance < distance_cutoff
  • metric::Distances.SemiMetric: Distance metric (default: Euclidean())

Notes

When distance_cutoff is restrictive relative to the data, the candidate pool may exhaust before n_train samples have been selected. The train cohort is then returned smaller than requested and a @warn is emitted with _id = :datasplits_optisim_undershoot and _group = :datasplits.

To silence this warning for a batch of splits, filter by id (e.g. with LoggingExtras.EarlyFilteredLogger):

using Logging, LoggingExtras
silent = EarlyFilteredLogger(log -> log.id !== :datasplits_optisim_undershoot,
                             current_logger())
with_logger(silent) do
    # repeated partition(...) calls here emit no undershoot warnings
end

References

  • Clark, R. D. (1997). OptiSim: An Extended Dissimilarity Selection Method for Finding Diverse Representative Subsets. J. Chem. Inf. Comput. Sci., 37(6), 1181–1188.

Examples

res = partition(X, OptiSimSplit(; max_subsample_size=10); train=70, test=30)
X_train, X_test = splitdata(res, X)
source
DataSplits.PredefinedSplitType
PredefinedSplit(test_fold::AbstractVector{<:Integer}) <: AbstractCVStrategy

Cross-validation with caller-provided fold assignments. Each entry of test_fold gives the fold ID in which the corresponding observation serves as test. Negative values mean the observation is never placed in the test cohort — it is part of every fold's training set.

Folds are produced in ascending order of fold ID. Mirrors scikit-learn's PredefinedSplit, but driven by partition and DataSplits' CrossValidationSplit result type.

Fields

  • test_fold::Vector{Int}: Length-N vector mapping each observation to the fold ID where it tests, or to a negative value for "always train".

Notes

  • length(test_fold) must equal numobs(data).
  • Fold IDs need not be contiguous; what matters is the set of distinct non-negative values. At least one non-negative ID must exist.

Examples

# Three folds: obs 1-2 test in fold 0, obs 3-4 test in fold 1, obs 5-6
# test in fold 2.
test_fold = [0, 0, 1, 1, 2, 2]
cvs = partition(X, PredefinedSplit(test_fold))

# Hold-out style: obs 7-10 are reserved for training only across all folds.
test_fold = [0, 0, 0, 1, 1, 1, -1, -1, -1, -1]
cvs = partition(X, PredefinedSplit(test_fold))
source
DataSplits.PurgedKFoldType
PurgedKFold(k::Integer; purge::Integer=0, embargo::Integer=0) <: AbstractCVStrategy

Purged k-fold cross-validation for time-dependent data, following the recipe in López de Prado, Advances in Financial Machine Learning (2018). Observations are sorted by time= and partitioned into k contiguous chronological blocks; each block takes a turn as the test cohort while the train cohort is everything else minus an asymmetric exclusion window:

  • purge observations are removed from the train cohort immediately before the test block. This mitigates leakage from samples whose labels overlap the test period (e.g. labels built from forward-looking returns whose horizon reaches into the test window).
  • embargo observations are removed from the train cohort immediately after the test block. This mitigates leakage from serial correlation between test-period features and the immediately subsequent train samples.

This is the asymmetric counterpart of BlockedCV (which uses a single symmetric gap on both sides) and the contiguous-block counterpart of sklearn's KFold adapted for time series.

Atomicity rule

Observations sharing the same timestamp are never split between train and test of the same fold — block boundaries fall between distinct time values, mirroring TimeSeriesSplit and BlockedCV. purge and embargo are measured in observations and may trim partial blocks of equal timestamps from the train side; no row ever leaks into the test cohort.

Fields

  • k::Int: Number of folds (must be ≥ 2 and ≤ number of distinct time values).
  • purge::Int: Number of observations excluded from the train cohort immediately before the test block (must be ≥ 0; default 0).
  • embargo::Int: Number of observations excluded from the train cohort immediately after the test block (must be ≥ 0; default 0).

Examples

# 5-fold purged CV with a 2-observation purge and a 1-observation embargo.
cvs = partition(X, PurgedKFold(5; purge = 2, embargo = 1); time = timestamps)

for (X_train, X_test) in splitview(cvs, X)
    fit!(model, X_train); evaluate(model, X_test)
end

# Single-input shorthand when the timestamps are also the data.
cvs = partition(timestamps, PurgedKFold(4; purge = 1))

References

López de Prado, M. Advances in Financial Machine Learning. Wiley, 2018, §7.4 ("Purging and Embargoing").

source
DataSplits.RandomSplitType
RandomSplit <: AbstractSplitStrategy

Randomly splits data into the requested cohort sizes.

Examples

res = partition(X, RandomSplit(); train = 80, test = 20)
X_train, X_test = splitdata(res, X)
source
DataSplits.RepeatedKFoldType
RepeatedKFold(k::Integer; n_repeats::Integer=10) <: AbstractCVStrategy

Repeated k-fold cross-validation. Runs KFold n_repeats times with a fresh random permutation each repeat, producing k * n_repeats folds in total. Mirrors scikit-learn's RepeatedKFold.

Each repeat is a full k-fold partition of the data; across repeats the fold assignments are independent random permutations. Use the same rng (with a fixed seed) to reproduce the full set of folds.

Fields

  • k::Int: Number of folds per repeat (must be ≥ 2 and ≤ N).
  • n_repeats::Int: Number of independent K-fold partitions (must be ≥ 1).

Examples

# 50 folds total (5 folds × 10 repeats).
cvs = partition(X, RepeatedKFold(5; n_repeats = 10);
                rng = MersenneTwister(42))
source
DataSplits.RepeatedStratifiedKFoldType
RepeatedStratifiedKFold(k::Integer; n_repeats::Integer=10, bins::Integer=10) <: AbstractCVStrategy

Repeated stratified k-fold cross-validation. Runs StratifiedKFold n_repeats times with a fresh random permutation each repeat, producing k * n_repeats folds in total. Mirrors scikit-learn's RepeatedStratifiedKFold.

The target= keyword is required (or, by fallback, data itself plays that role); see StratifiedKFold for the stratification rule (unique values for discrete targets, quantile bins for floats).

Fields

  • k::Int: Number of folds per repeat (must be ≥ 2 and ≤ N).
  • n_repeats::Int: Number of independent stratified K-fold partitions (must be ≥ 1).
  • bins::Int: Number of quantile bins for floating-point targets (must be ≥ 2; default 10).

Examples

# Classification.
cvs = partition(X, RepeatedStratifiedKFold(5; n_repeats = 10);
                target = labels, rng = MersenneTwister(42))

# Regression with 4 quantile bins.
cvs = partition(X, RepeatedStratifiedKFold(5; n_repeats = 10, bins = 4);
                target = y_continuous)
source
DataSplits.SPXYSplitType
SPXYSplit(; metric_X=Euclidean(), metric_y=Euclidean())

Sample set Partitioning based on joint X–Y distance (SPXY).

A variant of Kennard–Stone where the joint distance matrix is the element-wise sum of the (normalised) pairwise distance matrices of X and y.

Fields

  • metric_X::Distances.SemiMetric: Distance metric for X (default: Euclidean())
  • metric_y::Distances.SemiMetric: Distance metric for y (default: Euclidean())

Examples

res = partition(X, SPXYSplit(); target=y, train=70, test=30)
res = partition(X, SPXYSplit(; metric_X=Mahalanobis(cov(X; dims=2)));
                target=y, train=70, test=30)
X_train, X_test = splitdata(res, X)

See also

KennardStoneSplit — the classical variant that uses only X.

source
DataSplits.ShuffleSplitType
ShuffleSplit(n_splits::Integer) <: AbstractResamplingCVStrategy

Random permutation cross-validation. For each of the n_splits iterations, observations are randomly shuffled and the requested train and test cohort sizes are drawn from the head and the next slice of the permutation. Mirrors scikit-learn's ShuffleSplit, with the train/test sizes supplied at the partition call (in line with the rest of DataSplits' API).

Fields

  • n_splits::Int: Number of resamples (must be ≥ 1).

Notes

  • train and test sum to N, like the rest of the partition API — every observation is placed in exactly one cohort per resample. (sklearn lets you drop observations by setting train_size + test_size < 1; this package keeps the "all observations accounted for" invariant.)
  • Resamples are independent: an observation can land in train in one fold and test in another. This is the defining property of ShuffleSplit versus KFold.

Examples

# Fractions.
cvs = partition(X, ShuffleSplit(10); train = 0.8, test = 0.2)

# Absolute counts.
cvs = partition(X, ShuffleSplit(10); train = 80, test = 20)

# Reproducible.
cvs = partition(X, ShuffleSplit(10); train = 0.8, test = 0.2,
                rng = MersenneTwister(42))
source
DataSplits.SphereExclusionResultType
SphereExclusionResult

Result of sphere exclusion clustering.

Fields

  • assignments::Vector{Int}: Cluster index per point (1-based).
  • radius::Float64: Exclusion radius used for clustering.
  • metric::Distances.SemiMetric: Distance metric used.
source
DataSplits.StratifiedGroupKFoldType
StratifiedGroupKFold(k::Integer; bins::Integer=10, shuffle::Bool=false) <: AbstractCVStrategy

Group-aware and stratified k-fold cross-validation. No group spans two folds (like GroupKFold), and per-fold class proportions are kept close to the global distribution (like StratifiedKFold). Mirrors scikit-learn's StratifiedGroupKFold.

Both target= and groups= keywords are required; neither falls back to data (no sensible default — they describe orthogonal properties of the same observations).

Fields

  • k::Int: Number of folds (must be ≥ 2 and ≤ number of unique groups).
  • bins::Int: Number of quantile bins used when target is floating-point (must be ≥ 2; default 10). Ignored for non-float targets, which are treated as discrete classes.
  • shuffle::Bool: When true, the order in which groups are considered for fold assignment is shuffled using the rng passed to partition. When false (default), groups are processed in descending order of class-count vector norm — sklearn's deterministic balancing.

Algorithm

For each group, compute its per-class count vector. Process groups one by one (largest first by default). For each group, assign it to the fold that minimises the variance of per-class proportions across folds after the assignment — each class's fold counts are normalised by its global total so rare and abundant classes contribute on the same scale. This is the same greedy heuristic used by sklearn (y_counts_per_fold / y_distr).

Notes

  • Each class needs at least k members and must appear in at least k distinct groups; otherwise even the best assignment cannot place that class in every fold's training cohort. The algorithm does not raise on this — sklearn does not either — but fold class coverage may be uneven for very rare classes.

Examples

# Classification with patient-level groups.
cvs = partition(X, StratifiedGroupKFold(5);
                target = labels, groups = patient_ids)

# Regression with quantile bins.
cvs = partition(X, StratifiedGroupKFold(5; bins = 4);
                target = y_continuous, groups = batch_ids)

# Shuffled ordering — different seeds give different fold compositions.
cvs = partition(X, StratifiedGroupKFold(5; shuffle = true);
                target = labels, groups = patient_ids,
                rng = MersenneTwister(42))
source
DataSplits.StratifiedKFoldType
StratifiedKFold(k::Integer; bins::Integer=10, shuffle::Bool=false) <: AbstractCVStrategy

Stratified k-fold cross-validation. Within each fold, every class (or quantile bin) is represented in roughly the same proportion as in the full dataset.

Targets are passed via the target= keyword (or, by fallback, data itself plays that role).

Fields

  • k::Int: Number of folds (must be ≥ 2).
  • bins::Int: Number of quantile bins used when target is floating-point (must be ≥ 2; default 10). Ignored for non-float targets, which are treated as discrete classes.
  • shuffle::Bool: When true, member indices within each class are randomly permuted before round-robin assignment using the rng passed to partition, so different seeds yield different fold assignments. When false (default), assignment is fully deterministic.

Stratification rule

  • Discrete target (e.g. Int, Bool, Symbol, String): each unique value defines a class.
  • Float target: targets are binned into bins quantile-based bins; each bin defines a class.

Within each class, indices are distributed round-robin across the k folds, so every fold gets a near-equal share of every class.

Notes

  • Each class needs at least k members; otherwise a SplitParameterError is raised. For discrete targets this is a hard constraint; for binned continuous targets, lower bins if you hit it.
  • When a continuous target has many repeated values (e.g. lots of zeros), quantile edges may collapse and effectively yield fewer than bins populated bins. Stratification still works, but bin coverage is uneven.

Examples

# Classification.
cvs = partition(X, StratifiedKFold(5); target = labels)

# Regression: 10 quantile bins (default).
cvs = partition(X, StratifiedKFold(5); target = y_continuous)

# Regression with a custom number of bins.
cvs = partition(X, StratifiedKFold(5; bins = 4); target = y_continuous)

# Shuffled — different seeds give different fold assignments.
cvs = partition(X, StratifiedKFold(5; shuffle = true); target = labels,
                rng = MersenneTwister(42))
source
DataSplits.StratifiedShuffleSplitType
StratifiedShuffleSplit(n_splits::Integer; bins::Integer=10) <: AbstractResamplingCVStrategy

Stratified resampling cross-validation. Combines the per-resample random-draw structure of ShuffleSplit with the class/quantile-bin balancing of StratifiedKFold: each resample preserves the global class proportions in both train and test cohorts.

Targets are passed via the target= keyword (or, by fallback, data itself plays that role).

Fields

  • n_splits::Int: Number of resamples (must be ≥ 1).
  • bins::Int: Number of quantile bins used when target is floating-point (must be ≥ 2; default 10). Ignored for non-float targets, which are treated as discrete classes.

Stratification rule

Same as StratifiedKFold:

  • Discrete target (e.g. Int, Bool, Symbol, String): each unique value defines a class.
  • Float target: targets are binned into bins quantile-based bins.

Within each class/bin, members are randomly shuffled and the global train fraction is applied locally — round(Int, n_train * |class| / N) go to train, the rest to test. Rounding remainder is absorbed in the last class processed so totals match n_train / n_test exactly.

Notes

  • n_train + n_test == N (project-wide invariant — sklearn allows dropping observations, this package does not).
  • Each class needs at least 2 members so that both cohorts can receive a representative; otherwise a SplitParameterError is raised. Reduce bins for continuous targets if you hit it.

Examples

# Classification.
cvs = partition(X, StratifiedShuffleSplit(10);
                target = labels, train = 0.8, test = 0.2)

# Regression: 10 quantile bins (default).
cvs = partition(X, StratifiedShuffleSplit(10);
                target = y_continuous, train = 0.8, test = 0.2)
source
DataSplits.TargetPropertySplitType
TargetPropertySplit(order::Symbol) <: AbstractSplitStrategy

Splits observations by sorting a 1D property vector and selecting the top or bottom slice as the training set.

Fields

  • order::Symbol: :high selects the largest values for training; :low selects the smallest.

Examples

# Highest values go to train; y provides the ordering.
res = partition(X, TargetPropertySplit(:high); target = y, train = 80, test = 20)
X_train, X_test = splitdata(res, X)

# Convenience aliases.
res = partition(X, TargetPropertyHigh(); target = y, train = 80, test = 20)
res = partition(X, TargetPropertyLow();  target = y, train = 80, test = 20)

# When y is both data and property.
res = partition(y, TargetPropertyHigh(); train = 80, test = 20)
source
DataSplits.TimeSeriesSplitType
TimeSeriesSplit(k::Integer; gap::Integer=0, max_train_size::Union{Nothing,Integer}=nothing) <: AbstractCVStrategy

Time-aware cross-validation. The temporal sequence is partitioned into k + 1 chronological chunks; fold i (1 ≤ i ≤ k) tests on chunk i + 1 and trains on the observations chronologically preceding it.

By default the train cohort expands across all earlier chunks. Pass max_train_size (in observations) to cap the train cohort, mirroring scikit-learn's TimeSeriesSplit: when set, each fold trains on at most the most recent max_train_size observations before the test chunk.

Train and test cohorts in the same fold are separated by gap observations (useful to avoid leakage between adjacent samples in autocorrelated series).

Atomicity rule

Observations sharing the same timestamp are never split between train and test of the same fold — chunk boundaries always fall between distinct time values, mirroring TimeSplit. Chunk sizes are therefore measured in distinct time values, not in observations.

gap and max_train_size are measured in observations, matching sklearn's contract. When either falls inside a block of equal timestamps, that block is split on the train side — some rows are kept, the rest dropped. No row leaks into test (the test cohort still starts at the next chunk), but the train side is no longer block-aligned.

Fields

  • k::Int: Number of folds (must be ≥ 2).
  • gap::Int: Number of observations skipped from the end of the train cohort in each fold (must be ≥ 0; default 0).
  • max_train_size::Union{Nothing,Int}: When nothing (default), the train cohort expands across all earlier chunks. When an Int ≥ 1, the train cohort is capped to that many observations, taken from the most recent end (rolling window).

Notes

  • Requires at least k + 1 distinct time values.
  • A fold whose train cohort would be empty (because gap consumes it) raises SplitParameterError.

Examples

# Expanding window (default).
cvs = partition(X, TimeSeriesSplit(5); time = timestamps)

for (X_train, X_test) in splitview(cvs, X)
    fit!(model, X_train); evaluate(model, X_test)
end

# Rolling window: train uses at most the last 100 observations.
cvs = partition(X, TimeSeriesSplit(5; max_train_size = 100); time = timestamps)

# Rolling window with a one-observation gap between train and test.
cvs = partition(X, TimeSeriesSplit(5; gap = 1, max_train_size = 100); time = timestamps)
source
DataSplits.TimeSplitType
TimeSplit(order::Symbol=:asc) <: AbstractSplitStrategy

Splits a 1D array of dates/times into train/test sets, grouping by unique values so that no group is split across train and test.

The actual training cohort size may slightly overshoot n_train but never fall below it.

Fields

  • order::Symbol: :asc puts the oldest observations in train (default); :desc puts the newest in train.

Examples

# Oldest observations go to train (asc order).
res = partition(X, TimeSplit(:asc); time = dates, train = 70, test = 30)
X_train, X_test = splitdata(res, X)

# Convenience aliases.
res = partition(X, TimeSplitOldest(); time = dates, train = 70, test = 30)  # same as :asc
res = partition(X, TimeSplitNewest(); time = dates, train = 70, test = 30) # same as :desc

# When dates are the data themselves.
res = partition(dates, TimeSplitOldest(); train = 70, test = 30)
source
DataSplits.TrainTestSplitType
TrainTestSplit

A result type representing a train/test split.

Fields

  • train: Indices of training samples.
  • test: Indices of test samples.

Examples

res = partition(X, KennardStoneSplit(); train = 80, test = 20)
X_train, X_test = splitdata(res, X)
source
DataSplits.TrainValTestSplitType
TrainValTestSplit

A result type representing a train/validation/test split.

Fields

  • train: Indices of training samples.
  • val: Indices of validation samples.
  • test: Indices of test samples.

Examples

res = partition(X, RandomSplit(), KennardStoneSplit();
                train = 70, validation = 10, test = 20)
X_train, X_val, X_test = splitdata(res, X)
source
DataSplits.ValidFractionType
ValidFraction{T<:Real}

A wrapper type guaranteeing a real value strictly in (0, 1).

Arithmetic with plain numbers delegates to the underlying value so it can be used transparently in formulas.

source
DataSplits._assert_partitionableMethod
_assert_partitionable(data) -> N

Validate that data is non-empty and has at least 2 observations. Returns numobs(data) for downstream use. Used by every partition method to guarantee a meaningful split is possible.

source
DataSplits._assert_unit_fraction_sumMethod
_assert_unit_fraction_sum(fractions::ValidFraction...)

Assert that the supplied validated fractions form a complete partition.

Throws SplitParameterError if the sum of the wrapped fraction values is not approximately equal to 1.

source
DataSplits._blocked_cv_partitionMethod
_blocked_cv_partition(data, k, pre_gap, post_gap; time, name) -> CrossValidationSplit

Shared implementation for contiguous-block k-fold CV strategies. Sorts observations by time, distributes the k blocks, and for each fold uses everything outside [test_lo - pre_gap, test_hi + post_gap] as the train cohort.

name is used only in error messages to identify the calling strategy.

source
DataSplits._resolve_sizesMethod
_resolve_sizes(N, train, validation, test) -> (n_train, n_val, n_test)

Validate and resolve cohort sizes.

Integer form — two interpretations, distinguished by the sum:

  • sum == 100: values are percentages of N.
  • sum == N: values are absolute counts.

Any other sum is rejected.

Float form — values must each be in (0, 1) and sum to approximately 1.0.

When validation === nothing, n_val == 0 and only train and test cohorts are produced. Rounding remainder is absorbed by n_test.

source
DataSplits._to_feature_matrixMethod

Convert a Tables.jl-compatible input (e.g. DataFrame) to a features×samples matrix (F×N), which is the internal convention for distance-based strategies. Non-table inputs are returned unchanged.

source
DataSplits._warn_undershootMethod
_warn_undershoot(n_selected, n_requested, msg; id)

Emit a @warn when fewer samples were selected than requested. id is used as the log record _id for selective filtering with LoggingExtras.EarlyFilteredLogger.

source
DataSplits._within_group_stdMethod
_within_group_std(data, idxs) -> Float64

Compute the average per-feature standard deviation of observations at idxs within data, container-agnostically. Used by :neyman allocation to weight groups by within-group dispersion.

Tables.jl inputs are converted to an F×N matrix; Vectors are treated as 1D feature streams. Singleton groups return 0.0 to avoid NaN.

source
DataSplits.consumesMethod
consumes(alg::AbstractSplitStrategy) -> NTuple{N, Symbol}

Return the named slots this strategy reads, as a tuple of symbols from (:data, :target, :time, :groups).

source
DataSplits.distance_matrixMethod
distance_matrix(X, metric::Distances.SemiMetric)

Computes the full symmetric pairwise distance matrix for the dataset X using the given metric.

Arguments

  • X: Data matrix or container. Columns are samples (features × samples).
  • metric::Distances.SemiMetric: Distance metric from Distances.jl.

Returns

  • D::Matrix{Float64}: Matrix where D[i, j] = metric(xᵢ, xⱼ) and D[i, j] == D[j, i].

Notes

  • For custom containers, getobs(X, i) is used to access samples.
  • The matrix is symmetric and not normalized.
source
DataSplits.distribute_blocksMethod
distribute_blocks(B::Int, n_chunks::Int) -> chunk_block_end

Distribute B contiguous blocks across n_chunks as evenly as possible (matching numpy.array_split semantics: the remainder is spread over the first B mod n_chunks chunks). Returns a vector of length n_chunks where chunk_block_end[c] is the index of the last block in chunk c. Chunk sizes differ by at most 1.

source
DataSplits.fallback_from_dataMethod
fallback_from_data(alg::AbstractSplitStrategy) -> NTuple{N, Symbol}

Return the subset of consumes(alg) whose keyword may be omitted in partition, in which case data itself fills that slot.

Must satisfy: fallback_from_data(alg) ⊆ consumes(alg).

source
DataSplits.find_maximin_elementMethod
find_maximin_element(distances::AbstractMatrix{T},
                    source_set::Union{AbstractVector{Int},AbstractSet{Int}},
                    reference_set::Union{AbstractVector{Int},AbstractSet{Int}}) -> Int

Finds the element in source_set that maximizes the minimum distance to all elements in reference_set.

Arguments

  • distances::AbstractMatrix{T}: Precomputed, symmetric pairwise distance matrix (N×N).
  • source_set::Union{AbstractVector{Int},AbstractSet{Int}}: Indices to evaluate.
  • reference_set::Union{AbstractVector{Int},AbstractSet{Int}}: Indices to compare against.

Returns

  • Int: Index in source_set that is farthest from its nearest neighbor in reference_set.

Notes

  • Throws ArgumentError if reference_set is empty.
  • Breaks ties by returning the first maximum.
source
DataSplits.foldsMethod
folds(res::CrossValidationSplit) -> Vector{<:AbstractSplitResult}

Return the individual fold results from a cross-validation split.

source
DataSplits.group_offsetsMethod
group_offsets(sorted_keys, perm, v) -> block_offset

Compute block-boundary offsets for a grouped, sorted permutation.

block_offset[b]+1 : block_offset[b+1] are the positions in perm (equivalently the slice of v[perm]) whose value equals sorted_keys[b]. block_offset[1] == 0 and block_offset[end] == length(perm).

source
DataSplits.groupsortpermMethod
groupsortperm(v) -> (sorted_keys, perm)

Return the sorted unique values of v and a stable sort permutation of v.

perm is a permutation of 1:length(v) such that v[perm] is non-decreasing. sorted_keys == unique(v[perm]). Together, sorted_keys and perm partition every index in 1:length(v) with no duplicates.

source
DataSplits.innerfoldsMethod
innerfolds(f::NestedFold) -> CrossValidationSplit

Return the inner cross-validation split associated with the outer fold f. Inner fold indices are absolute (into 1:N), so they can be used directly against the original data without further remapping.

Example

cvs = partition(X, NestedCV(KFold(5), KFold(3)))
for outerfold in folds(cvs)
    X_outer_train, X_outer_test = splitdata(outerfold, X)
    for (X_tr, X_val) in splitview(innerfolds(outerfold), X)
        # hyperparameter tuning on the outer training cohort
    end
end
source
DataSplits.partitionMethod
partition(data, alg::AbstractCVStrategy;
          target=nothing, time=nothing, groups=nothing,
          rng=Random.default_rng()) -> CrossValidationSplit

Produce a cross-validation split: a CrossValidationSplit wrapping one fold result per element of the partition.

Unlike the train/test and train/val/test forms, this method does not accept train / test / validation keywords — fold sizes are fixed by the strategy (typically via k). Resampling strategies that do take caller-set cohort sizes subtype AbstractResamplingCVStrategy and dispatch to a separate partition method.

Auxiliary slots

  • target: response/property vector (e.g. for StratifiedKFold).
  • time: temporal ordering vector (e.g. for TimeSeriesSplit).
  • groups: group-membership vector (e.g. for GroupKFold).

Examples

cvs = partition(X, GroupKFold(5); groups = patient_ids)
for (X_train, X_test) in splitview(cvs, X)
  fit!(model, X_train); evaluate(model, X_test)
end
source
DataSplits.partitionMethod
partition(data, alg::AbstractResamplingCVStrategy;
          train, test,
          target=nothing, time=nothing, groups=nothing,
          rng=Random.default_rng()) -> CrossValidationSplit

Resampling cross-validation: each fold is an independent random train/test split sized by the caller, and n_splits independent resamples are produced. Used by ShuffleSplit, StratifiedShuffleSplit, and GroupShuffleSplitCV.

train and test follow the same conventions as the train/test partition form (percentages, absolute counts, or (0,1) fractions summing to 1).

Examples

cvs = partition(X, ShuffleSplit(10); train = 0.8, test = 0.2)
source
DataSplits.partitionMethod
partition(data, alg, val_alg;
          train, validation, test,
          target=nothing, time=nothing, groups=nothing,
          rng=Random.default_rng()) -> TrainValTestSplit

Split data into train, validation, and test cohorts using two strategies.

alg separates the test cohort from the rest; val_alg then separates the validation cohort from the remaining train pool.

Cohort sizes (train, validation, test)

Integers are accepted in two ways:

  • Percentages — values sum to 100.
  • Absolute counts — values sum to N = numobs(data).

Floats in (0, 1) summing to 1.0 are also accepted.

Examples

partition(X, RandomSplit(), KennardStoneSplit();
          train = 70, validation = 10, test = 20)
partition(X, RandomSplit(), KennardStoneSplit();
          train = 0.7, validation = 0.1, test = 0.2)
source
DataSplits.partitionMethod
partition(data, alg;
          train, test,
          target=nothing, time=nothing, groups=nothing,
          rng=Random.default_rng()) -> TrainTestSplit

Split data into train and test cohorts according to alg.

Cohort sizes (train, test)

Integers are accepted in two ways:

  • Percentages — values sum to 100.
  • Absolute counts — values sum to N = numobs(data).

Floats in (0, 1) summing to 1.0 are also accepted and converted to counts.

Auxiliary slots

  • target: response/property vector (e.g. for SPXYSplit).
  • time: temporal ordering vector (e.g. for TimeSplit).
  • groups: group-membership vector (e.g. for GroupShuffleSplit).

Examples

partition(X, KennardStoneSplit(); train = 80, test = 20)
partition(X, RandomSplit(); train = 0.8, test = 0.2)
partition(X, SPXYSplit(); target = y, train = 80, test = 20)
source
DataSplits.rowpairsMethod
rowpairs(data, alg::AbstractSplitStrategy; kwargs...)

Convenience wrapper equivalent to:

rowpairs(partition(data, alg; kwargs...))
source
DataSplits.rowpairsMethod
rowpairs(res) -> Vector{Tuple{Vector{Int}, Vector{Int}}}

Convert a split result into the index-pair format accepted by MLJ's evaluate! resampling= keyword.

  • CrossValidationSplit → one (train, test) pair per fold.
  • TrainTestSplit → a single-element vector [(train, test)].
  • TrainValTestSplit[(train, val)] (validation cohort, not test).

Example

cvs = partition(X, StratifiedKFold(5); target = y)
mach = machine(model, X, y)
evaluate!(mach; resampling = rowpairs(cvs), measure = accuracy)
source
DataSplits.sphere_exclusionMethod
sphere_exclusion(data; radius::Real, metric::Distances.SemiMetric=Euclidean()) -> SphereExclusionResult

Clusters samples in data using the sphere exclusion algorithm.

Arguments

  • data: Data matrix or container. Columns are samples.
  • radius::Real: Exclusion radius (normalized to [0, 1]).
  • metric::Distances.SemiMetric: Distance metric (default: Euclidean()).

Returns

  • SphereExclusionResult: Clustering result with assignments, radius, and metric.

Notes

  • The distance matrix is normalized to [0, 1] before clustering.
  • Each cluster contains all points within radius of the cluster center.

Examples

result = sphere_exclusion(X; radius=0.2)
assignments = result.assignments
source
DataSplits.splitdataMethod
splitdata(res, data)

Materialise the split: return a tuple of data subsets corresponding to the train/test (and optionally validation) indices in res.

When data is a DataFrame or other Tables.jl-compatible container, splitdata returns subsets of the same type.

source
DataSplits.splitviewMethod
splitview(res, data)

Like splitdata but returns lazy views via MLUtils.obsview — no data is copied. Prefer splitdata when you need independent copies.

source
DataSplits.testviewFunction
trainview(res, data...)
testview(res, data...)
valview(res, data...)

Return lazy views of the requested cohort for one or more data sources.

When called with a single data source, returns the view directly. When called with two or more, returns a Tuple of views — suitable for passing directly to Flux.DataLoader or similar.

valview is only defined for TrainValTestSplit.

For CrossValidationSplit, each function returns a Vector (one element per fold).

Examples

# Single source
X_train = trainview(split, X)
X_test  = testview(split, X)

# Multiple sources — tuple destructures naturally
X_train, y_train = trainview(split, X, y)
X_test,  y_test  = testview(split, X, y)

# Flux DataLoader — tuple passed directly
loader = Flux.DataLoader(trainview(split, X, y); batchsize = 64, shuffle = true)

# Train/val/test
X_train, y_train = trainview(split3, X, y)
X_val,   y_val   = valview(split3,   X, y)
X_test,  y_test  = testview(split3,  X, y)

# Cross-validation
for (X_tr, y_tr) in trainview(cvs, X, y)
    loader = Flux.DataLoader((X_tr, y_tr); batchsize = 64)
    # ...
end
source
DataSplits.trainviewMethod
trainview(res, data...)
testview(res, data...)
valview(res, data...)

Return lazy views of the requested cohort for one or more data sources.

When called with a single data source, returns the view directly. When called with two or more, returns a Tuple of views — suitable for passing directly to Flux.DataLoader or similar.

valview is only defined for TrainValTestSplit.

For CrossValidationSplit, each function returns a Vector (one element per fold).

Examples

# Single source
X_train = trainview(split, X)
X_test  = testview(split, X)

# Multiple sources — tuple destructures naturally
X_train, y_train = trainview(split, X, y)
X_test,  y_test  = testview(split, X, y)

# Flux DataLoader — tuple passed directly
loader = Flux.DataLoader(trainview(split, X, y); batchsize = 64, shuffle = true)

# Train/val/test
X_train, y_train = trainview(split3, X, y)
X_val,   y_val   = valview(split3,   X, y)
X_test,  y_test  = testview(split3,  X, y)

# Cross-validation
for (X_tr, y_tr) in trainview(cvs, X, y)
    loader = Flux.DataLoader((X_tr, y_tr); batchsize = 64)
    # ...
end
source
DataSplits.valviewFunction
trainview(res, data...)
testview(res, data...)
valview(res, data...)

Return lazy views of the requested cohort for one or more data sources.

When called with a single data source, returns the view directly. When called with two or more, returns a Tuple of views — suitable for passing directly to Flux.DataLoader or similar.

valview is only defined for TrainValTestSplit.

For CrossValidationSplit, each function returns a Vector (one element per fold).

Examples

# Single source
X_train = trainview(split, X)
X_test  = testview(split, X)

# Multiple sources — tuple destructures naturally
X_train, y_train = trainview(split, X, y)
X_test,  y_test  = testview(split, X, y)

# Flux DataLoader — tuple passed directly
loader = Flux.DataLoader(trainview(split, X, y); batchsize = 64, shuffle = true)

# Train/val/test
X_train, y_train = trainview(split3, X, y)
X_val,   y_val   = valview(split3,   X, y)
X_test,  y_test  = testview(split3,  X, y)

# Cross-validation
for (X_tr, y_tr) in trainview(cvs, X, y)
    loader = Flux.DataLoader((X_tr, y_tr); batchsize = 64)
    # ...
end
source