Reference
Contents
Index
DataSplits.AbstractCVStrategyDataSplits.AbstractResamplingCVStrategyDataSplits.AbstractSplitResultDataSplits.AbstractSplitStrategyDataSplits.BlockedCVDataSplits.BootstrapSplitDataSplits.CADEXSplitDataSplits.CrossValidationSplitDataSplits.GroupKFoldDataSplits.GroupShuffleSplitDataSplits.GroupShuffleSplitCVDataSplits.GroupStratifiedSplitDataSplits.KFoldDataSplits.KennardStoneSplitDataSplits.LazyCADEXSplitDataSplits.LazyKennardStoneSplitDataSplits.LazyMDKSSplitDataSplits.LazyMaximumDissimilaritySplitDataSplits.LazyMinimumDissimilaritySplitDataSplits.LazyOptiSimSplitDataSplits.LazySPXYSplitDataSplits.LeaveOneOutDataSplits.LeavePGroupsOutDataSplits.LeavePOutDataSplits.MDKSSplitDataSplits.MaximumDissimilaritySplitDataSplits.MinimumDissimilaritySplitDataSplits.MoraisLimaMartinSplitDataSplits.NestedCVDataSplits.NestedFoldDataSplits.OptiSimSplitDataSplits.PredefinedSplitDataSplits.PurgedKFoldDataSplits.RandomSplitDataSplits.RepeatedKFoldDataSplits.RepeatedStratifiedKFoldDataSplits.SPXYSplitDataSplits.ShuffleSplitDataSplits.SphereExclusionResultDataSplits.SplitInputErrorDataSplits.SplitNotImplementedErrorDataSplits.SplitParameterErrorDataSplits.StratifiedGroupKFoldDataSplits.StratifiedKFoldDataSplits.StratifiedShuffleSplitDataSplits.TargetPropertySplitDataSplits.TimeSeriesSplitDataSplits.TimeSplitDataSplits.TrainTestSplitDataSplits.TrainValTestSplitDataSplits.ValidFractionDataSplits.LeaveOneGroupOutDataSplits.TargetPropertyHighDataSplits.TargetPropertyLowDataSplits.TimeSplitNewestDataSplits.TimeSplitOldestDataSplits._assert_partitionableDataSplits._assert_unit_fraction_sumDataSplits._blocked_cv_partitionDataSplits._is_fractionDataSplits._resolve_sizesDataSplits._to_feature_matrixDataSplits._warn_undershootDataSplits._within_group_stdDataSplits.consumesDataSplits.distance_matrixDataSplits.distribute_blocksDataSplits.fallback_from_dataDataSplits.find_maximin_elementDataSplits.find_most_distant_pairDataSplits.foldsDataSplits.group_offsetsDataSplits.groupsortpermDataSplits.innerfoldsDataSplits.partitionDataSplits.partitionDataSplits.partitionDataSplits.partitionDataSplits.rowpairsDataSplits.rowpairsDataSplits.sphere_exclusionDataSplits.splitdataDataSplits.splitviewDataSplits.testdataDataSplits.testindicesDataSplits.testviewDataSplits.traindataDataSplits.trainindicesDataSplits.trainviewDataSplits.valdataDataSplits.valindicesDataSplits.valview
DataSplits.AbstractCVStrategy — Type
AbstractCVStrategy <: AbstractSplitStrategyAbstract supertype for cross-validation strategies — strategies that produce a CrossValidationSplit (a vector of folds) rather than a single train/test or train/val/test partition.
Dispatching on this subtype selects the partition(data, alg::AbstractCVStrategy; …) method, which does not accept train / test / validation keywords: fold sizes are determined by the strategy itself (typically through k).
To implement a custom CV strategy:
- Subtype this and define
_partition(data, alg; target, time, groups, rng, kwargs...)returning aCrossValidationSplit. - Declare
consumesand (optionally)fallback_from_dataexactly as for anyAbstractSplitStrategy.
DataSplits.AbstractResamplingCVStrategy — Type
AbstractResamplingCVStrategy <: AbstractCVStrategyAbstract supertype for resampling cross-validation strategies — strategies whose folds are independent random train/test splits sized by the caller, rather than fixed slices of a deterministic partition.
Subtyping this routes calls to the partition(data, alg; train, test, …) method, which forwards the resolved n_train / n_test to _partition. Used by ShuffleSplit, StratifiedShuffleSplit, and GroupShuffleSplitCV.
DataSplits.AbstractSplitResult — Type
AbstractSplitResultAbstract supertype for all split result types.
DataSplits.AbstractSplitStrategy — Type
AbstractSplitStrategyAbstract supertype for all splitting strategies.
To implement a custom train/test (or train/val/test) strategy, subtype this and define:
_partition(data, alg; n_train, n_test, target, time, groups, rng)returning anAbstractSplitResult.consumes(::MyStrategy)returning a tuple of symbols from(:data, :target, :time, :groups).fallback_from_data(::MyStrategy)returning the subset ofconsumesthat can fall back todata.
For cross-validation strategies (returning CrossValidationSplit) see AbstractCVStrategy — the contract there omits n_train / n_test.
DataSplits.BlockedCV — Type
BlockedCV(k::Integer; gap::Integer=0) <: AbstractCVStrategyBlocked k-fold cross-validation for dependent (time- or space-ordered) data. Observations are sorted by time= and partitioned into k contiguous chronological blocks; each block takes a turn as the test cohort while the train cohort is everything else — both blocks preceding it and blocks following it.
This differs from TimeSeriesSplit (forward-only, train always precedes test) and from KFold (no temporal ordering). It matches the "blocked CV" used in time-series / spatial-statistics literature (Bergmeir & Benítez 2012, Roberts et al. 2017) when train samples should not be drawn from arbitrary positions but the test block must still be embedded in a longer train history.
A gap window (in observations) is removed from the train cohort on both sides of the test block to mitigate leakage from autocorrelation.
Atomicity rule
Observations sharing the same timestamp are never split between train and test of the same fold — block boundaries fall between distinct time values, mirroring TimeSeriesSplit and TimeSplit. gap is measured in observations and may trim partial blocks of equal timestamps from the train side; no row ever leaks into test.
Fields
k::Int: Number of folds (must be ≥ 2 and ≤ number of distinct time values).gap::Int: Number of observations excluded from the train cohort on each side of the test block (must be ≥ 0; default0).
Examples
# 5-fold blocked CV with a one-observation embargo on both sides.
cvs = partition(X, BlockedCV(5; gap = 1); time = timestamps)
for (X_train, X_test) in splitview(cvs, X)
fit!(model, X_train); evaluate(model, X_test)
end
# Single-input shorthand when the timestamps are also the data.
cvs = partition(timestamps, BlockedCV(4))DataSplits.BootstrapSplit — Type
BootstrapSplit(n_splits::Integer) <: AbstractCVStrategyBootstrap cross-validation. For each of n_splits iterations the train cohort is drawn from 1:N with replacement (so it has exactly N indices, with duplicates), and the test cohort is the set of out-of-bag (OOB) observations — the unique indices that were not sampled. On average about (1 - 1/e) ≈ 63.2% of unique observations land in train; the remaining ~36.8% form the OOB test.
Fields
n_splits::Int: Number of bootstrap resamples (must be ≥ 1).
Important notes
- Train contains duplicates by design. This is the defining property of bootstrap sampling — it lets the model see the variance introduced by resampling the empirical distribution. If you need unique-only train indices use
ShuffleSplitinstead. - Test (OOB) size varies fold-to-fold. It is whatever the bootstrap left out; no caller-set size is honoured.
- Cohort sizes (
train,test) are not accepted: train is alwaysN(with replacement), test is the OOB. - Almost surely non-empty test for
N ≥ 2(probability of allNindices being drawn at least once falls off asN! / N^N).
Examples
# 50 bootstrap resamples.
cvs = partition(X, BootstrapSplit(50); rng = MersenneTwister(42))
for (X_train, X_test) in splitview(cvs, X)
fit!(model, X_train) # X_train has N obs, with duplicates
evaluate(model, X_test) # X_test is the OOB
endDataSplits.CADEXSplit — Type
CADEXSplitAlias for KennardStoneSplit.
DataSplits.CrossValidationSplit — Type
CrossValidationSplitA result type representing a k-fold cross-validation split.
Fields
folds::Vector{<:AbstractSplitResult}: One result per fold.
DataSplits.GroupKFold — Type
GroupKFold(k::Integer; shuffle::Bool=false) <: AbstractCVStrategyGroup-aware k-fold cross-validation. Whole groups are assigned to a single fold; no group ever appears in both the train and test cohort of the same fold. Equivalent in spirit to scikit-learn's GroupKFold.
Groups are passed via the groups= keyword (or, by fallback, data itself plays that role).
Fields
k::Int: Number of folds (must be ≥ 2 and ≤ number of unique groups).shuffle::Bool: Whentrue, the order in which groups are considered for fold assignment is shuffled using therngpassed topartition, so different RNG seeds yield different fold compositions. Whenfalse(default), assignment is deterministic and reproducible without anrng.
Notes
- Within each candidate group, the algorithm places it in the currently smallest fold, so observation counts across folds stay roughly balanced whether or not shuffling is enabled. Mirrors sklearn's
GroupKFold. - When
shuffle=false, groups are processed in descending order of size (the classic deterministic balancing). Whenshuffle=true, they are processed in a randomly permuted order — folds remain balanced but no longer follow size order.
Examples
# Deterministic (default).
cvs = partition(X, GroupKFold(5); groups = patient_ids)
# Shuffled — different seeds give different fold compositions.
cvs = partition(X, GroupKFold(5; shuffle = true);
groups = patient_ids, rng = MersenneTwister(42))
for (X_train, X_test) in splitview(cvs, X)
# train and evaluate
end
# Fallback: ids are simultaneously the data and the groups.
cvs = partition(patient_ids, GroupKFold(5))DataSplits.GroupShuffleSplit — Type
GroupShuffleSplit() <: AbstractSplitStrategyGroup-aware train/test splitting. Accumulates whole groups into the training set (in random order) until the requested training size is reached.
Groups are passed as a vector of membership IDs via the groups= keyword. Any grouping is valid: cluster assignments, patient IDs, scaffold labels, batch numbers, site identifiers, graph communities, etc.
Notes
Because groups are added whole, the actual train cohort size may overshoot the requested n_train. No attempt is made to minimise this overshoot.
Examples
# ids is both data and groups
res = partition(ids, GroupShuffleSplit(); train=80, test=20)
# X is split; group membership provided separately
res = partition(X, GroupShuffleSplit(); groups=patient_ids, train=80, test=20)
X_train, X_test = splitdata(res, X)
# With Clustering.jl
using Clustering
res = partition(X, GroupShuffleSplit();
groups=assignments(kmeans(X, 5)), train=80, test=20)DataSplits.GroupShuffleSplitCV — Type
GroupShuffleSplitCV(n_splits::Integer) <: AbstractResamplingCVStrategyGroup-aware random permutation cross-validation. For each of n_splits iterations the groups are shuffled and assigned whole into the train cohort until the requested training size is reached; the remaining groups form the test cohort. Mirrors scikit-learn's GroupShuffleSplit.
Groups are passed as a vector of membership IDs via the groups= keyword (or, by fallback, data itself plays that role).
Fields
n_splits::Int: Number of resamples (must be ≥ 1).
Notes
- Resamples are independent — a group can land in train in one fold and test in another. This is the defining property versus
GroupKFold, where each group appears in exactly one test cohort across folds. - Because groups are added whole, the actual train cohort size may overshoot the requested
n_train(same behaviour as the 2-cohortGroupShuffleSplit). trainandtestmust sum toN; every observation is placed in exactly one cohort per resample.
Examples
# Fractions.
cvs = partition(X, GroupShuffleSplitCV(10);
groups = patient_ids, train = 0.8, test = 0.2)
# Absolute counts.
cvs = partition(X, GroupShuffleSplitCV(10);
groups = patient_ids, train = 80, test = 20)
# Reproducible.
cvs = partition(X, GroupShuffleSplitCV(10);
groups = patient_ids, train = 0.8, test = 0.2,
rng = MersenneTwister(42))
# Fallback: ids are simultaneously the data and the groups.
cvs = partition(patient_ids, GroupShuffleSplitCV(10);
train = 0.8, test = 0.2)DataSplits.GroupStratifiedSplit — Type
GroupStratifiedSplit(allocation::Symbol; n=nothing) <: AbstractSplitStrategyGroup-stratified train/test splitting with flexible allocation methods.
Groups are passed as a vector of membership IDs via the groups= keyword. Any grouping is valid: cluster assignments, patient IDs, scaffold labels, batch numbers, site identifiers, graph communities, etc.
Fields
allocation::Symbol: Allocation method —:equal,:proportional, or:neyman.n::Union{Nothing,Int}: Samples per group for:equaland:neyman.
Allocation methods
:proportional— use all samples from each group (shuffled).:equal— selectnsamples from each group (requiresn).:neyman— select proportional to group size × within-group std (requiresn).
The training fraction within each group is derived from the global cohort sizes (n_train / N).
Examples
res = partition(X, GroupStratifiedSplit(:proportional);
groups=patient_ids, train=80, test=20)
X_train, X_test = splitdata(res, X)
# With Clustering.jl
using Clustering
res = partition(X, GroupStratifiedSplit(:equal; n=5);
groups=assignments(kmeans(X, 4)), train=80, test=20)References
May, R. J.; Maier, H. R.; Dandy, G. C. Data Splitting for Artificial Neural Networks Using SOM-Based Stratified Sampling. Neural Networks 2010, 23(2), 283–294.
DataSplits.KFold — Type
KFold(k::Integer; shuffle::Bool=false) <: AbstractCVStrategyStandard k-fold cross-validation. Splits the dataset into k roughly equal folds; each fold takes a turn as the test set while the remaining folds form the training set. Equivalent in spirit to scikit-learn's KFold.
Fields
k::Int: Number of folds (must be ≥ 2 and ≤ number of observations).shuffle::Bool: Whentrue, observations are randomly permuted before folding using therngpassed topartition, so different seeds yield different fold assignments. Whenfalse(default), observations are assigned in order and the split is fully deterministic.
Notes
- Fold sizes differ by at most 1 observation (first
n mod kfolds are one sample larger), mirroring scikit-learn's behaviour. - Unlike
GroupKFold, no extra keyword arguments are required.
Examples
# Deterministic (default).
cvs = partition(X, KFold(5))
# Shuffled — different seeds give different fold assignments.
cvs = partition(X, KFold(5; shuffle = true); rng = MersenneTwister(42))
for (X_train, X_test) in splitview(cvs, X)
# train and evaluate
endDataSplits.KennardStoneSplit — Type
KennardStoneSplit <: AbstractSplitStrategyIn-memory Kennard-Stone (CADEX) algorithm for train/test splitting.
Precomputes the full N×N distance matrix; prefer LazyKennardStoneSplit for large datasets where that is prohibitive.
Fields
metric::Distances.SemiMetric: Distance metric (default:Euclidean())
Examples
res = partition(X, KennardStoneSplit(); train = 80, test = 20)
X_train, X_test = splitdata(res, X)
res = partition(X, KennardStoneSplit(Cityblock()); train = 70, test = 30)DataSplits.LazyCADEXSplit — Type
LazyCADEXSplitAlias for LazyKennardStoneSplit.
DataSplits.LazyKennardStoneSplit — Type
LazyKennardStoneSplit <: AbstractSplitStrategyMemory-efficient Kennard-Stone (CADEX) algorithm. Computes distances on-the-fly (O(N) storage) rather than precomputing the full N×N matrix.
Fields
metric::Distances.SemiMetric: Distance metric (default:Euclidean())
Examples
res = partition(X, LazyKennardStoneSplit(); train = 80, test = 20)
X_train, X_test = splitdata(res, X)DataSplits.LazyMDKSSplit — Type
LazyMDKSSplit <: AbstractSplitStrategyMemory-efficient Minimum Dissimilarity Kennard–Stone (MDKS) splitting strategy. Uses Mahalanobis distance for X and Euclidean for y, normalised and summed as in SPXY. Computes distances on-the-fly (O(N) storage).
Fields
metric_X::Union{Nothing,Distances.SemiMetric}: Distance metric forX; ifnothing, Mahalanobis is computed from the data at split time.metric_y::Distances.SemiMetric: Distance metric fory(default:Euclidean())
Examples
res = partition(X, LazyMDKSSplit(); target=y, train=70, test=30)
X_train, X_test = splitdata(res, X)DataSplits.LazyMaximumDissimilaritySplit — Type
LazyMaximumDissimilaritySplit(; distance_cutoff=0.35, metric=Euclidean()) <: AbstractSplitStrategyLazy (on-the-fly distances) variant of MaximumDissimilaritySplit. Computes distances on-the-fly (O(N) storage) rather than precomputing the full distance matrix. Prefer this over MaximumDissimilaritySplit for large datasets.
Fields
distance_cutoff::Float64: Similarity threshold (default:0.35).metric::Distances.SemiMetric: Distance metric (default:Euclidean()).
Examples
res = partition(X, LazyMaximumDissimilaritySplit(); train = 70, test = 30)
X_train, X_test = splitdata(res, X)DataSplits.LazyMinimumDissimilaritySplit — Type
LazyMinimumDissimilaritySplit <: AbstractSplitStrategyLazy (on-the-fly distances) variant of MinimumDissimilaritySplit.
Examples
res = partition(X, LazyMinimumDissimilaritySplit(); train=70, test=30)
X_train, X_test = splitdata(res, X)DataSplits.LazyOptiSimSplit — Type
LazyOptiSimSplit <: AbstractSplitStrategyMemory-efficient, lazy implementation of the OptiSim (Clark 1997) dissimilarity selection strategy. Computes distances on-the-fly; avoids the full N×N matrix.
Fields
max_subsample_size::Int: Size of the temporary candidate subsample (default: 10)distance_cutoff::Float64: Similarity threshold (default: 0.35)metric::Distances.SemiMetric: Distance metric (default:Euclidean())
Notes
Emits the same undershoot warning as OptiSimSplit under _id = :datasplits_optisim_undershoot when distance_cutoff exhausts the candidate pool before reaching n_train. See ?OptiSimSplit for the silencing recipe.
References
- Clark, R. D. (1997). OptiSim: An Extended Dissimilarity Selection Method for Finding Diverse Representative Subsets. J. Chem. Inf. Comput. Sci., 37(6), 1181–1188.
Examples
res = partition(X, LazyOptiSimSplit(); train = 70, test = 30)
X_train, X_test = splitdata(res, X)DataSplits.LazySPXYSplit — Type
LazySPXYSplit <: AbstractSplitStrategyMemory-efficient SPXY splitting strategy. Computes distances on-the-fly (O(N) storage) rather than precomputing the full N×N matrix.
Fields
metric_X::Distances.SemiMetric: Distance metric forX(default:Euclidean())metric_y::Distances.SemiMetric: Distance metric fory(default:Euclidean())
Examples
res = partition(X, LazySPXYSplit(); target=y, train=80, test=20)
X_train, X_test = splitdata(res, X)DataSplits.LeaveOneOut — Type
LeaveOneOut() <: AbstractCVStrategyCross-validation where each single observation takes a turn as the test set. Produces n folds (one per observation). Equivalent to LeavePOut(1).
Examples
cvs = partition(X, LeaveOneOut())
for (X_train, X_test) in splitview(cvs, X)
# train and evaluate
endDataSplits.LeavePGroupsOut — Type
LeavePGroupsOut(p::Integer) <: AbstractCVStrategyLeave-p-groups-out cross-validation. Produces one fold per combination of p distinct groups: in each fold, the test cohort is exactly the observations belonging to those p groups, and the train cohort is everything else.
Equivalent to scikit-learn's LeavePGroupsOut. The number of folds is binomial(n_groups, p), which grows quickly — pick p accordingly.
Groups are passed via the groups= keyword (or, by fallback, data itself plays that role).
Fields
p::Int: Number of groups held out as test in each fold (must satisfy1 ≤ p < n_groups).
Constructors
LeavePGroupsOut(p)— generic constructor.LeaveOneGroupOut()— convenience alias forLeavePGroupsOut(1).
Examples
# One group out per fold (n_folds == n_groups).
cvs = partition(X, LeaveOneGroupOut(); groups = patient_ids)
# Two groups out per fold; n_folds == binomial(n_groups, 2).
cvs = partition(X, LeavePGroupsOut(2); groups = site_ids)
for (X_train, X_test) in splitview(cvs, X)
# train and evaluate
endDataSplits.LeavePOut — Type
LeavePOut(p::Integer) <: AbstractCVStrategyExhaustive cross-validation that uses every possible combination of p observations as the test set. Produces binomial(n, p) folds, where n is the number of observations.
Fields
p::Int: Number of observations in each test fold (must be ≥ 1 and < n).
Notes
- The number of folds grows as
binomial(n, p), which becomes very large quickly. Use only for small datasets or small values ofp. - For
p = 1, preferLeaveOneOutas a convenience alias.
Examples
cvs = partition(X, LeavePOut(2))
for (X_train, X_test) in splitview(cvs, X)
# train and evaluate
endDataSplits.MDKSSplit — Type
MDKSSplit(; metric=nothing)Minimum Dissimilarity Kennard–Stone (MDKS) split using Mahalanobis distance for X and Euclidean distance for y.
Fields
metric::Union{Nothing,Distances.PreMetric}: Distance metric forX. Whennothing(default), Mahalanobis distance is computed from the covariance ofXat split time.
Examples
res = partition(X, MDKSSplit(); target=y, train=70, test=30)
res = partition(X, MDKSSplit(; metric=Mahalanobis(cov(X; dims=2)));
target=y, train=70, test=30)
X_train, X_test = splitdata(res, X)See also
DataSplits.MaximumDissimilaritySplit — Type
MaximumDissimilaritySplit(; distance_cutoff=0.35, metric=Euclidean())Full OptiSim strategy (Clark 1997). Alias for OptiSimSplit with max_subsample_size = N — considers all remaining candidates each iteration.
Fields
distance_cutoff::Float64: Similarity threshold (default:0.35).metric::Distances.SemiMetric: Distance metric (default:Euclidean()).
Notes
- Greedily includes outliers; remove them before splitting if not desired.
References
- Clark, R. D. (1997). OptiSim: An Extended Dissimilarity Selection Method for Finding Diverse Representative Subsets. J. Chem. Inf. Comput. Sci., 37(6), 1181–1188.
Examples
res = partition(X, MaximumDissimilaritySplit(); train=70, test=30)
X_train, X_test = splitdata(res, X)DataSplits.MinimumDissimilaritySplit — Type
MinimumDissimilaritySplit <: AbstractSplitStrategyGreedy dissimilarity selection (Clark 1997). Equivalent to OptiSimSplit with max_subsample_size = 1 (considers only one candidate per iteration).
Fields
distance_cutoff::Float64: Similarity threshold (default: 0.35).metric::Distances.SemiMetric: Distance metric (default:Euclidean()).
References
- Clark, R. D. (1997). OptiSim: An Extended Dissimilarity Selection Method for Finding Diverse Representative Subsets. J. Chem. Inf. Comput. Sci., 37(6), 1181–1188.
Examples
res = partition(X, MinimumDissimilaritySplit(); train=70, test=30)
X_train, X_test = splitdata(res, X)DataSplits.MoraisLimaMartinSplit — Type
MoraisLimaMartinSplit(; swap_frac=0.1, metric=Euclidean())Kennard–Stone initialisation followed by random swapping of a fraction of samples between train and test sets.
Fields
swap_frac::ValidFraction: Fraction of samples to swap (0 < swap_frac < 1)metric::Distances.SemiMetric: Distance metric for Kennard–Stone (default:Euclidean())
Examples
res = partition(X, MoraisLimaMartinSplit(); train=80, test=20)
res = partition(X, MoraisLimaMartinSplit(; swap_frac=0.05); train=80, test=20)DataSplits.NestedCV — Type
NestedCV(outer::AbstractCVStrategy, inner::AbstractCVStrategy) <: AbstractCVStrategyNested cross-validation — combine an outer CV (for unbiased performance estimation) with an inner CV (for hyperparameter selection within each outer training cohort).
For each fold of outer, the outer test cohort is held out and the outer training cohort is itself partitioned by inner. The result is a CrossValidationSplit{NestedFold} where each fold exposes the usual outer (train, test) pair plus an innerfolds(...) accessor giving the inner CV split. Inner fold indices are already remapped to the absolute 1:N index space.
Fields
outer::AbstractCVStrategy— produces the outer train/test split per fold.inner::AbstractCVStrategy— partitions each outer training cohort.
Restrictions
outermust produceCrossValidationSplit{TrainTestSplit}(i.e. not nested itself).innermust be a non-resamplingAbstractCVStrategy— strategies subtypingAbstractResamplingCVStrategy(e.g.ShuffleSplit,StratifiedShuffleSplit,GroupShuffleSplitCV) require caller-set cohort sizes whichNestedCVdoes not currently propagate.
Slot resolution
consumes(::NestedCV) is the union of the outer and inner strategies' declared slots. partition resolves each slot once against the full dataset, then the inner CV sees a view sliced to the outer training cohort.
Examples
# 5 × 3 nested k-fold on a classification target.
cvs = partition(X, NestedCV(StratifiedKFold(5), StratifiedKFold(3)); target = y)
for outerfold in folds(cvs)
X_tr_outer, X_te_outer = splitdata(outerfold, X)
y_tr_outer, _ = splitdata(outerfold, y)
for (X_tr, X_val) in splitview(innerfolds(outerfold), X)
# tune hyperparameters on (X_tr, X_val)
end
# then refit on the full outer training cohort and score on X_te_outer
end
# Group-aware nesting.
cvs = partition(X, NestedCV(GroupKFold(5), GroupKFold(3)); groups = patient_ids)DataSplits.NestedFold — Type
NestedFold{I} <: AbstractSplitResultA single outer fold of a nested cross-validation split.
Fields
train::I— outer training indices (absolute, into1:N).test::I— outer test indices (absolute, into1:N).inner::CrossValidationSplit{TrainTestSplit{I}}— the inner cross-validation produced by applying the inner strategy to the outer training cohort. Inner fold indices are absolute (already remapped from the local1:length(train)index space back into1:N), so they can be used directly against the originaldata.
Iteration
NestedFold iterates as (train, test) — identical to TrainTestSplit for outer-loop compatibility. To access the inner CV, use the innerfolds accessor or the inner field.
DataSplits.OptiSimSplit — Type
OptiSimSplit(; max_subsample_size=10, distance_cutoff=0.35, metric=Euclidean())OptiSim (Clark 1997) K-dissimilarity selection strategy for train/test splitting.
Fields
max_subsample_size::Int: Size of the temporary candidate subsampledistance_cutoff::Float64: Two points are "similar" if their distance <distance_cutoffmetric::Distances.SemiMetric: Distance metric (default:Euclidean())
Notes
When distance_cutoff is restrictive relative to the data, the candidate pool may exhaust before n_train samples have been selected. The train cohort is then returned smaller than requested and a @warn is emitted with _id = :datasplits_optisim_undershoot and _group = :datasplits.
To silence this warning for a batch of splits, filter by id (e.g. with LoggingExtras.EarlyFilteredLogger):
using Logging, LoggingExtras
silent = EarlyFilteredLogger(log -> log.id !== :datasplits_optisim_undershoot,
current_logger())
with_logger(silent) do
# repeated partition(...) calls here emit no undershoot warnings
endReferences
- Clark, R. D. (1997). OptiSim: An Extended Dissimilarity Selection Method for Finding Diverse Representative Subsets. J. Chem. Inf. Comput. Sci., 37(6), 1181–1188.
Examples
res = partition(X, OptiSimSplit(; max_subsample_size=10); train=70, test=30)
X_train, X_test = splitdata(res, X)DataSplits.PredefinedSplit — Type
PredefinedSplit(test_fold::AbstractVector{<:Integer}) <: AbstractCVStrategyCross-validation with caller-provided fold assignments. Each entry of test_fold gives the fold ID in which the corresponding observation serves as test. Negative values mean the observation is never placed in the test cohort — it is part of every fold's training set.
Folds are produced in ascending order of fold ID. Mirrors scikit-learn's PredefinedSplit, but driven by partition and DataSplits' CrossValidationSplit result type.
Fields
test_fold::Vector{Int}: Length-Nvector mapping each observation to the fold ID where it tests, or to a negative value for "always train".
Notes
length(test_fold)must equalnumobs(data).- Fold IDs need not be contiguous; what matters is the set of distinct non-negative values. At least one non-negative ID must exist.
Examples
# Three folds: obs 1-2 test in fold 0, obs 3-4 test in fold 1, obs 5-6
# test in fold 2.
test_fold = [0, 0, 1, 1, 2, 2]
cvs = partition(X, PredefinedSplit(test_fold))
# Hold-out style: obs 7-10 are reserved for training only across all folds.
test_fold = [0, 0, 0, 1, 1, 1, -1, -1, -1, -1]
cvs = partition(X, PredefinedSplit(test_fold))DataSplits.PurgedKFold — Type
PurgedKFold(k::Integer; purge::Integer=0, embargo::Integer=0) <: AbstractCVStrategyPurged k-fold cross-validation for time-dependent data, following the recipe in López de Prado, Advances in Financial Machine Learning (2018). Observations are sorted by time= and partitioned into k contiguous chronological blocks; each block takes a turn as the test cohort while the train cohort is everything else minus an asymmetric exclusion window:
purgeobservations are removed from the train cohort immediately before the test block. This mitigates leakage from samples whose labels overlap the test period (e.g. labels built from forward-looking returns whose horizon reaches into the test window).embargoobservations are removed from the train cohort immediately after the test block. This mitigates leakage from serial correlation between test-period features and the immediately subsequent train samples.
This is the asymmetric counterpart of BlockedCV (which uses a single symmetric gap on both sides) and the contiguous-block counterpart of sklearn's KFold adapted for time series.
Atomicity rule
Observations sharing the same timestamp are never split between train and test of the same fold — block boundaries fall between distinct time values, mirroring TimeSeriesSplit and BlockedCV. purge and embargo are measured in observations and may trim partial blocks of equal timestamps from the train side; no row ever leaks into the test cohort.
Fields
k::Int: Number of folds (must be ≥ 2 and ≤ number of distinct time values).purge::Int: Number of observations excluded from the train cohort immediately before the test block (must be ≥ 0; default0).embargo::Int: Number of observations excluded from the train cohort immediately after the test block (must be ≥ 0; default0).
Examples
# 5-fold purged CV with a 2-observation purge and a 1-observation embargo.
cvs = partition(X, PurgedKFold(5; purge = 2, embargo = 1); time = timestamps)
for (X_train, X_test) in splitview(cvs, X)
fit!(model, X_train); evaluate(model, X_test)
end
# Single-input shorthand when the timestamps are also the data.
cvs = partition(timestamps, PurgedKFold(4; purge = 1))References
López de Prado, M. Advances in Financial Machine Learning. Wiley, 2018, §7.4 ("Purging and Embargoing").
DataSplits.RandomSplit — Type
RandomSplit <: AbstractSplitStrategyRandomly splits data into the requested cohort sizes.
Examples
res = partition(X, RandomSplit(); train = 80, test = 20)
X_train, X_test = splitdata(res, X)DataSplits.RepeatedKFold — Type
RepeatedKFold(k::Integer; n_repeats::Integer=10) <: AbstractCVStrategyRepeated k-fold cross-validation. Runs KFold n_repeats times with a fresh random permutation each repeat, producing k * n_repeats folds in total. Mirrors scikit-learn's RepeatedKFold.
Each repeat is a full k-fold partition of the data; across repeats the fold assignments are independent random permutations. Use the same rng (with a fixed seed) to reproduce the full set of folds.
Fields
k::Int: Number of folds per repeat (must be ≥ 2 and ≤ N).n_repeats::Int: Number of independent K-fold partitions (must be ≥ 1).
Examples
# 50 folds total (5 folds × 10 repeats).
cvs = partition(X, RepeatedKFold(5; n_repeats = 10);
rng = MersenneTwister(42))DataSplits.RepeatedStratifiedKFold — Type
RepeatedStratifiedKFold(k::Integer; n_repeats::Integer=10, bins::Integer=10) <: AbstractCVStrategyRepeated stratified k-fold cross-validation. Runs StratifiedKFold n_repeats times with a fresh random permutation each repeat, producing k * n_repeats folds in total. Mirrors scikit-learn's RepeatedStratifiedKFold.
The target= keyword is required (or, by fallback, data itself plays that role); see StratifiedKFold for the stratification rule (unique values for discrete targets, quantile bins for floats).
Fields
k::Int: Number of folds per repeat (must be ≥ 2 and ≤ N).n_repeats::Int: Number of independent stratified K-fold partitions (must be ≥ 1).bins::Int: Number of quantile bins for floating-point targets (must be ≥ 2; default10).
Examples
# Classification.
cvs = partition(X, RepeatedStratifiedKFold(5; n_repeats = 10);
target = labels, rng = MersenneTwister(42))
# Regression with 4 quantile bins.
cvs = partition(X, RepeatedStratifiedKFold(5; n_repeats = 10, bins = 4);
target = y_continuous)DataSplits.SPXYSplit — Type
SPXYSplit(; metric_X=Euclidean(), metric_y=Euclidean())Sample set Partitioning based on joint X–Y distance (SPXY).
A variant of Kennard–Stone where the joint distance matrix is the element-wise sum of the (normalised) pairwise distance matrices of X and y.
Fields
metric_X::Distances.SemiMetric: Distance metric forX(default:Euclidean())metric_y::Distances.SemiMetric: Distance metric fory(default:Euclidean())
Examples
res = partition(X, SPXYSplit(); target=y, train=70, test=30)
res = partition(X, SPXYSplit(; metric_X=Mahalanobis(cov(X; dims=2)));
target=y, train=70, test=30)
X_train, X_test = splitdata(res, X)See also
KennardStoneSplit — the classical variant that uses only X.
DataSplits.ShuffleSplit — Type
ShuffleSplit(n_splits::Integer) <: AbstractResamplingCVStrategyRandom permutation cross-validation. For each of the n_splits iterations, observations are randomly shuffled and the requested train and test cohort sizes are drawn from the head and the next slice of the permutation. Mirrors scikit-learn's ShuffleSplit, with the train/test sizes supplied at the partition call (in line with the rest of DataSplits' API).
Fields
n_splits::Int: Number of resamples (must be ≥ 1).
Notes
trainandtestsum toN, like the rest of the partition API — every observation is placed in exactly one cohort per resample. (sklearn lets you drop observations by settingtrain_size + test_size < 1; this package keeps the "all observations accounted for" invariant.)- Resamples are independent: an observation can land in train in one fold and test in another. This is the defining property of
ShuffleSplitversusKFold.
Examples
# Fractions.
cvs = partition(X, ShuffleSplit(10); train = 0.8, test = 0.2)
# Absolute counts.
cvs = partition(X, ShuffleSplit(10); train = 80, test = 20)
# Reproducible.
cvs = partition(X, ShuffleSplit(10); train = 0.8, test = 0.2,
rng = MersenneTwister(42))DataSplits.SphereExclusionResult — Type
SphereExclusionResultResult of sphere exclusion clustering.
Fields
assignments::Vector{Int}: Cluster index per point (1-based).radius::Float64: Exclusion radius used for clustering.metric::Distances.SemiMetric: Distance metric used.
DataSplits.SplitInputError — Type
SplitInputError(msg)Error thrown when input data to a split is invalid (e.g., empty, wrong shape, mismatched X/y).
DataSplits.SplitNotImplementedError — Type
SplitNotImplementedError(msg)Error thrown when a required split method or feature is not implemented.
DataSplits.SplitParameterError — Type
SplitParameterError(msg)Error thrown when split parameters are invalid (e.g., unknown allocation, out-of-bounds fraction).
DataSplits.StratifiedGroupKFold — Type
StratifiedGroupKFold(k::Integer; bins::Integer=10, shuffle::Bool=false) <: AbstractCVStrategyGroup-aware and stratified k-fold cross-validation. No group spans two folds (like GroupKFold), and per-fold class proportions are kept close to the global distribution (like StratifiedKFold). Mirrors scikit-learn's StratifiedGroupKFold.
Both target= and groups= keywords are required; neither falls back to data (no sensible default — they describe orthogonal properties of the same observations).
Fields
k::Int: Number of folds (must be ≥ 2 and ≤ number of unique groups).bins::Int: Number of quantile bins used whentargetis floating-point (must be ≥ 2; default10). Ignored for non-float targets, which are treated as discrete classes.shuffle::Bool: Whentrue, the order in which groups are considered for fold assignment is shuffled using therngpassed topartition. Whenfalse(default), groups are processed in descending order of class-count vector norm — sklearn's deterministic balancing.
Algorithm
For each group, compute its per-class count vector. Process groups one by one (largest first by default). For each group, assign it to the fold that minimises the variance of per-class proportions across folds after the assignment — each class's fold counts are normalised by its global total so rare and abundant classes contribute on the same scale. This is the same greedy heuristic used by sklearn (y_counts_per_fold / y_distr).
Notes
- Each class needs at least
kmembers and must appear in at leastkdistinct groups; otherwise even the best assignment cannot place that class in every fold's training cohort. The algorithm does not raise on this — sklearn does not either — but fold class coverage may be uneven for very rare classes.
Examples
# Classification with patient-level groups.
cvs = partition(X, StratifiedGroupKFold(5);
target = labels, groups = patient_ids)
# Regression with quantile bins.
cvs = partition(X, StratifiedGroupKFold(5; bins = 4);
target = y_continuous, groups = batch_ids)
# Shuffled ordering — different seeds give different fold compositions.
cvs = partition(X, StratifiedGroupKFold(5; shuffle = true);
target = labels, groups = patient_ids,
rng = MersenneTwister(42))DataSplits.StratifiedKFold — Type
StratifiedKFold(k::Integer; bins::Integer=10, shuffle::Bool=false) <: AbstractCVStrategyStratified k-fold cross-validation. Within each fold, every class (or quantile bin) is represented in roughly the same proportion as in the full dataset.
Targets are passed via the target= keyword (or, by fallback, data itself plays that role).
Fields
k::Int: Number of folds (must be ≥ 2).bins::Int: Number of quantile bins used whentargetis floating-point (must be ≥ 2; default10). Ignored for non-float targets, which are treated as discrete classes.shuffle::Bool: Whentrue, member indices within each class are randomly permuted before round-robin assignment using therngpassed topartition, so different seeds yield different fold assignments. Whenfalse(default), assignment is fully deterministic.
Stratification rule
- Discrete
target(e.g.Int,Bool,Symbol,String): each unique value defines a class. - Float
target: targets are binned intobinsquantile-based bins; each bin defines a class.
Within each class, indices are distributed round-robin across the k folds, so every fold gets a near-equal share of every class.
Notes
- Each class needs at least
kmembers; otherwise aSplitParameterErroris raised. For discrete targets this is a hard constraint; for binned continuous targets, lowerbinsif you hit it. - When a continuous target has many repeated values (e.g. lots of zeros), quantile edges may collapse and effectively yield fewer than
binspopulated bins. Stratification still works, but bin coverage is uneven.
Examples
# Classification.
cvs = partition(X, StratifiedKFold(5); target = labels)
# Regression: 10 quantile bins (default).
cvs = partition(X, StratifiedKFold(5); target = y_continuous)
# Regression with a custom number of bins.
cvs = partition(X, StratifiedKFold(5; bins = 4); target = y_continuous)
# Shuffled — different seeds give different fold assignments.
cvs = partition(X, StratifiedKFold(5; shuffle = true); target = labels,
rng = MersenneTwister(42))DataSplits.StratifiedShuffleSplit — Type
StratifiedShuffleSplit(n_splits::Integer; bins::Integer=10) <: AbstractResamplingCVStrategyStratified resampling cross-validation. Combines the per-resample random-draw structure of ShuffleSplit with the class/quantile-bin balancing of StratifiedKFold: each resample preserves the global class proportions in both train and test cohorts.
Targets are passed via the target= keyword (or, by fallback, data itself plays that role).
Fields
n_splits::Int: Number of resamples (must be ≥ 1).bins::Int: Number of quantile bins used whentargetis floating-point (must be ≥ 2; default10). Ignored for non-float targets, which are treated as discrete classes.
Stratification rule
Same as StratifiedKFold:
- Discrete
target(e.g.Int,Bool,Symbol,String): each unique value defines a class. - Float
target: targets are binned intobinsquantile-based bins.
Within each class/bin, members are randomly shuffled and the global train fraction is applied locally — round(Int, n_train * |class| / N) go to train, the rest to test. Rounding remainder is absorbed in the last class processed so totals match n_train / n_test exactly.
Notes
n_train + n_test == N(project-wide invariant — sklearn allows dropping observations, this package does not).- Each class needs at least 2 members so that both cohorts can receive a representative; otherwise a
SplitParameterErroris raised. Reducebinsfor continuous targets if you hit it.
Examples
# Classification.
cvs = partition(X, StratifiedShuffleSplit(10);
target = labels, train = 0.8, test = 0.2)
# Regression: 10 quantile bins (default).
cvs = partition(X, StratifiedShuffleSplit(10);
target = y_continuous, train = 0.8, test = 0.2)DataSplits.TargetPropertySplit — Type
TargetPropertySplit(order::Symbol) <: AbstractSplitStrategySplits observations by sorting a 1D property vector and selecting the top or bottom slice as the training set.
Fields
order::Symbol::highselects the largest values for training;:lowselects the smallest.
Examples
# Highest values go to train; y provides the ordering.
res = partition(X, TargetPropertySplit(:high); target = y, train = 80, test = 20)
X_train, X_test = splitdata(res, X)
# Convenience aliases.
res = partition(X, TargetPropertyHigh(); target = y, train = 80, test = 20)
res = partition(X, TargetPropertyLow(); target = y, train = 80, test = 20)
# When y is both data and property.
res = partition(y, TargetPropertyHigh(); train = 80, test = 20)DataSplits.TimeSeriesSplit — Type
TimeSeriesSplit(k::Integer; gap::Integer=0, max_train_size::Union{Nothing,Integer}=nothing) <: AbstractCVStrategyTime-aware cross-validation. The temporal sequence is partitioned into k + 1 chronological chunks; fold i (1 ≤ i ≤ k) tests on chunk i + 1 and trains on the observations chronologically preceding it.
By default the train cohort expands across all earlier chunks. Pass max_train_size (in observations) to cap the train cohort, mirroring scikit-learn's TimeSeriesSplit: when set, each fold trains on at most the most recent max_train_size observations before the test chunk.
Train and test cohorts in the same fold are separated by gap observations (useful to avoid leakage between adjacent samples in autocorrelated series).
Atomicity rule
Observations sharing the same timestamp are never split between train and test of the same fold — chunk boundaries always fall between distinct time values, mirroring TimeSplit. Chunk sizes are therefore measured in distinct time values, not in observations.
gap and max_train_size are measured in observations, matching sklearn's contract. When either falls inside a block of equal timestamps, that block is split on the train side — some rows are kept, the rest dropped. No row leaks into test (the test cohort still starts at the next chunk), but the train side is no longer block-aligned.
Fields
k::Int: Number of folds (must be ≥ 2).gap::Int: Number of observations skipped from the end of the train cohort in each fold (must be ≥ 0; default0).max_train_size::Union{Nothing,Int}: Whennothing(default), the train cohort expands across all earlier chunks. When anInt ≥ 1, the train cohort is capped to that many observations, taken from the most recent end (rolling window).
Notes
- Requires at least
k + 1distinct time values. - A fold whose train cohort would be empty (because
gapconsumes it) raisesSplitParameterError.
Examples
# Expanding window (default).
cvs = partition(X, TimeSeriesSplit(5); time = timestamps)
for (X_train, X_test) in splitview(cvs, X)
fit!(model, X_train); evaluate(model, X_test)
end
# Rolling window: train uses at most the last 100 observations.
cvs = partition(X, TimeSeriesSplit(5; max_train_size = 100); time = timestamps)
# Rolling window with a one-observation gap between train and test.
cvs = partition(X, TimeSeriesSplit(5; gap = 1, max_train_size = 100); time = timestamps)DataSplits.TimeSplit — Type
TimeSplit(order::Symbol=:asc) <: AbstractSplitStrategySplits a 1D array of dates/times into train/test sets, grouping by unique values so that no group is split across train and test.
The actual training cohort size may slightly overshoot n_train but never fall below it.
Fields
order::Symbol::ascputs the oldest observations in train (default);:descputs the newest in train.
Examples
# Oldest observations go to train (asc order).
res = partition(X, TimeSplit(:asc); time = dates, train = 70, test = 30)
X_train, X_test = splitdata(res, X)
# Convenience aliases.
res = partition(X, TimeSplitOldest(); time = dates, train = 70, test = 30) # same as :asc
res = partition(X, TimeSplitNewest(); time = dates, train = 70, test = 30) # same as :desc
# When dates are the data themselves.
res = partition(dates, TimeSplitOldest(); train = 70, test = 30)DataSplits.TrainTestSplit — Type
TrainTestSplitA result type representing a train/test split.
Fields
train: Indices of training samples.test: Indices of test samples.
Examples
res = partition(X, KennardStoneSplit(); train = 80, test = 20)
X_train, X_test = splitdata(res, X)DataSplits.TrainValTestSplit — Type
TrainValTestSplitA result type representing a train/validation/test split.
Fields
train: Indices of training samples.val: Indices of validation samples.test: Indices of test samples.
Examples
res = partition(X, RandomSplit(), KennardStoneSplit();
train = 70, validation = 10, test = 20)
X_train, X_val, X_test = splitdata(res, X)DataSplits.ValidFraction — Type
ValidFraction{T<:Real}A wrapper type guaranteeing a real value strictly in (0, 1).
Arithmetic with plain numbers delegates to the underlying value so it can be used transparently in formulas.
DataSplits.LeaveOneGroupOut — Method
LeaveOneGroupOut()Alias for LeavePGroupsOut(1) — produces one fold per unique group.
DataSplits.TargetPropertyHigh — Method
TargetPropertyHigh()Alias for TargetPropertySplit(:high) — selects the highest-valued observations for the training set.
DataSplits.TargetPropertyLow — Method
TargetPropertyLow()Alias for TargetPropertySplit(:low) — selects the lowest-valued observations for the training set.
DataSplits.TimeSplitNewest — Method
TimeSplitNewest()Alias for TimeSplit(:desc) — newest observations go to the training set.
DataSplits.TimeSplitOldest — Method
TimeSplitOldest()Alias for TimeSplit(:asc) — oldest observations go to the training set.
DataSplits._assert_partitionable — Method
_assert_partitionable(data) -> NValidate that data is non-empty and has at least 2 observations. Returns numobs(data) for downstream use. Used by every partition method to guarantee a meaningful split is possible.
DataSplits._assert_unit_fraction_sum — Method
_assert_unit_fraction_sum(fractions::ValidFraction...)Assert that the supplied validated fractions form a complete partition.
Throws SplitParameterError if the sum of the wrapped fraction values is not approximately equal to 1.
DataSplits._blocked_cv_partition — Method
_blocked_cv_partition(data, k, pre_gap, post_gap; time, name) -> CrossValidationSplitShared implementation for contiguous-block k-fold CV strategies. Sorts observations by time, distributes the k blocks, and for each fold uses everything outside [test_lo - pre_gap, test_hi + post_gap] as the train cohort.
name is used only in error messages to identify the calling strategy.
DataSplits._is_fraction — Method
Check whether a number is strictly in (0, 1).
DataSplits._resolve_sizes — Method
_resolve_sizes(N, train, validation, test) -> (n_train, n_val, n_test)Validate and resolve cohort sizes.
Integer form — two interpretations, distinguished by the sum:
- sum ==
100: values are percentages ofN. - sum ==
N: values are absolute counts.
Any other sum is rejected.
Float form — values must each be in (0, 1) and sum to approximately 1.0.
When validation === nothing, n_val == 0 and only train and test cohorts are produced. Rounding remainder is absorbed by n_test.
DataSplits._to_feature_matrix — Method
Convert a Tables.jl-compatible input (e.g. DataFrame) to a features×samples matrix (F×N), which is the internal convention for distance-based strategies. Non-table inputs are returned unchanged.
DataSplits._warn_undershoot — Method
_warn_undershoot(n_selected, n_requested, msg; id)Emit a @warn when fewer samples were selected than requested. id is used as the log record _id for selective filtering with LoggingExtras.EarlyFilteredLogger.
DataSplits._within_group_std — Method
_within_group_std(data, idxs) -> Float64Compute the average per-feature standard deviation of observations at idxs within data, container-agnostically. Used by :neyman allocation to weight groups by within-group dispersion.
Tables.jl inputs are converted to an F×N matrix; Vectors are treated as 1D feature streams. Singleton groups return 0.0 to avoid NaN.
DataSplits.consumes — Method
consumes(alg::AbstractSplitStrategy) -> NTuple{N, Symbol}Return the named slots this strategy reads, as a tuple of symbols from (:data, :target, :time, :groups).
DataSplits.distance_matrix — Method
distance_matrix(X, metric::Distances.SemiMetric)Computes the full symmetric pairwise distance matrix for the dataset X using the given metric.
Arguments
X: Data matrix or container. Columns are samples (features × samples).metric::Distances.SemiMetric: Distance metric from Distances.jl.
Returns
D::Matrix{Float64}: Matrix whereD[i, j] = metric(xᵢ, xⱼ)andD[i, j] == D[j, i].
Notes
- For custom containers,
getobs(X, i)is used to access samples. - The matrix is symmetric and not normalized.
DataSplits.distribute_blocks — Method
distribute_blocks(B::Int, n_chunks::Int) -> chunk_block_endDistribute B contiguous blocks across n_chunks as evenly as possible (matching numpy.array_split semantics: the remainder is spread over the first B mod n_chunks chunks). Returns a vector of length n_chunks where chunk_block_end[c] is the index of the last block in chunk c. Chunk sizes differ by at most 1.
DataSplits.fallback_from_data — Method
fallback_from_data(alg::AbstractSplitStrategy) -> NTuple{N, Symbol}Return the subset of consumes(alg) whose keyword may be omitted in partition, in which case data itself fills that slot.
Must satisfy: fallback_from_data(alg) ⊆ consumes(alg).
DataSplits.find_maximin_element — Method
find_maximin_element(distances::AbstractMatrix{T},
source_set::Union{AbstractVector{Int},AbstractSet{Int}},
reference_set::Union{AbstractVector{Int},AbstractSet{Int}}) -> IntFinds the element in source_set that maximizes the minimum distance to all elements in reference_set.
Arguments
distances::AbstractMatrix{T}: Precomputed, symmetric pairwise distance matrix (N×N).source_set::Union{AbstractVector{Int},AbstractSet{Int}}: Indices to evaluate.reference_set::Union{AbstractVector{Int},AbstractSet{Int}}: Indices to compare against.
Returns
Int: Index insource_setthat is farthest from its nearest neighbor inreference_set.
Notes
- Throws
ArgumentErrorifreference_setis empty. - Breaks ties by returning the first maximum.
DataSplits.find_most_distant_pair — Method
find_most_distant_pair(D::AbstractMatrix) -> (i, j)Find the indices of the most distant pair in a precomputed distance matrix.
DataSplits.folds — Method
folds(res::CrossValidationSplit) -> Vector{<:AbstractSplitResult}Return the individual fold results from a cross-validation split.
DataSplits.group_offsets — Method
group_offsets(sorted_keys, perm, v) -> block_offsetCompute block-boundary offsets for a grouped, sorted permutation.
block_offset[b]+1 : block_offset[b+1] are the positions in perm (equivalently the slice of v[perm]) whose value equals sorted_keys[b]. block_offset[1] == 0 and block_offset[end] == length(perm).
DataSplits.groupsortperm — Method
groupsortperm(v) -> (sorted_keys, perm)Return the sorted unique values of v and a stable sort permutation of v.
perm is a permutation of 1:length(v) such that v[perm] is non-decreasing. sorted_keys == unique(v[perm]). Together, sorted_keys and perm partition every index in 1:length(v) with no duplicates.
DataSplits.innerfolds — Method
innerfolds(f::NestedFold) -> CrossValidationSplitReturn the inner cross-validation split associated with the outer fold f. Inner fold indices are absolute (into 1:N), so they can be used directly against the original data without further remapping.
Example
cvs = partition(X, NestedCV(KFold(5), KFold(3)))
for outerfold in folds(cvs)
X_outer_train, X_outer_test = splitdata(outerfold, X)
for (X_tr, X_val) in splitview(innerfolds(outerfold), X)
# hyperparameter tuning on the outer training cohort
end
endDataSplits.partition — Method
partition(data, alg::AbstractCVStrategy;
target=nothing, time=nothing, groups=nothing,
rng=Random.default_rng()) -> CrossValidationSplitProduce a cross-validation split: a CrossValidationSplit wrapping one fold result per element of the partition.
Unlike the train/test and train/val/test forms, this method does not accept train / test / validation keywords — fold sizes are fixed by the strategy (typically via k). Resampling strategies that do take caller-set cohort sizes subtype AbstractResamplingCVStrategy and dispatch to a separate partition method.
Auxiliary slots
target: response/property vector (e.g. forStratifiedKFold).time: temporal ordering vector (e.g. forTimeSeriesSplit).groups: group-membership vector (e.g. forGroupKFold).
Examples
cvs = partition(X, GroupKFold(5); groups = patient_ids)
for (X_train, X_test) in splitview(cvs, X)
fit!(model, X_train); evaluate(model, X_test)
endDataSplits.partition — Method
partition(data, alg::AbstractResamplingCVStrategy;
train, test,
target=nothing, time=nothing, groups=nothing,
rng=Random.default_rng()) -> CrossValidationSplitResampling cross-validation: each fold is an independent random train/test split sized by the caller, and n_splits independent resamples are produced. Used by ShuffleSplit, StratifiedShuffleSplit, and GroupShuffleSplitCV.
train and test follow the same conventions as the train/test partition form (percentages, absolute counts, or (0,1) fractions summing to 1).
Examples
cvs = partition(X, ShuffleSplit(10); train = 0.8, test = 0.2)DataSplits.partition — Method
partition(data, alg, val_alg;
train, validation, test,
target=nothing, time=nothing, groups=nothing,
rng=Random.default_rng()) -> TrainValTestSplitSplit data into train, validation, and test cohorts using two strategies.
alg separates the test cohort from the rest; val_alg then separates the validation cohort from the remaining train pool.
Cohort sizes (train, validation, test)
Integers are accepted in two ways:
- Percentages — values sum to
100. - Absolute counts — values sum to
N = numobs(data).
Floats in (0, 1) summing to 1.0 are also accepted.
Examples
partition(X, RandomSplit(), KennardStoneSplit();
train = 70, validation = 10, test = 20)
partition(X, RandomSplit(), KennardStoneSplit();
train = 0.7, validation = 0.1, test = 0.2)DataSplits.partition — Method
partition(data, alg;
train, test,
target=nothing, time=nothing, groups=nothing,
rng=Random.default_rng()) -> TrainTestSplitSplit data into train and test cohorts according to alg.
Cohort sizes (train, test)
Integers are accepted in two ways:
- Percentages — values sum to
100. - Absolute counts — values sum to
N = numobs(data).
Floats in (0, 1) summing to 1.0 are also accepted and converted to counts.
Auxiliary slots
target: response/property vector (e.g. forSPXYSplit).time: temporal ordering vector (e.g. forTimeSplit).groups: group-membership vector (e.g. forGroupShuffleSplit).
Examples
partition(X, KennardStoneSplit(); train = 80, test = 20)
partition(X, RandomSplit(); train = 0.8, test = 0.2)
partition(X, SPXYSplit(); target = y, train = 80, test = 20)DataSplits.rowpairs — Method
rowpairs(data, alg::AbstractSplitStrategy; kwargs...)Convenience wrapper equivalent to:
rowpairs(partition(data, alg; kwargs...))DataSplits.rowpairs — Method
rowpairs(res) -> Vector{Tuple{Vector{Int}, Vector{Int}}}Convert a split result into the index-pair format accepted by MLJ's evaluate! resampling= keyword.
CrossValidationSplit→ one(train, test)pair per fold.TrainTestSplit→ a single-element vector[(train, test)].TrainValTestSplit→[(train, val)](validation cohort, not test).
Example
cvs = partition(X, StratifiedKFold(5); target = y)
mach = machine(model, X, y)
evaluate!(mach; resampling = rowpairs(cvs), measure = accuracy)DataSplits.sphere_exclusion — Method
sphere_exclusion(data; radius::Real, metric::Distances.SemiMetric=Euclidean()) -> SphereExclusionResultClusters samples in data using the sphere exclusion algorithm.
Arguments
data: Data matrix or container. Columns are samples.radius::Real: Exclusion radius (normalized to [0, 1]).metric::Distances.SemiMetric: Distance metric (default: Euclidean()).
Returns
SphereExclusionResult: Clustering result with assignments, radius, and metric.
Notes
- The distance matrix is normalized to [0, 1] before clustering.
- Each cluster contains all points within
radiusof the cluster center.
Examples
result = sphere_exclusion(X; radius=0.2)
assignments = result.assignmentsDataSplits.splitdata — Method
splitdata(res, data)Materialise the split: return a tuple of data subsets corresponding to the train/test (and optionally validation) indices in res.
When data is a DataFrame or other Tables.jl-compatible container, splitdata returns subsets of the same type.
DataSplits.splitview — Method
splitview(res, data)Like splitdata but returns lazy views via MLUtils.obsview — no data is copied. Prefer splitdata when you need independent copies.
DataSplits.testdata — Function
traindata(res, data...)
testdata(res, data...)
valdata(res, data...)Like trainview / testview / valview but materialises copies via getobs rather than returning lazy views.
DataSplits.testindices — Method
testindices(res::AbstractSplitResult) -> indicesReturn the test indices from a split result.
DataSplits.testview — Function
trainview(res, data...)
testview(res, data...)
valview(res, data...)Return lazy views of the requested cohort for one or more data sources.
When called with a single data source, returns the view directly. When called with two or more, returns a Tuple of views — suitable for passing directly to Flux.DataLoader or similar.
valview is only defined for TrainValTestSplit.
For CrossValidationSplit, each function returns a Vector (one element per fold).
Examples
# Single source
X_train = trainview(split, X)
X_test = testview(split, X)
# Multiple sources — tuple destructures naturally
X_train, y_train = trainview(split, X, y)
X_test, y_test = testview(split, X, y)
# Flux DataLoader — tuple passed directly
loader = Flux.DataLoader(trainview(split, X, y); batchsize = 64, shuffle = true)
# Train/val/test
X_train, y_train = trainview(split3, X, y)
X_val, y_val = valview(split3, X, y)
X_test, y_test = testview(split3, X, y)
# Cross-validation
for (X_tr, y_tr) in trainview(cvs, X, y)
loader = Flux.DataLoader((X_tr, y_tr); batchsize = 64)
# ...
endDataSplits.traindata — Method
traindata(res, data...)
testdata(res, data...)
valdata(res, data...)Like trainview / testview / valview but materialises copies via getobs rather than returning lazy views.
DataSplits.trainindices — Method
trainindices(res::AbstractSplitResult) -> indicesReturn the training indices from a split result.
DataSplits.trainview — Method
trainview(res, data...)
testview(res, data...)
valview(res, data...)Return lazy views of the requested cohort for one or more data sources.
When called with a single data source, returns the view directly. When called with two or more, returns a Tuple of views — suitable for passing directly to Flux.DataLoader or similar.
valview is only defined for TrainValTestSplit.
For CrossValidationSplit, each function returns a Vector (one element per fold).
Examples
# Single source
X_train = trainview(split, X)
X_test = testview(split, X)
# Multiple sources — tuple destructures naturally
X_train, y_train = trainview(split, X, y)
X_test, y_test = testview(split, X, y)
# Flux DataLoader — tuple passed directly
loader = Flux.DataLoader(trainview(split, X, y); batchsize = 64, shuffle = true)
# Train/val/test
X_train, y_train = trainview(split3, X, y)
X_val, y_val = valview(split3, X, y)
X_test, y_test = testview(split3, X, y)
# Cross-validation
for (X_tr, y_tr) in trainview(cvs, X, y)
loader = Flux.DataLoader((X_tr, y_tr); batchsize = 64)
# ...
endDataSplits.valdata — Function
traindata(res, data...)
testdata(res, data...)
valdata(res, data...)Like trainview / testview / valview but materialises copies via getobs rather than returning lazy views.
DataSplits.valindices — Method
valindices(res::TrainValTestSplit) -> indicesReturn the validation indices from a split result.
DataSplits.valview — Function
trainview(res, data...)
testview(res, data...)
valview(res, data...)Return lazy views of the requested cohort for one or more data sources.
When called with a single data source, returns the view directly. When called with two or more, returns a Tuple of views — suitable for passing directly to Flux.DataLoader or similar.
valview is only defined for TrainValTestSplit.
For CrossValidationSplit, each function returns a Vector (one element per fold).
Examples
# Single source
X_train = trainview(split, X)
X_test = testview(split, X)
# Multiple sources — tuple destructures naturally
X_train, y_train = trainview(split, X, y)
X_test, y_test = testview(split, X, y)
# Flux DataLoader — tuple passed directly
loader = Flux.DataLoader(trainview(split, X, y); batchsize = 64, shuffle = true)
# Train/val/test
X_train, y_train = trainview(split3, X, y)
X_val, y_val = valview(split3, X, y)
X_test, y_test = testview(split3, X, y)
# Cross-validation
for (X_tr, y_tr) in trainview(cvs, X, y)
loader = Flux.DataLoader((X_tr, y_tr); batchsize = 64)
# ...
end