Reference
Contents
Index
DataSplits.ClusterShuffleSplit
DataSplits.ClusterStratifiedSplit
DataSplits.KennardStoneSplit
DataSplits.OptiSimSplit
DataSplits.SPXYSplit
DataSplits.SphereExclusionResult
DataSplits.TargetPropertySplit
DataSplits.TimeSplit
DataSplits._split
DataSplits.cluster_stratified
DataSplits.distance_matrix
DataSplits.equal_allocation
DataSplits.find_maximin_element
DataSplits.find_most_distant_pair
DataSplits.get_sample
DataSplits.is_sample_array
DataSplits.kennardstone
DataSplits.lazy_kennard_stone
DataSplits.neyman_allocation
DataSplits.proportional_allocation
DataSplits.sample_indices
DataSplits.sphere_exclusion
DataSplits.split
DataSplits.split_with_positions
DataSplits.targetpropertysplit
DataSplits.ClusterShuffleSplit
— TypeClusterShuffleSplit(res::ClusteringResult, frac::Real)
ClusterShuffleSplit(f::Function, frac::Real, data; rng)
Group-aware train/test splitter: accepts either:
- A precomputed ClusteringResult.
- A clustering function f(data) that returns one.
At construction, clustering is executed so the strategy always holds a ClusteringResult.
Arguments:
res
or `f(...)frac
: fraction of samples in the training set (0 < frac < 1).
This splitter shuffles cluster IDs and accumulates whole clusters until frac * N
samples are in the train set, then returns (train_idx, test_idx)
.
DataSplits.ClusterStratifiedSplit
— TypeClusterStratifiedSplit(res::ClusteringResult, allocation::Symbol; n=nothing, frac)
ClusterStratifiedSplit(f::Function, allocation::Symbol; n=nothing, frac, data)
Cluster-stratified train/test splitter.
Splits each cluster into train/test using one of three allocation methods:
:equal
: Randomly selectsn
samples from each cluster, then splits them into train/test according tofrac
.:proportional
: Uses all samples in each cluster, splits them into train/test according tofrac
.:neyman
: Randomly selects a Neyman quota from each cluster (based on pooled std), then splits into train/test according tofrac
.
Arguments
res
orf(...)
: ClusteringResult or clustering function.allocation
::equal
,:proportional
, or:neyman
.n
: Number of samples per cluster (equal/neyman allocation).frac
: Fraction of selected samples to use for train (rest go to test).
Notes
- If
n
is greater than the cluster size, all samples in the cluster are used. - For
:proportional
, all samples are always used. - For
frac=1.0
, all selected samples go to train; forfrac=0.0
, all go to test.
DataSplits.KennardStoneSplit
— TypeKennardStoneSplit{T} <: SplitStrategy
A splitting strategy implementing the Kennard-Stone algorithm for train/test splitting.
Fields
frac::ValidFraction{T}
: Fraction of data to use for training (0 < frac < 1)metric::Distances.SemiMetric
: Distance metric to use (default: Euclidean())
Examples
```julia
Create a splitter with 80% training data using Euclidean distance
splitter = KennardStoneSplit(0.8)
Create a splitter with custom metric
using Distances splitter = KennardStoneSplit(0.7, Cityblock())
DataSplits.OptiSimSplit
— TypeOptiSimSplit(frac; n_clusters = 10, max_subsample_size, distance_cutoff = 0.10,
metric = Euclidean(), random_state = 42)
Implementation of OptiSim (Clark 1998, J. Chem. Inf. Comput. Sci.), an optimisable K‑dissimilarity selection.
frac
– fraction of samples to return in the training subsetn_clusters = M
– requested cluster/selection‑set sizemax_subsample_size = K
– size of the temporary sub‑sample (default:max(1, ceil(Int, 0.05N))
)distance_cutoff = c
– two points are “similar” if their distance <c
metric
– anyDistances.jl
metricrandom_state
– seed for the RNG
The splitter requires both an X
matrix and target vector y
when calling split
.
DataSplits.SPXYSplit
— MethodSPXYSplit(frac; metric = Euclidean())
Create an SPXY splitter – the variant of Kennard–Stone in which the distance matrix is the element‑wise sum of
- the (normalised) pairwise distance matrix of the feature matrix
X
- plus the (normalised) pairwise distance matrix of the response vector
y
.
frac
is the fraction of samples that will end up in the training subset.
split
must be called with a 2‑tuple (X, y)
or with positional arguments split(X, y, strategy)
; calling split(X, strategy)
will raise a MethodError
, because y
is mandatory for SPXY.
Arguments
name | type | meaning |
---|---|---|
frac | Real (0 < frac < 1) | training‑set fraction |
metric | [Distances.SemiMetric ] | distance metric used for both X and y |
See also
KennardStoneSplit
— the classical variant that uses only X
.
DataSplits.SphereExclusionResult
— TypeSphereExclusionResult
Result of sphere exclusion clustering.
Fields:
assignments::Vector{Int}
: cluster index per point.radius::Float64
: exclusion radius.metric::Distances.SemiMetric
: distance metric.
DataSplits.TargetPropertySplit
— TypeTargetPropertySplit{T} <: SplitStrategy
A splitting strategy that partitions a 1D property array into train/test sets by sorting the property values.
Fields
frac::ValidFraction{T}
: Fraction of data to use for training (0 < frac < 1)order::Symbol
: Sorting order; use:asc
,:desc
,:high
,:low
,:largest
,:smallest
, etc.
Examples
split(y, TargetPropertyHigh(0.8))
split(X[:, 3], TargetPropertyLow(0.5))
DataSplits.TimeSplit
— TypeTimeSplit{T} <: SplitStrategy
Splits a 1D array of dates/times into train/test sets, grouping by unique date/time values. No group (samples with the same date) is split between train and test. The actual fraction may be slightly above the requested one, but never below.
Fields
frac::ValidFraction{T}
: Fraction of data to use for training (0 < frac < 1)order::Symbol
: Sorting order; use:asc
(oldest in train, default),:desc
(newest in train)
Examples
split(dates, TimeSplitOldest(0.7))
split(dates, TimeSplitNewest(0.3))
DataSplits._split
— Method_split(X, y, strategy::SPXYSplit; rng = Random.GLOBAL_RNG) → (train_idx, test_idx)
Split a feature matrix X
and response vector y
into train/test subsets using the SPXY algorithm:
- Build the joint distance matrix
D = D_X + D_Y
(seeSPXYSplit
for details). - Run the Kennard–Stone maximin procedure on
D
. - Return two sorted index vectors (
train_idx
,test_idx
).
Arguments
name | type | requirement |
---|---|---|
X | AbstractMatrix | size(X, 1) == length(y) |
y | AbstractVector | |
strategy | SPXYSplit | created with SPXYSplit(frac; metric) |
rng | random‑number source | unused here but kept for API symmetry |
Returns
Two Vector{Int}
with the row indices of X
(and the corresponding entries of y
) that belong to the training and test subsets.
The indices are axis‑correct — if X
is an OffsetMatrix
whose first row is index 0
, the returned indices will also start at 0
.
DataSplits.cluster_stratified
— Methodcluster_stratified(N, s, rng, data)
Main splitting function. For each cluster, selects indices according to the allocation method, then splits those indices into train/test according to frac
.
DataSplits.distance_matrix
— Methoddistance_matrix(X, metric::PreMetric)
Compute the full symmetric pairwise distance matrix D
for the dataset X
using the given metric
. This function uses get_sample
to access samples, ensuring compatibility with any custom array type that implements get_sample
and sample_indices
.
Returns a matrix D
such that D[i, j] = metric(xᵢ, xⱼ)
and D[i, j] == D[j, i]
.
DataSplits.equal_allocation
— Methodequal_allocation(cl_ids, idxs_by_cluster, n, rng)
Randomly select n
samples from each cluster (or all if cluster is smaller). Returns a Dict mapping cluster id to selected indices.
DataSplits.find_maximin_element
— Methodfind_maximin_element(distances::AbstractMatrix{T},
source_set::AbstractVector{Int},
reference_set::AbstractVector{Int}) -> Int
Find the element in source_set
that maximizes the minimum distance to all elements in reference_set
.
Arguments
distances::AbstractMatrix{T}
: Precomputed, symmetric pairwise distance matrix.source_set::AbstractVector{Int}
: Set of items to evaluate.reference_set::AbstractVector{Int}
: Set of items to compare against.
Returns
Int
: The index insource_set
that is farthest from its nearest neighbour inreference_set
.
Notes
- If
reference_set
is empty, throwsArgumentError
- Breaks ties by returning the first maximum
DataSplits.find_most_distant_pair
— Methodfind_most_distant_pair(D::AbstractMatrix) → (i, j)
Finds indices of most distant pair in precomputed distance matrix.
DataSplits.get_sample
— Methodget_sample(A::AbstractArray, idx)
Public API for getting samples that handles any valid index type. Dispatches to internalget_sample after index conversion.
DataSplits.is_sample_array
— Methodis_sample_array(::Type{T}) -> Bool
Trait method that tells whether type T
behaves like a sample‑indexed array that supports sample_indices
, _get_sample
, etc.
You may extend this for custom containers to ensure compatibility with OptiSim or other sampling methods.
Defaults to false
unless specialized.
DataSplits.kennardstone
— Method_split(data, s::KennardStoneSplit; rng=Random.GLOBAL_RNG) → (train_idx, test_idx)
Optimized in-memory Kennard-Stone algorithm using precomputed distance matrix. Best for small-to-medium datasets where O(N²) memory is acceptable.
DataSplits.lazy_kennard_stone
— Method_split(data, s::KennardStoneSplit; rng=Random.GLOBAL_RNG) → (train_idx, test_idx)
Kennard-Stone (CADEX) algorithm for optimal train/test splitting using maximin strategy. Memory-optimized implementation with O(N) storage. Useful when working with large datasets where the NxN distance matrix does not fit memory. When working with small datasets, use the traditional implementation.
DataSplits.neyman_allocation
— Methodneyman_allocation(cl_ids, idxs_by_cluster, n, data, rng)
Randomly select a Neyman quota from each cluster (proportional to cluster size and mean std of features). Returns a Dict mapping cluster id to selected indices.
DataSplits.proportional_allocation
— Methodproportional_allocation(cl_ids, idxs_by_cluster, rng)
Use all samples in each cluster, shuffled. Returns a Dict mapping cluster id to selected indices.
DataSplits.sample_indices
— Methodsample_indices(A::AbstractArray) -> AbstractVector
Return a vector of valid sample indices for A
`, supporting all AbstractArray types.
This method defines the "sample axis" (typically axis 1) and determines how your splitting/sampling algorithms enumerate data points.
Default
For standard arrays, returns axes(A, 1)
.
Extension
To support non-standard arrays (e.g., views, custom wrappers), you may extend this method to expose logical sample indices:
Base.sample_indices(a::MyFancyArray) = 1:length(a.ids)
DataSplits.sphere_exclusion
— Methodsphere_exclusion(data; radius::Real, metric::Distances.SemiMetric=Euclidean()) -> SphereExclusionResult
Cluster samples in data
by sphere exclusion:
- Compute full pairwise distance matrix and normalize values to [0,1].
- While unassigned samples remain:
- Pick first unassigned sample
i
. - All unassigned samples
j
with normalized distanceD[i,j] <= radius
form a cluster. - Mark them assigned and increment cluster ID.
- Pick first unassigned sample
DataSplits.split
— Methodsplit(data, strategy; rng=Random.default_rng()) → (train, test)
Split data
into train/test sets according to strategy
.
DataSplits.split_with_positions
— Methodsplit_with_positions(data, s, core_algorithm; rng=Random.default_rng(), args...)
Generic wrapper for split strategies. Handles mapping between user indices and 1:N positions.
data
: The user’s data array.s
: The split strategy object.core_algorithm
: Function (N, s, rng, data, args...) -> (trainpos, testpos)
Returns: (trainidx, testidx) as indices valid for data
.
DataSplits.targetpropertysplit
— Methodtargetpropertysplit(N, s, rng, data)
Sorts the property array and splits into train/test according to s.frac
and s.order
.