Reference

Contents

Index

DataSplits.ClusterShuffleSplitType
ClusterShuffleSplit(res::ClusteringResult, frac::Real)
ClusterShuffleSplit(f::Function, frac::Real, data; rng)

Group-aware train/test splitter: accepts either:

  1. A precomputed ClusteringResult.
  2. A clustering function f(data) that returns one.

At construction, clustering is executed so the strategy always holds a ClusteringResult.

Arguments:

  • res or `f(...)
  • frac: fraction of samples in the training set (0 < frac < 1).

This splitter shuffles cluster IDs and accumulates whole clusters until frac * N samples are in the train set, then returns (train_idx, test_idx).

source
DataSplits.ClusterStratifiedSplitType
ClusterStratifiedSplit(res::ClusteringResult, allocation::Symbol; n=nothing, frac)
ClusterStratifiedSplit(f::Function, allocation::Symbol; n=nothing, frac, data)

Cluster-stratified train/test splitter.

Splits each cluster into train/test using one of three allocation methods:

  • :equal: Randomly selects n samples from each cluster, then splits them into train/test according to frac.
  • :proportional: Uses all samples in each cluster, splits them into train/test according to frac.
  • :neyman: Randomly selects a Neyman quota from each cluster (based on pooled std), then splits into train/test according to frac.

Arguments

  • res or f(...): ClusteringResult or clustering function.
  • allocation: :equal, :proportional, or :neyman.
  • n: Number of samples per cluster (equal/neyman allocation).
  • frac: Fraction of selected samples to use for train (rest go to test).

Notes

  • If n is greater than the cluster size, all samples in the cluster are used.
  • For :proportional, all samples are always used.
  • For frac=1.0, all selected samples go to train; for frac=0.0, all go to test.
source
DataSplits.KennardStoneSplitType
KennardStoneSplit{T} <: SplitStrategy

A splitting strategy implementing the Kennard-Stone algorithm for train/test splitting.

Fields

  • frac::ValidFraction{T}: Fraction of data to use for training (0 < frac < 1)
  • metric::Distances.SemiMetric: Distance metric to use (default: Euclidean())

Examples

```julia

Create a splitter with 80% training data using Euclidean distance

splitter = KennardStoneSplit(0.8)

Create a splitter with custom metric

using Distances splitter = KennardStoneSplit(0.7, Cityblock())

source
DataSplits.OptiSimSplitType
OptiSimSplit(frac; n_clusters = 10, max_subsample_size, distance_cutoff = 0.10,
             metric = Euclidean(), random_state = 42)

Implementation of OptiSim (Clark 1998, J. Chem. Inf. Comput. Sci.), an optimisable K‑dissimilarity selection.

  • frac – fraction of samples to return in the training subset
  • n_clusters = M – requested cluster/selection‑set size
  • max_subsample_size = K – size of the temporary sub‑sample (default: max(1, ceil(Int, 0.05N)))
  • distance_cutoff = c – two points are “similar” if their distance < c
  • metric – any Distances.jl metric
  • random_state – seed for the RNG

The splitter requires both an X matrix and target vector y when calling split.

source
DataSplits.SPXYSplitMethod
SPXYSplit(frac; metric = Euclidean())

Create an SPXY splitter – the variant of Kennard–Stone in which the distance matrix is the element‑wise sum of

  • the (normalised) pairwise distance matrix of the feature matrix X
  • plus the (normalised) pairwise distance matrix of the response vector y.

frac is the fraction of samples that will end up in the training subset.

Note

split must be called with a 2‑tuple (X, y) or with positional arguments split(X, y, strategy); calling split(X, strategy) will raise a MethodError, because y is mandatory for SPXY.

Arguments

nametypemeaning
fracReal (0 < frac < 1)training‑set fraction
metric[Distances.SemiMetric]distance metric used for both X and y

See also

KennardStoneSplit — the classical variant that uses only X.

source
DataSplits.SphereExclusionResultType
SphereExclusionResult

Result of sphere exclusion clustering.

Fields:

  • assignments::Vector{Int}: cluster index per point.
  • radius::Float64: exclusion radius.
  • metric::Distances.SemiMetric: distance metric.
source
DataSplits.TargetPropertySplitType
TargetPropertySplit{T} <: SplitStrategy

A splitting strategy that partitions a 1D property array into train/test sets by sorting the property values.

Fields

  • frac::ValidFraction{T}: Fraction of data to use for training (0 < frac < 1)
  • order::Symbol: Sorting order; use :asc, :desc, :high, :low, :largest, :smallest, etc.

Examples

split(y, TargetPropertyHigh(0.8))
split(X[:, 3], TargetPropertyLow(0.5))
source
DataSplits.TimeSplitType
TimeSplit{T} <: SplitStrategy

Splits a 1D array of dates/times into train/test sets, grouping by unique date/time values. No group (samples with the same date) is split between train and test. The actual fraction may be slightly above the requested one, but never below.

Fields

  • frac::ValidFraction{T}: Fraction of data to use for training (0 < frac < 1)
  • order::Symbol: Sorting order; use :asc (oldest in train, default), :desc (newest in train)

Examples

split(dates, TimeSplitOldest(0.7))
split(dates, TimeSplitNewest(0.3))
source
DataSplits._splitMethod
_split(X, y, strategy::SPXYSplit; rng = Random.GLOBAL_RNG) → (train_idx, test_idx)

Split a feature matrix X and response vector y into train/test subsets using the SPXY algorithm:

  1. Build the joint distance matrix D = D_X + D_Y (see SPXYSplit for details).
  2. Run the Kennard–Stone maximin procedure on D.
  3. Return two sorted index vectors (train_idx, test_idx).

Arguments

nametyperequirement
XAbstractMatrixsize(X, 1) == length(y)
yAbstractVector
strategySPXYSplitcreated with SPXYSplit(frac; metric)
rngrandom‑number sourceunused here but kept for API symmetry

Returns

Two Vector{Int} with the row indices of X (and the corresponding entries of y) that belong to the training and test subsets.

The indices are axis‑correct — if X is an OffsetMatrix whose first row is index 0, the returned indices will also start at 0.

source
DataSplits.cluster_stratifiedMethod
cluster_stratified(N, s, rng, data)

Main splitting function. For each cluster, selects indices according to the allocation method, then splits those indices into train/test according to frac.

source
DataSplits.distance_matrixMethod
distance_matrix(X, metric::PreMetric)

Compute the full symmetric pairwise distance matrix D for the dataset X using the given metric. This function uses get_sample to access samples, ensuring compatibility with any custom array type that implements get_sample and sample_indices.

Returns a matrix D such that D[i, j] = metric(xᵢ, xⱼ) and D[i, j] == D[j, i].

source
DataSplits.equal_allocationMethod
equal_allocation(cl_ids, idxs_by_cluster, n, rng)

Randomly select n samples from each cluster (or all if cluster is smaller). Returns a Dict mapping cluster id to selected indices.

source
DataSplits.find_maximin_elementMethod
find_maximin_element(distances::AbstractMatrix{T},
                    source_set::AbstractVector{Int},
                    reference_set::AbstractVector{Int}) -> Int

Find the element in source_set that maximizes the minimum distance to all elements in reference_set.

Arguments

  • distances::AbstractMatrix{T}: Precomputed, symmetric pairwise distance matrix.
  • source_set::AbstractVector{Int}: Set of items to evaluate.
  • reference_set::AbstractVector{Int}: Set of items to compare against.

Returns

  • Int: The index in source_set that is farthest from its nearest neighbour in reference_set.

Notes

  • If reference_set is empty, throws ArgumentError
  • Breaks ties by returning the first maximum
source
DataSplits.get_sampleMethod
get_sample(A::AbstractArray, idx)

Public API for getting samples that handles any valid index type. Dispatches to internalget_sample after index conversion.

source
DataSplits.is_sample_arrayMethod
is_sample_array(::Type{T}) -> Bool

Trait method that tells whether type T behaves like a sample‑indexed array that supports sample_indices, _get_sample, etc.

You may extend this for custom containers to ensure compatibility with OptiSim or other sampling methods.

Defaults to false unless specialized.

source
DataSplits.kennardstoneMethod
_split(data, s::KennardStoneSplit; rng=Random.GLOBAL_RNG) → (train_idx, test_idx)

Optimized in-memory Kennard-Stone algorithm using precomputed distance matrix. Best for small-to-medium datasets where O(N²) memory is acceptable.

source
DataSplits.lazy_kennard_stoneMethod
_split(data, s::KennardStoneSplit; rng=Random.GLOBAL_RNG) → (train_idx, test_idx)

Kennard-Stone (CADEX) algorithm for optimal train/test splitting using maximin strategy. Memory-optimized implementation with O(N) storage. Useful when working with large datasets where the NxN distance matrix does not fit memory. When working with small datasets, use the traditional implementation.

source
DataSplits.neyman_allocationMethod
neyman_allocation(cl_ids, idxs_by_cluster, n, data, rng)

Randomly select a Neyman quota from each cluster (proportional to cluster size and mean std of features). Returns a Dict mapping cluster id to selected indices.

source
DataSplits.proportional_allocationMethod
proportional_allocation(cl_ids, idxs_by_cluster, rng)

Use all samples in each cluster, shuffled. Returns a Dict mapping cluster id to selected indices.

source
DataSplits.sample_indicesMethod
sample_indices(A::AbstractArray) -> AbstractVector

Return a vector of valid sample indices for A`, supporting all AbstractArray types.

This method defines the "sample axis" (typically axis 1) and determines how your splitting/sampling algorithms enumerate data points.

Default

For standard arrays, returns axes(A, 1).

Extension

To support non-standard arrays (e.g., views, custom wrappers), you may extend this method to expose logical sample indices:

Base.sample_indices(a::MyFancyArray) = 1:length(a.ids)
source
DataSplits.sphere_exclusionMethod
sphere_exclusion(data; radius::Real, metric::Distances.SemiMetric=Euclidean()) -> SphereExclusionResult

Cluster samples in data by sphere exclusion:

  1. Compute full pairwise distance matrix and normalize values to [0,1].
  2. While unassigned samples remain:
    • Pick first unassigned sample i.
    • All unassigned samples j with normalized distance D[i,j] <= radius form a cluster.
    • Mark them assigned and increment cluster ID.
source
DataSplits.splitMethod
split(data, strategy; rng=Random.default_rng()) → (train, test)

Split data into train/test sets according to strategy.

source
DataSplits.split_with_positionsMethod
split_with_positions(data, s, core_algorithm; rng=Random.default_rng(), args...)

Generic wrapper for split strategies. Handles mapping between user indices and 1:N positions.

  • data: The user’s data array.
  • s: The split strategy object.
  • core_algorithm: Function (N, s, rng, data, args...) -> (trainpos, testpos)

Returns: (trainidx, testidx) as indices valid for data.

source