Reference

Reference
- Contents
- Index

Index

DataSplits.ClusterShuffleSplit
DataSplits.ClusterStratifiedSplit
DataSplits.KennardStoneSplit
DataSplits.OptiSimSplit
DataSplits.SPXYSplit
DataSplits.SphereExclusionResult
DataSplits.TargetPropertySplit
DataSplits.TimeSplit
DataSplits._split
DataSplits.cluster_stratified
DataSplits.distance_matrix
DataSplits.equal_allocation
DataSplits.find_maximin_element
DataSplits.find_most_distant_pair
DataSplits.get_sample
DataSplits.is_sample_array
DataSplits.kennardstone
DataSplits.lazy_kennard_stone
DataSplits.neyman_allocation
DataSplits.proportional_allocation
DataSplits.sample_indices
DataSplits.sphere_exclusion
DataSplits.split
DataSplits.split_with_positions
DataSplits.targetpropertysplit

DataSplits.ClusterShuffleSplit — Type

ClusterShuffleSplit(res::ClusteringResult, frac::Real)
ClusterShuffleSplit(f::Function, frac::Real, data; rng)

Group-aware train/test splitter: accepts either:

A precomputed ClusteringResult.
A clustering function f(data) that returns one.

At construction, clustering is executed so the strategy always holds a ClusteringResult.

Arguments:

res or `f(...)
frac: fraction of samples in the training set (0 < frac < 1).

This splitter shuffles cluster IDs and accumulates whole clusters until frac * N samples are in the train set, then returns (train_idx, test_idx).

source

DataSplits.ClusterStratifiedSplit — Type

ClusterStratifiedSplit(res::ClusteringResult, allocation::Symbol; n=nothing, frac)
ClusterStratifiedSplit(f::Function, allocation::Symbol; n=nothing, frac, data)

Cluster-stratified train/test splitter.

Splits each cluster into train/test using one of three allocation methods:

:equal: Randomly selects n samples from each cluster, then splits them into train/test according to frac.
:proportional: Uses all samples in each cluster, splits them into train/test according to frac.
:neyman: Randomly selects a Neyman quota from each cluster (based on pooled std), then splits into train/test according to frac.

Arguments

res or f(...): ClusteringResult or clustering function.
allocation: :equal, :proportional, or :neyman.
n: Number of samples per cluster (equal/neyman allocation).
frac: Fraction of selected samples to use for train (rest go to test).

Notes

If n is greater than the cluster size, all samples in the cluster are used.
For :proportional, all samples are always used.
For frac=1.0, all selected samples go to train; for frac=0.0, all go to test.

source

DataSplits.KennardStoneSplit — Type

KennardStoneSplit{T} <: SplitStrategy

A splitting strategy implementing the Kennard-Stone algorithm for train/test splitting.

Fields

frac::ValidFraction{T}: Fraction of data to use for training (0 < frac < 1)
metric::Distances.SemiMetric: Distance metric to use (default: Euclidean())

Examples

```julia

Create a splitter with 80% training data using Euclidean distance

splitter = KennardStoneSplit(0.8)

Create a splitter with custom metric

using Distances splitter = KennardStoneSplit(0.7, Cityblock())

source

DataSplits.OptiSimSplit — Type

OptiSimSplit(frac; n_clusters = 10, max_subsample_size, distance_cutoff = 0.10,
             metric = Euclidean(), random_state = 42)

Implementation of OptiSim (Clark 1998, J. Chem. Inf. Comput. Sci.), an optimisable K‑dissimilarity selection.

frac – fraction of samples to return in the training subset
n_clusters = M – requested cluster/selection‑set size
max_subsample_size = K – size of the temporary sub‑sample (default: max(1, ceil(Int, 0.05N)))
distance_cutoff = c – two points are “similar” if their distance < c
metric – any Distances.jl metric
random_state – seed for the RNG

The splitter requires both an X matrix and target vector y when calling split.

source

DataSplits.SPXYSplit — Method

SPXYSplit(frac; metric = Euclidean())

Create an SPXY splitter – the variant of Kennard–Stone in which the distance matrix is the element‑wise sum of

the (normalised) pairwise distance matrix of the feature matrix X
plus the (normalised) pairwise distance matrix of the response vector y.

frac is the fraction of samples that will end up in the training subset.

Note

split must be called with a 2‑tuple (X, y) or with positional arguments split(X, y, strategy); calling split(X, strategy) will raise a MethodError, because y is mandatory for SPXY.

Arguments

name	type	meaning
`frac`	`Real` (0 < `frac` < 1)	training‑set fraction
`metric`	[`Distances.SemiMetric`]	distance metric used for both `X` and `y`

See also

KennardStoneSplit — the classical variant that uses only X.

source

DataSplits.SphereExclusionResult — Type

SphereExclusionResult

Result of sphere exclusion clustering.

Fields:

assignments::Vector{Int}: cluster index per point.
radius::Float64: exclusion radius.
metric::Distances.SemiMetric: distance metric.

source

DataSplits.TargetPropertySplit — Type

TargetPropertySplit{T} <: SplitStrategy

A splitting strategy that partitions a 1D property array into train/test sets by sorting the property values.

Fields

frac::ValidFraction{T}: Fraction of data to use for training (0 < frac < 1)
order::Symbol: Sorting order; use :asc, :desc, :high, :low, :largest, :smallest, etc.

Examples

split(y, TargetPropertyHigh(0.8))
split(X[:, 3], TargetPropertyLow(0.5))

source

DataSplits.TimeSplit — Type

TimeSplit{T} <: SplitStrategy

Splits a 1D array of dates/times into train/test sets, grouping by unique date/time values. No group (samples with the same date) is split between train and test. The actual fraction may be slightly above the requested one, but never below.

Fields

frac::ValidFraction{T}: Fraction of data to use for training (0 < frac < 1)
order::Symbol: Sorting order; use :asc (oldest in train, default), :desc (newest in train)

Examples

split(dates, TimeSplitOldest(0.7))
split(dates, TimeSplitNewest(0.3))

source

DataSplits._split — Method

_split(X, y, strategy::SPXYSplit; rng = Random.GLOBAL_RNG) → (train_idx, test_idx)

Split a feature matrix X and response vector y into train/test subsets using the SPXY algorithm:

Build the joint distance matrix D = D_X + D_Y (see SPXYSplit for details).
Run the Kennard–Stone maximin procedure on D.
Return two sorted index vectors (train_idx, test_idx).

Arguments

name	type	requirement
`X`	`AbstractMatrix`	`size(X, 1) == length(y)`
`y`	`AbstractVector`
`strategy`	`SPXYSplit`	created with `SPXYSplit(frac; metric)`
`rng`	random‑number source	unused here but kept for API symmetry

Returns

Two Vector{Int} with the row indices of X (and the corresponding entries of y) that belong to the training and test subsets.

The indices are axis‑correct — if X is an OffsetMatrix whose first row is index 0, the returned indices will also start at 0.

source

DataSplits.cluster_stratified — Method

cluster_stratified(N, s, rng, data)

Main splitting function. For each cluster, selects indices according to the allocation method, then splits those indices into train/test according to frac.

source

DataSplits.distance_matrix — Method

distance_matrix(X, metric::PreMetric)

Compute the full symmetric pairwise distance matrix D for the dataset X using the given metric. This function uses get_sample to access samples, ensuring compatibility with any custom array type that implements get_sample and sample_indices.

Returns a matrix D such that D[i, j] = metric(xᵢ, xⱼ) and D[i, j] == D[j, i].

source

DataSplits.equal_allocation — Method

equal_allocation(cl_ids, idxs_by_cluster, n, rng)

Randomly select n samples from each cluster (or all if cluster is smaller). Returns a Dict mapping cluster id to selected indices.

source

DataSplits.find_maximin_element — Method

find_maximin_element(distances::AbstractMatrix{T},
                    source_set::AbstractVector{Int},
                    reference_set::AbstractVector{Int}) -> Int

Find the element in source_set that maximizes the minimum distance to all elements in reference_set.

Arguments

distances::AbstractMatrix{T}: Precomputed, symmetric pairwise distance matrix.
source_set::AbstractVector{Int}: Set of items to evaluate.
reference_set::AbstractVector{Int}: Set of items to compare against.

Returns

Int: The index in source_set that is farthest from its nearest neighbour in reference_set.

Notes

If reference_set is empty, throws ArgumentError
Breaks ties by returning the first maximum

source

DataSplits.find_most_distant_pair — Method

find_most_distant_pair(D::AbstractMatrix) → (i, j)

Finds indices of most distant pair in precomputed distance matrix.

source

DataSplits.get_sample — Method

get_sample(A::AbstractArray, idx)

Public API for getting samples that handles any valid index type. Dispatches to internalget_sample after index conversion.

source

DataSplits.is_sample_array — Method

is_sample_array(::Type{T}) -> Bool

Trait method that tells whether type T behaves like a sample‑indexed array that supports sample_indices, _get_sample, etc.

You may extend this for custom containers to ensure compatibility with OptiSim or other sampling methods.

Defaults to false unless specialized.

source

DataSplits.kennardstone — Method

_split(data, s::KennardStoneSplit; rng=Random.GLOBAL_RNG) → (train_idx, test_idx)

Optimized in-memory Kennard-Stone algorithm using precomputed distance matrix. Best for small-to-medium datasets where O(N²) memory is acceptable.

source

DataSplits.lazy_kennard_stone — Method

_split(data, s::KennardStoneSplit; rng=Random.GLOBAL_RNG) → (train_idx, test_idx)

Kennard-Stone (CADEX) algorithm for optimal train/test splitting using maximin strategy. Memory-optimized implementation with O(N) storage. Useful when working with large datasets where the NxN distance matrix does not fit memory. When working with small datasets, use the traditional implementation.

source

DataSplits.neyman_allocation — Method

neyman_allocation(cl_ids, idxs_by_cluster, n, data, rng)

Randomly select a Neyman quota from each cluster (proportional to cluster size and mean std of features). Returns a Dict mapping cluster id to selected indices.

source

DataSplits.proportional_allocation — Method

proportional_allocation(cl_ids, idxs_by_cluster, rng)

Use all samples in each cluster, shuffled. Returns a Dict mapping cluster id to selected indices.

source

DataSplits.sample_indices — Method

sample_indices(A::AbstractArray) -> AbstractVector

Return a vector of valid sample indices for A`, supporting all AbstractArray types.

This method defines the "sample axis" (typically axis 1) and determines how your splitting/sampling algorithms enumerate data points.

Default

For standard arrays, returns axes(A, 1).

Extension

To support non-standard arrays (e.g., views, custom wrappers), you may extend this method to expose logical sample indices:

Base.sample_indices(a::MyFancyArray) = 1:length(a.ids)

source

DataSplits.sphere_exclusion — Method

sphere_exclusion(data; radius::Real, metric::Distances.SemiMetric=Euclidean()) -> SphereExclusionResult

Cluster samples in data by sphere exclusion:

Compute full pairwise distance matrix and normalize values to [0,1].
While unassigned samples remain:
- Pick first unassigned sample i.
- All unassigned samples j with normalized distance D[i,j] <= radius form a cluster.
- Mark them assigned and increment cluster ID.

source

DataSplits.split — Method

split(data, strategy; rng=Random.default_rng()) → (train, test)

Split data into train/test sets according to strategy.

source

DataSplits.split_with_positions — Method

split_with_positions(data, s, core_algorithm; rng=Random.default_rng(), args...)

Generic wrapper for split strategies. Handles mapping between user indices and 1:N positions.

data: The user’s data array.
s: The split strategy object.
core_algorithm: Function (N, s, rng, data, args...) -> (trainpos, testpos)

Returns: (trainidx, testidx) as indices valid for data.

source

DataSplits.targetpropertysplit — Method

targetpropertysplit(N, s, rng, data)

Sorts the property array and splits into train/test according to s.frac and s.order.

source

Reference

Contents

Index