Sphere Exclusion
sphere_exclusion clusters samples by picking a centre point and assigning to it all samples within a specified radius. The radius is applied to the normalised distance matrix (scaled to [0, 1]), so it is unitless and portable across datasets with different absolute scales.
This is a clustering utility, not a split strategy itself. Use the resulting cluster assignments as the groups= argument to any group-aware split or CV strategy.
How it works
- Normalise the pairwise distance matrix to [0, 1].
- Take the first unassigned sample as a new cluster centre.
- Assign to that cluster every unassigned sample within
radiusof the centre. - Repeat until all samples are assigned.
The number of clusters is determined automatically by radius — smaller radii produce more, smaller clusters; larger radii produce fewer, larger ones.
Usage
using DataSplits
result = sphere_exclusion(X; radius = 0.3)
cluster_ids = result.assignments # Vector{Int}, one entry per sample
# Use assignments as groups for a group-aware split.
res = partition(X, GroupShuffleSplit();
groups = cluster_ids, train = 0.8, test = 0.2)
# Use with GroupKFold for cluster-stratified cross-validation.
cvs = partition(X, GroupKFold(5); groups = cluster_ids)
# Custom metric.
using Distances
result = sphere_exclusion(X; radius = 0.2, metric = Cityblock())Choosing the radius
The radius parameter (in [0, 1] after normalisation) controls cluster granularity:
| Radius | Effect |
|---|---|
| Small (0.05–0.15) | Many small, tight clusters |
| Medium (0.2–0.4) | Moderate cluster sizes — good default |
| Large (0.5+) | Few large clusters; may produce very uneven groups |
There is no single best radius — it depends on the density of your data. A useful heuristic: choose a radius such that the number of clusters is roughly equal to your intended number of folds or groups.
Return value
sphere_exclusion returns a SphereExclusionResult with:
assignments::Vector{Int}— cluster ID per sample (1-based).radius::Float64— the radius that was used.metric::Distances.SemiMetric— the metric that was used.
The result implements the Clustering.ClusteringResult interface, so standard accessor functions (assignments, nclusters, counts) work.