bqlearn.corruptions.make_cluster_imbalance¶
- bqlearn.corruptions.make_cluster_imbalance(X, y, *arrays, per_class_n_clusters=3, majority_ratio=1.0, imbalance_distribution='step', minority_class_fraction=0.5, random_state=None, n_jobs=None)[source]¶
Create per-class cluster imbalance in a multi class scenario according to [1].
Learns a
sklearn.cluster.KMeansclustering once per class and creates class imbalance based on the cluster labels.- Parameters:
- Xarray-like of shape (n_samples, n_features)
The samples.
- yarray-like of shape (n_samples, )
The targets.
- *arrays: sequence of indexables with length / shape[0] equals to n_samples
Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes.
- per_class_n_clustersdict or int, default = 3
The number of clusters are associated with classes in the form
{class_label: n_cluster}for the KMeans algorithm.If an
int, then the same number of clusters is used for all classes.- majority_ratiofloat, default = 1.0
Ratio between number of samples in majority classes and number of samples in minority classes.
- imbalance_distribution{‘step’, ‘linear’}, default=’step’
Imbalance distribution.
- minority_class_fractionfloat, default = 0.5
Fraction of classes considered as minority classes. Only used when imbalance_distribution=’step’.
- random_stateint or RandomState, default=None
Controls the randomness of the KMeans clustering.
- n_jobsint, default=None
The number of jobs to use for the computation. This parallelize the training of
sklearn.cluster.KMeansper class.Nonemeans 1 unless in ajoblib.parallel_backendcontext.-1means using all processors. See Glossary for more details.
- Returns:
- X_imbalancedarray-like of shape (n_samples_new, n_features)
The array containing the imbalanced data.
- y_imbalancedarray-like of shape (n_samples_new)
The corresponding label of X_imbalanced.
- *arrays_imbalancedlist, length=len(arrays)
The corresponding imbalanced arrays.
References
[1]Nodet, V. Lemaire, A. Bondu, A. Cornuéjols, “Design of Algorithms Dealing with Closed-Set Distribution Shifts”, 2023.