bqlearn.corruptions.make_cluster_imbalance¶

bqlearn.corruptions.make_cluster_imbalance(X, y, *arrays, per_class_n_clusters=3, majority_ratio=1.0, imbalance_distribution='step', minority_class_fraction=0.5, random_state=None, n_jobs=None)[source]¶

Create per-class cluster imbalance in a multi class scenario according to [1].

Learns a sklearn.cluster.KMeans clustering once per class and creates class imbalance based on the cluster labels.

Parameters:

Xarray-like of shape (n_samples, n_features)

The samples.

yarray-like of shape (n_samples, )

The targets.

*arrays: sequence of indexables with length / shape[0] equals to n_samples

Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes.

per_class_n_clustersdict or int, default = 3

The number of clusters are associated with classes in the form {class_label: n_cluster} for the KMeans algorithm.

If an int, then the same number of clusters is used for all classes.

majority_ratiofloat, default = 1.0

Ratio between number of samples in majority classes and number of samples in minority classes.

imbalance_distribution{‘step’, ‘linear’}, default=’step’

Imbalance distribution.

minority_class_fractionfloat, default = 0.5

Fraction of classes considered as minority classes. Only used when imbalance_distribution=’step’.

random_stateint or RandomState, default=None

Controls the randomness of the KMeans clustering.

n_jobsint, default=None

The number of jobs to use for the computation. This parallelize the training of sklearn.cluster.KMeans per class.

None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

Returns:

X_imbalancedarray-like of shape (n_samples_new, n_features): The array containing the imbalanced data.
y_imbalancedarray-like of shape (n_samples_new): The corresponding label of X_imbalanced.
*arrays_imbalancedlist, length=len(arrays): The corresponding imbalanced arrays.

References

[1]

Nodet, V. Lemaire, A. Bondu, A. Cornuéjols, “Design of Algorithms Dealing with Closed-Set Distribution Shifts”, 2023.