Introduction

Biquality Learning is a machine learning framework to train classifiers on Biquality Data, where the dataset is split into a trusted and an untrusted part:

  • The trusted dataset contains trustworthy samples with clean labels and proper feature distribution.

  • The untrusted dataset contains potentially corrupted samples from label noise or covariate shift (distribution shift).

We designed the biquality-learn library following the general design principles of scikit-learn, meaning that it provides a consistent interface for training and using biquality learning algorithms with an easy way to compose building blocks provided by the library with other blocks from libraries sharing these design principles. It includes various reweighting algorithms, plugin correctors, and functions for simulating label noise and generating sample data to benchmark biquality learning algorithms.

biquality-learn and its dependencies can be easily installed through pip:

pip install biquality-learn

Overall, the goal of biquality-learn is to make well-known and proven biquality learning algorithms accessible and easy to use for everyone and to enable researchers to experiment in a reproducible way on biquality data.

Design of the API

scikit-learn is a machine learning library for Python with a design philosophy emphasizing consistency, simplicity, and performance. The library provides a consistent interface for various algorithms, making it easy for users to switch between models. It also aims to make machine learning easy to get started with through user-friendly API and precise documentation. Additionally, it is built on top of efficient numerical libraries (numpy, and SciPy) to ensure that models can be trained and used on large datasets in a reasonable amount of time.

In biquality-learn, we followed the same principle, implementing a similar API with fit(), transform(), and predict() methods. In addition to passing the input features \(X\) and the labels \(Y\) as in scikit-learn, in biquality-learn, we need to provide information regarding whether each sample comes from the trusted or untrusted dataset: the additional sample_quality parameter serves to specify from which dataset the sample originates where a value of 0 indicates an untrusted sample, and 1 a trusted one.

Which algorithms are implemented in biquality-learn ?

In biquality-learn, we purposely implemented only a specific class of algorithms centered on approaches for tabular data and classifiers, thus restricting approaches that are genuinely classifier agnostic or implementable within scikit-learn’s API. We did so not to break the design principles shared with scikit-learn and not impose a particular deep learning library such as PyTorch, or TensorFlow on the user.

We summarized all implemented algorithms and what kind of corruption they can handle in the following Table.

Algorithms implemented in biquality-learn

Algorithms

Dataset Shifts

Weaknesses of Supervision

EasyAdapt [H2007]

\(\checkmark\)

\(\times\)

TrAdaBoost [DYXY2008]

\(\checkmark\)

\(\times\)

Unhinged [RMW2015] (Linear/Kernel)

\(\times\)

\(\checkmark\)

Backward [NDRT2013]

\(\times\)

\(\checkmark\)

IRLNL [LT2015]

\(\times\)

\(\checkmark\)

Plugin [ZLA2021]

\(\times\)

\(\checkmark\)

KKMM [FNS2020]

\(\checkmark\)

\(\checkmark\)

IKMM [FNS2020]

\(\checkmark\)

\(\checkmark\)

IRBL [NLBC2021]

\(\times\)

\(\checkmark\)

KPDR

\(\checkmark\)

\(\checkmark\)

IPDR [L2018]

\(\checkmark\)

\(\checkmark\)

Refer to the documentation and examples provided in the library for more information : API Reference.

Training Biquality Learning Classifiers

Training a biquality learning algorithm using biquality-learn is the same procedure as training a supervised algorithm with scikit-learn thanks to the library’s design. The features \(X\) and the targets \(Y\) of samples belonging to the trusted dataset \(D_T\) and untrusted dataset \(D_U\) must be provided as one global dataset \(D\). Additionally, the indicator representing if a sample is trusted or not has to be provided: \(\textit{sample_quality}=\mathbb{1}_{X\in D_T}\).

Here is an example of how to train a biquality classifier using the bqlearn.density_ratio.KKMM (K-Kernel Mean Matching) algorithm from biquality-learn:

from sklearn.linear_models import LogisticRegression
from bqlearn.density_ratio import KKMM

kkmm = KKMM(LogisticRegression(), kernel="rbf")
kkmm.fit(X, y, sample_quality=sample_quality)
kkmm.predict(X_new)

scikit-learn’s metadata routing

scikit-learn’s metadata routing is a Scikit Learn Enhancement Proposal (SLEP006) describing a system that can be used to seamlessly incorporate various metadata in addition to the required features and targets in estimators, scorers and transformers. biquality-learn uses this design to integrate the sample_quality property into the training and prediction process of biquality learning algorithms. It allows one to use biquality-learn’s algorithms in a similar way to scikit-learn’s algorithms by passing the sample_quality property as an additional argument to the fit(), predict(), and other methods.

Currently, the main components provided by scikit-learn support this design and is already usable for cross-validators. However, it will be extended to all components in the future, and biquality-learn will significantly benefit from many “free” features. When https://github.com/scikit-learn/scikit-learn/pull/24250 will be merged, it will be possible to make a bagging ensemble of biquality classifiers thanks to the sklearn.ensemble.BaggingClassifier without overriding its behavior on biquality data.

from sklearn.ensemble import BaggingClassifier

bag = BaggingClassifier(kkmm).fit(X, y, sample_quality=sample_quality)

Cross-Validating Biquality Classifiers

Any cross-validators working for usual Supervised Learning can work in the case of Biquality Learning. However, when splitting the data into a train and test set, untrusted samples need to be removed from the test set to avoid computing supervised metrics on corrupted labels. That is why bqlearn.model_selection.make_biquality_cv() is provided by biquality-learn to post-process any scikit-learn compatible cross-validators.

Here is an example of how to use scikit-learn’s sklearn.model_selection.RandomizedSearchCV function to perform hyperparameter validation for a biquality learning algorithm in biquality-learn:

from sklearn.model_selection import RandomizedSearchCV
from sklearn.utils.fixes import loguniform
from bqlearn.model_selection import make_biquality_cv

param_dist = {"final_estimator__C": loguniform(1e3, 1e5)}
n_iter=20

random_search = RandomizedSearchCV(
   kkmm,
   param_distributions=param_dist,
   n_iter=n_iter,
   cv=make_biquality_cv(X, sample_quality, cv=3)
)
random_search.fit(X, y, sample_quality=sample_quality)