Introduction¶
Biquality Learning is a machine learning framework to train classifiers on Biquality Data, where the dataset is split into a trusted and an untrusted part:
The trusted dataset contains trustworthy samples with clean labels and proper feature distribution.
The untrusted dataset contains potentially corrupted samples from label noise or covariate shift (distribution shift).
We designed the biquality-learn library following the general design principles of scikit-learn, meaning that it provides a consistent interface for training and using biquality learning algorithms with an easy way to compose building blocks provided by the library with other blocks from libraries sharing these design principles. It includes various reweighting algorithms, plugin correctors, and functions for simulating label noise and generating sample data to benchmark biquality learning algorithms.
biquality-learn and its dependencies can be easily installed through pip:
pip install biquality-learn
Overall, the goal of biquality-learn is to make well-known and proven biquality learning algorithms accessible and easy to use for everyone and to enable researchers to experiment in a reproducible way on biquality data.
Source Code: https://github.com/biquality-learn/biquality-learn
Documentation: https://biquality-learn.readthedocs.io/
License: BSD 3-Clause
Design of the API¶
scikit-learn is a machine learning library for Python with a design philosophy emphasizing consistency, simplicity, and performance. The library provides a consistent interface for various algorithms, making it easy for users to switch between models. It also aims to make machine learning easy to get started with through user-friendly API and precise documentation. Additionally, it is built on top of efficient numerical libraries (numpy, and SciPy) to ensure that models can be trained and used on large datasets in a reasonable amount of time.
In biquality-learn, we followed the same principle, implementing a
similar API with fit(), transform(), and predict() methods.
In addition to passing the input features \(X\) and the labels
\(Y\) as in scikit-learn, in biquality-learn, we need to
provide information regarding whether each sample comes from the trusted or
untrusted dataset: the additional sample_quality parameter serves to specify
from which dataset the sample originates where a value of 0 indicates an untrusted
sample, and 1 a trusted one.
Which algorithms are implemented in biquality-learn ?¶
In biquality-learn, we purposely implemented only a specific class of algorithms centered on approaches for tabular data and classifiers, thus restricting approaches that are genuinely classifier agnostic or implementable within scikit-learn’s API. We did so not to break the design principles shared with scikit-learn and not impose a particular deep learning library such as PyTorch, or TensorFlow on the user.
We summarized all implemented algorithms and what kind of corruption they can handle in the following Table.
Algorithms |
Dataset Shifts |
Weaknesses of Supervision |
|---|---|---|
EasyAdapt [H2007] |
\(\checkmark\) |
\(\times\) |
TrAdaBoost [DYXY2008] |
\(\checkmark\) |
\(\times\) |
Unhinged [RMW2015] (Linear/Kernel) |
\(\times\) |
\(\checkmark\) |
Backward [NDRT2013] |
\(\times\) |
\(\checkmark\) |
IRLNL [LT2015] |
\(\times\) |
\(\checkmark\) |
Plugin [ZLA2021] |
\(\times\) |
\(\checkmark\) |
KKMM [FNS2020] |
\(\checkmark\) |
\(\checkmark\) |
IKMM [FNS2020] |
\(\checkmark\) |
\(\checkmark\) |
IRBL [NLBC2021] |
\(\times\) |
\(\checkmark\) |
KPDR |
\(\checkmark\) |
\(\checkmark\) |
IPDR [L2018] |
\(\checkmark\) |
\(\checkmark\) |
Refer to the documentation and examples provided in the library for more information : API Reference.
Training Biquality Learning Classifiers¶
Training a biquality learning algorithm using biquality-learn is the same procedure as training a supervised algorithm with scikit-learn thanks to the library’s design. The features \(X\) and the targets \(Y\) of samples belonging to the trusted dataset \(D_T\) and untrusted dataset \(D_U\) must be provided as one global dataset \(D\). Additionally, the indicator representing if a sample is trusted or not has to be provided: \(\textit{sample_quality}=\mathbb{1}_{X\in D_T}\).
Here is an example of how to train a biquality classifier using the
bqlearn.density_ratio.KKMM (K-Kernel Mean Matching) algorithm from biquality-learn:
from sklearn.linear_models import LogisticRegression
from bqlearn.density_ratio import KKMM
kkmm = KKMM(LogisticRegression(), kernel="rbf")
kkmm.fit(X, y, sample_quality=sample_quality)
kkmm.predict(X_new)
scikit-learn’s metadata routing¶
scikit-learn’s metadata routing is a Scikit Learn Enhancement
Proposal (SLEP006) describing a system that can be used to seamlessly
incorporate various metadata in addition to the required features and
targets in estimators, scorers and transformers.
biquality-learn uses this design to integrate the sample_quality
property into the training and prediction process of biquality learning
algorithms. It allows one to use biquality-learn’s algorithms in a
similar way to scikit-learn’s algorithms by passing the
sample_quality property as an additional argument to the fit(),
predict(), and other methods.
Currently, the main components provided by scikit-learn support this
design and is already usable for cross-validators. However, it will be
extended to all components in the future, and biquality-learn will
significantly benefit from many “free” features. When
https://github.com/scikit-learn/scikit-learn/pull/24250 will be merged,
it will be possible to make a bagging ensemble of biquality classifiers
thanks to the sklearn.ensemble.BaggingClassifier without
overriding its behavior on biquality data.
from sklearn.ensemble import BaggingClassifier
bag = BaggingClassifier(kkmm).fit(X, y, sample_quality=sample_quality)
Cross-Validating Biquality Classifiers¶
Any cross-validators working for usual Supervised Learning can work in
the case of Biquality Learning. However, when splitting the data into a
train and test set, untrusted samples need to be removed from the test
set to avoid computing supervised metrics on corrupted labels. That is
why bqlearn.model_selection.make_biquality_cv() is provided
by biquality-learn to post-process any scikit-learn compatible
cross-validators.
Here is an example of how to use scikit-learn’s
sklearn.model_selection.RandomizedSearchCV function
to perform hyperparameter validation for a
biquality learning algorithm in biquality-learn:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.utils.fixes import loguniform
from bqlearn.model_selection import make_biquality_cv
param_dist = {"final_estimator__C": loguniform(1e3, 1e5)}
n_iter=20
random_search = RandomizedSearchCV(
kkmm,
param_distributions=param_dist,
n_iter=n_iter,
cv=make_biquality_cv(X, sample_quality, cv=3)
)
random_search.fit(X, y, sample_quality=sample_quality)