bqlearn.density_ratio.IKMM

class bqlearn.density_ratio.IKMM(estimator, *, n_estimators=10, exploit_iterative_learning=True, window=1, kernel='rbf', kernel_params={}, B=1000, epsilon=None, max_iter=1000, tol=1e-06, batch_size=None, n_jobs=None)[source]

An Iterative KMM Density Ratio Biquality Classifier.

An Iterative DR Biquality Classifier using Ensemble [1] Kernel Mean Matching [2] to reweigh untrusted examples [3].

Parameters:
estimatorobject

The estimator from which the IDR classifier is built. Support for sample weighting and probability prediction is required.

n_estimatorsint, default=10

Number of trained estimators on reweighted samples.

exploit_iterative_learning: boolean, default=False

If the estimator supports iterative learning with warm_start, exploit it by computing new weights for every epoch when fitting estimator.

window: int, default=1

Number of previous losses used to compute sample weights.

kernelstr or callable, default=”rbf”

Kernel mapping used internally. This parameter is directly passed to pairwise_kernel. If kernel is a string, it must be one of the metrics in pairwise.PAIRWISE_KERNEL_FUNCTIONS. If kernel is “precomputed”, X is assumed to be a kernel matrix. Alternatively, if kernel is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two rows from X as input and return the corresponding kernel value as a single number. This means that callables from sklearn.metrics.pairwise are not allowed, as they operate on matrices, not single samples. Use the string identifying the kernel instead.

kernel_paramsdict, optional (default={})

Kernel additional parameters

B: float, optional (default=1000)

Bounding weights parameter.

epsilon: float, optional (default=None)

Constraint parameter. If None epsilon is set to (np.sqrt(n_samples_untrusted - 1)/np.sqrt(n_samples_untrusted).

max_iterint, default=100

Maximum number of iterations. The solver iterates until convergence (determined by ‘tol’) or this number of iterations.

tol: float, default=1e-4

Termination criteria dictating the absolute and relative error on the primal residual, dual residual and duality gap.

batch_sizeint or float, default=None

Size of minibatches for batched Kernel Mean Matching. An int value represent an absolute number of untrusted samples used per batch. An float value represent the fraction of untrusted samples used per batch. When set to None, use the entire untrusted samples in one batch.

n_jobsint, default=None

The number of jobs to use for the computation. This parallelize the density ratio estimation procedures on all samples.

None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

Attributes:
estimator_classifier

The fitted estimator.

classes_ndarray of shape (n_classes,)

The classes labels.

n_classes_int

The number of classes.

References

[1]

Miao Y., Farahat A. and Kamel M. “Ensemble Kernel Mean Matching”, 2015

[2]

Huang, J. and Smola, A. and Gretton, A. and Borgwardt, KM. and Schölkopf, B., “Correcting Sample Selection Bias by Unlabeled Data”, 2006

[3]

Fang, T., Lu, N., Niu, G., and Sugiyama, M. “Rethinking importance weighting for deep learning under distribution shift.”, NeurIPS 2020

Methods

decision_function(X)

Call decision function of the final_estimator.

fit(X, y[, sample_quality])

Fit the reweighted model.

get_params([deep])

Get parameters for this estimator.

predict(X)

Predict the classes of X.

predict_log_proba(X)

Predict log probability for each possible outcome.

predict_proba(X)

Predict probability for each possible outcome.

score(X, y[, sample_weight])

Return the mean accuracy on the given test data and labels.

set_params(**params)

Set the parameters of this estimator.

decision_function(X)[source]

Call decision function of the final_estimator.

Parameters:
Xarray-like, shape (n_samples, n_features)

The input samples.

Returns:
yndarray, shape (n_samples,)

The predicted classes.

fit(X, y, sample_quality=None)[source]

Fit the reweighted model.

Parameters:
X{array-like, sparse matrix} of shape (n_samples, n_features)

The training input samples. Sparse matrix can be CSC, CSR, COO, DOK, or LIL. COO, DOK, and LIL are converted to CSR.

yarray-like of shape (n_samples,)

The target labels.

sample_qualityarray-like, shape (n_samples,)

Sample qualities.

Returns:
selfobject
get_params(deep=True)[source]

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

predict(X)[source]

Predict the classes of X.

Parameters:
Xarray-like, shape (n_samples, n_features)

The input samples.

Returns:
yndarray, shape (n_samples,)

The predicted classes.

predict_log_proba(X)[source]

Predict log probability for each possible outcome.

Parameters:
Xarray-like, shape (n_samples, n_features)

The input samples.

Returns:
log_parray, shape (n_samples, n_classes)

Array with log prediction probabilities.

predict_proba(X)[source]

Predict probability for each possible outcome.

Parameters:
Xarray-like, shape (n_samples, n_features)

The input samples.

Returns:
parray, shape (n_samples, n_classes)

The class probabilities of the input samples. The order of the classes corresponds to that in the attribute classes_.

score(X, y, sample_weight=None)[source]

Return the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters:
Xarray-like of shape (n_samples, n_features)

Test samples.

yarray-like of shape (n_samples,) or (n_samples, n_outputs)

True labels for X.

sample_weightarray-like of shape (n_samples,), default=None

Sample weights.

Returns:
scorefloat

Mean accuracy of self.predict(X) w.r.t. y.

set_params(**params)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.