Classification¶
In this module, we consider models of each source domain are any machine learning models. And by using a cross-entrpy loss in CGDRO Learner, we aggregate the sources into the target domain. For more details of methods, please refer CGDRO-Classification.
We can import Classification module by the code below:
from cgdro import Classification
Module Arguments & Outputs¶
Classification¶
f_learner(str, optional): method used to fit outcome models on each source. Defaults to 'linear'.w_learner(str, optional): method used to fit density models on each source. Defaults to 'logistic'.split(bool, optional): whether to split the source data into two halves for fitting outcome and density models. Defaults to True.seed(int, optional): random seed for data-splitting. Defaults to 123.
Built-in functions in Classification:
| BUilt-in Functions | Description |
|---|---|
fit() |
Fit robust classification model with cross-entropy loss in the target domain. |
predict_proba() |
Make robust classification probability prediction in the target domain. |
predict() |
Make robust label prediction in the target domain. |
infer() |
Build debiased confidence intervals of the target linear regression coefficients. |
summary() |
Summarize the results. |
fit()¶
Arguments:
X_list(list): list of feature matrices on each source domainy_list(list): list of label arrays on each source domainX0(array, optional): feature matrix on the target domain. If None, use the pooled source data as the target data. Defaults to None.max_iter(int, optional): maximum number of iterations. Defaults to 1000.tol(float, optional): tolerance for convergence. Defaults to 1e-6.check_dual(bool, optional): whether to check the duality gap. Defaults to False.verbose(bool, optional): whether to print out the fitting information. Defaults to False.
Outputs: enabled the following attributes:
.parameters:"coef_": CGDRO aggregated debiased coefficient estimators in the target domain;"weight_": CGDRO aggregated weights of the source domains.
predict_proba()¶
Arguments:
X: Input features for prediction. If None, uses the training data. Defaults to None.
Outputs:
proba: classification probability prediction in the target domain.
predict()¶
Arguments:
X: Input features for prediction. If None, uses the training data. Defaults to None.
Outputs:
pred: label prediction in the target domain.
infer()¶
Arguments:
M(int, optional): number of resampling iterations. Defaults to 200.alpha(float, optional): significance level for confidence intervals. Defaults to 0.05.diag(bool, optional): whether to use diagonal approximation for covariance matrices. Defaults to True.parallel(bool, optional): whether to use parallel computing. Defaults to False.n_workers(int, optional): number of workers for parallel computing. Defaults to 4.
Outputs enabled the following attributes:
.CI: CGDRO debiased aggregated confidence intervals of the target domain coefficients.
summary()¶
Arguments
-index (array-like or None): 1-based indices of dimensions to print (subset of 1..d). Defaults to all dimensions.
class_index(array-like or None): class labels to print (subset of 1..self.num_class-1). Defaults to all (1..self.num_class-1).
Outputs
- Summay of CGDRO aggregated weights, estimators, and confidence intervals.
Example¶
Data Generating Process¶
In this example, we generate a multi-source domain data with $2$ domains, putting $100$ samples on each source domain and $1,000$ samples on the target domain. We consider a multi-class classification problem with $C=K+1=3$ labels. The dimension of the parameters is $p=5$.
from cgdro.data import DataContainerSimu_linear_Cl
# two source groups, each with 100 samples, and 1000 target samples
n = 100; p = 5; L = 2; N = 1000; K = 2
data = DataContainerSimu_linear_Cl(n=n, N=N, p=p, L=L, K=K)
data.generate_funcs_list(seed=123)
data.generate_data(seed=123)
Xlist = data.X_sources_list
Ylist = data.Y_sources_list
X0 = data.X_target
Implementation & Results¶
## First announcing the module
## Then calling the functions fit() and infer()
## choose linear regression for both f_learner and logistic regression for w_learner
cc = Classification(f_learner='linear', w_learner='logistic')
cc.fit(Xlist,Ylist,X0)
cc.infer()
cc.summary()
Model Summary: ================================= CGDRO Aggregated Weights: group | 1 2 weight_ | 0.7985 0.2015 ================================= Coefficient Estimators: Class 1 coefficients: index | 1 2 3 4 5 coef_ | 0.0582 -0.1451 -0.0623 -0.0652 0.3645 Class 2 coefficients: index | 1 2 3 4 5 coef_ | 0.3192 -0.3334 -0.1658 0.3339 -0.3211 ================================= Confidence Intervals: Class 1 Confidence Intervals: index | 1 2 3 4 5 CIs | (-1.113,1.287) (-1.996,1.440) (-2.633,1.996) (-1.540,3.026) (-1.153,2.826) Class 2 Confidence Intervals: index | 1 2 3 4 5 CIs | (-0.866,1.729) (-2.148,1.004) (-2.503,1.806) (-1.280,3.821) (-1.507,1.776)
## select the indexes of coefficients to do inference on [3,5] for class 2
cc.summary(
index = [3,5], class_index=2
)
Model Summary: ================================= CGDRO Aggregated Weights: group | 1 2 weight_ | 0.7985 0.2015 ================================= Coefficient Estimators: Class 2 coefficients: index | 3 5 coef_ | -0.1658 -0.3211 ================================= Confidence Intervals: Class 2 Confidence Intervals: index | 3 5 CIs | (-2.503,1.806) (-1.507,1.776)
We can get statistical inference results from CGDRO, including CGDRO Aggregated Weights (learned weights from each group of source domain), Coefficient Estimators (the worst-case estimators of coefficient on target domain), and Confidence Intervals (valid confidence intervals of target domain coefficient estimators). In the summarized results above, group refers to each group of source domains, index refers to the index of coeffients, starting from the intercept if intercept=True, else starting from the first dimension of coefficient, and Class start from $1$ to $K=C-1$.
Prediction¶
Make prediction on target data (you do not have to state the coveriate you use for prediction since target data is the default choice) and show the first 10 predicted values of softmax probabilities and labels.
pred_proba = cc.predict_proba()
print(pred_proba[:10, :])
[[0.46621458 0.33948019 0.19430523] [0.33735549 0.3403033 0.32234121] [0.49823435 0.25034083 0.25142482] [0.37073523 0.32970061 0.29956415] [0.24311714 0.18835182 0.56853105] [0.32211245 0.18882565 0.4890619 ] [0.4418154 0.42461165 0.13357295] [0.45198723 0.42602329 0.12198948] [0.36234295 0.34360026 0.29405679] [0.39832987 0.46336429 0.13830584]]
pred = cc.predict()
print(pred[:10])
[0 1 0 0 2 2 0 0 0 1]