Regression.linear.ld¶

In this module, we assume that the conditional outcome model in each source domain is a low-dimensional linear regression. For more details of methods, please refer CGDRO-Regression.

We can import Regression.linear module by the code below:

In [ ]:

Copied!

# import Regression.linear module
from cgdro.Regression import linear
# import Regression.linear module
from cgdro.Regression import linear

Now we give an example showing how to implement Regression.linear.ld with three different loss functions:

Reward-based loss
Squared loss
Regret-based loss

Module Arguments & Outputs¶

Regression.linear.ld¶

intercept (bool, optional): whether to include intercept in outcome models. Defaults to False.
delta (float, optional): ridge penalty level, non-positive. Defaults to 0.
verbose (bool, optional): whether to print out the fitting information. Defaults to False.

Built-in functions in Regression.linear.ld:

BUilt-in Functions	Description
`fit()`	Fit robust linear regression (low-dim) in the target domain.
`predict()`	Make robust prediction in the target domain.
`infer()`	Build confidence intervals of the target linear regression coefficients.
`summary()`	Summarize the results.

fit()¶

Arguments:

X_list (list of array-like): list of source domain features, each element is n_i x d.
y_list (list of array-like): list of source domain labels, each element is n_i x 1.
loss_type (str, optional): type of the loss function used to compute the optimal aggregation weights. Options include 'reward' (default), 'squaredloss', and 'regret'. Defaults to 'reward'.
X0 (array-like, optional): target domain features, n0 x d. If None, use all sources' data. Defaults to None.

Outputs: enabled the following attributes:

.parameters : "coef_": CGDRO aggregated coefficient estimators in the target domain; "weight_": CGDRO aggregated weights of the source domains.

predict()¶

Arguments:

X : Input features for prediction. If None, uses the training data. Defaults to None.

Outputs:

pred : linear prediction in the target domain.

infer()¶

Arguments:

M (int, optional): number of resampling iterations. Defaults to 200.
alpha (float, optional): significance level for confidence intervals. Defaults to 0.05.
alpha_thres (float, optional): threshold for generating samples. Defaults to 0.01.

Outputs enabled the following attributes:

.CI : CGDRO aggregated confidence intervals of the target domain coefficients.

summary()¶

Arguments

index (list or int optional) : index of interest in the coefficients. Defaults to None.

Outputs

Summay of CGDRO aggregated weights, estimators, and confidence intervals.

Example¶

Data Generating Process¶

In this example, we generate a multi-source domain data with $3$ domains, putting $1,000$ samples on each source domain and $10,000$ samples on the target domain. The dimension of the parameters is $p=5$.

In [ ]:

Copied!





from cgdro.data import DataContainerSimu_linear_reg_lowd
from cgdro.geometry import *

# number of source groups = 3, with 1000 samples each
# sigma: source group 1,3: 0.5; source group 2: 2
# target sample size = 10000
# dimension p = 5
n_list = [1000, 1000, 1000]
N = 10000  # target sample size
data = DataContainerSimu_linear_reg_lowd(n_list=n_list, N=N, p=5)
data.generate_funcs_list(seed=0)
data.generate_data(seed=0)

Xlist = data.X_sources_list
Ylist = data.Y_sources_list
X0 = data.X_target
from cgdro.data import DataContainerSimu_linear_reg_lowd
from cgdro.geometry import *

# number of source groups = 3, with 1000 samples each
# sigma: source group 1,3: 0.5; source group 2: 2
# target sample size = 10000
# dimension p = 5
n_list = [1000, 1000, 1000]
N = 10000  # target sample size
data = DataContainerSimu_linear_reg_lowd(n_list=n_list, N=N, p=5)
data.generate_funcs_list(seed=0)
data.generate_data(seed=0)

Xlist = data.X_sources_list
Ylist = data.Y_sources_list
X0 = data.X_target

Implementation & Results¶

We implement three loss functions by Regression.linear.ld, including reward, squaredloss, and regret. Geometrically, reward: $f^∗$ is the point closest to the original within the convex hull of ${f(l)}_{l\in[L]}$; squaredloss: $f^{sq}$ corresponds to the source model with the largest noise level with the highest noise level when this noise is substantially higher than that in other sources; regret: $f^{reg}$ is the center of the smallest circle enclosing all individual source models. The following are three examples of Regression.linear.ld with three types of loss functions.

Note: In Regression.linear.ld, only when loss_type=reward can we do inference to get confidence intervals, or we can only do point estimation and prediction.

loss_type = reward¶

In [ ]:

Copied!





## First announcing the module
## Then calling the functions fit() and infer()
## Note: only when loss_type='reward', infer() can be called to get confidence intervals
## For other loss_type, only point estimation and prediction can be done
reg = linear.ld()
reg.fit(Xlist, Ylist, X0, loss_type='reward')
reg.infer(alpha=0.05)
## First announcing the module
## Then calling the functions fit() and infer()
## Note: only when loss_type='reward', infer() can be called to get confidence intervals
## For other loss_type, only point estimation and prediction can be done
reg = linear.ld()
reg.fit(Xlist, Ylist, X0, loss_type='reward')
reg.infer(alpha=0.05)

In [ ]:

Copied!

reg.summary()
reg.summary()

Model Summary:
=================================
CGDRO Aggregated Weights:

group     |        1        2        3
weight_   |   0.4567   0.3451   0.1982

=================================
Coefficient Estimators:

index     |        1        2        3        4        5
coef_     |  -0.0655  -0.0433   0.0032  -0.0018   0.0997

=================================
Confidence Intervals:

index     |              1              2              3              4              5
CI        | (-0.1283,-0.0027) (-0.1080,0.0214) (-0.0601,0.0665) (-0.0666,0.0629) (0.0351,0.1643)

We can get statistical inference results from CGDRO, including CGDRO Aggregated Weights (learned weights from each group of source domain), Coefficient Estimators (the worst-case estimators of coefficient on target domain), and Confidence Intervals (valid confidence intervals of target domain coefficient estimators). In the summarized results above, group refers to each group of source domains, index refers to the index of coeffients, starting from the intercept if intercept=True, else starting from the first dimension of coefficient.

Make prediction on target data (you do not have to state the coveriate you use for prediction since target data is the default choice) and show the first 10 predicted values.

In [ ]:

Copied!

pred = reg.predict()
print(pred[:10])
pred = reg.predict()
print(pred[:10])

[-0.13644197 -0.08692431  0.04184841 -0.18107761 -0.30044671  0.0026928
 -0.30307077 -0.18239394 -0.06712825  0.20524958]

In [ ]:

Copied!





# Geometry view: the convex hull of the source coefficients (cloesest point to the origin)
beta_source = reg.beta_list
beta_ch, w_ch = nearest_on_convex_hull(beta_source)
print("Estimated coefficients on convex hull:", beta_ch)
print("Weights:", w_ch)
# Geometry view: the convex hull of the source coefficients (cloesest point to the origin)
beta_source = reg.beta_list
beta_ch, w_ch = nearest_on_convex_hull(beta_source)
print("Estimated coefficients on convex hull:", beta_ch)
print("Weights:", w_ch)

Estimated coefficients on convex hull: [-0.06592632 -0.0432436   0.00284681 -0.00168026  0.09944676]
Weights: [0.45700851 0.34576112 0.19723037]

We can see from the results above, $f^∗$ is the point closest to the original within the convex hull of ${f(l)}_{l\in[L]}$.

loss_type = squaredloss¶

In [ ]:

Copied!

reg = linear.ld()
reg.fit(Xlist, Ylist, X0, loss_type='squaredloss')
#reg.infer()
reg = linear.ld()
reg.fit(Xlist, Ylist, X0, loss_type='squaredloss')
#reg.infer()

In [ ]:

Copied!

reg.summary()
reg.summary()

Model Summary:
=================================
CGDRO Aggregated Weights:

group     |        1        2        3
weight_   |   0.0000   1.0000   0.0000

=================================
Coefficient Estimators:

index     |        1        2        3        4        5
coef_     |  -0.3487  -0.1735  -0.2884  -0.1579  -0.1389

Confidence Intervals not computed. Please run infer() method.

In [ ]:

Copied!

pred = reg.predict()
print(pred[:10])
pred = reg.predict()
print(pred[:10])

[-0.15697078  0.20275954 -0.30645405 -0.57685222 -0.10223598 -0.03202532
 -0.90429844  0.972492    1.15274749  0.42575315]

In [ ]:

Copied!

# Geometry view: the sufficiently large noise group dominates
reg.beta_list[1]
# Geometry view: the sufficiently large noise group dominates
reg.beta_list[1]

array([-0.34866929, -0.17351915, -0.2884043 , -0.15793023, -0.13894055])

We can see from the results above, $f^{sq}$ corresponds to the source model with the largest noise level with the highest noise level when this noise is substantially higher than that in other sources.

loss_type = regret¶

In [ ]:

Copied!

reg = linear.ld()
reg.fit(Xlist, Ylist, X0, loss_type='regret')
reg = linear.ld()
reg.fit(Xlist, Ylist, X0, loss_type='regret')

In [ ]:

Copied!

reg.summary()
reg.summary()

Model Summary:
=================================
CGDRO Aggregated Weights:

group     |        1        2        3
weight_   |   0.3184   0.4467   0.2350

=================================
Coefficient Estimators:

index     |        1        2        3        4        5
coef_     |  -0.1004  -0.0816  -0.0368  -0.0537   0.0602

Confidence Intervals not computed. Please run infer() method.

In [ ]:

Copied!

pred = reg.predict()
print(pred[:10])
pred = reg.predict()
print(pred[:10])

[-0.16569961  0.00258756 -0.03526818 -0.25353749 -0.30456777  0.02647563
 -0.39066011 -0.00236404  0.13115212  0.20484504]

In [ ]:

Copied!





# Geometry view: the center of the minimum enclosing ball of the source coefficients
beta_source = reg.beta_list
beta_cr, r_cr, w_cr = circumcenter_3vectors(beta_source)
print("Estimated coefficients on center of minimum enclosing ball:", beta_cr)
print("Weights:", w_cr)  
# Geometry view: the center of the minimum enclosing ball of the source coefficients
beta_source = reg.beta_list
beta_cr, r_cr, w_cr = circumcenter_3vectors(beta_source)
print("Estimated coefficients on center of minimum enclosing ball:", beta_cr)
print("Weights:", w_cr)  

Estimated coefficients on center of minimum enclosing ball: [-0.09880893 -0.08368395 -0.03569643 -0.05683792  0.06023717]
Weights: [0.3107334  0.44558458 0.24368202]

We can see from the results above, regret: $f^{reg}$ is the center of the smallest circle enclosing all individual source models. The following are three examples of Regression.linear.ld with three types of loss functions.