DALIB Algorithms¶

The adaptation subpackage contains definitions for the following domain adaptation algorithms:

Besides specific algorithms, this package also provides a recommended image classifier for each algorithms.

We provide benchmarks of different domain adaptation algorithms on Office-31, Office-Home and VisDA-2017 as follows. Note that Origin means the accuracy reported by the original paper, while Avg is the accuracy reported by DALIB.

Office-31 accuracy on ResNet-50

Methods	Origin	Avg	A → W	D → W	W → D	A → D	D → A	W → A
Source Only	76.1	79.5	75.8	95.5	99.0	79.3	63.6	63.8
DANN	82.2	86.4	91.7	97.9	100.0	82.9	72.8	73.3
DAN	80.4	83.7	84.2	98.4	100.0	87.3	66.9	65.2
JAN	84.3	87.3	93.7	98.4	100.0	89.4	71.2	71.0
CDAN	87.7	88.7	93.1	98.6	100.0	93.4	75.6	71.5
MCD	/	85.9	91.8	98.6	100.0	89.0	69.0	66.9
MDD	88.9	89.2	93.6	98.6	100.0	93.6	76.7	72.9

Office-Home accuracy on ResNet-50

Methods	Origin	Avg	Ar → Cl	Ar → Pr	Ar → Rw	Cl → Ar	Cl → Pr	Cl → Rw	Pr → Ar	Pr → Cl	Pr → Rw	Rw → Ar	Rw → Cl	Rw → Pr
Source Only	46.1	58.2	41.5	65.8	73.6	52.2	59.5	63.6	51.5	36.4	71.3	65.2	42.8	75.4
DANN	57.6	65.5	52.7	61.8	73.4	57.4	67.2	69.6	57.2	55.4	79.0	71.4	60.0	81.1
DAN	56.3	61.6	45.5	67.9	73.9	57.6	63.7	66.2	55.2	39.7	74.3	66.8	49.1	78.7
JAN	58.3	65.9	50.4	71.8	76.7	60.0	67.7	68.9	60.4	49.8	77.0	71.2	55.6	81.0
CDAN	65.8	68.8	54.4	70.9	77.9	61.6	71.1	71.9	62.3	54.9	80.7	75.1	60.8	83.7
MCD	/	67.8	51.7	72.2	78.2	63.7	69.5	70.8	61.5	52.8	78.0	74.5	58.4	81.8
MDD	68.1	69.5	56.2	74.9	78.8	63.4	72.5	72.6	63.8	54.6	80.0	73.5	60.1	83.7

VisDA-2017 accuracy on ResNet-50 and ResNet-101

Methods	Origin	DALIB	Origin	DALIB
Backbone	ResNet-50	ResNet-50	ResNet-101	ResNet-101
Source Only	/	55.1	52.4	58.3
DANN	/	72.6	57.4	72.9
DAN	/	60.6	61.1	64.8
JAN	61.6	64.9	/	68.0
CDAN	66.8	74.6	/	74.5
MCD	69.2	69.1	71.9	77.3
MDD	74.6	74.9	/	78.5

DomainNet accuracy on ResNet-101

Source Only	clp	inf	pnt	real	skt	Avg
clp	N/A	18.0	32.7	50.6	39.4	35.2
inf	35.7	N/A	31.1	50.0	26.5	35.8
pnt	41.1	17.8	N/A	56.8	35.0	37.7
real	48.6	22.9	48.8	N/A	36.1	39.1
skt	49.0	15.3	34.8	46.1	N/A	36.3
Avg	43.6	18.5	36.9	50.9	34.3	36.8

DANN	clp	inf	pnt	real	skt	Avg
clp	N/A	19.7	35.4	53.9	44.2	38.3
inf	26.7	N/A	23.8	28.8	23.7	25.8
pnt	37.2	18.7	N/A	51.1	36.0	35.8
real	50.6	22.1	47.9	N/A	39.0	39.9
skt	54.0	19.7	42.7	52.8	N/A	42.3
Avg	42.1	20.1	37.5	46.7	35.7	36.4

DAN	clp	inf	pnt	real	skt	Avg
clp	N/A	17.3	37.9	54.0	42.6	38.0
inf	34.9	N/A	33.4	46.5	29.9	36.2
pnt	43.9	17.7	N/A	55.9	39.3	39.2
real	50.1	20.0	48.6	N/A	38.4	39.3
skt	54.2	17.5	44.2	53.4	N/A	42.3
Avg	45.8	18.1	41.0	52.5	37.6	39.0

CDAN	clp	inf	pnt	real	skt	Avg
clp	N/A	20.8	40.0	56.1	45.5	40.6
inf	31.2	N/A	30.0	41.4	24.7	31.8
pnt	44.6	20.5	N/A	57.0	39.9	40.5
real	55.3	24.1	52.6	N/A	42.4	43.6
skt	56.7	21.3	46.2	55.0	N/A	44.8
Avg	47.0	21.7	42.2	52.4	38.1	40.3

MDD	clp	inf	pnt	real	skt	Avg
clp	N/A	21.2	42.9	59.5	47.5	42.8
inf	35.3	N/A	34.0	49.6	29.4	37.1
pnt	48.6	19.7	N/A	59.4	42.6	42.6
real	58.3	24.9	53.7	N/A	46.2	45.8
skt	58.7	20.7	46.5	57.7	N/A	45.9
Avg	50.2	21.6	44.3	56.6	41.4	42.8

DANN¶

class dalib.adaptation.dann.DomainAdversarialLoss(domain_discriminator: torch.nn.modules.module.Module, reduction: Optional[str] = 'mean')[source]¶

Bases: torch.nn.modules.module.Module

The Domain Adversarial Loss

Domain adversarial loss measures the domain discrepancy through training a domain discriminator. Given domain discriminator \(D\), feature representation \(f\), the definition of DANN loss is

\[\begin{split}loss(\mathcal{D}_s, \mathcal{D}_t) &= \mathbb{E}_{x_i^s \sim \mathcal{D}_s} log[D(f_i^s)] \\ &+ \mathbb{E}_{x_j^t \sim \mathcal{D}_t} log[1-D(f_j^t)].\\\end{split}\]

Parameters:

domain_discriminator (class:nn.Module object): A domain discriminator object, which predicts the domains of features. Its input shape is (N, F) and output shape is (N, 1)
reduction (string, optional): Specifies the reduction to apply to the output: 'none' | 'mean' | 'sum'. 'none': no reduction will be applied, 'mean': the sum of the output will be divided by the number of elements in the output, 'sum': the output will be summed. Default: 'mean'

Inputs: f_s, f_t

f_s (tensor): feature representations on source domain, \(f^s\)
f_t (tensor): feature representations on target domain, \(f^t\)

Shape:

f_s, f_t: \((N, F)\) where F means the dimension of input features.
Outputs: scalar by default. If :attr:reduction is 'none', then \((N, )\).

Examples::

>>> from dalib.modules.domain_discriminator import DomainDiscriminator
>>> discriminator = DomainDiscriminator(in_feature=1024, hidden_size=1024)
>>> loss = DomainAdversarialLoss(discriminator, reduction='mean')
>>> # features from source domain and target domain
>>> f_s, f_t = torch.randn(20, 1024), torch.randn(20, 1024)
>>> output = loss(f_s, f_t)

DAN¶

class dalib.adaptation.dan.MultipleKernelMaximumMeanDiscrepancy(kernels: Sequence[torch.nn.modules.module.Module], linear: Optional[bool] = False, quadratic_program: Optional[bool] = False)[source]¶

Bases: torch.nn.modules.module.Module

The Multiple Kernel Maximum Mean Discrepancy (MK-MMD) used in Learning Transferable Features with Deep Adaptation Networks

Given source domain \(\mathcal{D}_s\) of \(n_s\) labeled points and target domain \(\mathcal{D}_t\) of \(n_t\) unlabeled points drawn i.i.d. from P and Q respectively, the deep networks will generate activations as \(\{z_i^s\}_{i=1}^{n_s}\) and \(\{z_i^t\}_{i=1}^{n_t}\). The MK-MMD \(D_k (P, Q)\) between probability distributions P and Q is defined as

\[D_k(P, Q) \triangleq \| E_p [\phi(z^s)] - E_q [\phi(z^t)] \|^2_{\mathcal{H}_k},\]

\(k\) is a kernel function in the function space

\[\mathcal{K} \triangleq \{ k=\sum_{u=1}^{m}\beta_{u} k_{u} \}\]

where \(k_{u}\) is a single kernel.

Using kernel trick, MK-MMD can be computed as

\[\begin{split}\hat{D}_k(P, Q) &= \dfrac{1}{n_s^2} \sum_{i=1}^{n_s}\sum_{j=1}^{n_s} k(z_i^{s}, z_j^{s}) \\ &+ \dfrac{1}{n_t^2} \sum_{i=1}^{n_t}\sum_{j=1}^{n_t} k(z_i^{t}, z_j^{t}) \\ &- \dfrac{2}{n_s n_t} \sum_{i=1}^{n_s}\sum_{j=1}^{n_t} k(z_i^{s}, z_j^{t}). \\\end{split}\]

Parameters:

kernels (tuple(nn.Module)): kernel functions.
linear (bool): whether use the linear version of DAN. Default: False
quadratic_program (bool): whether use quadratic program to solve \(\beta\). Default: False

Inputs: z_s, z_t

z_s (tensor): activations from the source domain, \(z^s\)
z_t (tensor): activations from the target domain, \(z^t\)

Shape:

Inputs: \((minibatch, *)\) where * means any dimension
Outputs: scalar

Note

Activations \(z^{s}\) and \(z^{t}\) must have the same shape.

Note

The kernel values will add up when there are multiple kernels.

Examples::

>>> from dalib.modules.kernels import GaussianKernel
>>> feature_dim = 1024
>>> batch_size = 10
>>> kernels = (GaussianKernel(alpha=0.5), GaussianKernel(alpha=1.), GaussianKernel(alpha=2.))
>>> loss = MultipleKernelMaximumMeanDiscrepancy(kernels)
>>> # features from source domain and target domain
>>> z_s, z_t = torch.randn(batch_size, feature_dim), torch.randn(batch_size, feature_dim)
>>> output = loss(z_s, z_t)

JAN¶

class dalib.adaptation.jan.JointMultipleKernelMaximumMeanDiscrepancy(kernels: Sequence[Sequence[torch.nn.modules.module.Module]], linear: Optional[bool] = True, thetas: Sequence[torch.nn.modules.module.Module] = None)[source]¶

Bases: torch.nn.modules.module.Module

The Joint Multiple Kernel Maximum Mean Discrepancy (JMMD) used in Deep Transfer Learning with Joint Adaptation Networks

Given source domain \(\mathcal{D}_s\) of \(n_s\) labeled points and target domain \(\mathcal{D}_t\) of \(n_t\) unlabeled points drawn i.i.d. from P and Q respectively, the deep networks will generate activations in layers \(\mathcal{L}\) as \(\{(z_i^{s1}, ..., z_i^{s|\mathcal{L}|})\}_{i=1}^{n_s}\) and \(\{(z_i^{t1}, ..., z_i^{t|\mathcal{L}|})\}_{i=1}^{n_t}\). The empirical estimate of \(\hat{D}_{\mathcal{L}}(P, Q)\) is computed as the squared distance between the empirical kernel mean embeddings as

\[\begin{split}\hat{D}_{\mathcal{L}}(P, Q) &= \dfrac{1}{n_s^2} \sum_{i=1}^{n_s}\sum_{j=1}^{n_s} \prod_{l\in\mathcal{L}} k^l(z_i^{sl}, z_j^{sl}) \\ &+ \dfrac{1}{n_t^2} \sum_{i=1}^{n_t}\sum_{j=1}^{n_t} \prod_{l\in\mathcal{L}} k^l(z_i^{tl}, z_j^{tl}) \\ &- \dfrac{2}{n_s n_t} \sum_{i=1}^{n_s}\sum_{j=1}^{n_t} \prod_{l\in\mathcal{L}} k^l(z_i^{sl}, z_j^{tl}). \\\end{split}\]

Parameters:

kernels (tuple(tuple(nn.Module))): kernel functions, where kernels[r] corresponds to kernel \(k^{\mathcal{L}[r]}\).
linear (bool): whether use the linear version of JAN. Default: False
thetas (list(Theta): use adversarial version JAN if not None. Default: None

Inputs: z_s, z_t

z_s (tuple(tensor)): multiple layers’ activations from the source domain, \(z^s\)
z_t (tuple(tensor)): multiple layers’ activations from the target domain, \(z^t\)

Shape:

\(z^{sl}\) and \(z^{tl}\): \((minibatch, *)\) where * means any dimension
Outputs: scalar

Note

Activations \(z^{sl}\) and \(z^{tl}\) must have the same shape.

Note

The kernel values will add up when there are multiple kernels for a certain layer.

Examples::

>>> feature_dim = 1024
>>> batch_size = 10
>>> layer1_kernels = (GaussianKernel(alpha=0.5), GaussianKernel(1.), GaussianKernel(2.))
>>> layer2_kernels = (GaussianKernel(1.), )
>>> loss = JointMultipleKernelMaximumMeanDiscrepancy((layer1_kernels, layer2_kernels))
>>> # layer1 features from source domain and target domain
>>> z1_s, z1_t = torch.randn(batch_size, feature_dim), torch.randn(batch_size, feature_dim)
>>> # layer2 features from source domain and target domain
>>> z2_s, z2_t = torch.randn(batch_size, feature_dim), torch.randn(batch_size, feature_dim)
>>> output = loss((z1_s, z2_s), (z1_t, z2_t))

CDAN¶

class dalib.adaptation.cdan.ConditionalDomainAdversarialLoss(domain_discriminator: torch.nn.modules.module.Module, entropy_conditioning: Optional[bool] = False, randomized: Optional[bool] = False, num_classes: Optional[int] = -1, features_dim: Optional[int] = -1, randomized_dim: Optional[int] = 1024, reduction: Optional[str] = 'mean')[source]¶

Bases: torch.nn.modules.module.Module

The Conditional Domain Adversarial Loss

Conditional Domain adversarial loss measures the domain discrepancy through training a domain discriminator in a conditional manner. Given domain discriminator \(D\), feature representation \(f\) and classifier predictions \(g\), the definition of CDAN loss is

\[\begin{split}loss(\mathcal{D}_s, \mathcal{D}_t) &= \mathbb{E}_{x_i^s \sim \mathcal{D}_s} log[D(T(f_i^s, g_i^s))] \\ &+ \mathbb{E}_{x_j^t \sim \mathcal{D}_t} log[1-D(T(f_j^t, g_j^t))],\\\end{split}\]

where \(T\) is a multi linear map or randomized multi linear map which convert two tensors to a single tensor.

Parameters:

domain_discriminator (class:nn.Module object): A domain discriminator object, which predicts the domains of features. Its input shape is (N, F) and output shape is (N, 1)
entropy_conditioning (bool, optional): If True, use entropy-aware weight to reweight each training example. Default: False
randomized (bool, optional): If True, use randomized multi linear map. Else, use multi linear map. Default: False
num_classes (int, optional): Number of classes. Default: -1
features_dim (int, optional): Dimension of input features. Default: -1
randomized_dim (int, optional): Dimension of features after randomized. Default: 1024
reduction (string, optional): Specifies the reduction to apply to the output: 'none' | 'mean' | 'sum'. 'none': no reduction will be applied, 'mean': the sum of the output will be divided by the number of elements in the output, 'sum': the output will be summed. Default: 'mean'

Note

You need to provide num_classes, features_dim and randomized_dim only when randomized is set True.

Inputs: g_s, f_s, g_t, f_t

g_s (tensor): unnormalized classifier predictions on source domain, \(g^s\)
f_s (tensor): feature representations on source domain, \(f^s\)
g_t (tensor): unnormalized classifier predictions on target domain, \(g^t\)
f_t (tensor): feature representations on target domain, \(f^t\)

Shape:

g_s, g_t: \((minibatch, C)\) where C means the number of classes.
f_s, f_t: \((minibatch, F)\) where F means the dimension of input features.
Output: scalar by default. If :attr:reduction is 'none', then \((minibatch, )\).

Examples::

>>> from dalib.modules.domain_discriminator import DomainDiscriminator
>>> num_classes = 2
>>> feature_dim = 1024
>>> batch_size = 10
>>> discriminator = DomainDiscriminator(in_feature=feature_dim, hidden_size=1024)
>>> loss = ConditionalDomainAdversarialLoss(discriminator, reduction='mean')
>>> # features from source domain and target domain
>>> f_s, f_t = torch.randn(batch_size, feature_dim), torch.randn(batch_size, feature_dim)
>>> # logits output from source domain adn target domain
>>> g_s, g_t = torch.randn(batch_size, num_classes), torch.randn(batch_size, num_classes)
>>> output = loss(g_s, f_s, g_t, f_t)

class dalib.adaptation.cdan.RandomizedMultiLinearMap(features_dim: int, num_classes: int, output_dim: Optional[int] = 1024)[source]¶

Bases: torch.nn.modules.module.Module

Random multi linear map

Given two inputs \(f\) and \(g\), the definition is

\[T_{\odot}(f,g) = \dfrac{1}{\sqrt{d}} (R_f f) \odot (R_g g),\]

where \(\odot\) is element-wise product, \(R_f\) and \(R_g\) are random matrices sampled only once and ﬁxed in training.

Parameters:

features_dim (int): dimension of input \(f\)
num_classes (int): dimension of input \(g\)
output_dim (int, optional): dimension of output tensor. Default: 1024

Shape:

f: (minibatch, features_dim)
g: (minibatch, num_classes)
Outputs: (minibatch, output_dim)

class dalib.adaptation.cdan.MultiLinearMap[source]¶

Bases: torch.nn.modules.module.Module

Multi linear map

Shape:

f: (minibatch, F)
g: (minibatch, C)
Outputs: (minibatch, F * C)

MCD¶

dalib.adaptation.mcd.classifier_discrepancy(predictions1: torch.Tensor, predictions2: torch.Tensor) → torch.Tensor[source]¶

The Classifier Discrepancy in Maximum Classiﬁer Discrepancy for Unsupervised Domain Adaptation. The classfier discrepancy between predictions \(p_1\) and \(p_2\) can be described as:

\[d(p_1, p_2) = \dfrac{1}{K} \sum_{k=1}^K | p_{1k} - p_{2k} |,\]

where K is number of classes.

Parameters:

predictions1 (tensor): Classifier predictions \(p_1\). Expected to contain raw, normalized scores for each class
predictions2 (tensor): Classifier predictions \(p_2\)

dalib.adaptation.mcd.entropy(predictions: torch.Tensor) → torch.Tensor[source]¶

Entropy of N predictions \((p_1, p_2, ..., p_N)\). The definition is:

\[d(p_1, p_2, ..., p_N) = -\dfrac{1}{K} \sum_{k=1}^K \log \left( \dfrac{1}{N} \sum_{i=1}^N p_{ik} \right)\]

where K is number of classes.

Parameters:

predictions (tensor): Classifier predictions. Expected to contain raw, normalized scores for each class

class dalib.adaptation.mcd.ImageClassifierHead(in_features: int, num_classes: int, bottleneck_dim: Optional[int] = 1024)[source]¶

Bases: torch.nn.modules.module.Module

Classifier Head for MCD. Parameters:

in_features (int): Dimension of input features

num_classes (int): Number of classes

bottleneck_dim (int, optional): Feature dimension of the bottleneck layer. Default: 1024

Shape:

Inputs: \((minibatch, F)\) where F = in_features.
Output: \((minibatch, C)\) where C = num_classes.

MDD¶

class dalib.adaptation.mdd.MarginDisparityDiscrepancy(margin: Optional[int] = 4, reduction: Optional[str] = 'mean')[source]¶

Bases: torch.nn.modules.module.Module

The margin disparity discrepancy (MDD) is proposed to measure the distribution discrepancy in domain adaptation.

The \(y^s\) and \(y^t\) are logits output by the main classifier on the source and target domain respectively. The \(y_{adv}^s\) and \(y_{adv}^t\) are logits output by the adversarial classifier. They are expected to contain raw, unnormalized scores for each class.

The definition can be described as:

\[\mathcal{D}_{\gamma}(\hat{\mathcal{S}}, \hat{\mathcal{T}}) = \gamma \mathbb{E}_{y^s, y_{adv}^s \sim\hat{\mathcal{S}}} \log\left(\frac{\exp(y_{adv}^s[h_{y^s}])}{\sum_j \exp(y_{adv}^s[j])}\right) + \mathbb{E}_{y^t, y_{adv}^t \sim\hat{\mathcal{T}}} \log\left(1-\frac{\exp(y_{adv}^t[h_{y^t}])}{\sum_j \exp(y_{adv}^t[j])}\right),\]

where \(\gamma\) is a margin hyper-parameter and \(h_y\) refers to the predicted label when the logits output is \(y\). You can see more details in Bridging Theory and Algorithm for Domain Adaptation.

Parameters:

margin (float): margin \(\gamma\). Default: 4
reduction (string, optional): Specifies the reduction to apply to the output: 'none' | 'mean' | 'sum'. 'none': no reduction will be applied, 'mean': the sum of the output will be divided by the number of elements in the output, 'sum': the output will be summed. Default: 'mean'

Inputs: y_s, y_s_adv, y_t, y_t_adv

y_s: logits output \(y^s\) by the main classifier on the source domain
y_s_adv: logits output \(y^s\) by the adversarial classifier on the source domain
y_t: logits output \(y^t\) by the main classifier on the target domain
y_t_adv: logits output \(y_{adv}^t\) by the adversarial classifier on the target domain

Shape:

Inputs: \((minibatch, C)\) where C = number of classes, or \((minibatch, C, d_1, d_2, ..., d_K)\) with \(K \geq 1\) in the case of K-dimensional loss.
Output: scalar. If reduction is 'none', then the same size as the target: \((minibatch)\), or \((minibatch, d_1, d_2, ..., d_K)\) with \(K \geq 1\) in the case of K-dimensional loss.

Examples::

>>> num_classes = 2
>>> batch_size = 10
>>> loss = MarginDisparityDiscrepancy(margin=4.)
>>> # logits output from source domain and target domain
>>> y_s, y_t = torch.randn(batch_size, num_classes), torch.randn(batch_size, num_classes)
>>> # adversarial logits output from source domain and target domain
>>> y_s_adv, y_t_adv = torch.randn(batch_size, num_classes), torch.randn(batch_size, num_classes)
>>> output = loss(y_s, y_s_adv, y_t, y_t_adv)

class dalib.adaptation.mdd.ImageClassifier(backbone: torch.nn.modules.module.Module, num_classes: int, bottleneck_dim: Optional[int] = 1024, width: Optional[int] = 1024)[source]¶

Bases: torch.nn.modules.module.Module

Classifier for MDD. Parameters:

backbone (class:nn.Module object): Any backbone to extract 1-d features from data

num_classes (int): Number of classes

bottleneck_dim (int, optional): Feature dimension of the bottleneck layer. Default: 1024

width (int, optional): Feature dimension of the classifier head. Default: 1024

Note

Classifier for MDD has one backbone, one bottleneck, while two classifier heads. The first classifier head is used for final predictions. The adversarial classifier head is only used when calculating MarginDisparityDiscrepancy.

Note

Remember to call function step() after function forward() during training phase! For instance,

>>> # x is inputs, classifier is an ImageClassifier
>>> outputs, outputs_adv = classifier(x)
>>> classifier.step()

Inputs:

x (Tensor): input data

Outputs: (outputs, outputs_adv)

outputs: logits outputs by the main classifier
outputs_adv: logits outputs by the adversarial classifier

Shapes:

x: \((minibatch, *)\), same shape as the input of the backbone.
outputs, outputs_adv: \((minibatch, C)\), where C means the number of classes.

dalib.adaptation.mdd.shift_log(x: torch.Tensor, offset: Optional[float] = 1e-06) → torch.Tensor[source]¶

First shift, then calculate log, which can be described as:

\[y = \max(\log(x+\text{offset}), 0)\]

Used to avoid the gradient explosion problem in log(x) function when x=0.

Parameters:

x: input tensor
offset: offset size. Default: 1e-6

Note

Input tensor falls in [0., 1.] and the output tensor falls in [-log(offset), 0]