Skip to content

Commit 0bdf1d8

Browse files
committed
add the choice of kl_loss when co-training, update README
1 parent e4d8ea7 commit 0bdf1d8

8 files changed

Lines changed: 43 additions & 28 deletions

File tree

README.md

Lines changed: 27 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
22
[![arXiv](https://img.shields.io/badge/cs.CV-%09arXiv%3A2011.14660-red)](https://arxiv.org/abs/2011.14660)
33

4-
# SplitNet: Divide and Co-training
4+
# Divide and Co-training
55

6-
SplitNet achieves 98.71% on CIFAR-10, 89.46% on CIFAR-100, and 83.60% on ImageNet (SE-ResNet-101, 64x4d, 320px)
6+
Divide and co-training achieve 98.71% on CIFAR-10, 89.46% on CIFAR-100, and 83.60% on ImageNet (SE-ResNet-101, 64x4d, 320px)
77
by dividing one existing large network into several small ones and co-training.
88

99
## Table of Contents
@@ -31,31 +31,30 @@ by dividing one existing large network into several small ones and co-training.
3131
<div align="justify">
3232

3333
This is the code for the paper
34-
<a href="https://arxiv.org/abs/2011.14660">SplitNet: Divide and Co-training.</a>
34+
<a href="https://arxiv.org/abs/2011.14660">
35+
Towards Better Accuracy-efficiency Trade-offs: Divide and Co-training.</a>
3536
<br />
3637

37-
The width of a neural network matters since increasing
38-
the width will necessarily increase the model capacity. However,
39-
the performance of a network does not improve linearly
40-
with the width and soon gets saturated. To tackle this problem,
41-
we propose to increase the number of networks rather
42-
than purely scaling up the width. To prove it, one large network
43-
is divided into several small ones, and each of these
44-
small networks has a fraction of the original one’s parameters.
45-
We then train these small networks together and make
46-
them see various views of the same data to learn different
47-
and complementary knowledge. During this co-training process,
48-
networks can also learn from each other. As a result,
49-
small networks can achieve better ensemble performance
38+
The width of a neural network matters since increasing the width
39+
will necessarily increase the model capacity.
40+
However, the performance of a network does not improve linearly
41+
with the width and soon gets saturated.
42+
In this case, we argue that increasing the number of networks (ensemble)
43+
can achieve better accuracy-efficiency trade-offs than purely increasing the width.
44+
To prove it,
45+
one large network is divided into several small ones
46+
regarding its parameters and regularization components.
47+
Each of these small networks has a fraction of the original one's parameters.
48+
We then train these small networks together and make them see various
49+
views of the same data to increase their diversity.
50+
During this co-training process,
51+
networks can also learn from each other.
52+
As a result, small networks can achieve better ensemble performance
5053
than the large one with few or no extra parameters or FLOPs.
51-
This reveals that the number of networks is a new dimension
52-
of effective model scaling, besides depth/width/resolution.
5354
Small networks can also achieve faster inference speed
54-
than the large one by concurrent running on different devices.
55-
We validate the idea --- increasing the number of
56-
networks is a new dimension of effective model scaling ---
57-
with different network architectures on common benchmarks
58-
through extensive experiments.
55+
than the large one by concurrent running on different devices.
56+
We validate our argument with 8 different neural architectures on
57+
common benchmarks through extensive experiments.
5958
</div>
6059

6160
<div align=center>
@@ -70,8 +69,8 @@ through extensive experiments.
7069

7170
## Features and TODO
7271

73-
- [x] Support SplitNet with different models, i.e., ResNet, Wide-ResNet, ResNeXt, ResNeXSt, SENet,
74-
Shake-Shake, DenseNet, PyramidNet (+Shake-Drop), EfficientNet. Also support ResNeSt without SplitNet.
72+
- [x] Support divide and co-training with different models, i.e., ResNet, Wide-ResNet, ResNeXt, ResNeXSt, SENet,
73+
Shake-Shake, DenseNet, PyramidNet (+Shake-Drop), EfficientNet.
7574
- [x] Different data augmentation methods, i.e., mixup, random erasing, auto-augment, rand-augment, cutout
7675
- [x] Distributed training (tested with multi-GPUs on single machine)
7776
- [x] Multi-GPUs synchronized BatchNormalization
@@ -197,7 +196,7 @@ wget -qO- https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/
197196
- Download [The SVHN dataset](http://ufldl.stanford.edu/housenumbers/) (*Format 2: Cropped Digits*),
198197
put them in the `dataset/svhn` directory.
199198

200-
- `cd` to `github` directory and clone the `SplitNet-Divide-and-Co-training` repo.
199+
- `cd` to `github` directory and clone the `Divide-and-Co-training` repo.
201200
For brevity, rename it as `splitnet`.
202201

203202

@@ -291,9 +290,9 @@ Then run
291290
## Citations
292291

293292
```
294-
@misc{2020_SplitNet,
293+
@misc{2020_splitnet,
295294
author = {Shuai Zhao and Liguang Zhou and Wenxiao Wang and Deng Cai and Tin Lun Lam and Yangsheng Xu},
296-
title = {SplitNet: Divide and Co-training},
295+
title = {Towards Better Accuracy-efficiency Trade-offs: Divide and Co-training},
297296
howpublished = {arXiv},
298297
year = {2020}
299298
}

miscs/fig1_width.png

31.5 KB
Loading

miscs/fig2_framework.png

9.76 KB
Loading

miscs/fig3_latency.png

118 KB
Loading

miscs/res_cifar10.png

17.3 KB
Loading

miscs/res_cifar100.png

54.9 KB
Loading

miscs/res_imagenet.png

107 KB
Loading

model/splitnet.py

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -202,6 +202,7 @@ def __init__(self,
202202
self.models = nn.ModuleList(models)
203203
self.criterion = criterion
204204
if args.is_identical_init:
205+
print("INFO:PyTorch: Using identical initialization.")
205206
self._identical_init()
206207

207208
# data transform - use different transformers for different networks
@@ -222,6 +223,7 @@ def __init__(self,
222223
self.cot_weight_warm_up_epochs = args.cot_weight_warm_up_epochs
223224
# self.kl_temperature = args.kl_temperature
224225
self.cot_loss_choose = args.cot_loss_choose
226+
print("INFO:PyTorch: The co-training loss is {}.".format(self.cot_loss_choose))
225227
self.num_classes = args.num_classes
226228

227229
def forward(self, x, target=None, mode='train', epoch=0, streams=None):
@@ -335,6 +337,20 @@ def _co_training_loss(self, outputs, loss_choose, epoch=0):
335337
H_mean = (- p_mean * torch.log(p_mean)).sum(-1).mean()
336338
H_sep = (- p_all * F.log_softmax(outputs_all, dim=-1)).sum(-1).mean()
337339
cot_loss = weight_now * (H_mean - H_sep)
340+
341+
elif loss_choose == 'kl_seperate':
342+
outputs_all = torch.stack(outputs, dim=0)
343+
# repeat [1,2,3] like [1,1,2,2,3,3] and [2,3,1,3,1,2]
344+
outputs_r1 = torch.repeat_interleave(outputs_all, self.split_factor - 1, dim=0)
345+
index_list = [j for i in range(self.split_factor) for j in range(self.split_factor) if j!=i]
346+
outputs_r2 = torch.index_select(outputs_all, dim=0, index=torch.tensor(index_list, dtype=torch.long).cuda())
347+
# calculate the KL divergence
348+
kl_loss = F.kl_div(F.log_softmax(outputs_r1, dim=-1),
349+
F.softmax(outputs_r2, dim=-1).detach(),
350+
reduction='none')
351+
# cot_loss = weight_now * (kl_loss.sum(-1).mean(-1).sum() / (self.split_factor - 1))
352+
cot_loss = weight_now * (kl_loss.sum(-1).mean(-1).sum() / (self.split_factor - 1))
353+
338354
else:
339355
raise NotImplementedError
340356

0 commit comments

Comments
 (0)