11[ ![ License] ( https://img.shields.io/badge/License-Apache%202.0-blue.svg )] ( https://opensource.org/licenses/Apache-2.0 )
22[ ![ arXiv] ( https://img.shields.io/badge/cs.CV-%09arXiv%3A2011.14660-red )] ( https://arxiv.org/abs/2011.14660 )
33
4- # SplitNet: Divide and Co-training
4+ # Divide and Co-training
55
6- SplitNet achieves 98.71% on CIFAR-10, 89.46% on CIFAR-100, and 83.60% on ImageNet (SE-ResNet-101, 64x4d, 320px)
6+ Divide and co-training achieve 98.71% on CIFAR-10, 89.46% on CIFAR-100, and 83.60% on ImageNet (SE-ResNet-101, 64x4d, 320px)
77by dividing one existing large network into several small ones and co-training.
88
99## Table of Contents
@@ -31,31 +31,30 @@ by dividing one existing large network into several small ones and co-training.
3131<div align =" justify " >
3232
3333This is the code for the paper
34- <a href =" https://arxiv.org/abs/2011.14660 " >SplitNet: Divide and Co-training.</a >
34+ <a href =" https://arxiv.org/abs/2011.14660 " >
35+ Towards Better Accuracy-efficiency Trade-offs: Divide and Co-training.</a >
3536<br />
3637
37- The width of a neural network matters since increasing
38- the width will necessarily increase the model capacity. However,
39- the performance of a network does not improve linearly
40- with the width and soon gets saturated. To tackle this problem,
41- we propose to increase the number of networks rather
42- than purely scaling up the width. To prove it, one large network
43- is divided into several small ones, and each of these
44- small networks has a fraction of the original one’s parameters.
45- We then train these small networks together and make
46- them see various views of the same data to learn different
47- and complementary knowledge. During this co-training process,
48- networks can also learn from each other. As a result,
49- small networks can achieve better ensemble performance
38+ The width of a neural network matters since increasing the width
39+ will necessarily increase the model capacity.
40+ However, the performance of a network does not improve linearly
41+ with the width and soon gets saturated.
42+ In this case, we argue that increasing the number of networks (ensemble)
43+ can achieve better accuracy-efficiency trade-offs than purely increasing the width.
44+ To prove it,
45+ one large network is divided into several small ones
46+ regarding its parameters and regularization components.
47+ Each of these small networks has a fraction of the original one's parameters.
48+ We then train these small networks together and make them see various
49+ views of the same data to increase their diversity.
50+ During this co-training process,
51+ networks can also learn from each other.
52+ As a result, small networks can achieve better ensemble performance
5053than the large one with few or no extra parameters or FLOPs.
51- This reveals that the number of networks is a new dimension
52- of effective model scaling, besides depth/width/resolution.
5354Small networks can also achieve faster inference speed
54- than the large one by concurrent running on different devices.
55- We validate the idea --- increasing the number of
56- networks is a new dimension of effective model scaling ---
57- with different network architectures on common benchmarks
58- through extensive experiments.
55+ than the large one by concurrent running on different devices.
56+ We validate our argument with 8 different neural architectures on
57+ common benchmarks through extensive experiments.
5958</div >
6059
6160<div align =center >
@@ -70,8 +69,8 @@ through extensive experiments.
7069
7170## Features and TODO
7271
73- - [x] Support SplitNet with different models, i.e., ResNet, Wide-ResNet, ResNeXt, ResNeXSt, SENet,
74- Shake-Shake, DenseNet, PyramidNet (+Shake-Drop), EfficientNet. Also support ResNeSt without SplitNet.
72+ - [x] Support divide and co-training with different models, i.e., ResNet, Wide-ResNet, ResNeXt, ResNeXSt, SENet,
73+ Shake-Shake, DenseNet, PyramidNet (+Shake-Drop), EfficientNet.
7574- [x] Different data augmentation methods, i.e., mixup, random erasing, auto-augment, rand-augment, cutout
7675- [x] Distributed training (tested with multi-GPUs on single machine)
7776- [x] Multi-GPUs synchronized BatchNormalization
@@ -197,7 +196,7 @@ wget -qO- https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/
197196- Download [ The SVHN dataset] ( http://ufldl.stanford.edu/housenumbers/ ) (* Format 2: Cropped Digits* ),
198197put them in the ` dataset/svhn ` directory.
199198
200- - ` cd ` to ` github ` directory and clone the ` SplitNet- Divide-and-Co-training` repo.
199+ - ` cd ` to ` github ` directory and clone the ` Divide-and-Co-training ` repo.
201200For brevity, rename it as ` splitnet ` .
202201
203202
@@ -291,9 +290,9 @@ Then run
291290## Citations
292291
293292```
294- @misc{2020_SplitNet ,
293+ @misc{2020_splitnet ,
295294 author = {Shuai Zhao and Liguang Zhou and Wenxiao Wang and Deng Cai and Tin Lun Lam and Yangsheng Xu},
296- title = {SplitNet : Divide and Co-training},
295+ title = {Towards Better Accuracy-efficiency Trade-offs : Divide and Co-training},
297296 howpublished = {arXiv},
298297 year = {2020}
299298}
0 commit comments