-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
When making modifications to a model and training from scratch, it is often better to use group norm instead of batch norm due to smaller batch sizes allowed by detection models on single GPUs.
If you simply take an existing detection model from mmdet and change "BN" to "GN", you will get errors because the "num_groups" attribute needs to be an exact multiple of the number of features in the layer.
While it would be best to explicitly set num_groups intelligently for each layer, I think there is a simple heuristic that can make prototyping and swapping normalization layers easier. We could easily allow "num_groups" to be a special code (e.g. "auto") and in that case we could automatically choose a number of groups that is reasonable given the layer.
Consider the current build_norm_layer function:
def build_norm_layer(cfg, num_features, postfix=''):
"""Build normalization layer.
Args:
cfg (dict): The norm layer config, which should contain:
- type (str): Layer type.
- layer args: Args needed to instantiate a norm layer.
- requires_grad (bool, optional): Whether stop gradient updates.
num_features (int): Number of input channels.
postfix (int | str): The postfix to be appended into norm abbreviation
to create named layer.
Returns:
(str, nn.Module): The first element is the layer name consisting of
abbreviation and postfix, e.g., bn1, gn. The second element is the
created norm layer.
"""
if not isinstance(cfg, dict):
raise TypeError('cfg must be a dict')
if 'type' not in cfg:
raise KeyError('the cfg dict must contain the key "type"')
cfg_ = cfg.copy()
layer_type = cfg_.pop('type')
if layer_type not in NORM_LAYERS:
raise KeyError(f'Unrecognized norm type {layer_type}')
norm_layer = NORM_LAYERS.get(layer_type)
abbr = infer_abbr(norm_layer)
assert isinstance(postfix, (int, str))
name = abbr + str(postfix)
requires_grad = cfg_.pop('requires_grad', True)
cfg_.setdefault('eps', 1e-5)
if layer_type != 'GN':
layer = norm_layer(num_features, **cfg_)
if layer_type == 'SyncBN':
layer._specify_ddp_gpu_num(1)
else:
assert 'num_groups' in cfg_
layer = norm_layer(num_channels=num_features, **cfg_)
for param in layer.parameters():
param.requires_grad = requires_grad
return name, layerwe could insert code after the assert 'num_groups' in cfg_ to allow the setting of "num_groups" to be "auto". Perhaps num_groups could be a dictionary that allows for more specific parameters related to which heuristic you want to choose, but in this case I just did something simple:
Enumerate all the group sizes that would be valid for a layer and construct a list of "info" dictionaries that contain the number of channels per group given the number of groups. I consider the "ideal" number of groups and channels to be something like the square root of the number of total features. I then use a simple heuristic which takes the product of the absolute difference between the number of groups and channels and this ideal for each candidate and chooses the number of groups that minimizes this heuristic (using 1 - the number of groups as a tiebreaker).
if cfg_['num_groups'] == 'auto':
valid_num_groups = [
factor for factor in range(1, num_features)
if num_features % factor == 0
]
infos = [
{'ng': ng, 'nc': num_features / ng}
for ng in valid_num_groups
]
ideal = num_features ** (0.5)
for item in infos:
item['heuristic'] = abs(ideal - item['ng']) * abs(ideal - item['nc'])
chosen = sorted(infos, key=lambda x: (x['heuristic'], 1 - x['ng']))[0]
cfg_['num_groups'] = chosen['ng']There are lots of ways you could automatically choose a setting for num_groups that is reasonable and feasible given num_features, but I found this method to work reasonably well.
If there is interest in a feature like this, I can make the PR.