- Title: Instance Segmentation by Jointly Optimizing Spatial Embeddings and Clustering Bandwidth
- Authors: Davy Neven, Bert De Brabandere, Marc Proesmans, Luc Van Gool
- Link: http://openaccess.thecvf.com/content_CVPR_2019/papers/Neven_Instance_Segmentation_by_Jointly_Optimizing_Spatial_Embeddings_and_Clustering_Bandwidth_CVPR_2019_paper.pdf
- Tags: Neural Network, Computer Vision, Instance Segmentation
- Year: 2019
- What
- Authors formulated a new loss function to use with R-CNN for proposal-free instance segmentation suited for realtime performance which directly optimizes the intersection-over-union of each instance by pulling pixels into an optimal, object-specific clustering region.
- How
- To solve the issue authors proposed to learn an instance specific margin.
- In order to do so, authors proposed to use a gaussian function for each instance, which converts the distance between a (spatial) pixel embedding
e_i = x_i + o_i
and the instance centroid into a probability of belonging to that instance. - As opposed to using the standard cross-entropy loss function, authors opted for using the Lovasz-hinge loss instead. Since this loss function is a (piecewise linear) convex surrogate to the Jaccard loss, it directly optimizes the intersection-over-union of each instance.
- Authors trained the seed map with a regression loss function. Background pixels are regressed to zero and foreground pixels are regressed to the output of the gaussian. Authors trained a seed map for each semantic class.
- As a basic architecture ERFNet was used. It is a dense-prediction encoder-decoder network optimized for real-time semantic segmentation. Authors converted the model into a 2-branch network, by sharing the encoder part and having 2 separate decoders. The first branch predicts the sigma and offset values, with 3 or 4 output channels depending on sigma. The other branch outputs N seed maps, one for each semantic class.
- Authors first pre-train our models on
500 x 500
crops, taken out of the original2048 x 1024
train images and centered around an object, for 200 epochs with a batch-size of 12. This way, authors didn't spend to much computation time on background patches without any instances. Afterwards authors finetuned the network for another 50 epochs on1024x1024
crops with a batch-size of 2 to increase the performance on the bigger objects who couldn’t fit completely within the500x500
crop. During this stage, authors kept the batch normalization statistics fixed. Authors used the Adam optimizer and polynomial learning rate decay.
- Results