Co-occurrent Features in Semantic Segmentation

Method

Co-occurrent Features 一个N通道CNN feature $X={x_1,...,x_N}$, 对于target feature $x_t$，co-occurrent feature $x_c$的概率为 $$ p(x_c|x_t)=\frac{e^{s(x_c,x_t)}}{\sum_{i=1}^Ne^{s(x_i,x_t)}}, s(x_c,x_t)=u^T_{x_c}v_{x_t}, u_{x_c}=\Phi_c(x_c), v_{x_t}=\Phi_t(x_t)$$
co-occurrent features is a high-rank problem, Softmax不足以建模，增加scene context as contextual prior $$ p(x_c|x_t)=\sum_{k=1}^K\pi^k\frac{e^{s^k(x_c,x_t)}}{\sum_{i=1}^Ne^{s^k(x_i,x_t)}}, \pi^k=\frac{\exp(w_k^T\bar v_x)}{\sum_{k'=1}^K\exp(w_{k'}^T\bar v_x)}, \bar v_x=\sum_{i=1}^N \frac{v_{x_i}}{N}$$ $\pi^k$是先验或第空个成分的混合权重，$s^k$是第k个成分的相似度, $\bar v_x$可以捕捉全局信息
Aggregated Co-occurrent Feature Module: 利用以上计算的p来做channel attention $$ z_t=\sum_{c=1}^Np(x_c|x_t)\phi_c $$
Global Pooling Feature: 全局pooling再上采样

本文提出一种新的self-attention的方式，把输入feature转换到两个特征空间，i.e., target vetcor和co-occurrent空间，交叉计算相似度，算相似度时同时考虑到整个场景信息。最后用non-local的方式来聚合特征。