Tensor LowRank Reconstruction for
Semantic Segmentation
Abstract
Context information plays an indispensable role in the success of semantic segmentation. Recently, nonlocal selfattention based methods are proved to be effective for context information collection. Since the desired context consists of spatialwise and channelwise attentions, 3D representation is an appropriate formulation. However, these nonlocal methods describe 3D context information based on a 2D similarity matrix, where space compression may lead to channelwise attention missing. An alternative is to model the contextual information directly without compression. However, this effort confronts a fundamental difficulty, namely the highrank property of context information. In this paper, we propose a new approach to model the 3D context representations, which not only avoids the space compression but also tackles the highrank difficulty. Here, inspired by tensor canonicalpolyadic decomposition theory (i.e, a highrank tensor can be expressed as a combination of rank1 tensors.), we design a lowranktohighrank context reconstruction framework (i.e, RecoNet). Specifically, we first introduce the tensor generation module (TGM), which generates a number of rank1 tensors to capture fragments of context feature. Then we use these rank1 tensors to recover the highrank context features through our proposed tensor reconstruction module (TRM). Extensive experiments show that our method achieves stateoftheart on various public datasets. Additionally, our proposed method has more than 100 times less computational cost compared with conventional nonlocalbased methods.
Keywords:
Semantic Segmentation, LowRank Reconstruction, Tensor Decomposition1 Introduction
Semantic segmentation aims to assign the pixelwise predictions for the given image, which is a challenging task requiring finegrained shape, texture and category recognition. The pioneering work, fully convolutional networks (FCN) [31], explores the effectiveness of deep convolutional networks in segmentation task. Recently, more work achieves great progress from exploring the contextual information [33, 1, 25, 4, 51, 5, 37], in which nonlocal based methods are the recent mainstream [52, 49, 16].
These methods model the context representation by rating the elementwise importance for contextual tensors. However, the context features obtained from this line lack of channelwise attention, which is a key component of context. Specifically, for a typical nonlocal block, the 2D similarity map is generated by the matrix multiplication of two inputs with dimension of and , respectively. It is noted that the channel dimension is eliminated during the multiplication, which implies that only the spatialwise attention is represented while the channelwise attention is compressed. Therefore, these nonlocal based methods could collect finegrained spatial context features but may sacrifice channelwise context attention.
An intuitive idea tackling this issue is to construct the context directly instead of using the 2D similarity map. Unfortunately, this approach confronts a fundamental difficulty because of the highrank property of context features [49]. That is, the context tensor should be highrank to have enough capacity since contexts vary from image to image and this large diversity cannot be wellrepresented by very limited parameters.
Inspired by tensor canonicalpolyadic decomposition theory [20], i.e., a highrank tensor can be expressed as a combination of rank1 tensors, we propose a new approach of modeling highrank contextual information in a progressive manner without channelwise space compression. We show the workflow of nonlocal networks and RecoNet in Fig. 1. The basic idea is to first use a series of lowrank tensors to collect fragments of context features and then build them up to reconstruct finegrained context features. Specifically, our proposed framework consists of two key components, rank1 tensor generation module (TGM) and highrank tensor reconstruction module (TRM). Here, TGM aims to generate the rank1 tensors in channel, height and width directions, which explore the context features in different views with lowrank constraints. TRM adopts tensor canonicalpolyadic (CP) reconstruction to reconstruct the highrank attention map, in which the cooccurrence contextual information is mined based on the rank1 tensors from different views. The cooperation of these two components leads to the effective and efficient highrank context modeling.
We tested our method on five public datasets. On these experiments, the proposed method consistently achieves the stateoftheart, especially for PASCALVOC12 [13], RecoNet reaches the top1 performance. Furthermore, by incorporating the simple and clean lowrank features, our whole model has less computation consumption (more than 100 times lower than nonlocal) compared to other nonlocal based context modeling methods.
The contributions of this work mainly lie in three aspects:

Our studies reveal a new path to the context modeling, namely, context reconstruction from lowrank to highrank in a progressive way.

We develop a new semantic segmentation framework RecoNet, which explores the contextual information through tensor CP reconstruction. It not only keeps both spatialwise and channelwise attentions, but also deals with highrank difficulty.

We conduct extensive experiments to compare the proposed methods with others on various public datasets, where it yields notable performance gains. Furthermore, RecoNet also has less computation cost, i.e, more than 100 times smaller than nonlocal based methods.
2 Related Work
Tensor Lowrank Representation. According to tensor decomposition theory [20], a tensor can be represented by the linear combination of series of lowrank tensors. The reconstruction results of these lowrank tensors are the principal components of original tensor. Therefore, tensor lowrank representation is widely used in computer vision task such as convolution speedup [21] and model compression [46]. There are two tensor decomposition methods: Tuker decomposition and CP decompostion [20]. For the Tuker decomposition, the tensor is decomposed into a set of matrices and one core tensor. If the core tensor is diagonal, then Tuker decomposition degrades to CP decomposition. For the CP decomposition, the tensor is represented by a set of rank1 tensors (vectors). In this paper, we apply this theory for reconstruction, namely reconstructing highrank contextual tensor from a set of rank1 context fragments.
SelfAttention in Computer Vision. Self attention is firstly proposed in natural language processing (NLP) [38, 8, 44, 10]. It serves as a global encoding method that can merge long distance features. This property is also important to computer vision tasks. Hu et al. propose SENet [18], exploiting channel information for better image classification through channel wise attention. Woo et al. propose CBAM [40] that combines channelwise attention and spatialwise attention to capture rich feature in CNN. Wang et al. propose nonlocal neural network [39]. It catches longrange dependencies of a featuremap, which breaks the receptive field limitation of convolution kernel.
Context Aggregation in Semantic Segmentation. Context information is so important for semantic segmentation and many researchers pay their attention to explore the context aggregation. The initial context harvesting method is to increase receptive fields such as FCN [31], which merges feature of different scales. Then feature pyramid methods [4, 51, 5] are proposed for better context collection. Although feature pyramid collects rich context information, the contexts are not gathered adaptively. In other words, the importance of each element in contextual tensor is not discriminated. Selfattentionbased methods are thus proposed to overcome this problem, such as EncNet [48], PSANet [52], APCNet [16], and CFNet [49]. Researchers also propose some efficient selfattention methods such as EMANet [22], CCNet [19], Net [6], which have lower computation consumption and GPU memory occupation. However, most of these methods suffer from channelwise space compression due to the 2D similarity map. Compared to these works, our method differs essentially in that it uses the 3D lowrank tensor reconstruction to catch longrange dependencies without sacrificing channelwise attention.
3 Methodology
3.1 Overview
The semantic information prediction from an image is closely related to the context information. Due to the large varieties of context, a highrank tensor is required for the context feature representation. However, under this constraint, modeling the context features directly means a huge cost. Inspired by the CP decomposition theory, although the context prediction is a highrank problem, we can separate it into a series of lowrank problems and these lowrank problems are easier to deal with. Specifically, we do not predict context feature directly, instead, we generate its fragments. Then we build up a complete context feature using these fragments. The lowrank to highrank reconstruction strategy not only maintains 3D representation (for both channelwise and spatialwise), but also tackles with the highrank difficulty.
The pipeline of our model is shown in Fig. 2, which consists of lowrank tensor generation module (TGM), highrank tensor reconstruction module (TRM), and global pooling module (GPM) to harvest global context in both spatial and channel dimensions. We upsample the model output using bilinear interpolation before semantic label prediction.
In our implementation, multiple lowrank perceptrons are used to deal with the highrank problem, by which we learn parts of context information (i.e, context fragments). We then build the highrank tensor via tensor reconstruction theory [20].
Formulation: Assuming we have 3 vectors in C/H/W directions , and , where and is the tensor rank. These vectors are the CP decomposed fragments of , then tensor CP rank reconstruction is defined as:
(1) 
where is a scaling factor.
3.2 Tensor Generation Module
In this section, we first provide some basic definitions and then show how to derive the lowrank tensors from the proposed module.
Context Fragments. We define context fragments as the outputs of the tensor generation module, which indicates some rank1 vectors , and (as defined in previous part) in the channel, the height and the width directions. Every context fragment contains a part of context information.
Feature Generator. We define three feature generators: Channel Generator, Height Generator and Width Generator. Each generator is composed of PoolConvSigmoid sequence. Global pooling is widely used in previous works [29, 51] as the global context harvesting method. Similarly, here we use global average pooling in feature generators, obtaining the global context representation in C/H/W directions.
Context Fragments Generation. In order to learn fragments of context information across the three directions, we apply channel, height and width generator on the top of input feature. We repeat this process times obtaining 3 learnable vectors , and , where . All vectors are generated using independent convolution kernels. Each of them learns a part of context information and outputs as context fragment. The TGM is shown in Fig. 3.
Nonlinearity in TGM. Recalling that TGM generates 3 rank1 tensors and these tensors are activated by function, which rescales the values in context fragments to [0, 1]. We add the nonlinearity for two reasons. Firstly, each rescaled element can be regarded as the weight of a certain kind of context feature, which satisfy the definition of attention. Secondly, all the context fragments shall not be linear dependent so that each of them can represent different information.
3.3 Tensor Reconstruction Module
In this part, we introduce the context feature reconstruction and aggregation procedure. The entire reconstruction process is clean and simple, which is based on Equation 1. For a better interpretation, we first introduce the context aggregation process.
Context Aggregation. Different from previous works that only collect spatial or channel attention [52, 48], we collect attention distribution in both directions simultaneously. The goal of TRM is to obtain the 3D attention map which keeps response in both spatial and channel attention. After that, context feature is obtained by elementwise product. Specifically, given an input feature and a context attention map , the finegrained context feature is then given by:
(2) 
In this process, every represents the extent that be activated.
Lowrank Reconstruction. The tensor reconstruction module (TRM) tackles the highrank property of context feature. The full workflow of TRM is shown in Fig. 4, which consists of two steps, i.e, subattention map aggregation and global context feature reconstruction. Firstly, three context fragments , and are synthesized into a rank1 subattention map . This subattention map represents a part of 3D context feature, and we will show the visualization of some in experimental result part. Then, other context fragments are reconstructed following the same process. After that we aggregate these subattention maps using weighted mean:
(3) 
Here is a learnable normalize factor. Although each subattention map represents lowrank context information, the combination of them becomes a highrank tensor. The finegrained context features in both spatial and channel dimensions are obtained after Equation 3 and Equation 2.
3.4 Global Pooling Module
Global pooling module (GPM) is commonly used in previous work [51, 49]. It is composed of a global average pooling operation followed with a 1 1 convolution. It harvests global context in both spatial and channel dimensions. In our proposed model, we apply GPM for the further boost of network performance.
3.5 Network Details
We use ResNet [17] as our backbone and apply dilation strategy to the output of Res4 and Res5 of it. Then, the output stride of our proposed network is 8. The output feature of Res5 block is marked as . TGM+TRM and GPM are then added on the top of . Following previous works [48, 51], we also use auxiliary loss after Res4 block. We set the weight to 0.2. The total loss is formulated as follows:
(4) 
Finally, we concatenate with the context featuremap generated by TGM+TRM and the global context generated by GPM to make the final prediction.
3.6 Relation to Previous Approaches
Compared with nonlocal and its variants that explore the pairwise relationship between pixels, the proposed method is essentially unary attention. Unary attention has been widely used in image classification such as SENet [18] and CBAM [40]. It is also broadly adopted in semantic segmentation such as DFN [45] and EncNet [48]. Apparently, SENet is the simplest formation of RecoNet. The 3D attention map of SENet is as Formula (5):
(5)  
RecoNet degenerates to SENet by setting tensor rank . Meanwhile, and . From Formula (5), it is observed that the weights in H and W directions are the same, which implies that SENet only harvests channel attention while sets the same weights in spatial domain. EncNet [48] is the updated version of SENet, which also uses the same spatial weights. Different spatial weights are introduced in CBAM, which extends Formula (5) to Equation 6.
(6) 
Here is the 3D attention map of CBAM. The spatial attention is considered in CBAM. However, single rank1 tensor can not represent complicated context information. Considering an extreme case, the spatial attention is CPdecomposed into 2 rank1 tensors and . Then, becomes a subattention map of RecoNet.
Simple but effective is the advantage of unary attentions, but they are also criticized for not being able to represent complicated features or for being able to represent features only in one direction (spatial/channel). RecoNet not only takes the advantage of simplicity and effectiveness from unary attention, but also delivers comprehensive feature representations from multiview (i.e, spatial and channel dimension).
4 Experiments
Many experiments are carried out in this section. We use five datasets: PASCALVOC12, PASCALContext, COCOStuff, ADE20K and SIFTFLOW to test the performance of RecoNet.
4.1 Implementation Details
RecoNet is implemented using Pytorch [32]. Following previous works [14, 48], synchronized batch normalization is applied. The learning rate scheduler is
We use multiscale and flip evaluation with input scales [0.75, 1, 1.25, 1.5, 1.75, 2.0] times of original scale. The evaluation metrics we use is mean IntersectionoverUnion (mIoU).
4.2 Results on Different Datasets
PascalVoc12.
We first test RecoNet using PASCALVOC12 [13] dataset, a golden benchmark of semantic segmentation, which includes object categories and one background class. The dataset contains , , images for training, validation and testing. Our training set contains images from PASCAL augmentation dataset. The results are shown in Table 1. RecoNet reaches mIoU, surpassing current best algorithm using ResNet101 by , which is a large margin.
FCN[31]  PSPNet[51]  EncNet[48]  APCNet[16]  CFNet[49]  DMNet[15]  RecoNet  

aero  76.8  91.8  94.1  95.8  95.7  96.1  93.7 
bike  34.2  71.9  69.2  75.8  71.9  77.3  66.3 
bird  68.9  94.7  96.3  84.5  95.0  94.1  95.6 
boat  49.4  71.2  76.7  76.0  76.3  72.8  72.8 
bottle  60.3  75.8  86.2  80.6  82.8  78.1  87.4 
bus  75.3  95.2  96.3  96.9  94.8  97.1  94.5 
car  74.7  89.9  90.7  90.0  90.0  92.7  92.6 
cat  77.6  95.9  94.2  96.0  95.9  96.4  96.5 
chair  21.4  39.3  38.8  42.0  37.1  39.8  48.4 
cow  62.5  90.7  90.7  93.7  92.6  91.4  94.5 
table  46.8  71.7  73.3  75.4  73.0  75.5  76.6 
dog  71.8  90.5  90.0  91.6  93.4  92.7  94.4 
horse  63.9  94.5  92.5  95.0  94.6  95.8  95.9 
mbike  76.5  88.8  88.8  90.5  89.6  91.0  93.8 
person  73.9  89.6  87.9  89.3  88.4  90.3  90.4 
plant  45.2  72.8  68.7  75.8  74.9  76.6  78.1 
sheep  72.4  89.6  92.6  92.8  95.2  94.1  93.6 
sofa  37.4  64  59.0  61.9  63.2  62.1  63.4 
train  70.9  85.1  86.4  88.9  89.7  85.5  88.6 
tv  55.1  76.3  73.4  79.6  78.2  77.6  83.1 
mIoU  62.2  82.6  82.9  84.2  84.2  84.4  85.6 
Method  Backbone  mIoU 

CRFRNN [53]  74.7  
DPN [30]  77.5  
Piecewise [26]  78.0  
ResNet38 [42]  84.9  
PSPNet [51]  ResNet101  85.4 
DeepLabv3 [4]  ResNet101  85.7 
EncNet [48]  ResNet101  85.9 
DFN [45]  ResNet101  86.2 
CFNet [49]  ResNet101  87.2 
EMANet [22]  ResNet101  87.7 
DeeplabV3+ [5]  Xception  87.8 
DeeplabV3+ [5]  Xception+JFT  89.0 
RecoNet  ResNet101  88.5 
RecoNet  ResNet152  89.0 
Method  Backbone  mIoU 

FCN8s [31]  37.8  
ParseNet [29]  40.4  
Piecewise [26]  43.3  
VeryDeep [41]  44.5  
DeepLabv2 [3]  ResNet101  45.7 
RefineNet [25]  ResNet152  47.3 
PSPNet [51]  ResNet101  47.8 
MSCI [24]  ResNet152  50.3 
Ding et al. [12]  ResNet101  51.6 
EncNet [48]  ResNet101  51.7 
DANet [14]  ResNet101  52.6 
SVCNet [11]  ResNet101  53.2 
CFNet [49]  ResNet101  54.0 
DMNet [15]  ResNet101  54.4 
RecoNet  ResNet101  54.8 
Method  Backbone  mIoU 

FCN8s [31]  22.7  
DeepLabv2 [3]  ResNet101  26.9 
RefineNet [25]  ResNet101  33.6 
Ding et al. [12]  ResNet101  35.7 
SVCNet [11]  ResNet101  39.6 
DANet [14]  ResNet101  39.7 
EMANet [22]  ResNet101  39.9 
RecoNet  ResNet101  41.5 
Method  pixel acc.  mIoU 

Sharma et al. [35]  79.6   
Yang et al. [43]  79.8   
FCN8s [31]  85.9  41.2 
DAGRNN+CRF [36]  87.8  44.8 
Piecewise [26]  88.1  44.9 
SVCNet [11]  89.1  46.3 
RecoNet  89.6  46.8 
Method  Backbone  mIoU 

RefineNet [25]  ResNet152  40.70 
PSPNet [51]  ResNet101  43.29 
DSSPN [23]  ResNet101  43.68 
SAC [34]  ResNet101  44.30 
EncNet [48]  ResNet101  44.65 
CFNet [49]  ResNet50  42.87 
CFNet [49]  ResNet101  44.89 
CCNet [19]  ResNet101  45.22 
RecoNet  ResNet50  43.40 
RecoNet  ResNet101  45.54 
Following previous work [48, 16, 49, 15, 14], we use COCOpertained model during training. We first train our model on MSCOCO [27] dataset for 30 epochs, where the initial learning rate is set to . Then the model is finetuned on PASCAL augmentation training set for another 80 epochs. Finally, we finetune our model on original VOC12 train+val set for extra 50 epochs and the initial is set to 1e5. The results in Table 6 show that RecoNet101 outperforms current stateoftheart algorithms with the same backbone. Moreover, RecoNet also exceeds stateoftheart methods that use better backbone such as Xception [7]. By applying ResNet152 backbone, RecoNet reaches mIoU without adding extra data. The result is now in the 1st place of the PASCALVOC12 challenge^{1}^{1}1http://host.robots.ox.ac.uk:8080/anonymous/PXWAVA.html.
PASCALContext.
[43] is a densely labeled scene parsing dataset includes object and stuff classes plus one background class. It contains images for training and images for testing. Following previous works [49, 48, 16], we evaluate the dataset with background class ( classes in total). The results are shown in Table 6. RecoNet performs better than all previous approaches that use nonlocal block such as CFNet and DANet, which implies that our proposed context modeling method is better than nonlocal block.
COCOStuff.
[2] is a challenging dataset which includes object and stuff categories. The dataset provides images for training and images for testing. The outstanding performance of RecoNet (as shown in Table 6) illustrates that the context tensor we modeled has enough capacity to represent complicated context features.
SIFTFlow.
[28] is a dataset that focuses on urban scene, which consists of images in training set and images for testing. The resolution of images is and semantic classes are annotated with pixellevel labels. The result in Table 6 shows that the proposed RecoNet outperforms previous stateoftheart methods.
Ade20k.
[54] is a large scale scene parsing dataset which contains K images annotated with semantic categories. There are K training images, K validation images and K test images. The experimental results are shown in Table 6. RecoNet shows better performance than nonlocal based methods such as CCNet [19]. The superiority on result means RecoNet can collect richer context information.
4.3 Ablation Study
In this section, we perform the thorough ablation experiments to investigate the effect of different components in our method and the effect of different rank number. These experiments provide more insights of our proposed method. The experiments are conducted on PASCALVOC12 set and more ablation studies can be found in supplementary material.
Different Components. In this part, we design several variants of our model to validate the contributions of different components. The experimental settings are the same with previous part. Here we have three main components, including global pooling module (GPM) and tensor lowrank reconstruction module inducing TGM and TRM. For fairness, we fix the tensor rank . The influence of each module is shown in Table 7. According to our experiment results, tensor lowrank reconstruction module contributes 9.9% mIoU gain in network performance and the pooling module also improves mIoU by 0.6%. Then we use the auxiliary loss after Res4 block. We finally get 81.4% mIoU by using GPM and TGM+TRM together. The result shows that the tensor lowrank reconstruction module dominants the entire performance.
Method  TGM+TRM  GPM  Auxloss  MS/Flip  FT  mIoU % 

ResNet50  68.7  
ResNet50  78.6  
ResNet50  79.2  
ResNet50  79.8  
ResNet101  81.4  
ResNet101  82.1  
ResNet101  82.9 
Tensor Rank. Tensor rank determines the information capacity of our reconstructed attention map. In this experiment, we use ResNet101 as the backbone. We sample from 16 to 128 to investigate the effect of tensor rank. An intuitive thought is that the performance would be better with the increase of . However, our experiment results on Table 9 illustrates that the larger does not always lead to a better performance. Because we apply TGM+TRM on the input feature , which has maximum tensor rank 64. An enormous may increase redundancy and lead to overfitting, which harms the network performance. Therefore, we choose in our experiments.
Method  Tensor Rank  mIoU % 

RecoNet  16  81.2 
RecoNet  32  81.8 
RecoNet  48  81.4 
RecoNet  64  82.1 
RecoNet  80  81.6 
RecoNet  96  81.0 
RecoNet  128  80.7 
Method  SS  MS/Flip  FLOPs 

ResNet101      190.6G 
DeepLabV3+ [5]  79.45  80.59  +84.1G 
PSPNet [51]  79.20  80.36  +77.5G 
DANet [14]  79.64  80.78  +117.3G 
PSANet [52]  78.71  79.92  +56.3G 
CCNet [19]  79.51  80.77  +65.3G 
EMANet [22]  80.09  81.38  +43.1G 
RecoNet  81.40  82.13  +41.9G 
Comparison with Previous Approaches. In this paper, we use deepbase ResNet as our backbone. Specifically, we replace the first convolution in ResNet with three consequent convolutions. This design is widely adopted in semantic segmentation and serves as the backbone network of many prior works[51, 48, 49, 19, 22]. Since the implementation details and backbones vary in different algorithms. In order to compare our method with previous approaches in absolutely fair manner, we implemented several stateoftheart algorithms (listed in Table 9) based on our ResNet101 backbone and training setting. The results are shown in Table 9. We compare our method with feature pyramid approaches such as PSPNet [51] and DeepLabV3+ [5]. The evaluation results show that our algorithm not only surpass these method in mIoU but also in FLOPs. Also, we compare our method with nonlocal attention based algorithms such as DANet [14] and PSANet [52]. It is noticed that our singlescale result outperforms their multiscale results, which implies the superiority of our method. Additionally, we compare RecoNet with other lowcost nonlocal methods such as CCNet [19] and EMANet [22], where RecoNet achieves the best performance with relatively small cost.
4.4 Further Discussion
We further design several experiments to show computational complexity of the proposed method, and visualize some subattention maps from the reconstructed context features.
Method  Channel  FLOPs  GPU Memory 

NonLocal [39]  512  19.33G  88.00MB 
APCNet [16]  512  8.98G  193.10MB 
RCCA [19]  512  5.37G  41.33MB 
Net [6]  512  4.30G  25.00MB 
AFNB [55]  512  2.62G  25.93MB 
LatentGNN [50]  512  2.58G  44.69MB 
EMAUnit [22]  512  2.42G  24.12MB 
TGM+TRM  512  0.0215G  8.31MB 
Computational Complexity Analysis. Our proposed method is based on the lowrank tensors, thus having large advantage on computational consumption. Recalling that nonlocal block has computational complexity of . On the TGM stage, we generates a series of learnable vectors using convolutions. The computational complexity is while on the TRM stage, we reconstruct the highrank tensor from these vectors and the complexity is for each rank1 tensor. Since , the total complexity is , which is much smaller than nonlocal block. Here is the tensor rank. Table 10 shows the FLOPs and GPU occupation of TGM+TRM. From the table we can see that the cost of TGM+TRM is neglegible compared with other nonlocal based methods. Our proposed method has about 900 times less FLOPs and more than 100 times less FLOPs compared with nonlocal block and other nonlocalbased methods, such as Net [6] and LatentGNN [50]. Besides of these methods, we calculate the FLOPs and GPU occupation of RCCA, AFNB and EMAUnit, which is core component of CCNet [19], AsymmetricNL[55] and EMANet[22]. It can be found that TGM+TRM has the lowest computational overhead.
Visualization. In our proposed method, context features are constructed by the linear combination of subattention maps, i.e, . Therefore, we visualize their heat maps to check the part of features they activate. We randomly select four subattention maps , , , , as shown in Fig. 5. We can see that different subattention maps activate different parts of the image. For instance, for the last case, the four attention maps focus on the foreground, the horse, the person, and the background, respectively, which implies that the lowrank attention captures the context fragments and RecoNet can catch longrange dependencies.
5 Conclusion
In this paper, we propose a tensor lowrank reconstruction for context features prediction, which overcomes the feature compression problem that occurred in previous works. We collect highrank context information by using lowrank context fragments that generated by our proposed tensor generation module. Then we use CP reconstruction to build up highrank context features. We embed the finegrained context features into our proposed RecoNet. The stateofthearts performance on different datasets and the superiority on computational consumption show the success of our context collection method.
6 Appendix
6.1 More Experimental Results
We conduct experiments on Cityscapes dataset [9], which is a famous scene segementation dataset that includes 19 semantic classes. It provides 2975/500/1525 images for training, validation and testing. Since the training setting of Cityscapes is very distinct to the implementation details that presented in main paper, we put the results in supplementary materials.
The input images are cropped into 512 1024 before input. The batch size we use is 8. Initially, the learning rate . SGD optimizer with momentum = 0.9 and weight decay = 0.0005 is applied for training. The evaluation metrics and data augmentation strategies we use are the same as main paper.
Method  Backbone  mIoU 

PSANet [52]  ResNet101  80.1 
CFNet [49]  ResNet101  79.6 
AsymmetricNL [55]  ResNet101  81.3 
CCNet [19]  ResNet101  81.4 
DANet [14]  ResNet101  81.5 
ACFNet [47]  ResNet101  81.8 
RecoNet  ResNet101  82.3 
For the evaluation on set, we train 40K/100K iterations on set respectively. The testing results are shown on Table 11, which collects current stateoftheart attention based methods. RecoNet get better performance than these approaches. The online hard example mining (OHEM) strategy is not used in our implementation since it is time consuming. The result is avaliable on the website.^{2}^{2}2https://www.cityscapesdataset.com/anonymousresults/?id=7c7bfabc1026a9fd07b348bfd311c56a57ba0369969f3bd9fd9f036ce49a2934
In order to validate the consistency of RecoNet, we conduct additional ablation experiments on Cityscapes dataset. The tensor rank is set to for ablation. In Table 12, it can be found that TGM+TRM contributes 5.8 % mIoU improvement (73.1% to 78.9%), which dominates the other modules. The experimental results show that RecoNet is consistent on different datasets.
Method  TGM+TRM  GPM  Auxloss  MS/Flip  mIoU % 

ResNet50  73.1  
ResNet50  78.9  
ResNet50  79.4  
ResNet50  79.8  
ResNet101  80.5  
ResNet101  81.6 
6.2 More Visualization
Fig. 6 shows some results of RecoNet101 on PACALVOC12 dataset. The figure shows that RecoNet has a better qualitative result, especially in the boundary, which also demonstrates its effectiveness of context modeling.
References
 [1] Badrinarayanan, V., Kendall, A., Cipolla, R.: SegNet: A deep convolutional encoderdecoder architecture for image segmentation. IEEE TPAMI 39(12), 2481–2495 (2017)
 [2] Caesar, H., Uijlings, J., Ferrari, V.: COCOStuff: Thing and stuff classes in context. In: Proc. CVPR. pp. 1209–1218 (2018)
 [3] Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE TPAMI 40(4), 834–848 (2018)
 [4] Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017)
 [5] Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoderdecoder with atrous separable convolution for semantic image segmentation. In: Proc. ECCV. pp. 801–818 (2018)
 [6] Chen, Y., Kalantidis, Y., Li, J., Yan, S., Feng, J.: A^ 2Nets: Double attention networks. In: Proc. NIPS. pp. 352–361 (2018)
 [7] Chollet, F.: Xception: Deep learning with depthwise separable convolutions. In: Proc. CVPR. pp. 1251–1258 (2017)
 [8] Chorowski, J.K., Bahdanau, D., Serdyuk, D., Cho, K., Bengio, Y.: Attentionbased models for speech recognition. In: Proc. NIPS. pp. 577–585 (2015)
 [9] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3213–3223 (2016)
 [10] Cui, Y., Chen, Z., Wei, S., Wang, S., Liu, T., Hu, G.: Attentionoverattention neural networks for reading comprehension. In: Proc. ACL (2017)
 [11] Ding, H., Jiang, X., Shuai, B., Liu, A.Q., Wang, G.: Semantic correlation promoted shapevariant context for segmentation. In: Proc. CVPR. pp. 8885–8894 (2019)
 [12] Ding, H., Jiang, X., Shuai, B., Qun Liu, A., Wang, G.: Context contrasted feature and gated multiscale aggregation for scene segmentation. In: Proc. CVPR. pp. 2393–2402 (2018)
 [13] Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (VOC) challenge. International Journal of Computer Vision 88(2), 303–338 (2010)
 [14] Fu, J., Liu, J., Tian, H., Fang, Z., Lu, H.: Dual attention network for scene segmentation. arXiv preprint arXiv:1809.02983 (2018)
 [15] He, J., Deng, Z., Qiao, Y.: Dynamic multiscale filters for semantic segmentation. In: Proc. ICCV. pp. 3562–3572 (2019)
 [16] He, J., Deng, Z., Zhou, L., Wang, Y., Qiao, Y.: Adaptive pyramid context network for semantic segmentation. In: Proc. CVPR. pp. 7519–7528 (2019)
 [17] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proc. CVPR. pp. 770–778 (2016)
 [18] Hu, J., Shen, L., Sun, G.: Squeezeandexcitation networks. In: Proc. CVPR. pp. 7132–7141 (2018)
 [19] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: CCNet: Crisscross attention for semantic segmentation. In: Proc. ICCV. pp. 603–612 (2019)
 [20] Kolda, T.G., Bader, B.W.: Tensor decompositions and applications. SIAM Review (SIREV) 51(3) (2009)
 [21] Lebedev, V., Ganin, Y., Rakhuba, M., Oseledets, I., Lempitsky, V.: Speedingup convolutional neural networks using finetuned CPdecomposition. In: Proc. ICLR (2015)
 [22] Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectationmaximization attention networks for semantic segmentation. In: Proc. ICCV. pp. 9167–9176 (2019)
 [23] Liang, X., Xing, E., Zhou, H.: Dynamicstructured semantic propagation network. In: Proc. CVPR. pp. 752–761 (2018)
 [24] Lin, D., Ji, Y., Lischinski, D., CohenOr, D., Huang, H.: Multiscale context intertwining for semantic segmentation. In: Proc. ECCV. pp. 603–619 (2018)
 [25] Lin, G., Milan, A., Shen, C., Reid, I.: RefineNet: Multipath refinement networks for highresolution semantic segmentation. Proc. CVPR pp. 1925–1934 (2017)
 [26] Lin, G., Shen, C., Van Den Hengel, A., Reid, I.: Efficient piecewise training of deep structured models for semantic segmentation. In: Proc. CVPR. pp. 3194–3203 (2016)
 [27] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: Common objects in context. In: Proc. ECCV. pp. 740–755 (2014)
 [28] Liu, C., Yuen, J., Torralba, A.: SIFT Flow: Dense correspondence across scenes and its applications. IEEE TPAMI 33(5), 978–994 (2011)
 [29] Liu, W., Rabinovich, A., Berg, A.C.: ParseNet: Looking wider to see better. arXiv preprint arXiv:1506.04579 (2015)
 [30] Liu, Z., Li, X., Luo, P., Loy, C.C., Tang, X.: Semantic image segmentation via deep parsing network. In: Proc. ICCV. pp. 1377–1385 (2015)
 [31] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proc. CVPR. pp. 3431–3440 (2015)
 [32] Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in PyTorch. In: NIPS Workshop (2017)
 [33] Ronneberger, O., Fischer, P., Brox, T.: UNet: Convolutional networks for biomedical image segmentation. In: Proc. MICCAI. pp. 234–241 (2015)
 [34] Rui, Z., Sheng, T., Zhang, Y., Li, J., Yan, S.: Scaleadaptive convolutions for scene parsing. In: Proc. ICCV. pp. 2031–2039 (2017)
 [35] Sharma, A., Tuzel, O., Liu, M.Y.: Recursive context propagation network for semantic scene labeling. In: Proc. NIPS (2014)
 [36] Shuai, B., Zup, Z., Wang, B., Wang, G.: Scene segmentation with DAGrecurrent neural networks. IEEE TPAMI 40(6), 1480–1493 (2018)
 [37] Sun, R., Zhu, X., Wu, C., Huang, C., Shi, J., Ma, L.: Not all areas are equal: Transfer learning for semantic segmentation via hierarchical region selection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4360–4369 (2019)
 [38] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proc. NIPS. pp. 5998–6008 (2017)
 [39] Wang, X., Girshick, R., Gupta, A., He, K.: Nonlocal neural networks. In: Proc. CVPR. pp. 7794–7803 (2018)
 [40] Woo, S., Park, J., Lee, J.Y., So Kweon, I.: CBAM: Convolutional block attention module. In: Proc. ECCV. pp. 3–19 (2018)
 [41] Wu, Z., Shen, C., Hengel, A.v.d.: Bridging categorylevel and instancelevel semantic image segmentation. arXiv preprint arXiv:1605.06885 (2016)
 [42] Wu, Z., Shen, C., Van Den Hengel, A.: Wider or deeper: Revisiting the resnet model for visual recognition. Pattern Recognition 90, 119–133 (2019)
 [43] Yang, J., Price, B., Cohen, S., Yang, M.H.: Context driven scene parsing with attention to rare classes. In: Proc. CVPR. pp. 3294–3301 (2014)
 [44] Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., Hovy, E.: Hierarchical attention networks for document classification. In: Proc. NAACL. pp. 1480–1489 (2016)
 [45] Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., Sang, N.: Learning a discriminative feature network for semantic segmentation. In: Proc. CVPR. pp. 1857–1866 (2018)
 [46] Yu, X., Liu, T., Wang, X., Tao, D.: On compressing deep models by low rank and sparse decomposition. In: Proc. CVPR. pp. 7370–7379 (2017)
 [47] Zhang, F., Chen, Y., Li, Z., Hong, Z., Liu, J., Ma, F., Han, J., Ding, E.: Acfnet: Attentional class feature network for semantic segmentation. In: The IEEE International Conference on Computer Vision (ICCV) (October 2019)
 [48] Zhang, H., Dana, K., Shi, J., Zhang, Z., Wang, X., Tyagi, A., Agrawal, A.: Context encoding for semantic segmentation. In: Proc. CVPR. pp. 7151–7160 (2018)
 [49] Zhang, H., Zhang, H., Wang, C., Xie, J.: Cooccurrent features in semantic segmentation. In: Proc. CVPR. pp. 548–557 (2019)
 [50] Zhang, S., He, X., Yan, S.: LatentGNN: Learning efficient nonlocal relations for visual recognition. In: Proc. ICML. pp. 7374–7383 (2019)
 [51] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: Proc. CVPR. pp. 2881–2890 (2017)
 [52] Zhao, H., Zhang, Y., Liu, S., Shi, J., Change Loy, C., Lin, D., Jia, J.: PSANet: Pointwise spatial attention network for scene parsing. In: Proc. ECCV. pp. 267–283 (2018)
 [53] Zheng, S., Jayasumana, S., RomeraParedes, B., Vineet, V., Su, Z., Du, D., Huang, C., Torr, P.H.: Conditional random fields as recurrent neural networks. In: Proc. ICCV. pp. 1529–1537 (2015)
 [54] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: Proc. CVPR. pp. 633–641 (2017)
 [55] Zhu, Z., Xu, M., Bai, S., Huang, T., Bai, X.: Asymmetric nonlocal neural networks for semantic segmentation. In: Proc. ICCV. pp. 593–602 (2019)