Efficient Mixed Transformer for Single Image Super-Resolution

Ling Zheng, Jinchen Zhu, Jinpeng Shi, Shizhuang Weng

Comparison of the trade-off between model performance and complexity on the Urban100 (×4) test set.

Abstract

Recently, Transformer-based methods have achieved impressive results in single image super-resolution (SISR). However, the lack of locality mechanism and high complexity limit their application in the field of super-resolution (SR). To solve these problems, we propose a new method, Efficient Mixed Transformer (EMT) in this study. Specifically, we propose the Mixed Transformer Block (MTB), consisting of multiple consecutive transformer layers, in some of which the Pixel Mixer (PM) is used to replace the Self-Attention (SA). PM can enhance the local knowledge aggregation with pixel shifting operations. At the same time, no additional complexity is introduced as PM has no parameters and floating-point operations. Moreover, we employ striped window for SA (SWSA) to gain an efficient global dependency modelling by utilizing image anisotropy. Experimental results show that EMT outperforms the existing methods on benchmark dataset and achieved state-of-the-art performance.

HR

SwinIR

NGswin

EMT (ours)




In img_062 and 076 of Urban100, most of the methods failed to recover the grid shape in the images and generated different degrees of artifacts, and EMT relieves this problem and shows clearer super-resolution (SR) results. In BSD100 of 108005 and 78004, other methods produce over-smoothing around local edges and significant blurring of high frequency textures. In the last two subplots, the compared methods struggle to preserve the font structure, while our method shows better performance in font reconstruction. These results demonstrate the effectiveness of our method combining local mechanism and global dependency modelling in SR task.