WetMapFormer: A Unified Deep CNN and Vision Transformer for Complex Wetland Mapping
Ali Jamali, Swalpa Kumar Roy, and Pedram Ghamisi
The Ramsar Convention of 1971 encourages wetland preservation, but it is unclear how climate change will affect wetland extent and related biodiversity. Due to the use of the self-attention mechanism, vision transformers (ViTs) gain better modeling of global contextual information and become a powerful alternative to Convolutional Neural Networks (CNNs). However, ViTs require enormous training datasets to activate their image classification power, and gathering training samples for remote sensing applications is typically costly. As such, in this study, we develop a deep learning algorithm called (WetMapFormer), which effectively utilizes both CNNs and vision transformer architectures for precise mapping of wetlands in three pilot sites around the Albert county, York county, and Grand Bay-Westfield located in New Brunswick, Canada. The WetMapFormer utilizes local window attention (LWA) rather than the conventional self-attention mechanism for improving the capability of feature generalization in a local area by substantially reducing the computational cost of vanilla ViTs. We extensively evaluated the robustness of the proposed WetMapFormer with Sentinel-1 and Sentinel-2 satellite data and compared it with the various CNNs and vision transformer models which include ViT, Swin Transformer, HybridSN, CoAtNet, a multimodel network, and ResNet, respectively. The proposed WetMapFormer achieves F-1 scores of 0.94, 0.94, 0.96, 0.97, 0.97, 0.97, and 1 for the recognition of aquatic bed, freshwater marsh, shrub wetland, bog, fen, forested wetland, and water, respectively. As compared to other vision transformers, the WetMapFormer limits receptive fields while adjusting translational invariance and equivariance characteristics. The code is available on GitHub.
International Journal of Applied Earth Observation and Geoinformation, 120, 103333, 2023-06-01.