Yaoming Cai, Zijia Zhang, Pedram Ghamisi, Behnood Rasti, Xiaobo Liu, and Zhihua Cai
Given the increasing diversity of available remote sensing data sources, multimodal fusion land cover classification has emerged as a promising direction in the community of Earth observation. However, modern supervised multimodal deep learning methods heavily rely on extensive amounts of human-annotated training data. To address this issue, we propose a novel unsupervised method for multimodal remote sensing land-cover type clustering: Transformer-based Multimodal Prototypical Contrastive Clustering (TMPCC). It is based on three core designs. First, we design a multimodal Transformer that learns a shared space through adaptive interactions between and within modalities. Second, we introduce an online clustering mechanism based on unified prototype learning that is scalable to large-scale multimodal datasets. Third, we employ a self-supervised training strategy that combines instance contrastive loss and clustering loss to enable efficient and effective model training. Coupling these three designs allows for training an end-to-end online clustering network that achieves state-of-the-art performance in multimodal RS data clustering, e.g., the highest clustering accuracy (92.28%) among existing methods on the hyperspectral-LiDAR Trento dataset. Our method demonstrates promising scalability in terms of modality and sample size, and it can also generalize to out-of-sample data. The code for this method is available on GitHub.
Information Sciences, 119655, 2023-09-06.