Deniz Engin & Yannis Avrithis

ViTiS

An overview of the proposed method.

Recent vision-language models are driven by large-scale pretrained models. However, adapting pretrained models on limited data presents challenges such as overfitting, catastrophic forgetting, and the cross-modal gap between vision and language. We introduce a parameter-efficient method to address these challenges, combining multimodal prompt learning and a transformer-based mapping network, while keeping the pretrained models frozen. Our experiments on several video question answering benchmarks demonstrate the superiority of our approach in terms of performance and parameter efficiency on both zero-shot and few-shot settings. Our code is available online.

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2804-2810, 2023-10-02.

Download
View paper
IARAI Authors
Dr Yannis Avrithis
Research
Algorithms
Keywords
Computer Vision, Few-Shot Learning, Masked Language Modeling, Transformer, Vision-Language Model

©2023 IARAI - INSTITUTE OF ADVANCED RESEARCH IN ARTIFICIAL INTELLIGENCE

Imprint | Privacy Policy

Stay in the know with developments at IARAI

We can let you know if there’s any

updates from the Institute.
You can later also tailor your news feed to specific research areas or keywords (Privacy)
Loading

Log in with your credentials

Forgot your details?

Create Account