International Journal of Computer Applications |
Foundation of Computer Science (FCS), NY, USA |
Volume 186 - Number 61 |
Year of Publication: 2025 |
Authors: Shyam Agarwal, Amey Bharat Gohil, Smit Nautambhai Modi |
![]() |
Shyam Agarwal, Amey Bharat Gohil, Smit Nautambhai Modi . X-VL: Injecting External Knowledge into Vision-Language Models for Better Answering. International Journal of Computer Applications. 186, 61 ( Jan 2025), 51-58. DOI=10.5120/ijca2025924397
In recent years, there has been significant growth in the vision and language community, especially with the advent of large models. Visual Question Answering (VQA) is a task in computer vision and natural language processing that is both unique as well as difficult in its framing because it demands a holistic understanding of images and language for accurate responses and requires the model to integrate multiple modalities of data. The conventional VQA approach processes the entire image to answer a posed question, often missing nuanced contextual information. This research work aims to improve VQA systems by incorporating external knowledge into the system and analyzing the performance. This study utilizes MMLFT (MultiModal Late Fusion Transformer), used with three pre-trained models for textual embeddings: BERT, RoBERTa, and ALBERT, and three pre-trained models for image encoding: ViT, DeiT, and BEiT. Experiments are conducted across various possible combinations of these text and image encoders to assess the impact of incorporating external knowledge into the system. Captions from a pre-trained image captioning model, BLIP, are utilized as a form of external knowledge to the model, and the investigation focuses on whether this addition enhances the model’s evaluation metrics. Although much work has been done in improving VQA models by adding external knowledge to them, this study is believed to be the first to approach the topic from a data-specific point of view, closely analyzing the data entries and attempting to justify why the results improve or not. A simple but novel way to cheaply generate inferences about an image is also presented, showcasing its potential to assist with future VQA tasks. The conclusion drawn is that adding external information contributes to better results, but the mode of knowledge addition needs to be well-constructed.