X-VL: Injecting External Knowledge into Vision-Language Models for Better Answering

Shyam Agarwal; Amey Bharat Gohil; Smit Nautambhai Modi

Call for Paper

July Edition

IJCA solicits high quality original research papers for the upcoming July edition of the journal. The last date of research paper submission is 20 June 2025

Submit your paper

Know more

The week's pick

Designing Multi-Tenant E-Learning Systems in the Cloud: A Process-Oriented Approach for Higher Education

Sameh Azouzi Sonia Ayachi Ghannouchi

Random Articles

Data Mining using Modified GFMM Neural Network

April

2015

Monitoring System using GSM

May

2015

ON Tiling Patterns Involving Islamic Stars with an Odd Number of Vertices

March

2013

Design and Implementation of Scalable, Fully Distributed Web Crawler for a Web Search Engine

February

2011

Reseach Article

X-VL: Injecting External Knowledge into Vision-Language Models for Better Answering

by Shyam Agarwal, Amey Bharat Gohil, Smit Nautambhai Modi

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 186 - Number 61

Year of Publication: 2025

Authors: Shyam Agarwal, Amey Bharat Gohil, Smit Nautambhai Modi

10.5120/ijca2025924397

Shyam Agarwal, Amey Bharat Gohil, Smit Nautambhai Modi . X-VL: Injecting External Knowledge into Vision-Language Models for Better Answering. International Journal of Computer Applications. 186, 61 ( Jan 2025), 51-58. DOI=10.5120/ijca2025924397

@article{ 10.5120/ijca2025924397,

author = { Shyam Agarwal, Amey Bharat Gohil, Smit Nautambhai Modi },

title = { X-VL: Injecting External Knowledge into Vision-Language Models for Better Answering },

journal = { International Journal of Computer Applications },

issue_date = { Jan 2025 },

volume = { 186 },

number = { 61 },

month = { Jan },

year = { 2025 },

issn = { 0975-8887 },

pages = { 51-58 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume186/number61/x-vl-injecting-external-knowledge-into-vision-language-models-for-better-answering/ },

doi = { 10.5120/ijca2025924397 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2025-01-28T19:07:03.947579+05:30

%A Shyam Agarwal

%A Amey Bharat Gohil

%A Smit Nautambhai Modi

%T X-VL: Injecting External Knowledge into Vision-Language Models for Better Answering

%J International Journal of Computer Applications

%@ 0975-8887

%V 186

%N 61

%P 51-58

%D 2025

%I Foundation of Computer Science (FCS), NY, USA

Abstract

In recent years, there has been significant growth in the vision and language community, especially with the advent of large models. Visual Question Answering (VQA) is a task in computer vision and natural language processing that is both unique as well as difficult in its framing because it demands a holistic understanding of images and language for accurate responses and requires the model to integrate multiple modalities of data. The conventional VQA approach processes the entire image to answer a posed question, often missing nuanced contextual information. This research work aims to improve VQA systems by incorporating external knowledge into the system and analyzing the performance. This study utilizes MMLFT (MultiModal Late Fusion Transformer), used with three pre-trained models for textual embeddings: BERT, RoBERTa, and ALBERT, and three pre-trained models for image encoding: ViT, DeiT, and BEiT. Experiments are conducted across various possible combinations of these text and image encoders to assess the impact of incorporating external knowledge into the system. Captions from a pre-trained image captioning model, BLIP, are utilized as a form of external knowledge to the model, and the investigation focuses on whether this addition enhances the model’s evaluation metrics. Although much work has been done in improving VQA models by adding external knowledge to them, this study is believed to be the first to approach the topic from a data-specific point of view, closely analyzing the data entries and attempting to justify why the results improve or not. A simple but novel way to cheaply generate inferences about an image is also presented, showcasing its potential to assist with future VQA tasks. The conclusion drawn is that adding external information contributes to better results, but the mode of knowledge addition needs to be well-constructed.

References

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In International Conference on Computer Vision (ICCV), 2015.
Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chaitanya Malaviya, Asli Celikyilmaz, and Yejin Choi. Comet: Commonsense transformers for automatic knowledge graph construction. arXiv, 1906.05317, 2019. Available: https: //arxiv.org/abs/1906.05317.
Zhexue Chen et al. Lako: Knowledge-driven visual question answering via late knowledge-to-text injection. arXiv, 2022. Available: https://arxiv.org/abs/2207.12888.
Hongyang Gao, Jingjing Mao, Jian Zhou, Zhicheng Huang, Lei Wang, and Wei Xu. Are you talking to a machine? dataset and methods for multilingual image question answering. In Advances in Neural Information Processing Systems (NeurIPS), pages 2296–2304, 2015.
Donald Geman, Stuart Geman, Neil Hallonquist, and Larry Younes. Visual turing test for computer vision systems. Proceedings of the National Academy of Sciences, pages 3618– 3623, 2015.
Yunchao Gong, Linjie Wang, Margaret Hodosh, Julia Hockenmaier, and Svetlana Lazebnik. Improving image-sentence embeddings using large weakly annotated photo collections. In European Conference on Computer Vision (ECCV), pages 529–545, 2014.
Margaret Hodosh, Peter Young, and Julia Hockenmaier. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, 47:853–899, 2013.
Jena D. Hwang et al. Comet-atomic 2020: On symbolic and neural commonsense knowledge graphs. 2020.
Yangqing Jia, Mathieu Salzmann, and Trevor Darrell. Learning cross-modality similarity for multinomial data. In IEEE International Conference on Computer Vision (ICCV), pages 2407–2414, 2011.
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified visionlanguage understanding and generation. arXiv, 2201.12086, 2022. Available: https://arxiv.org/abs/2201.12086.
Haitian Lu. Open-ended generative commonsense question answering with knowledge graph-enhanced language models. Semantic Scholar, 2021. Available: https://www.semanticscholar.org/paper/ Open-Ended-Generative-Commonsense-Question-with-Lu/ a201c722c7de07b7354dda9cdabf9baf7e6e2ec0.
Mateusz Malinowski and Mario Fritz. A multi-world approach to question answering about real-world scenes based on uncertain input. In Advances in Neural Information Processing Systems (NeurIPS), pages 1682–1690, 2014.
Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz. Ask your neurons: A neural-based approach to answering questions about images. In IEEE International Conference on Computer Vision (ICCV), pages 1–9, 2015.
Siddharth Ravi, Abhishek Chinchure, Leonid Sigal, Renjie Liao, and Vered Shwartz. Vlc-bert: Visual question answering with contextualized commonsense knowledge. arXiv, 2022. Available: https://arxiv.org/abs/2210.13626.
Robyn Speer, Joshua Chin, and Catherine Havasi. Conceptnet 5.5: An open multilingual graph of general knowledge. arXiv, 2018. Available: https://arxiv.org/abs/1612.03975.
Joachim Vogel and Bernt Schiele. Semantic modeling of natural scenes for content-based image retrieval. International Journal of Computer Vision, 72(2):133–157, 2007.
Chien-Yao Wang, Alexey Bochkovskiy, and Hong- Yuan Mark Liao. Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv, 2207.02696, 2022. Available: https://arxiv.org/abs/2207.02696.

Index Terms

Computer Science

Information Sciences

Machine Learning

Deep Learning

Aritficial Intelligence

Keywords

Visual Question Answering (VQA) External knowledge injection Multimodal Fusion models Common Sense Transformer