CFP last date
20 March 2025
Reseach Article

X-VL: Injecting External Knowledge into Vision-Language Models for Better Answering

by Shyam Agarwal, Amey Bharat Gohil, Smit Nautambhai Modi
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 186 - Number 61
Year of Publication: 2025
Authors: Shyam Agarwal, Amey Bharat Gohil, Smit Nautambhai Modi
10.5120/ijca2025924397

Shyam Agarwal, Amey Bharat Gohil, Smit Nautambhai Modi . X-VL: Injecting External Knowledge into Vision-Language Models for Better Answering. International Journal of Computer Applications. 186, 61 ( Jan 2025), 51-58. DOI=10.5120/ijca2025924397

@article{ 10.5120/ijca2025924397,
author = { Shyam Agarwal, Amey Bharat Gohil, Smit Nautambhai Modi },
title = { X-VL: Injecting External Knowledge into Vision-Language Models for Better Answering },
journal = { International Journal of Computer Applications },
issue_date = { Jan 2025 },
volume = { 186 },
number = { 61 },
month = { Jan },
year = { 2025 },
issn = { 0975-8887 },
pages = { 51-58 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume186/number61/x-vl-injecting-external-knowledge-into-vision-language-models-for-better-answering/ },
doi = { 10.5120/ijca2025924397 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2025-01-28T19:07:03.947579+05:30
%A Shyam Agarwal
%A Amey Bharat Gohil
%A Smit Nautambhai Modi
%T X-VL: Injecting External Knowledge into Vision-Language Models for Better Answering
%J International Journal of Computer Applications
%@ 0975-8887
%V 186
%N 61
%P 51-58
%D 2025
%I Foundation of Computer Science (FCS), NY, USA
Abstract

In recent years, there has been significant growth in the vision and language community, especially with the advent of large models. Visual Question Answering (VQA) is a task in computer vision and natural language processing that is both unique as well as difficult in its framing because it demands a holistic understanding of images and language for accurate responses and requires the model to integrate multiple modalities of data. The conventional VQA approach processes the entire image to answer a posed question, often missing nuanced contextual information. This research work aims to improve VQA systems by incorporating external knowledge into the system and analyzing the performance. This study utilizes MMLFT (MultiModal Late Fusion Transformer), used with three pre-trained models for textual embeddings: BERT, RoBERTa, and ALBERT, and three pre-trained models for image encoding: ViT, DeiT, and BEiT. Experiments are conducted across various possible combinations of these text and image encoders to assess the impact of incorporating external knowledge into the system. Captions from a pre-trained image captioning model, BLIP, are utilized as a form of external knowledge to the model, and the investigation focuses on whether this addition enhances the model’s evaluation metrics. Although much work has been done in improving VQA models by adding external knowledge to them, this study is believed to be the first to approach the topic from a data-specific point of view, closely analyzing the data entries and attempting to justify why the results improve or not. A simple but novel way to cheaply generate inferences about an image is also presented, showcasing its potential to assist with future VQA tasks. The conclusion drawn is that adding external information contributes to better results, but the mode of knowledge addition needs to be well-constructed.

References
  1. Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In International Conference on Computer Vision (ICCV), 2015.
  2. Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chaitanya Malaviya, Asli Celikyilmaz, and Yejin Choi. Comet: Commonsense transformers for automatic knowledge graph construction. arXiv, 1906.05317, 2019. Available: https: //arxiv.org/abs/1906.05317.
  3. Zhexue Chen et al. Lako: Knowledge-driven visual question answering via late knowledge-to-text injection. arXiv, 2022. Available: https://arxiv.org/abs/2207.12888.
  4. Hongyang Gao, Jingjing Mao, Jian Zhou, Zhicheng Huang, Lei Wang, and Wei Xu. Are you talking to a machine? dataset and methods for multilingual image question answering. In Advances in Neural Information Processing Systems (NeurIPS), pages 2296–2304, 2015.
  5. Donald Geman, Stuart Geman, Neil Hallonquist, and Larry Younes. Visual turing test for computer vision systems. Proceedings of the National Academy of Sciences, pages 3618– 3623, 2015.
  6. Yunchao Gong, Linjie Wang, Margaret Hodosh, Julia Hockenmaier, and Svetlana Lazebnik. Improving image-sentence embeddings using large weakly annotated photo collections. In European Conference on Computer Vision (ECCV), pages 529–545, 2014.
  7. Margaret Hodosh, Peter Young, and Julia Hockenmaier. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, 47:853–899, 2013.
  8. Jena D. Hwang et al. Comet-atomic 2020: On symbolic and neural commonsense knowledge graphs. 2020.
  9. Yangqing Jia, Mathieu Salzmann, and Trevor Darrell. Learning cross-modality similarity for multinomial data. In IEEE International Conference on Computer Vision (ICCV), pages 2407–2414, 2011.
  10. Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified visionlanguage understanding and generation. arXiv, 2201.12086, 2022. Available: https://arxiv.org/abs/2201.12086.
  11. Haitian Lu. Open-ended generative commonsense question answering with knowledge graph-enhanced language models. Semantic Scholar, 2021. Available: https://www.semanticscholar.org/paper/ Open-Ended-Generative-Commonsense-Question-with-Lu/ a201c722c7de07b7354dda9cdabf9baf7e6e2ec0.
  12. Mateusz Malinowski and Mario Fritz. A multi-world approach to question answering about real-world scenes based on uncertain input. In Advances in Neural Information Processing Systems (NeurIPS), pages 1682–1690, 2014.
  13. Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz. Ask your neurons: A neural-based approach to answering questions about images. In IEEE International Conference on Computer Vision (ICCV), pages 1–9, 2015.
  14. Siddharth Ravi, Abhishek Chinchure, Leonid Sigal, Renjie Liao, and Vered Shwartz. Vlc-bert: Visual question answering with contextualized commonsense knowledge. arXiv, 2022. Available: https://arxiv.org/abs/2210.13626.
  15. Robyn Speer, Joshua Chin, and Catherine Havasi. Conceptnet 5.5: An open multilingual graph of general knowledge. arXiv, 2018. Available: https://arxiv.org/abs/1612.03975.
  16. Joachim Vogel and Bernt Schiele. Semantic modeling of natural scenes for content-based image retrieval. International Journal of Computer Vision, 72(2):133–157, 2007.
  17. Chien-Yao Wang, Alexey Bochkovskiy, and Hong- Yuan Mark Liao. Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv, 2207.02696, 2022. Available: https://arxiv.org/abs/2207.02696.
Index Terms

Computer Science
Information Sciences
Machine Learning
Deep Learning
Aritficial Intelligence

Keywords

Visual Question Answering (VQA) External knowledge injection Multimodal Fusion models Common Sense Transformer