CFP last date
20 January 2025
Reseach Article

Multi-Feature Fusion for Video Captioning

by Yubo Jiang
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 181 - Number 48
Year of Publication: 2019
Authors: Yubo Jiang
10.5120/ijca2019918660

Yubo Jiang . Multi-Feature Fusion for Video Captioning. International Journal of Computer Applications. 181, 48 ( Apr 2019), 47-53. DOI=10.5120/ijca2019918660

@article{ 10.5120/ijca2019918660,
author = { Yubo Jiang },
title = { Multi-Feature Fusion for Video Captioning },
journal = { International Journal of Computer Applications },
issue_date = { Apr 2019 },
volume = { 181 },
number = { 48 },
month = { Apr },
year = { 2019 },
issn = { 0975-8887 },
pages = { 47-53 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume181/number48/30483-2019918660/ },
doi = { 10.5120/ijca2019918660 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-07T01:09:30.560126+05:30
%A Yubo Jiang
%T Multi-Feature Fusion for Video Captioning
%J International Journal of Computer Applications
%@ 0975-8887
%V 181
%N 48
%P 47-53
%D 2019
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Video captioning is the task that integrates natural language processing and computer together. For typical approaches, they are based on CNN-RNN, which use the pre-trained Convolutional Neural Network (CNN) to extract image feature and the Recurrent Neural Network (RNN) to generate captions word by word. However, most of the approaches only use the global video feature and loss the spatial and motion information. To address the aforementioned problem, a novel video captioning method based on multi-feature fusion is proposed. This method extracts the spatial features, motion features and video features of each frame, and all the features are fused to generate video captions. The fused features are input into long-term and short-term memory (LSTM) which is used as the natural language generating module. Multiple natural language modules are trained by different feature combinations, and then fused in later stages. First, one model is selected to obtain multiple possible outputs of the current input, and then the probabilities of the current output are calculated by other models. Then the probabilities of these outputs are weighted and the highest probabilities are taken as outputs. In this method, feature fusion methods include pre-fusion and post-fusion. Experiments on the standard test set MSVD show that the fusion of different types of feature methods can achieve higher evaluation scores; the same type of feature fusion evaluation results will not be higher than a single feature score; the use of features to fine-tune the pre-trained model is not effective. The METEOR score is 0.302, which is 1.34% higher than the current maximum value. It shows that this method can improve the accuracy of automatic video description.

References
  1. J. Song, Z. Guo, L. Gao, W. Liu, D. Zhang and H. T. Shen, “Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning”, Proceedings of the International Joint Conferences on Artificial Intelligence, (2017) August.
  2. L. Baraldi, C. Grana and R. Cucchiara, “Hierarchical Boundary-Aware Neural Encoder for Video Captioning”, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, (2017).
  3. S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell and K. Saenko, “Sequence to Sequence – Video to Text”, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, (2015).
  4. Y. Jia, Shelhamer , J. Donahue, et al., (2014). “Caffe: convolutional architecture for fast feature embedding”, (2016).
  5. S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venugopalan, R. J. Mooney, T. Darrell and K. Saenko, “Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zeroshot recognition”, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition,, (2013).
  6. M. Everingham, L. Van Gool, C. K. I. Williams, and J. Winn, “The pascal visual object classes (voc) challenge”, International Journal of Computer Vision (IJCV), vol. 88, no. 2, (2010), pp. 303–338.
  7. M. Hodosh, P. Young and J. Hockenmaier, “Framing image description as a ranking task: Data, models and evaluation metrics”, Journal of Artificial Intelligence Research, vol. 47, (2013), pp. 853–899.
  8. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke and A. Rabinovich, “Going deeper with convolutions”, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2015).
  9. A. Krizhevsky, I. Sutskever and G. E. Hinton, “Imagenet classification with deep convolutional neural networks”, Proceedings of the Advances in Neural Information Processing Systems (NIPS), (2012).
  10. K. Cho, B. Van Merri¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation”, arXiv preprint arXiv:1406.1078, (2014).
  11. N. Krishnamoorthy, G. Malkarnenkar, R. J Mooney, and et al., “Generating Natural-Language Video Descriptions Using Text-mined Knowledge”, Proceedings of the Twenty- Seventh AAAI Conference on Artificial Intelligence. Menlo Park,(2013).
  12. L. Yao, A. Torabi and K. Cho, “Describing videos by exploiting temporal structure”, Proceedings of the 2015 IEEE International Conference on Computer Vision, Piscataway,(2015).
  13. G. Farneback, “Two-Frame Motion Estimation Based on Polynomial Expansion”, Proceedings of the 13th Scandinavian Conference on Image Analysis, , Berlin (2003).
  14. G. Gkioxari and J. Malik, “Finding Action Tubes” Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, (2015).
  15. H. Wang, A. Klaser, C. Schmid, “Action Recognition by Dense Trajectories”, Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition, Washington, (2011).
  16. O. Vinyals, A. Toshev and S. Bengio, “Show and Tell: A Neural Image Caption Generator”, Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, (2015).
  17. K. Papineni, “BLEU: A Method for Automatic Evaluation of Machine Translation”, Wireless Networks, (2015).
  18. C. Lin and F. J. Och, “Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics” , Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, (2004).
  19. M. J. Denkowski and A. Lavie, “Meteor Universal: Language Specific Translation Evaluation For Any Target Language”, Proceedings of the Ninth Workshop on Statistical Machine Translation, (2014).
  20. R. Vedantam, C. L. Zitnick and D. Parikh, “Cider: Consensus-Based Image Description Evaluation”, in IEEE Conference on Computer Vision and Pattern Recognition, (2015).
  21. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar and C. L. Zitnick, “Microsoft COCO: Common Objects in Context,” Proceedings of the European Conference on Computer Vision (ECCV). Springer, (2014).
Index Terms

Computer Science
Information Sciences

Keywords

Feature Fusion Video Captioning Deep Learning LSTM