Multi-Feature Fusion for Video Captioning

Yubo Jiang

Call for Paper

June Edition

IJCA solicits high quality original research papers for the upcoming June edition of the journal. The last date of research paper submission is 20 May 2024

Submit your paper

Know more

The week's pick

Enhancing Privacy Preservation: Multi-Attribute Protection with P-Sensitive K-Anonymity

Twinkle Patel Kiran Amin

Random Articles

Implementation of RS Encoder and RS Decoder using UHD Architecture

September

2013

Customized Travel Planner using MapReduce and Approximation Algorithm

June

2015

A Random Matrix - based Fraud Prevention Model

Jun

2017

A Hybrid Feature Selection Method based on IGSBFS and Naïve Bayes for the Diagnosis of Erythemato - Squamous Diseases

March

2012

Reseach Article

Multi-Feature Fusion for Video Captioning

by Yubo Jiang

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 181 - Number 48

Year of Publication: 2019

Authors: Yubo Jiang

10.5120/ijca2019918660

Yubo Jiang . Multi-Feature Fusion for Video Captioning. International Journal of Computer Applications. 181, 48 ( Apr 2019), 47-53. DOI=10.5120/ijca2019918660

@article{ 10.5120/ijca2019918660,

author = { Yubo Jiang },

title = { Multi-Feature Fusion for Video Captioning },

journal = { International Journal of Computer Applications },

issue_date = { Apr 2019 },

volume = { 181 },

number = { 48 },

month = { Apr },

year = { 2019 },

issn = { 0975-8887 },

pages = { 47-53 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume181/number48/30483-2019918660/ },

doi = { 10.5120/ijca2019918660 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2024-02-07T01:09:30.560126+05:30

%A Yubo Jiang

%T Multi-Feature Fusion for Video Captioning

%J International Journal of Computer Applications

%@ 0975-8887

%V 181

%N 48

%P 47-53

%D 2019

%I Foundation of Computer Science (FCS), NY, USA

Abstract

Video captioning is the task that integrates natural language processing and computer together. For typical approaches, they are based on CNN-RNN, which use the pre-trained Convolutional Neural Network (CNN) to extract image feature and the Recurrent Neural Network (RNN) to generate captions word by word. However, most of the approaches only use the global video feature and loss the spatial and motion information. To address the aforementioned problem, a novel video captioning method based on multi-feature fusion is proposed. This method extracts the spatial features, motion features and video features of each frame, and all the features are fused to generate video captions. The fused features are input into long-term and short-term memory (LSTM) which is used as the natural language generating module. Multiple natural language modules are trained by different feature combinations, and then fused in later stages. First, one model is selected to obtain multiple possible outputs of the current input, and then the probabilities of the current output are calculated by other models. Then the probabilities of these outputs are weighted and the highest probabilities are taken as outputs. In this method, feature fusion methods include pre-fusion and post-fusion. Experiments on the standard test set MSVD show that the fusion of different types of feature methods can achieve higher evaluation scores; the same type of feature fusion evaluation results will not be higher than a single feature score; the use of features to fine-tune the pre-trained model is not effective. The METEOR score is 0.302, which is 1.34% higher than the current maximum value. It shows that this method can improve the accuracy of automatic video description.

References

J. Song, Z. Guo, L. Gao, W. Liu, D. Zhang and H. T. Shen, “Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning”, Proceedings of the International Joint Conferences on Artificial Intelligence, (2017) August.
L. Baraldi, C. Grana and R. Cucchiara, “Hierarchical Boundary-Aware Neural Encoder for Video Captioning”, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, (2017).
S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell and K. Saenko, “Sequence to Sequence – Video to Text”, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, (2015).
Y. Jia, Shelhamer , J. Donahue, et al., (2014). “Caffe: convolutional architecture for fast feature embedding”, (2016).
S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venugopalan, R. J. Mooney, T. Darrell and K. Saenko, “Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zeroshot recognition”, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition,, (2013).
M. Everingham, L. Van Gool, C. K. I. Williams, and J. Winn, “The pascal visual object classes (voc) challenge”, International Journal of Computer Vision (IJCV), vol. 88, no. 2, (2010), pp. 303–338.
M. Hodosh, P. Young and J. Hockenmaier, “Framing image description as a ranking task: Data, models and evaluation metrics”, Journal of Artificial Intelligence Research, vol. 47, (2013), pp. 853–899.
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke and A. Rabinovich, “Going deeper with convolutions”, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2015).
A. Krizhevsky, I. Sutskever and G. E. Hinton, “Imagenet classification with deep convolutional neural networks”, Proceedings of the Advances in Neural Information Processing Systems (NIPS), (2012).
K. Cho, B. Van Merri¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation”, arXiv preprint arXiv:1406.1078, (2014).
N. Krishnamoorthy, G. Malkarnenkar, R. J Mooney, and et al., “Generating Natural-Language Video Descriptions Using Text-mined Knowledge”, Proceedings of the Twenty- Seventh AAAI Conference on Artificial Intelligence． Menlo Park，(2013)．
L. Yao, A. Torabi and K. Cho, “Describing videos by exploiting temporal structure”, Proceedings of the 2015 IEEE International Conference on Computer Vision, Piscataway，(2015)．
G. Farneback, “Two-Frame Motion Estimation Based on Polynomial Expansion”, Proceedings of the 13th Scandinavian Conference on Image Analysis, , Berlin (2003)．
G. Gkioxari and J. Malik, “Finding Action Tubes” Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, (2015).
H. Wang, A. Klaser, C. Schmid, “Action Recognition by Dense Trajectories”, Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition, Washington, (2011).
O. Vinyals, A. Toshev and S. Bengio, “Show and Tell: A Neural Image Caption Generator”, Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Piscataway, (2015)．
K. Papineni, “BLEU: A Method for Automatic Evaluation of Machine Translation”, Wireless Networks， (2015).
C. Lin and F. J. Och, “Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics” , Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, (2004).
M. J. Denkowski and A. Lavie, “Meteor Universal: Language Specific Translation Evaluation For Any Target Language”, Proceedings of the Ninth Workshop on Statistical Machine Translation, (2014).
R. Vedantam, C. L. Zitnick and D. Parikh, “Cider: Consensus-Based Image Description Evaluation”, in IEEE Conference on Computer Vision and Pattern Recognition, (2015).
T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar and C. L. Zitnick, “Microsoft COCO: Common Objects in Context,” Proceedings of the European Conference on Computer Vision (ECCV). Springer, (2014).

Index Terms

Computer Science

Information Sciences

Keywords

Feature Fusion Video Captioning Deep Learning LSTM