International Journal of Computer Applications |
Foundation of Computer Science (FCS), NY, USA |
Volume 176 - Number 39 |
Year of Publication: 2020 |
Authors: Oluwafemi Oriola |
10.5120/ijca2020920503 |
Oluwafemi Oriola . Exploring N-gram, Word Embedding and Topic Models for Content-based Fake News Detection in FakeNewsNet Evaluation. International Journal of Computer Applications. 176, 39 ( Jul 2020), 25-30. DOI=10.5120/ijca2020920503
FakeNewsNet is a repository of two novel datasets, PolitiFact and GossipCop, which are employed for evaluation of fake news detection techniques. Unlike other extensively studied benchmark fake news datasets, the FakeNewsNet datasets incorporate news content, social context, and dynamic information, which could be used to study fake news propagation, detection, and mitigation. Existing works on FakeNewsNet have focused on one-hot encoding, social contexts such as user-based models, and dynamic information such as news propagation model. However, n-gram, word embeddings, and topic models of news contents, which have been impressive in other contexts have not been explored. This paper therefore explores n-gram, word embeddings, and topic models of news contents for the evaluation of FakeNewsNet datasets. Unigram-based n-gram model, skip-gram word2vec-based word embeddings model and Latent Dirichlet Allocation-based topic model are extracted after preprocessing the datasets. The features are weighted by TFIDF to overcome the shortcomings of the individual models and analyzed using Logistic Regression. The evaluation of the models and their hybrids shows that n-gram model outperforms word embedding and topic models. Specifically, n-gram model records accuracy, precision, recall and F1-score of 0.80, 0.79, 0.78 and 0.79, respectively for PolitiFact and records 0.82, 0.75, 0.79 and 0.77, respectively for GossipCop. The comparison with benchmarks also shows that the performance of n-gram model is better.