International Journal of Computer Applications |
Foundation of Computer Science (FCS), NY, USA |
Volume 176 - Number 14 |
Year of Publication: 2020 |
Authors: Osanyin Quadri A., Ajose-Ismail B. M. |
10.5120/ijca2020920071 |
Osanyin Quadri A., Ajose-Ismail B. M. . A Neural Network Language Document Representation Technique for Web-Page Classification. International Journal of Computer Applications. 176, 14 ( Apr 2020), 38-43. DOI=10.5120/ijca2020920071
The task of assigning a web page to the correct category is getting cumbersome because of the influx of digital documents on the World Wide Web. The performance of applications such as web directories, question and answering system, web content filtering systems depends on the key performance of automatic web page classification systems. From extant literature, the performance of web page classification system depends on adequate textual representation of the web content. Several statistical document representation techniques such as bag of words models, n-grams models and topic models have been proposed by authors to capture the real semantics of web documents but are fraught with several challenges such as semantic mismatch, multiple meanings of words. Thus, this paper proposes a recent neural network language model (Doc2Vec) which utilizes document embedding’s to solve the document representation problem of web page classification system. Results obtained confirms the earlier assumption that Doc2Vec performs robustly on very high dimensional text such as web documents, it also capture the real semantics of the web document.