International Journal of Computer Applications |
Foundation of Computer Science (FCS), NY, USA |
Volume 63 - Number 4 |
Year of Publication: 2013 |
Authors: Shrijina Sreenivasan, B. Lakshmipathi |
10.5120/10455-5163 |
Shrijina Sreenivasan, B. Lakshmipathi . An Unsupervised Model to detect Web Spam based on Qualified Link Analysis and Language Models. International Journal of Computer Applications. 63, 4 ( February 2013), 33-37. DOI=10.5120/10455-5163
With the massive use of the internet and the search engines, a major problem that comes to light is the Web Spam. Web spam can be detected by analyzing the various features of web pages and categorizing them as belonging to the spam or non-spam category. The proposed work considers unsupervised learning algorithms to characterize the web pages based on the link based features and content based features to compare the difference between the various sources of information in the source and target page. An unsupervised learning technique that is initially considered is the Hidden Markov Model which captures the different browsing patterns of users. Users may not only access the web through direct hyperlinks but may also jump from one page to another by typing URL's or even by opening multiple windows. The unsupervised techniques have no previous class definitions to map outcomes to. As a result, they find out all possible probabilities of relation between the source and target page. This helps to attain higher efficiency in the detection of web spam even if the dataset used is small. Other unsupervised methods used to implement the same are the Self Organizing Map (SOM) and the Adaptive Resonance Theory (ART). Finally a performance evaluation of all the techniques used is made and represented in the increasing order of their performance metric.