We apologize for a recent technical issue with our email system, which temporarily affected account activations. Accounts have now been activated. Authors may proceed with paper submissions. PhDFocusTM
CFP last date
20 November 2024
Reseach Article

An Unsupervised Model to detect Web Spam based on Qualified Link Analysis and Language Models

by Shrijina Sreenivasan, B. Lakshmipathi
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 63 - Number 4
Year of Publication: 2013
Authors: Shrijina Sreenivasan, B. Lakshmipathi
10.5120/10455-5163

Shrijina Sreenivasan, B. Lakshmipathi . An Unsupervised Model to detect Web Spam based on Qualified Link Analysis and Language Models. International Journal of Computer Applications. 63, 4 ( February 2013), 33-37. DOI=10.5120/10455-5163

@article{ 10.5120/10455-5163,
author = { Shrijina Sreenivasan, B. Lakshmipathi },
title = { An Unsupervised Model to detect Web Spam based on Qualified Link Analysis and Language Models },
journal = { International Journal of Computer Applications },
issue_date = { February 2013 },
volume = { 63 },
number = { 4 },
month = { February },
year = { 2013 },
issn = { 0975-8887 },
pages = { 33-37 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume63/number4/10455-5163/ },
doi = { 10.5120/10455-5163 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T21:13:17.716987+05:30
%A Shrijina Sreenivasan
%A B. Lakshmipathi
%T An Unsupervised Model to detect Web Spam based on Qualified Link Analysis and Language Models
%J International Journal of Computer Applications
%@ 0975-8887
%V 63
%N 4
%P 33-37
%D 2013
%I Foundation of Computer Science (FCS), NY, USA
Abstract

With the massive use of the internet and the search engines, a major problem that comes to light is the Web Spam. Web spam can be detected by analyzing the various features of web pages and categorizing them as belonging to the spam or non-spam category. The proposed work considers unsupervised learning algorithms to characterize the web pages based on the link based features and content based features to compare the difference between the various sources of information in the source and target page. An unsupervised learning technique that is initially considered is the Hidden Markov Model which captures the different browsing patterns of users. Users may not only access the web through direct hyperlinks but may also jump from one page to another by typing URL's or even by opening multiple windows. The unsupervised techniques have no previous class definitions to map outcomes to. As a result, they find out all possible probabilities of relation between the source and target page. This helps to attain higher efficiency in the detection of web spam even if the dataset used is small. Other unsupervised methods used to implement the same are the Self Organizing Map (SOM) and the Adaptive Resonance Theory (ART). Finally a performance evaluation of all the techniques used is made and represented in the increasing order of their performance metric.

References
  1. J. Abernethy, O. Chapelle, and C. Castillo, "Webspam identification through content and hyperlinks," in Proc. Fourth Int. Workshop on Adversarial Information Retrieval on the Web (AIRWeb), Beijing, China, 2008, pp. 41–44
  2. L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates, "Link-based characterization and detection of web spam," in Proc. 2nd Int. Workshop on Adversarial Information Retrieval on the Web (AIRWeb'06), Seattle, WA, 2006, pp. 1–8.
  3. A. A. Benczúr, I. Bíró, K. Csalogány, and M. Uher, "Detecting nepotistic links by language model disagreement," in Proc. 15th Int. Conf. World Wide Web (WWW'06), New York, 2006, pp. 939–940, ACM.
  4. A. A. Benczúr, K. Csalogány, T. Sarlós, and M. Uher, "Spamrank Fully automatic link spam detection," in Proc. First Int. Workshop on Adversarial Information Retrieval on the Web (AIRWeb, Chiba, Japan, 2005, pp. 25–38
  5. Alexandros Ntoulas et al. , "Detecting Spam Web Pages through Content Analysis"
  6. C. Castillo, D. Donato, L. Becchetti, P. Boldi, S. Leonardi, M. Santini, and S. Vigna, "A reference collection for web spam," SIGIR Forum, vol. 40, no. 2, pp. 11–24.
  7. C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri, "Know your neighbors: Web spam detection using the web topology," in Proc. 30th Annu. Int. ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR'07), New York, 2007, pp. 423–430, ACM.
  8. Lourdes Araujo and Juan Martinez-Romo, "Web Spam Detection: New classification Features Based on Qualified Link Analysis and Language"
  9. B. Davison, Recognizing Nepotistic Links on the Web 2000[Online]. Available: http://citeseer. ist. psu. edu/davison00recognizing. html
  10. N. Craswell, D. Hawking, and S. Robertson, "Effective site finding using link anchor information," in Proc. 24th Annu. Int. ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR'01), New York, 2001, pp. 250–257, ACM.
  11. N. Eiron and K. S. McCurley, "Analysis of anchor text for web search," in Proc. 26th Annu. Int. ACM SIGIR Conf. Research and Development in Informaion Retrieval (SIGIR'03), New York, 2003, pp. 459–460
  12. S N Sivanandam, S Sumathi, S N Deepa, "Introduction to Neural Networks using Matlab 6. 0"
  13. Spamdexing, http://en. wikipedia. org/wiki/Spamdexing
  14. Hidden Markov Model Features, http://en. wikipedia. org/wiki/Hidden_Markov_model
  15. Self Organizing Map: http://en. wikipedia. org/wiki/Self-organizing_map
  16. Self Organizing Maps architecture and definition: http://users. ics. aalto. fi/jhollmen/dippa/node9. html
  17. Adaptive Resonance Theory concepts: http://en. wikipedia. org/wiki/Adaptive_resonance_theory
  18. Zolt´an Gy¨ongyi and Hector Garcia-Molina, "Web spam Taxonomy" http://ilpubs. stanford. edu:8090/771/1/2005-9. pdf
  19. Performance measures using sensitivity and specificity, http://en. wikipedia. org/wiki/Sensitivity_and_specificity
  20. The Ranking of pages via search engines: http://en. wikipedia. org/wiki/PageRank
  21. The concept, terms and definitions of a Language Model, http://en. wikipedia. org/wiki/Language_model
  22. Features of various measures like the true positive, false positive rate http://en. wikipedia. org/wiki/Type_I_and_type_II_errors
  23. Precision, Recall and F-measure: http://en. wikipedia. org/wiki/Precision_and_recall
  24. Erol Sahin, "Neurocomputing. Adaptive Resonance Theory"http://www. kovan. ceng. metu. edu. tr/~erol/Courses/CENG569/slides/ceng569-2005-2006-w6. pdf
Index Terms

Computer Science
Information Sciences

Keywords

Link analysis Unsupervised Learning Techniques Web spam Detection