Improving web page clustering using Probabilistic Latent Semantic Analysis

Call for Paper

May Edition

IJCA solicits high quality original research papers for the upcoming May edition of the journal. The last date of research paper submission is 20 April 2026

Submit your paper

Know more

The week's pick

Evaluating Text-to-Text Generation from LLMs: A Case Study and Scalable Framework

Ziqiao Ao Juhi Singh Sebastian Antinome

Random Articles

Reseach Article

Improving web page clustering using Probabilistic Latent Semantic Analysis

Published on April 2012 by Lalit A. Patil, S M. Kamalapur

Emerging Trends in Computer Science and Information Technology (ETCSIT2012)

Foundation of Computer Science USA

ETCSIT - Number 4

April 2012

Authors: Lalit A. Patil, S M. Kamalapur

Lalit A. Patil, S M. Kamalapur . Improving web page clustering using Probabilistic Latent Semantic Analysis. Emerging Trends in Computer Science and Information Technology (ETCSIT2012). ETCSIT, 4 (April 2012), 1-4.

@article{

author = { Lalit A. Patil, S M. Kamalapur },

title = { Improving web page clustering using Probabilistic Latent Semantic Analysis },

journal = { Emerging Trends in Computer Science and Information Technology (ETCSIT2012) },

issue_date = { April 2012 },

volume = { ETCSIT },

number = { 4 },

month = { April },

year = { 2012 },

issn = 0975-8887,

pages = { 1-4 },

numpages = 4,

url = { /proceedings/etcsit/number4/5982-1025/ },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Proceeding Article

%1 Emerging Trends in Computer Science and Information Technology (ETCSIT2012)

%A Lalit A. Patil

%A S M. Kamalapur

%T Improving web page clustering using Probabilistic Latent Semantic Analysis

%J Emerging Trends in Computer Science and Information Technology (ETCSIT2012)

%@ 0975-8887

%V ETCSIT

%N 4

%P 1-4

%D 2012

%I International Journal of Computer Applications

Abstract

Traditional clustering algorithms are usually based on the bag-of-words (BOW) approach. A notorious disadvantage of the BOW model is that it ignores the semantic relationship among words. As a result, if two documents use different collections of core words to represent the same topic, they may be assigned to different clusters, even though the core words they use are probably synonyms or semantically associated in other form and other disadvantage of conventional web page clustering technique is often utilized to reveal the functional similarity of web pages. Tagging can be beneficial to improve the clustering performance. Several efforts have been made to explore social tagging for clustering. But there is some drawbacks of tagging web based clustering. To our knowledge, all the existing approaches exploiting tag information for webpage clustering assume that all the WebPages are tagged, which is a somewhat restrictive assumption. In a more realistic setting, one can only expect that the tags will be available for only a small number of WebPages. In this paper, we propose a new web page grouping approach based on Probabilistic Latent Semantic Analysis (PLSA) model. An iterative algorithm based on maximum likelihood principle is employed to overcome the aforementioned computational shortcoming

References

Thomas Hofmann, "Probabilistic Latent Semantic Indexing", Proceedings of the Twenty-Second Annual International SIGIR Conference on Research and Development in Information Retrieval.
Dempster, A. , Laird, N. , and Rubin, D. "Maximum likelihood from incomplete data via the EM algorithm. " J. Royal Statist. Soc. B 39 (1977), 138.
Dumais, S. T. Latent semantic indexing", Trec-3 report. In Proceedings of the Text Retrieval Conference (TREC-3) (1995), D. Harman, Ed. , pp. 219.
Gildea, D. , and Hofmann, T. Topic-based language models using em. In Proceedings of the 6th European Conference on Speech Communication and Technology(EUROSPEECH) (1999).
Hofmann, T. Latent class models for collaborative filtering. In Proceedings of the 16th International Joint Conference on Artificial Intelligence (IJCAI) (1999).
Hofmann, T. Probabilistic latent semantic analysis. In Proceedings of the 15th Conference on Uncertainty in AI (1999).
Hofmann, T. , Puzicha, J. , and Jordan, M. I. Unsupervised learning from dyadic data. In Advances in Neural Information Processing Systems (1999),vol. 11.
Michael Tipping and Christopher Bishop. 1999. Probabilistic principal component analysis. Journal of the Royal Statistical Society, Series B, 61(3):611-622.
Anusua Trivedi, Piyush Rai, Scott L. DuVall "Exploiting Tag and Word Correlations for Improved Webpage Clustering "SMUC'10, October 30,2010, Toronto, Ontario, Canada. Copyright 2010 ACM.
http://www. stumbleupon. com
http://www. delicious. com
Open Directory Project (http://www. dmoz. org/)

Index Terms

Computer Science

Information Sciences

Keywords

web page