International Conference in Computational Intelligence |
Foundation of Computer Science USA |
ICCIA - Number 6 |
March 2012 |
Authors: Lalit A. Patil, S M. Kamalapur, Dhananjay Kanade |
5300f4ae-d677-42f3-94f4-ea1dafbd9471 |
Lalit A. Patil, S M. Kamalapur, Dhananjay Kanade . Web Page Clustering using Latent Semantic Analysis. International Conference in Computational Intelligence. ICCIA, 6 (March 2012), 21-25.
Web mining techniques such as clustering help to organize the web content into appropriate subject based categories so that their efficient search and retrieval becomes manageable. Traditional WebPages clustering typically uses only the page content (usually the page text) in an appropriate feature vector representation such as Bags of words, termfrequency /inverse document frequency ,etc. and then applies standard clustering algorithms(e.g. K-means, Suffix tree, Query directed clustering). For example, Users can provide captions for images on the internet, provide tags to WebPages and other media content they regularly browse on the internet, etc. Therefore such user – generated content can provide useful information in various form such as meta-data or in more explicit ways such as tags. Typically, WebPages clustering algorithms only use feature extracted from the page text. However, the advent also social –bookmaking websites, such as StumbleUpon and Delicious has led to a huge amount of usergenerated content such as the information that is associated with the WebPages. In multi-view learning, the feature can be split into two subset alone is sufficient for learning. Here as for, unsupervised learning algorithms, multiple views of the data can often help in extracting better features. Canonical Correlation Analysis (CCA) is an unsupervised feature extraction technique for finding dependencies between two (or more) views of the data by maximizing the correlations between the views in a shared subspace. But the drawbacks of the CCA is it gives The first approach is based on an annotation based probabilistic latent semantic analysis (LSA) over document-word and tagword co-occurrence matrices