CFP last date
20 March 2025
Reseach Article

An Approach to Discovery and Re-ranking of Educational content from the World Wide Web using Latent Dirichlet Allocation

Published on None 2011 by Jagadish V, Hariharan G, Geetha T V
Artificial Intelligence Techniques - Novel Approaches & Practical Applications
Foundation of Computer Science USA
AIT - Number 4
None 2011
Authors: Jagadish V, Hariharan G, Geetha T V

Jagadish V, Hariharan G, Geetha T V . An Approach to Discovery and Re-ranking of Educational content from the World Wide Web using Latent Dirichlet Allocation. Artificial Intelligence Techniques - Novel Approaches & Practical Applications. AIT, 4 (None 2011), 14-20.

author = { Jagadish V, Hariharan G, Geetha T V },
title = { An Approach to Discovery and Re-ranking of Educational content from the World Wide Web using Latent Dirichlet Allocation },
journal = { Artificial Intelligence Techniques - Novel Approaches & Practical Applications },
issue_date = { None 2011 },
volume = { AIT },
number = { 4 },
month = { None },
year = { 2011 },
issn = 0975-8887,
pages = { 14-20 },
numpages = 7,
url = { /specialissues/ait/number4/2844-225/ },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
%0 Special Issue Article
%1 Artificial Intelligence Techniques - Novel Approaches & Practical Applications
%A Jagadish V
%A Hariharan G
%A Geetha T V
%T An Approach to Discovery and Re-ranking of Educational content from the World Wide Web using Latent Dirichlet Allocation
%J Artificial Intelligence Techniques - Novel Approaches & Practical Applications
%@ 0975-8887
%N 4
%P 14-20
%D 2011
%I International Journal of Computer Applications

With tremendous increase in the amount of digital data available educators are forced to author content for learning and teaching for use in their classes. With that there has emerged a need to facilitate automatic discovery of learning resources from the World Wide Web. In this work, we present a novel approach for discovering content from the web for e-learning. We argue that for an e-learning scenario, retrieval of the redundant content from the web is a serious problem to be addressed as it does not satisfy the requirements of a typical learner. Furthermore, the content retrieved should cover all topics as in his syllabus. Sense-disambiguation should be performed during information retrieval from the web so that it corresponds to the learner’s actual domain of interest. This work presents a domain ontology based re-querying approach for query expansion to discover content from open corpus sources. We use the Latent Dirichlet Allocation Model for unsupervised classification of document segments to aid students and educators. Having identified the topics at the granularity of document segments in an unsupervised fashion, we state that internal topic transitions in a resource retrieved from the web can be exploited for providing relevant and personalized content. In addition to this, we propose a re-ranking scheme for ordering results from search engines to maximize topic coverage and minimize redundancy among retrieved results. We also evaluate the effectiveness of our proposed method for information retrieval and show that our work results in greater coverage of topics from the web without redundancy.

  1. Brusilovsky, P. & Henze, N., 2007, “Open Corpus Adaptive Educational Hypermedia”. In The Adaptive Web: Methods and Strategies of Web Personalisation, Lecture Notes in Computer Science, vol. 4321, Berlin: Springer Verlag, pp. 671-696.
  2. Boyle T., 2003, “Design Principles for Authoring Dynamic, Reusable Learning Objects”. In the Australian Journal of Educational Technology, vol. 19(1), pp. 46-58.
  3. Gary Marchionini, 1995, "The Costs of Educational Technology: A Framework for Assessing Change". Invited paper at ED- MEDIA 95 in Graz, Austria.
  4. Steichen, B., Lawless, S., O'Connor, A. & Wade, V. 2009, "Dynamic Hypertext Generation for Reusing Open Corpus Content" In the proceedings of the 20th ACM Conference on Hypertext and Hypermedia, Hypertext, in Torino, Italy.
  5. Apache Lucene - a free/open source information retrieval software library, originally created in Java. It is released under the Apache Software License.
  6. Nutch an open source search engine based on Lucene at
  7. David M. Blei, Andrew Y. Ng, Michael I. Jordan ,2003,”Latent Dirichlet Allocation”,The Journal of Machine Learning Research,3, pp. 993-1022.
  8. Lawless, S.,2009, "Leveraging Content from Open Corpus Sources for Technology Enhanced Learning". Ph.D. Thesis, Submitted to the University of Dublin, Trinity College.
  9. Sparck-Jones, K. ,1972,. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(1), pp. 11-21.
  10. Page, Lawrence; Brin, Sergey; Motwani, Rajeev and Winograd, Terry, 1999 , “The PageRank citation ranking: Bringing order to the Web”.
  11. Behnak Yaltaghian,, Mark Chignell, 2002,”Re-ranking search results using network analysis a case study with google: a case study with Google“,CASCON '02 Proceedings of the 2002 conference of the Centre for Advanced Studies on Collaborative research, pp. 14.
  12. Search engine available at
  13. Jian-Tao Sun; Zheng Chen; Hua-Jun Zeng; Yu-Chang Lu; Chun-Yi Shi; Wei-Ying Ma; 2004 ,”Supervised latent semantic indexing for document categorization”,Proceedings of IEEE International conference on Data Mining, pp. 535-538.
  14. Istvan Biro, 2009, “Document Classification with Latent Dirichlet Allocation”. Ph.D Thesis, Eötvös Loránd University.
  15. Lawless, S., Hederman, L. & Wade, V.,2008, "OCCS: Enabling the Dynamic Discovery, Harvesting and Delivery of Educational Content from Open Corpus Sources" In the proceedings of the 8th IEEE International Conference on Advanced Learning Technologies, I-CALT 2008, in Santander, Cantabria, Spain.
  16. J. Becker, D. Kuropka, 2003,""Topic-based Vector Space Model" , In Proceedings of the 6th International Conference on Business Information Systems, pp. 7-12.
  17. JGibb LDA, Package available at
  18. Protégé ontology editor available at
Index Terms

Computer Science
Information Sciences


Latent Dirichlet Allocation Re-ranking Topic coverage Reudundancy Topic coverage Reudundancy