We apologize for a recent technical issue with our email system, which temporarily affected account activations. Accounts have now been activated. Authors may proceed with paper submissions. PhDFocusTM
CFP last date
20 November 2024
Reseach Article

Comparison on the Effectiveness of Different Statistical Similarity Measures

by Safa’a I. Hajeer
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 53 - Number 8
Year of Publication: 2012
Authors: Safa’a I. Hajeer
10.5120/8440-2224

Safa’a I. Hajeer . Comparison on the Effectiveness of Different Statistical Similarity Measures. International Journal of Computer Applications. 53, 8 ( September 2012), 14-19. DOI=10.5120/8440-2224

@article{ 10.5120/8440-2224,
author = { Safa’a I. Hajeer },
title = { Comparison on the Effectiveness of Different Statistical Similarity Measures },
journal = { International Journal of Computer Applications },
issue_date = { September 2012 },
volume = { 53 },
number = { 8 },
month = { September },
year = { 2012 },
issn = { 0975-8887 },
pages = { 14-19 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume53/number8/8440-2224/ },
doi = { 10.5120/8440-2224 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T20:53:35.189098+05:30
%A Safa’a I. Hajeer
%T Comparison on the Effectiveness of Different Statistical Similarity Measures
%J International Journal of Computer Applications
%@ 0975-8887
%V 53
%N 8
%P 14-19
%D 2012
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Document retrieval is the process of matching of some sated user query against a set of free-text records (documents), its one major technique for organizing and managing information. This project was concerned with studying which of the different statistical measures in IR have the most effectiveness on document retrieval using a unified set of documents. The results show that the Cosine Similarity Measure is the best of other seven measures (Inner Product, Dice Coefficient, Jaccard Coefficient, Inclusion Similarity Coefficient, Overlap Coefficient Measure, Euclidean distance Measure and Manhattan Distance Measure (City Block Distance) for both languages, with precision on Arabic collection 38% and recall 53. 2%. On English collection, the precision is 25% and recall 65%.

References
  1. Singhal A. (2001), Modern Information Retrieval: A Brief Overview, IEEE Data Engineering Bulletin, Vol. 24, No. 4, pp. 35-43.
  2. Stephens R. (2004), Information Retrieval & computational Geometry, www. ddj. com/dept/architect/184405928, available on October, 2008.
  3. Baeza-Yates R. and Ribeiro-Neto B. (1999), Modern Information Retrieval, ACM Press, New York.
  4. Grossman D. and Frieder O. (2004), Information Retrieval Algorithms and heuristics, Netherlands, USA.
  5. Al-Sinjilawi S. and Al- Kabi M. (2007), A comparative study of efficiency of different measures to classify Arabic text, University of Sharjah of pure & Applied sciences, Vol. 4, No. 2.
  6. Vester K. and Martiny M. (2005), Information Retrieval in Document spaces using clustering, Master Thesis, Technical University of Denmark, Denmark.
  7. Garcia E. (2008), Understanding Inverse Document Frequency (IDF), IR Watch Newsletter, USA.
  8. Zhai C. (2007), A Brief Review of Information Retrieval Models, University of Illinois at Urbana Champaign, USA.
  9. Euclidean Space – Wikipedia, the free encyclopedia, http://en. wikipedia. org/wiki/Euclidean_space, available on October, 2008.
  10. Chowdhury A. (2001), On the Design of reliable efficient information systems, Thesis for Doctor of philosophy in Computer Science, Illinois Institute of Technology, Chicago, USA.
  11. Salton G. , Wang A. and Yang C. (1975), A Vector Space Model for Automatic Indexing, Communication of the ACM, Vol. 18, No. 11, pp. 613-620.
Index Terms

Computer Science
Information Sciences

Keywords

Information Retrieval (IR) Vector space model ranking algorithm Similarity Measures