International Journal of Computer Applications |
Foundation of Computer Science (FCS), NY, USA |
Volume 107 - Number 11 |
Year of Publication: 2014 |
Authors: Aadesh Neupane |
10.5120/18799-0315 |
Aadesh Neupane . Development of Nepali Character Database for Character Recognition based on Clustering. International Journal of Computer Applications. 107, 11 ( December 2014), 42-46. DOI=10.5120/18799-0315
Character Recognition tasks requires large set of reliable dataset to apply recognition algorithms and generate efficient models out of them. In case of Nepali language, no such character dataset exists for character recognition research, at least in the public domain. Nepali language has 36 consonant characters, 12 vowels character and each vowel character can modify each consonant characters. In this regard, there can be total of 446 characters including Nepali numeric characters. So, manually creating dataset for Nepali characters requires tons of effort, cost and time. In this paper, an elegant way of creating Nepali character dataset using semi-supervised clustering approach is described which minimizes effort and time. Also, optimization is done on existing segmentation algorithm [1] to segment Nepali characters for both handwritten and scanned Nepali text. Complex features are extracted from these segmented characters by applying Discrete Cosine Transform and Wavelet transform. Thus, these extracted features are used to create database of Nepali characters using phash and k-means cluster. Presently, the database contains 38,493 characters distributed among 52 different clusters.