CFP last date
22 December 2025
Call for Paper
January Edition
IJCA solicits high quality original research papers for the upcoming January edition of the journal. The last date of research paper submission is 22 December 2025

Submit your paper
Know more
Random Articles
Reseach Article

A POS-Tagged Corpus for Dogri: Development and Annotation using DogriTag

by Vipul Saluja, Jyotshna Dongardive
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 187 - Number 59
Year of Publication: 2025
Authors: Vipul Saluja, Jyotshna Dongardive
10.5120/ijca2025926001

Vipul Saluja, Jyotshna Dongardive . A POS-Tagged Corpus for Dogri: Development and Annotation using DogriTag. International Journal of Computer Applications. 187, 59 ( Nov 2025), 36-43. DOI=10.5120/ijca2025926001

@article{ 10.5120/ijca2025926001,
author = { Vipul Saluja, Jyotshna Dongardive },
title = { A POS-Tagged Corpus for Dogri: Development and Annotation using DogriTag },
journal = { International Journal of Computer Applications },
issue_date = { Nov 2025 },
volume = { 187 },
number = { 59 },
month = { Nov },
year = { 2025 },
issn = { 0975-8887 },
pages = { 36-43 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume187/number59/a-pos-tagged-corpus-for-dogri-development-and-annotation-using-dogritag/ },
doi = { 10.5120/ijca2025926001 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2025-11-18T21:11:27.059583+05:30
%A Vipul Saluja
%A Jyotshna Dongardive
%T A POS-Tagged Corpus for Dogri: Development and Annotation using DogriTag
%J International Journal of Computer Applications
%@ 0975-8887
%V 187
%N 59
%P 36-43
%D 2025
%I Foundation of Computer Science (FCS), NY, USA
Abstract

This paper discusses about the process used to create a linguistically selected and manually annotated Part of Speech (POS) tagged corpus for Dogri. Dogri is a low resource and Indo Aryan language that is spoken in the Indian Union Territory of Jammu and Kashmir and in some regions of Pakistan. Dogri is poorly represented in Natural Language Processing (NLP) despite the sufficient number of speakers and the official recognition. This is due to the absence of resources such as defined tag sets, annotated corpora and annotation specific tools. To fill this gap, a POS tagged Dogri corpus was developed from a domain-specific subset of the Linguistic Data Consortium for Indian Languages (LDC-IL). This corpus has about 25,000 sentences (approximately 400,000 tokens). A specialized web platform named DogriTag was developed that can track audits and make semi-automated tag suggestions, to do the annotation. To check the quality of the annotations, inter annotator agreement analysis was used. The results show a Cohen's Kappa score of 0.89 indicating a lot of agreement. This resource is very important for making NLP tools like POS taggers, syntactic parsers, and morphological analyzers for Dogri. Future work will include adding more tags, using pretrained language models to transfer information between languages, and covering more areas.

References
  1. D. Engelhardt, J. Mach. Learn. Res., 21(203), 1–30, 2020.
  2. K. Gallagher et al., bioRxiv, 2023-04, 2023.
  3. J. Horwood and E. Noutahi, ACS Omega, 5(51), 32984–32994, 2020.
  4. M. Madondo et al., arXiv preprint, arXiv:2506.10073, 2025.
  5. K. Gallagher et al., Cancer Res., 84(11), 1929–1941, 2024.
  6. M. Korshunova et al., Commun. Chem., 5(1), 129, 2022.
  7. M. Liu, X. Shen, and W. Pan, Stat. Med., 41(20), 4034–4056, 2022.
  8. J. N. Eckardt et al., Cancers, 13(18), 4624, 2021.
  9. S. Pandiyan and L. Wang, Comput. Biol. Med., 150, 106140, 2022.
  10. C. Li et al., Phys. Med., 125, 104498, 2024.
  11. H. Mashayekhi et al., Comput. Methods Programs Biomed., 243, 107884, 2024.
  12. M. Popova, O. Isayev, and A. Tropsha, Sci. Adv., 4(7), eaap7885, 2018.
  13. R. Özçelik et al., J. Chem. Inf. Model., 65(14), 7352–7372, 2025.
  14. L. Wang et al., Pharmaceuticals, 16(2), 253, 2023.
  15. F. G. Albani et al., Drug Des. Dev. Ther., 5685–5707, 2025.
  16. H. G. Svensson et al., Mach. Learn., 113(7), 4811–4843, 2024.
  17. A. Ünlü et al., Nat. Mach. Intell., 1–17, 2025.
  18. M. H. N. Le et al., Biochim. Biophys. Acta, 167680, 2025.
  19. S. Herráiz-Gil et al., Appl. Sci., 15(5), 2798, 2025.
  20. Takahiro Eitsuka, Naoto Tatewaki, Hiroshi Nishida, Kiyotaka Nakagawa, and Teruo Miyazawa. 2016. Synergistic anticancer effect of tocotrienol combined with chemotherapeutic agents or dietary components: A review. International Journal of Molecular Sciences 17, 10 (2016), 1605.
Index Terms

Computer Science
Information Sciences

Keywords

Dogri language; Part of speech tagging; ILPOSTS; low resource NLP; annotated corpus; Indian languages; inter annotator agreement; linguistic annotation; web-based annotation tool; Indo Aryan languages