A POS-Tagged Corpus for Dogri: Development and Annotation using DogriTag

Vipul Saluja; Jyotshna Dongardive

Call for Paper

May Edition

IJCA solicits high quality original research papers for the upcoming May edition of the journal. The last date of research paper submission is 20 April 2026

Submit your paper

Know more

The week's pick

A Unified NIST SP 800-90B Validation Framework for CMOS True Random Number Generators and Quantum Random Number Generators

Che-Ping Lin

Random Articles

Reseach Article

A POS-Tagged Corpus for Dogri: Development and Annotation using DogriTag

by Vipul Saluja, Jyotshna Dongardive

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 187 - Number 59

Year of Publication: 2025

Authors: Vipul Saluja, Jyotshna Dongardive

10.5120/ijca2025926001

Vipul Saluja, Jyotshna Dongardive . A POS-Tagged Corpus for Dogri: Development and Annotation using DogriTag. International Journal of Computer Applications. 187, 59 ( Nov 2025), 36-43. DOI=10.5120/ijca2025926001

@article{ 10.5120/ijca2025926001,

author = { Vipul Saluja, Jyotshna Dongardive },

title = { A POS-Tagged Corpus for Dogri: Development and Annotation using DogriTag },

journal = { International Journal of Computer Applications },

issue_date = { Nov 2025 },

volume = { 187 },

number = { 59 },

month = { Nov },

year = { 2025 },

issn = { 0975-8887 },

pages = { 36-43 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume187/number59/a-pos-tagged-corpus-for-dogri-development-and-annotation-using-dogritag/ },

doi = { 10.5120/ijca2025926001 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2025-11-18T21:11:27.059583+05:30

%A Vipul Saluja

%A Jyotshna Dongardive

%T A POS-Tagged Corpus for Dogri: Development and Annotation using DogriTag

%J International Journal of Computer Applications

%@ 0975-8887

%V 187

%N 59

%P 36-43

%D 2025

%I Foundation of Computer Science (FCS), NY, USA

Abstract

This paper discusses about the process used to create a linguistically selected and manually annotated Part of Speech (POS) tagged corpus for Dogri. Dogri is a low resource and Indo Aryan language that is spoken in the Indian Union Territory of Jammu and Kashmir and in some regions of Pakistan. Dogri is poorly represented in Natural Language Processing (NLP) despite the sufficient number of speakers and the official recognition. This is due to the absence of resources such as defined tag sets, annotated corpora and annotation specific tools. To fill this gap, a POS tagged Dogri corpus was developed from a domain-specific subset of the Linguistic Data Consortium for Indian Languages (LDC-IL). This corpus has about 25,000 sentences (approximately 400,000 tokens). A specialized web platform named DogriTag was developed that can track audits and make semi-automated tag suggestions, to do the annotation. To check the quality of the annotations, inter annotator agreement analysis was used. The results show a Cohen's Kappa score of 0.89 indicating a lot of agreement. This resource is very important for making NLP tools like POS taggers, syntactic parsers, and morphological analyzers for Dogri. Future work will include adding more tags, using pretrained language models to transfer information between languages, and covering more areas.

References

D. Engelhardt, J. Mach. Learn. Res., 21(203), 1–30, 2020.
K. Gallagher et al., bioRxiv, 2023-04, 2023.
J. Horwood and E. Noutahi, ACS Omega, 5(51), 32984–32994, 2020.
M. Madondo et al., arXiv preprint, arXiv:2506.10073, 2025.
K. Gallagher et al., Cancer Res., 84(11), 1929–1941, 2024.
M. Korshunova et al., Commun. Chem., 5(1), 129, 2022.
M. Liu, X. Shen, and W. Pan, Stat. Med., 41(20), 4034–4056, 2022.
J. N. Eckardt et al., Cancers, 13(18), 4624, 2021.
S. Pandiyan and L. Wang, Comput. Biol. Med., 150, 106140, 2022.
C. Li et al., Phys. Med., 125, 104498, 2024.
H. Mashayekhi et al., Comput. Methods Programs Biomed., 243, 107884, 2024.
M. Popova, O. Isayev, and A. Tropsha, Sci. Adv., 4(7), eaap7885, 2018.
R. Özçelik et al., J. Chem. Inf. Model., 65(14), 7352–7372, 2025.
L. Wang et al., Pharmaceuticals, 16(2), 253, 2023.
F. G. Albani et al., Drug Des. Dev. Ther., 5685–5707, 2025.
H. G. Svensson et al., Mach. Learn., 113(7), 4811–4843, 2024.
A. Ünlü et al., Nat. Mach. Intell., 1–17, 2025.
M. H. N. Le et al., Biochim. Biophys. Acta, 167680, 2025.
S. Herráiz-Gil et al., Appl. Sci., 15(5), 2798, 2025.
Takahiro Eitsuka, Naoto Tatewaki, Hiroshi Nishida, Kiyotaka Nakagawa, and Teruo Miyazawa. 2016. Synergistic anticancer effect of tocotrienol combined with chemotherapeutic agents or dietary components: A review. International Journal of Molecular Sciences 17, 10 (2016), 1605.

Index Terms

Computer Science

Information Sciences

Keywords

Dogri language; Part of speech tagging; ILPOSTS; low resource NLP; annotated corpus; Indian languages; inter annotator agreement; linguistic annotation; web-based annotation tool; Indo Aryan languages