| International Journal of Computer Applications |
| Foundation of Computer Science (FCS), NY, USA |
| Volume 187 - Number 69 |
| Year of Publication: 2025 |
| Authors: S. Winston Cruz, G. Roch Libia Rani |
10.5120/ijca2025926157
|
S. Winston Cruz, G. Roch Libia Rani . Web Scraping Localized Parallel Multilingual Help Content in Indian Languages. International Journal of Computer Applications. 187, 69 ( Dec 2025), 35-42. DOI=10.5120/ijca2025926157
The need for multilingual corpora has witnessed a quantum leap with the development in web mining and large language models (LLM). Multilingual data extraction from websites is a way of developing parallel corpora. Controlled use of web scraping is a useful technique for the creation of this corpus. Among the various types of localized content on the web, including the machine translations, content produced engaging human translators and reviewers are most useful followed by machine translated content that has undergone human post editing. The help center documents and the terms and conditions documents available on the websites in different languages come under these categories. In this paper, such content is manually identified and the issues in scraping them are discussed. A Python code that uses BeautifulSoup library for extracting these materials in various Indian languages like Hindi, Kannada and Tamil is presented. The concerns related to arranging the content parallelly with their source in English is then discussed. Finally, details of the sample parallel corpus extracted is analyzed and presented.