CFP last date
20 January 2026
Call for Paper
February Edition
IJCA solicits high quality original research papers for the upcoming February edition of the journal. The last date of research paper submission is 20 January 2026

Submit your paper
Know more
Random Articles
Reseach Article

Web Scraping Localized Parallel Multilingual Help Content in Indian Languages

by S. Winston Cruz, G. Roch Libia Rani
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 187 - Number 69
Year of Publication: 2025
Authors: S. Winston Cruz, G. Roch Libia Rani
10.5120/ijca2025926157

S. Winston Cruz, G. Roch Libia Rani . Web Scraping Localized Parallel Multilingual Help Content in Indian Languages. International Journal of Computer Applications. 187, 69 ( Dec 2025), 35-42. DOI=10.5120/ijca2025926157

@article{ 10.5120/ijca2025926157,
author = { S. Winston Cruz, G. Roch Libia Rani },
title = { Web Scraping Localized Parallel Multilingual Help Content in Indian Languages },
journal = { International Journal of Computer Applications },
issue_date = { Dec 2025 },
volume = { 187 },
number = { 69 },
month = { Dec },
year = { 2025 },
issn = { 0975-8887 },
pages = { 35-42 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume187/number69/web-scraping-localized-parallel-multilingual-help-content-in-indian-languages/ },
doi = { 10.5120/ijca2025926157 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2025-12-24T19:35:38.327193+05:30
%A S. Winston Cruz
%A G. Roch Libia Rani
%T Web Scraping Localized Parallel Multilingual Help Content in Indian Languages
%J International Journal of Computer Applications
%@ 0975-8887
%V 187
%N 69
%P 35-42
%D 2025
%I Foundation of Computer Science (FCS), NY, USA
Abstract

The need for multilingual corpora has witnessed a quantum leap with the development in web mining and large language models (LLM). Multilingual data extraction from websites is a way of developing parallel corpora. Controlled use of web scraping is a useful technique for the creation of this corpus. Among the various types of localized content on the web, including the machine translations, content produced engaging human translators and reviewers are most useful followed by machine translated content that has undergone human post editing. The help center documents and the terms and conditions documents available on the websites in different languages come under these categories. In this paper, such content is manually identified and the issues in scraping them are discussed. A Python code that uses BeautifulSoup library for extracting these materials in various Indian languages like Hindi, Kannada and Tamil is presented. The concerns related to arranging the content parallelly with their source in English is then discussed. Finally, details of the sample parallel corpus extracted is analyzed and presented.

References
  1. Shaharbanu, A., & McDonald, S. (2025, 08 01). Legality of data scraping under Indian law. India Business Law Journal. Retrieved 10 30, 2025, from https://law.asia/ india-data-scraping-regulation/
  2. Lotfi, C., Srinivasan, S., Ertz, M., & Latrous, I. (2022). Web Scraping Techniques and Applications: A Literature Review. In R. Pal & P. K. Shukla (Eds.), SCRS Conference Proceedings on Intelligent Systems (pp. 381-394). Soft Computing Research Society. https://doi.org/10.524 58/978-93-91842-08-6-38
  3. Gupta, P., & Jamwal, S. S. (2025). Enhancing NLP for Low-Resource Language by Developing Deep Learning-Powered Morphological Analysis of Dogri: An End-to-End Pipeline from Corpus Construction and Linguistic Annotation to Model Training and Deployment. SN Computer Science, 6. https://link.springer.com/article/10. 1007/s42979-025-04429-9
  4. Bale, A. S., Ghorpade, N., S, R., Kamalesh, S., R, R., & S, R. B. (2022). Web Scraping Approaches and their Performance on Modern Websites. In 2022 3rd International Conference on Electronics and Sustainable Communication Systems (ICESC) (pp. 956-959). IEEE. 10.1109/ICESC54411.2022.9885689
  5. NHAI, Ministry of Road Transport and Highways. (n.d.). Terms & Conditions. National Highways Authority of India. Retrieved November 18, 2025, from https://nhai. gov.in/#/terms-conditions
  6. Agarwal, M., Alam, M. M. I., & Anastasopoulos, A. (2023). LIMIT: Language Identification, Misidentification, and Translation using Hierarchical Models in 350+ Languages. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 14496–14519). Association for Computational Linguistics.
  7. Ingle, Y., & Mishra, P. (2025). ILID: native script language identification for Indian languages. arXiv, 2507.11832v2. arXiv. https://doi.org/10.48550/arXiv.250 7.11832
  8. Hao, S., Han, W., Jiang, T., Li, Y., Wu, H., Zhong, C., Zhou, Z., & Tao, H. (2024). Synthetic data in AI: Challenges, applications, and ethical implications. arXiv pre-print arXiv, 2401.01629.
  9. Amazon.com, Inc. (n.d.). Online Shopping site in India: Shop Online for Mobiles, Books, Watches, Shoes and More. Amazon.in. Retrieved November 18, 2025, from http://www.amazon.in
  10. Google. (n.d.). Cloud Translation. Google Cloud. Retrieved September 15, 2025, from https://cloud. google.com/translate
  11. Microsoft. (n.d.). Translator Text API. Microsoft. Retrieved September 17, 2025, from https://www.microsoft. com/en-us/translator/business/translator-api/
  12. Amazon Web Services. (2025). Amazon Translate API Reference - Amazon Translate API Reference. AWS Documentation. Retrieved November 18, 2025, from https://docs.aws.amazon.com/translate/latest/APIReference/welcome.html
  13. Times of India. (2024, June 24). Flipkart launches support for Tamil, Telugu and Kannada on app. Times of India. https://timesofindia.indiatimes.com/gadgets-news/ flipkart-launches-support-for-tamil-telugu-and-kannada-on-app/articleshow/76558247.cms
  14. Flipkart. (2025, September 2). Flipkart product returns process – your returns policy questions answered. Flipkart Stories. Retrieved May 18, 2025, from https://stories.flipkart.com/flipkart-product-returns-2/
  15. Flipkart. (2023, February 9). फ्लिपकार्ट प्रोडक्ट रिटर्न प्रक्रिया - रिटर्न पालिसी के सभी सवालों के जवाब. Flipkart Stories. Retrieved May 18, 2025, from https://stories.flipkart.com/फ्लिपकार्ट-रिटर्न्स/
  16. Flipkart. (2023, February 9). ஃபிளிப்கார்ட் தயாரிப்பு திரும்பப்பெறும் செயல்முறை – இது எவ்வாறு இயங்குகிறது மற்றும் நீங்கள் மனதில் கொள்ள வேண்டியவை. Flipkart Stories. Retrieved May 18, 2025, from https://stories.flipkart.com/ஃப்ளிப்கார்ட்-திரும்புக/
  17. Flipkart. (2023, February 9). ಫ್ಲಿಪ್ಕಾರ್ಟ್‌ ಉತ್ಪನ್ನ ಹಿಂದಿರುಗಿಸುವ ಪ್ರಕ್ರಿಯೆ – ಅದು ಹೇಗೆ ಕೆಲಸಮಾಡುತ್ತದೆ ಮತ್ತು ನೀವು ಏನನ್ನು ನೆನಪಿನಲ್ಲಿಡಬೇಕು. Flipkart Stories. https://stories.flipkart.com/ಫ್ಲಿಪ್ಕಾರ್ಟ್-ಹಿಂದಿರುಗಿ/
  18. Microsoft. (2024). Microsoft® Office Language Accessory Pack – Tamil. Microsoft. Retrieved January 05, 2025, from https://www.microsoft.com/ta-in/download/details.aspx?id=51200
  19. Microsoft. (n.d.). MSN | Personalised News, Top Head-lines, Live Updates and more. msn. Retrieved May 20, 2025, from https://www.msn.com/en-ae?ocid=msedgdhp &pc=U531&cvid=691c4d3821d94f51a3ac5e6d618a607e&ei=11
  20. Microsoft. (n.d.). Microsoft translator | translate from English. Microsoft Bing. Retrieved May 10, 2025, from https://www.bing.com/translator
  21. Microsoft. (2025, July 30). Microsoft Change Locale. Microsoft Services Agreement. Retrieved August 20, 2025, from https://www.microsoft.com/en-in/services agreement/locale
  22. Google. (n.d.). Google. Google. Retrieved May 20, 2025, from https://www.google.com/
  23. Google. (2024, May 22). Google Terms of Service – Privacy & Terms – Google. Google Policies. Retrieved August 20, 2025, from https://policies.google.com/terms?hl=en-IN&fg=1
  24. SketchEngine. (n.d.). Setting up parallel and multilingual corpora. SketchEngine. Retrieved October 23, 2025, from https://www.sketchengine.eu/guide/setting-up-parallel-corpora/#tab-id-2
  25. OpenAI. (n.d.). ChatGPT. [Large language model]. https://chatgpt.com/
  26. Google. (n.d.). Welcome To Colab - Colab. Colab. Retrieved October 10, 2025, from https://colab.research. google.com/
  27. YouTube. (2022, January 5). Terms of Service. YouTube IN. Retrieved May 20, 2025, from https://www.youtube. com/t/terms?hl=en&override_hl=1
  28. Apple Inc. (n.d.). iPad User Guide. Apple Support. Retrieved May 20, 2025, from https://support.apple.com/en-in/guide/ipad/welcome/ipados
  29. Apple Inc. (2025). Find and download games in the Apple Games app on iPad. iPad User Guide. Retrieved May 20, 2025, from https://support.apple.com/en-in/guide/ipad/ipad3aa36b02/ipados
  30. Apple Inc. (2025). Add text on a Freeform board on iPad. iPad User Guide. Retrieved May 20, 2025, from https://support.apple.com/en-in/guide/ipad/ipad5a22ec43/ipados
  31. Apple Inc. (2025). iPad पर Freeform बोर्ड में टेक्स्ट जोड़ें. iPad यूज़र गाइड. Retrieved May 20, 2025, from https://support.apple.com/hi-in/guide/ipad/ipad5a22ec43/ipados
  32. Apple Inc. (2025). iPadನಲ್ಲಿನ Freeform ಬೋರ್ಡ್‌ನಲ್ಲಿ ಸ್ಟಿಕಿ ಟಿಪ್ಪಣಿಗಳು, ಆಕಾರಗಳು ಮತ್ತು ಪಠ್ಯ ಬಾಕ್ಸ್‌ಗಳಲ್ಲಿ ಪಠ್ಯವನ್ನು ಸೇರಿಸುವುದು. iPad ಬಳಕೆದಾರರ ಮಾರ್ಗದರ್ಶಿ. Retrieved May 20, 2025, from https://support.apple.com/kn-in/guide/ipad/ipad5a22ec43/ipados
  33. Apple Inc. (2025). iPadஇல் உள்ள Freeform போர்டில் ஸ்டிக்கி நோட்ஸ், வடிவங்கள் மற்றும் உரைப் பெட்டிகளில் உரையைச் சேர்த்தல். iPad பயனர் வழிகாட்டி. Retrieved May 20, 2025, from https://support.apple.com/ta-in/guide/ipad/ipad5a22ec43/ipados
  34. Lehmann, T. (1993). A grammar of Modern Tamil (2nd ed.). Pondicherry Institute of Linguistics and Culture.
  35. Apple Inc. (2025). Wake, unlock, and lock iPad. Apple Support. Retrieved May 21, 2025, from https://support.apple.com/en-in/guide/ipad/ipad9940ee8d/ipados
  36. Apple Inc. (2025). iPad सक्रिय करें, अनलॉक और लॉक करें. Retrieved May 21, 2025, from https://support.apple.com/hi-in/guide/ipad/ipad9940ee8d/ipados
  37. Apple Inc. (2025). iPad ಅನ್ನು ಎಚ್ಚರಗೊಳಿಸಿ, ಅನ್‌ಲಾಕ್ ಮಾಡಿ ಮತ್ತು ಲಾಕ್ ಮಾಡಿ. iPad ಬಳಕೆದಾರರ ಮಾರ್ಗದರ್ಶಿ. Retrieved May 21, 2025, from https://support.apple.com/kn-in/guide/ipad/ipad9940ee8d/ipados
  38. Apple Inc. (2025). Send and reply to messages on iPad. Apple Support. Retrieved May 21, 2025, from https://support.apple.com/en-in/guide/ipad/ipad99acb44a/ipados.
  39. Apple Inc. (2025). iPadನಲ್ಲಿನ Freeform ಬೋರ್ಡ್‌ನಲ್ಲಿ ಸ್ಟಿಕಿ ಟಿಪ್ಪಣಿಗಳು, ಆಕಾರಗಳು ಮತ್ತು ಪಠ್ಯ ಬಾಕ್ಸ್‌ಗಳಲ್ಲಿ ಪಠ್ಯವನ್ನು ಸೇರಿಸುವುದು. iPad ಬಳಕೆದಾರರ ಮಾರ್ಗದರ್ಶಿ. Retrieved May 21, 2025, from https://support.apple.com/kn-in/guide/ipad/ipad99acb44a/ipados.
Index Terms

Computer Science
Information Sciences
Parallel corpora
Indian languages
web mining

Keywords

Web scraping BeautifulSoup localization Tamil Kannada Hindi