| International Journal of Computer Applications |
| Foundation of Computer Science (FCS), NY, USA |
| Volume 187 - Number 82 |
| Year of Publication: 2026 |
| Authors: Amitesh Kumar Jha, Rajwant Singh Rao |
10.5120/ijca2026926434
|
Amitesh Kumar Jha, Rajwant Singh Rao . Beyond Single-Scale Vision Transformers: Multi-Scale Feature Fusion for Robust Scene and Document Text Recognition. International Journal of Computer Applications. 187, 82 ( Feb 2026), 29-42. DOI=10.5120/ijca2026926434
Transformer-based Optical Character Recognition (OCR) systems have recently demonstrated strong performance by modeling long-range dependencies in text images. However, most existing approaches rely on single-scale visual representations, which limits their robustness in scenarios involving variable font sizes, degraded characters, and complex document layouts. This study proposes a Multi-Scale Feature-Based Transformer (MSFT-OCR) that explicitly integrates fine-, mid-, and coarse-scale visual features using scale-aware attention mechanisms. The proposed architecture enables effective interaction between character-level details and global word-level context through inter-scale attention. Extensive experiments on scene text and document OCR benchmarks demonstrate that the proposed method consistently outperforms single-scale Transformer models on IIIT5K-Words, IAM, SVT on basis of evaluation metrics CA(%), WA(%), NED(%). Ablation studies and attention visualizations further validate the effectiveness of multi-scale modeling in text recognition.