| International Journal of Computer Applications |
| Foundation of Computer Science (FCS), NY, USA |
| Volume 187 - Number 47 |
| Year of Publication: 2025 |
| Authors: Dimitrios Papakyriakou, Ioannis S. Barbounakis |
10.5120/ijca2025925785
|
Dimitrios Papakyriakou, Ioannis S. Barbounakis . Deep Learning for Edge AI: SqueezeNet CNN Training on Distributed ARM-based Clusters. International Journal of Computer Applications. 187, 47 ( Oct 2025), 6-17. DOI=10.5120/ijca2025925785
The increasing demand for lightweight and energy-efficient deep learning models at the edge has fueled interest in training convolutional neural networks (CNNs) directly on ARM-based CPU clusters. This study examines the feasibility and performance constraints of distributed training for the compact SqueezeNet v1.1 architecture, implemented using an MPI-based parallel framework on a Beowulf cluster composed of Raspberry Pi devices. Experimental evaluation across up to 24 Raspberry Pi nodes (48 MPI processes) reveals a sharp trade-off between training acceleration and model generalization. While wall-clock training time improves by over (11×) under increased parallelism, test accuracy deteriorates significantly, collapsing to chance-level performance (≈10%) as data partitions per process become excessively small. This behavior highlights a statistical scaling limit, beyond which computational gains are offset by learning inefficiency. The findings are consistent with the statistical bottlenecks identified by Shallue et al. (2019) [11], extending their observations from large-scale GPU/CPU systems to energy-constrained ARM-based edge clusters. These findings underscore the importance of balanced task decomposition in CPU-bound environments and contribute new insights into the complex interplay between model compactness, data sparsity, and parallel training efficiency in edge-AI systems. This framework also provides a viable low-power platform for real-time SNN research on edge devices.