Deep Learning for Edge AI: SqueezeNet CNN Training on Distributed ARM-based Clusters

Dimitrios Papakyriakou; Ioannis S. Barbounakis

Call for Paper

January Edition

IJCA solicits high quality original research papers for the upcoming January edition of the journal. The last date of research paper submission is 22 December 2025

Submit your paper

Know more

The week's pick

A Hybrid Transformer-CNN Framework with Early and Late Fusion for Robust Skin Lesion Classification

Raihan Tanvir

Random Articles

Reseach Article

Deep Learning for Edge AI: SqueezeNet CNN Training on Distributed ARM-based Clusters

by Dimitrios Papakyriakou, Ioannis S. Barbounakis

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 187 - Number 47

Year of Publication: 2025

Authors: Dimitrios Papakyriakou, Ioannis S. Barbounakis

10.5120/ijca2025925785

Dimitrios Papakyriakou, Ioannis S. Barbounakis . Deep Learning for Edge AI: SqueezeNet CNN Training on Distributed ARM-based Clusters. International Journal of Computer Applications. 187, 47 ( Oct 2025), 6-17. DOI=10.5120/ijca2025925785

@article{ 10.5120/ijca2025925785,

author = { Dimitrios Papakyriakou, Ioannis S. Barbounakis },

title = { Deep Learning for Edge AI: SqueezeNet CNN Training on Distributed ARM-based Clusters },

journal = { International Journal of Computer Applications },

issue_date = { Oct 2025 },

volume = { 187 },

number = { 47 },

month = { Oct },

year = { 2025 },

issn = { 0975-8887 },

pages = { 6-17 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume187/number47/deep-learning-for-edge-ai-squeezenet-cnn-training-on-distributed-arm-based-clusters/ },

doi = { 10.5120/ijca2025925785 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2025-10-23T00:18:06+05:30

%A Dimitrios Papakyriakou

%A Ioannis S. Barbounakis

%T Deep Learning for Edge AI: SqueezeNet CNN Training on Distributed ARM-based Clusters

%J International Journal of Computer Applications

%@ 0975-8887

%V 187

%N 47

%P 6-17

%D 2025

%I Foundation of Computer Science (FCS), NY, USA

Abstract

The increasing demand for lightweight and energy-efficient deep learning models at the edge has fueled interest in training convolutional neural networks (CNNs) directly on ARM-based CPU clusters. This study examines the feasibility and performance constraints of distributed training for the compact SqueezeNet v1.1 architecture, implemented using an MPI-based parallel framework on a Beowulf cluster composed of Raspberry Pi devices. Experimental evaluation across up to 24 Raspberry Pi nodes (48 MPI processes) reveals a sharp trade-off between training acceleration and model generalization. While wall-clock training time improves by over (11×) under increased parallelism, test accuracy deteriorates significantly, collapsing to chance-level performance (≈10%) as data partitions per process become excessively small. This behavior highlights a statistical scaling limit, beyond which computational gains are offset by learning inefficiency. The findings are consistent with the statistical bottlenecks identified by Shallue et al. (2019) [11], extending their observations from large-scale GPU/CPU systems to energy-constrained ARM-based edge clusters. These findings underscore the importance of balanced task decomposition in CPU-bound environments and contribute new insights into the complex interplay between model compactness, data sparsity, and parallel training efficiency in edge-AI systems. This framework also provides a viable low-power platform for real-time SNN research on edge devices.

References

Shi, W., Cao, J., Zhang, Q., Li, Y., & Xu, L. (2016). Edge computing: Vision and challenges. IEEE Internet of Things Journal, 3(5), 637–646. https://doi.org/10.1109/JIOT.2016.2579198.
Li, S., Xu, L. D., & Zhao, S. (2018). 5G Internet of Things: A survey. Journal of Industrial Information Integration,10,1-9 https://doi.org/10.1016/j.jii.2018.01.005
Sze, V., Chen, Y. H., Yang, T. J., & Emer, J. S. (2017). Efficient processing of deep neural networks: A tutorial and survey. Proceedings of the IEEE, 105(12), 2295–2329. https://doi.org/10.1109/JPROC.2017.2761740
Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., … & Adam, H. (2017). MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. https://arxiv.org/abs/1704.04861
Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K., Dally, W. J., & Keutzer, K. (2016). SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size. arXiv Preprint. https://doi.org/10.48550/arXiv.1602.07360
Li, H., Kadav, A., Durdanovic, I., Samet, H., & Graf, H. P. (2017). Pruning filters for efficient convnets. International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1608.08710
Ramesh, S., & Chakrabarty, K. (2021). Challenges and opportunities in training deep neural networks on edge devices. ACM Transactions on Embedded Computing Systems (TECS), 20(5s), 1–26. https://doi.org/10.1145/3477084
Raspberry Pi 4 Model B. [Online]. Available: raspberrypi.com/products/raspberry-pi-4-model-b/.
Raspberry Pi 4 Model B specifications. [Online]. Available: https://magpi.raspberrypi.com/articles/raspberry-pi-4-specs-benchmarks.
Masters, D., & Luschi, C. (2018). Revisiting small batch training for deep neural networks. arXiv preprint arXiv:1804.07612. https://arxiv.org/abs/1804.07612
Shallue, C. J., Lee, J., Antognini, J., Sohl-Dickstein, J., Frostig, R., & Dahl, G. E. (2019). Measuring the effects of data parallelism on neural network training. Journal of Machine Learning Research, 20(112), 1–49. http://jmlr.org/papers/v20/18-789.html
Ben-Nun, T., & Hoefler, T. (2019). Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. ACM Computing Surveys, 52(4), 1–43. https://doi.org/10.1145/3320060

Index Terms

Computer Science

Information Sciences

Keywords

SqueezeNet Distributed Deep Learning Edge Computing Raspberry Pi Cluster Beowulf Cluster ARM Architecture MPI (Message Passing Interface) Low-Power AI Strong Scaling Model Generalization Statistical Scaling Limit