CFP last date
20 May 2026
Reseach Article

Deep Learning for Edge AI: SqueezeNet CNN Training on Distributed ARM-based Clusters

by Dimitrios Papakyriakou, Ioannis S. Barbounakis
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 187 - Number 47
Year of Publication: 2025
Authors: Dimitrios Papakyriakou, Ioannis S. Barbounakis
10.5120/ijca2025925785

Dimitrios Papakyriakou, Ioannis S. Barbounakis . Deep Learning for Edge AI: SqueezeNet CNN Training on Distributed ARM-based Clusters. International Journal of Computer Applications. 187, 47 ( Oct 2025), 6-17. DOI=10.5120/ijca2025925785

@article{ 10.5120/ijca2025925785,
author = { Dimitrios Papakyriakou, Ioannis S. Barbounakis },
title = { Deep Learning for Edge AI: SqueezeNet CNN Training on Distributed ARM-based Clusters },
journal = { International Journal of Computer Applications },
issue_date = { Oct 2025 },
volume = { 187 },
number = { 47 },
month = { Oct },
year = { 2025 },
issn = { 0975-8887 },
pages = { 6-17 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume187/number47/deep-learning-for-edge-ai-squeezenet-cnn-training-on-distributed-arm-based-clusters/ },
doi = { 10.5120/ijca2025925785 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2025-10-23T00:18:06+05:30
%A Dimitrios Papakyriakou
%A Ioannis S. Barbounakis
%T Deep Learning for Edge AI: SqueezeNet CNN Training on Distributed ARM-based Clusters
%J International Journal of Computer Applications
%@ 0975-8887
%V 187
%N 47
%P 6-17
%D 2025
%I Foundation of Computer Science (FCS), NY, USA
Abstract

The increasing demand for lightweight and energy-efficient deep learning models at the edge has fueled interest in training convolutional neural networks (CNNs) directly on ARM-based CPU clusters. This study examines the feasibility and performance constraints of distributed training for the compact SqueezeNet v1.1 architecture, implemented using an MPI-based parallel framework on a Beowulf cluster composed of Raspberry Pi devices. Experimental evaluation across up to 24 Raspberry Pi nodes (48 MPI processes) reveals a sharp trade-off between training acceleration and model generalization. While wall-clock training time improves by over (11×) under increased parallelism, test accuracy deteriorates significantly, collapsing to chance-level performance (≈10%) as data partitions per process become excessively small. This behavior highlights a statistical scaling limit, beyond which computational gains are offset by learning inefficiency. The findings are consistent with the statistical bottlenecks identified by Shallue et al. (2019) [11], extending their observations from large-scale GPU/CPU systems to energy-constrained ARM-based edge clusters. These findings underscore the importance of balanced task decomposition in CPU-bound environments and contribute new insights into the complex interplay between model compactness, data sparsity, and parallel training efficiency in edge-AI systems. This framework also provides a viable low-power platform for real-time SNN research on edge devices.

References
  1. Shi, W., Cao, J., Zhang, Q., Li, Y., & Xu, L. (2016). Edge computing: Vision and challenges. IEEE Internet of Things Journal, 3(5), 637–646. https://doi.org/10.1109/JIOT.2016.2579198.
  2. Li, S., Xu, L. D., & Zhao, S. (2018). 5G Internet of Things: A survey. Journal of Industrial Information Integration,10,1-9 https://doi.org/10.1016/j.jii.2018.01.005
  3. Sze, V., Chen, Y. H., Yang, T. J., & Emer, J. S. (2017). Efficient processing of deep neural networks: A tutorial and survey. Proceedings of the IEEE, 105(12), 2295–2329. https://doi.org/10.1109/JPROC.2017.2761740
  4. Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., … & Adam, H. (2017). MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. https://arxiv.org/abs/1704.04861
  5. Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K., Dally, W. J., & Keutzer, K. (2016). SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size. arXiv Preprint. https://doi.org/10.48550/arXiv.1602.07360
  6. Li, H., Kadav, A., Durdanovic, I., Samet, H., & Graf, H. P. (2017). Pruning filters for efficient convnets. International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1608.08710
  7. Ramesh, S., & Chakrabarty, K. (2021). Challenges and opportunities in training deep neural networks on edge devices. ACM Transactions on Embedded Computing Systems (TECS), 20(5s), 1–26. https://doi.org/10.1145/3477084
  8. Raspberry Pi 4 Model B. [Online]. Available: raspberrypi.com/products/raspberry-pi-4-model-b/.
  9. Raspberry Pi 4 Model B specifications. [Online]. Available: https://magpi.raspberrypi.com/articles/raspberry-pi-4-specs-benchmarks.
  10. Masters, D., & Luschi, C. (2018). Revisiting small batch training for deep neural networks. arXiv preprint arXiv:1804.07612. https://arxiv.org/abs/1804.07612
  11. Shallue, C. J., Lee, J., Antognini, J., Sohl-Dickstein, J., Frostig, R., & Dahl, G. E. (2019). Measuring the effects of data parallelism on neural network training. Journal of Machine Learning Research, 20(112), 1–49. http://jmlr.org/papers/v20/18-789.html
  12. Ben-Nun, T., & Hoefler, T. (2019). Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. ACM Computing Surveys, 52(4), 1–43. https://doi.org/10.1145/3320060
Index Terms

Computer Science
Information Sciences

Keywords

SqueezeNet Distributed Deep Learning Edge Computing Raspberry Pi Cluster Beowulf Cluster ARM Architecture MPI (Message Passing Interface) Low-Power AI Strong Scaling Model Generalization Statistical Scaling Limit