CFP last date
20 February 2025
Reseach Article

Pipeline for Real-time Anomaly Detection in Log Data Streams using Apache Kafka and Apache Spark

by Poojitha G., Sowmyarani C. N.
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 182 - Number 24
Year of Publication: 2018
Authors: Poojitha G., Sowmyarani C. N.
10.5120/ijca2018917942

Poojitha G., Sowmyarani C. N. . Pipeline for Real-time Anomaly Detection in Log Data Streams using Apache Kafka and Apache Spark. International Journal of Computer Applications. 182, 24 ( Oct 2018), 8-13. DOI=10.5120/ijca2018917942

@article{ 10.5120/ijca2018917942,
author = { Poojitha G., Sowmyarani C. N. },
title = { Pipeline for Real-time Anomaly Detection in Log Data Streams using Apache Kafka and Apache Spark },
journal = { International Journal of Computer Applications },
issue_date = { Oct 2018 },
volume = { 182 },
number = { 24 },
month = { Oct },
year = { 2018 },
issn = { 0975-8887 },
pages = { 8-13 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume182/number24/30079-2018917942/ },
doi = { 10.5120/ijca2018917942 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-07T01:12:19.131933+05:30
%A Poojitha G.
%A Sowmyarani C. N.
%T Pipeline for Real-time Anomaly Detection in Log Data Streams using Apache Kafka and Apache Spark
%J International Journal of Computer Applications
%@ 0975-8887
%V 182
%N 24
%P 8-13
%D 2018
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Anomaly detection is a standout amongst the most critical assignments so as to construct a system that is trustworthy and secure. The aim of anomaly detection is to detect significant deviation of the system behavior from that of the normal behavior. This approach is broadly used on static data, for instance on dumps of log data. Most systems require a real-time detection of anomalies with a specific end goal to lessen the harm that can be caused by the ignorance of an anomaly or detection at a later time. The recent implementations of the anomaly detection are mostly based on self-learning methods. Machine learning has brought about a significant transformation in the field of anomaly detection. One of the methodologies for anomaly detection depends on clustering algorithms. The implementation discussed in this paper utilizes a time-series evaluation approach for anomaly detection. The paper explains the pipeline built for anomaly detection and the visualization of the results.

References
  1. Klaus Greff, Rupesh Kumar Srivastava, Jan Koutník, Bas R. Steunebrink, Jürgen Schmidhuber, ‘A Search Space Odyssey’, IEEE Transactions on Neural Networks and Learning Systems, Volume: 28, Issue: 10, Oct. 2017
  2. Andrei Talaş, Florin Pop, Gabriel Neagu, ‘Elastic stack in action for smart cities: Making sense of big data’, 13th IEEE International Conference on Intelligent Computer Communication and Processing (ICCP), Cluj-Napoca, Romania, 7-9 Sept. 2017, pp. 469-476.
  3. Melody Moh, Santhosh Pininti, Sindhusha Doddapaneni, ‘Detecting Web Attacks Using Multi-Stage Log Analysis’, Advanced Computing (IACC), 2016 IEEE 6th international Conference, 27-28 Feb 2016
  4. Tarun Prakash, Misha Kakkar, Kritika Patel, ‘Geo-identification of web users through logs using ELK stack’, in Cloud System and Big Data Engineering, Noida, India, 2016, pp. 606-610.
  5. Robert Winding, Timothy Wright, Michael Chapple, ‘System Anomaly Detection: Mining Firewall Logs’, Securecomm and Workshops Conference, 2006
  6. Liu Yunpeng, Hou Di, Bao Junpeng, ‘Multi-step Ahead Time Series Forecasting for Different Data Patterns Based on LSTM Recurrent Neural Network’, Web Information Systems and Applications Conference (WISA), 11-12 Nov. 2017
  7. Yuriy Kochura, Sergii Stirenko, Oleg Alienin, Michail Novotarskiy, Yuri Gordienko, ‘Comparative analysis of open source frameworks for machine learning with use case in single-threaded and multi-threaded modes’, 12th International Scientific and Technical Conference on Computer Sciences and Information Technologies (CSIT), Lviv, Ukraine 2017, pp. 373-376
  8. Zheng Zhao, Weihai Chei, Xingming Wu, Peter C Y Chen, Jingmeng Liu, ‘LSTM network: a deep learning approach for short-term traffic forecast’, IET Intelligent Transport Systems (Volume: 11, Issue: 2), March 2017, pp. 68-75
  9. Baojun Zhou, Jie Li, Xiaoyan Wang, Yu Gu, Li Xu, Yongqiang Hu, Lihua Zhu, ‘Online Internet traffic monitoring system using spark streaming’, Big Data Mining and Analytics, Volume: 1, Issue: 1, March 2018, pp. 47-56
  10. Qinkun Xiao, Yang Si, ‘Time series prediction using graph model’, 3rd IEEE International Conference on Computer and Communications (ICCC), Chengdu, China, Dec 2017, pp. 1358-1361
  11. Karl Eric Harper, Jiang Zheng, Sam Ade Jacobs, Aldo Dagnino, Anton Jansen, Thomas Goldschmidt, Adamantios Marinakis, 'Industrial Analytics Pipelines', First IEEE International Conference on Big Data Computing Service and Applications (BigDataService), 30 March - 2 April 2015, Redwood City, CA, USA, pp. 242-248
  12. Chris Olah, ‘Understanding LSTM Networks’, http://colah.github.io/posts/2015-08-Understanding-LSTMs/, 27 August 2015.
  13. Min Du, Feifei Li, Guineng Zheng, Vivek Srikumar, ‘DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning’, Computer and Communications Security Conference, ACM, New York, USA, 2017, pp. 1-14
  14. Sung Jun Son , Youngmi Kwon, ‘Performance of ELK stack and commercial system in security log analysis', IEEE 13th Malaysia International Conference on Communications (MICC), Johor Bahru, Malaysia, 28-30 Nov, 2017, pp. 187-190
  15. Tian Guo , Zhao Xu , Xin Yao , ‘Robust Online Time Series Prediction with Recurrent Neural Networks’, IEEE International Conference on Data Science and Advanced Analytics (DSAA), Montreal, QC, Canada, 17-19 Oct, 2016, pp. 816-825
  16. Jiajun Peng, Zheng Huang , Jie Cheng, ‘A Deep Recurrent Network for Web Server Performance Prediction’, IEEE Second International Conference on Data Science in Cyberspace (DSC), Shenzhen, China, 26-29 June, 2017, pp. 500-504
  17. Angel Garcia-Pedrero, Pilar Gomez-Gil, ‘Time series forecasting using recurrent neural networks and wavelet reconstructed signals’, 20th International Conference on Electronics, Communications and Computer (CONIELECOMP), Cholula, Mexico, 22-24 Feb, 2010, pp. 169-173
  18. Ramanna Hanamanthrao, S Thejaswini, ‘Real-time clickstream data analytics and visualization’, 2nd IEEE International Conference on Recent Trends in Electronics, Information & Communication Technology (RTEICT), Bangalore, India, 2017, pp. 2139-2144.
  19. Kasun Amarasinghe, Milos Manic, Ryan Hruska, ‘Optimal Stop Word Selection for Text Mining in Critical Infrastructure Domain’, IEEE International Conference on Data Mining, Philadelphia, PA, USA, 18-20 Aug 2015, pp. 1-6
  20. Haitao Zhao, Shaoyuan Sun, Bo Jin, ‘Sequential Fault Diagnosis based on LSTM Neural Network’, IEEE Access, 30 Jan 2018, pp. 12929-12939
  21. Nicolo Navarin, Beatrice Vincenzi, Mirko Polato, Alessandro Sperduti, ‘LSTM networks for data-aware remaining time prediction of business process instances’, IEEE Symposium Series on Computational Intelligence, Honolulu, HI, USA, 27 Nov-1 Dec 2017, pp. 1-7
  22. Qimin Cao, Yinrong Qiao, Zhong Lyu, ‘Machine learning to detect anomalies in web log analysis’, 3rd IEEE International Conference on Computer and Communications (ICCC), Chengdu, China, Dec 2017, pp. 519-523
  23. Yangdong Liu, Yizhe Wang, Xiaoguang Yang, Linan Zhang, ‘Short-term travel time prediction by deep learning: A comparison of different LSTM-DNN models’, IEEE 20th International Conference on Intelligent Transport Systems (ITSC), Yokohama, Japan, Oct 2017, pp. 1-8
  24. Serkan Kiranyaz, Adel Gastli, Lazhar Ben-Brahim, Nasser Alemadi, Moncef Gabbouj, ‘Real-Time Fault Detection and Identification for MMC using 1D Convolutional Neural Networks’, IEEE Transactions on Industrial Electronics, 2018, pp. 1-1
  25. Daniel Schachinger, Jürgen Pannosch, Wolfgang Kastner, ‘Adaptive learning-based time series prediction framework for building energy management’, IEEE International Conference on Industrial Electronics for Sustainable Energy Systems (IESES), Hamilton, New Zealand, New Zealand, Feb 2018, pp. 453-458
  26. Rishika Shree, Tanupriya Choudhury, Subhash Chand Gupta, Praveen Kumar, ‘KAFKA: The modern platform for data management and analysis in big data domain’, 2nd International Conference on Telecommunication and Networks (TEL-NET),  Noida, India, Aug 2017, pp. 1-5
  27. Paul Le Noac'h, Alexandru Costan, Luc Bougé, ‘A performance evaluation of Apache Kafka in support of big data streaming applications’, IEEE International Conference on Big Data (Big Data), Boston, MA, USA, Dec 2017, pp. 4803-4806
  28. Ayae Ichinose, Atsuko Takefusa, Hidemoto Nakada, Masato Oguchi, ‘A study of a video analysis framework using Kafka and spark streaming’, IEEE International Conference on Big Data (Big Data), Boston, MA, USA, Dec 2017, pp. 2396-2401
  29. Subhash Kumar, ‘Evolution of Spark framework for simplifying big data analytics’, 3rd International Conference on Computing for Sustainable Global Development, New Delhi, India, March 2016, pp. 3597-3602
  30. Marcin Bajer, ‘Building an IoT Data Hub with Elasticsearch, Logstash and Kibana’, Future Internet of Things and Cloud Workshops, November 2017, pp. 63-68
  31. Aniruddha Parvat, Jai Chavan, Siddhesh Kadam, Souradeep Dev, Vidhi Pathak, ‘A survey of deep-learning frameworks’, International Conference on Inventive Systems and Control (ICISC), Coimbatore, India, 19-20 Jan 2017, pp. 1-7
  32. Anush Sankaran, Rahul Aralikatte, Senthil Mani, Shreya Khare, Naveen Panwar, Neelamadhav Gantayat, ‘DARVIZ: Deep Abstract Representation, Visualization, and Verification of Deep Learning Models’, 39th IEEE/ACM International Conference on Software Engineering: New Ideas and Emerging Technologies Results Track (ICSE-NIER), Buenos Aires, Argentina, 20-28 May 2017, pp. 47-50
  33. Karl Eric Harper, Jiang Zheng, Sam Ade Jacobs, Aldo Dagnino, Anton Jansen, Thomas Goldschmidt, Adamantios Marinakis, 'Industrial Analytics Pipelines', First IEEE International Conference on Big Data Computing Service and Applications (BigDataService), Redwood City, CA, USA, 30 March - 2 April 2015, pp. 242-248
  34. John A. Miller, Casey Bowman, Vishnu Gowda Harish, Shannon Quinn, ‘Open Source Big Data Analytics Frameworks Written in Scala’, IEEE International Congress on Big Data (BigData Congress), San Francisco, CA, USA, 27 June-2 July 2016, pp. 389-393
  35. M S Bhat, D G Nair, D Bansal, J Vaishnavi, ‘Data structure-based performance evaluation of emerging technologies - A comparison of Scala, Ruby, Groovy, and Python’, Sixth CSI International Conference on Software Engineering (CONSEG), Indore, India, 5-7 Sept. 2012, pp. 1-5
Index Terms

Computer Science
Information Sciences

Keywords

Anomaly detection real-time elastic stack Long Short-Term Memory Apache Spark