CFP last date
20 January 2025
Reseach Article

Comparing Apache Spark and Map Reduce with Performance Analysis using K-Means

by Satish Gopalani, Rohan Arora
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 113 - Number 1
Year of Publication: 2015
Authors: Satish Gopalani, Rohan Arora
10.5120/19788-0531

Satish Gopalani, Rohan Arora . Comparing Apache Spark and Map Reduce with Performance Analysis using K-Means. International Journal of Computer Applications. 113, 1 ( March 2015), 8-11. DOI=10.5120/19788-0531

@article{ 10.5120/19788-0531,
author = { Satish Gopalani, Rohan Arora },
title = { Comparing Apache Spark and Map Reduce with Performance Analysis using K-Means },
journal = { International Journal of Computer Applications },
issue_date = { March 2015 },
volume = { 113 },
number = { 1 },
month = { March },
year = { 2015 },
issn = { 0975-8887 },
pages = { 8-11 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume113/number1/19788-0531/ },
doi = { 10.5120/19788-0531 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T22:49:48.008191+05:30
%A Satish Gopalani
%A Rohan Arora
%T Comparing Apache Spark and Map Reduce with Performance Analysis using K-Means
%J International Journal of Computer Applications
%@ 0975-8887
%V 113
%N 1
%P 8-11
%D 2015
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Big Data has long been the topic of fascination for Computer Science enthusiasts around the world, and has gained even more prominence in the recent times with the continuous explosion of data resulting from the likes of social media and the quest for tech giants to gain access to deeper analysis of their data. This paper discusses two of the comparison of - Hadoop Map Reduce and the recently introduced Apache Spark – both of which provide a processing model for analyzing big data. Although both of these options are based on the concept of Big Data, their performance varies significantly based on the use case under implementation. This is what makes these two options worthy of analysis with respect to their variability and variety in the dynamic field of Big Data. In this paper we compare these two frameworks along with providing the performance analysis using a standard machine learning algorithm for clustering (K-Means).

References
  1. Apache Hadoop Documentation 2014 http://hadoop. apache. org/.
  2. Shvachko K. , Hairong Kuang, Radia S, Chansler, R The Hadoop Distributed File System Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium
  3. Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI'04: Sixth Symposium on Operating System Design and Implementation, 2004.
  4. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google file system. In 19th Symposium on Operating Systems Principles, pages 29–43, Lake George, New York, 2003.
  5. HortonWorks documentation 2014 http://docs. hortonworks. com/HDPDocuments/HDP1/HDP-1. 2. 4/bk_getting-started-guide/content/ch_hdp1_getting_started_chp2_1. html
  6. Apache Spark documentation 2014 https://spark. apache. org/documentation. html.
  7. Apache Spark Research 2014 https://spark. apache. org/research. html.
  8. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. Technical Report UCB/EECS-2011-82, EECS Department, University of California, Berkeley, 2011
  9. Reynold Xin, Joshua Rosen, Matei Zaharia, Michael J. Franklin, Scott Shenker, Ion Stoica. Shark: SQL and Rich Analytics at Scale. SIGMOD 2013. June 2013.
  10. Tom White, Hadoop the definitive guide chapter 06
  11. Spark Internals - Spark Summit 2014 http://spark-summit. org/wp-content/uploads/2014/07/A-Deeper-Understanding-of-Spark-Internals-Aaron-Davidson. pdf
  12. Spark Job Flow – Databricks https://databricks-training. s3. amazonaws. com/slides/advanced-spark-training. pdf
  13. Aaron Davidson, Andrew Or. Optimizing Shuffle Performance in Spark. Technical Report http://www. cs. berkeley. edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report. pdf
  14. Machine Learning, Wikipedia, 2014 http://en. wikipedia. org/wiki/Machine_learning
  15. Machine learning with Spark - Spark Summit 2013 https://spark-summit. org/2013/exercises/machine-learning-with-spark. html
Index Terms

Computer Science
Information Sciences

Keywords

Big data Hadoop HDFS Map Reduce Spark Mahout MLib Machine learning K-Means.