CFP last date
20 January 2025
Reseach Article

Performance Tuning and Scheduling of Large Data Set Analysis in Map Reduce Paradigm by Optimal Configuration using Hadoop

by Sasiniveda. G, Revathi. N
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 70 - Number 21
Year of Publication: 2013
Authors: Sasiniveda. G, Revathi. N
10.5120/12195-8335

Sasiniveda. G, Revathi. N . Performance Tuning and Scheduling of Large Data Set Analysis in Map Reduce Paradigm by Optimal Configuration using Hadoop. International Journal of Computer Applications. 70, 21 ( May 2013), 37-41. DOI=10.5120/12195-8335

@article{ 10.5120/12195-8335,
author = { Sasiniveda. G, Revathi. N },
title = { Performance Tuning and Scheduling of Large Data Set Analysis in Map Reduce Paradigm by Optimal Configuration using Hadoop },
journal = { International Journal of Computer Applications },
issue_date = { May 2013 },
volume = { 70 },
number = { 21 },
month = { May },
year = { 2013 },
issn = { 0975-8887 },
pages = { 37-41 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume70/number21/12195-8335/ },
doi = { 10.5120/12195-8335 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T21:33:29.861153+05:30
%A Sasiniveda. G
%A Revathi. N
%T Performance Tuning and Scheduling of Large Data Set Analysis in Map Reduce Paradigm by Optimal Configuration using Hadoop
%J International Journal of Computer Applications
%@ 0975-8887
%V 70
%N 21
%P 37-41
%D 2013
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Data analysis is an important functionality in cloud computing which allows a huge amount of data to be processed over very large clusters. Hadoop is a software framework for large data analysis. It provide a Hadoop distributed file system for the analysis and transformation of very large data sets is performed using the MapReduce paradigm. MapReduce is known as a popular way to hold data in the cloud environment due to its excellent scalability and good fault tolerance. Map Reduce is a programming model widely used for processing large data sets. Hadoop Distributed File System is designed to stream those data sets. The Hadoop MapReduce system was often unfair in its allocation and a dramatic improvement is achieved through the Mapper Reducer System. The proposed Mapper Reducer function using the mean shift clustering based algorithm allows us to analyze the data set and achieve better performance in executing the job by using optimal configuration of mappers and reducers based on the size of the data sets and also helps the users to view the status of the job and to find the error localization of scheduled jobs. This will efficiently utilize the performance tuning properties of optimized scheduled jobs. So, the efficiency of the system will result in substantially lowered system cost, energy usage, management complexity and increases the performance of the system.

References
  1. Apache,"Hadoop,http://hadoop. apache. org/docs/r0. 20. 2/hdfs_design. html"
  2. D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature space analysis. IEEE Trans. Pattern Anal. Machine Intell. , 24:603–619, 2002.
  3. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google ?le system. In 19th Symposium on Operating Systems Principles, pages 29–43, Lake George, New York, 2003.
  4. Jeffrey Dean and Sanjay Ghemawat "Map Reduce: Simplified Data Processing on Large Clusters" International Journal of Engineering Research and Applications ISSN: 1 – 13, July 2004.
  5. Matei Zaharias, Andy Konwinski, et al "Improving Map Reduce Performance in Heterogeneous Environments" IEEE Transactions on Parallel and distributed processing, Vol. 23, No. 19, April 2010.
  6. Quan Chen, Daqiang Zhang, et al. "SAMR: A Self-adaptive Map Reduce Scheduling Algorithm In Heterogeneous Environment" International Journal of Engineering Research and Applications ISSN: 2736-2743, July 2010.
  7. Mohammad Farhan Husain, James Mc Glothlin, et. al "Heuristics-Based Query Processing for Large RDF Graphs Using Cloud Computing" IEEE Transactions on knowledge and data Engineering, Vol. 23, No. 9, September 2011.
  8. Hadoop, http://lucene. apache. org/hadoop
  9. Amazon Elastic Compute Cloud, http://aws. amazon. com/ec2
  10. Kyong -Ha Lee, Hyunsik Choi "Parallel Data Processing with MapReduce: A Survey" International Journal of Engineering Research and Applications Vol. 40, No. 4 December 2011.
  11. Nikzad Babaii Rizvandi1,Albert Y. Zomaya , et. al " On Modeling Dependency between Map Reduce Configuration Parameters and Total Execution Time " IEEE Transactions on Distributed, Parallel, and Cluster Computing , Vol. 23, No. 9, March 2012.
  12. Gabriel G. Casta, Alberto Nunez, et al. "Dimensioning Scientific Computing systems to improve performance of Map-Reduce based applications" International Journal of Engineering Research and Applications ISSN: 226 – 235, July 2012.
  13. D. Jiang et al. Map-join-reduce: Towards scalable and efficient data analysis on large clusters. IEEE Transactions on Knowledge and Data Engineering, 2010.
  14. D. Jiang et al . The performance of mapreduce: An in-depth study. Proceedings of the VLDB Endowment,3(1-2):pp 472–483, 2010.
  15. M. Elteir, H. Lin, W. chun Feng, Enhancing mapreduce via asynchronous data processing, in: ICPADS'10: IEEE 16th International Conference on Parallel and Distributed Systems, 2010, pp. 397-405.
  16. Mr. Yogesh Pingle, Vaibhav Kohli, Shruti Kamat, Nimesh Poladia Big Data Processing using Apache Hadoop in Cloud System International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622.
  17. F. N. Afrati and J. D. Ullman, Optimizing Joins in a Map-Reduce Environment, Proc. 13th Int'l Conf. Extending Database Technology (EDBT '10), 2010.
  18. Y. Bu, B. Howe, M. Balazinska, and M. Ernst, "Hadoop: Efficient Iterative Data Processing on Large Clusters," Proc. VLDB Endowment, vol. 3, no. 1/2, pp. 285-296, 2010.
  19. Foto N. Afrati and Jeffrey D. Ullman, Optimizing Multiway Joins in a Map-Reduce Environment IEEE Transactions on knowledge and data Engineering, VOL. 23, NO. 9, September 2011.
  20. Indranil Palit and Chandan K. Reddy, Scalable and Parallel Boosting with MapReduce IEEE Transactions on knowledge and data Engineering, VOL. 24, NO. 10, October 2012.
Index Terms

Computer Science
Information Sciences

Keywords

Cloud Computing Hadoop Distributed file System Performance Tuning Mean shift Clustering Amazon web services