International Conference on Simulations in Computing Nexus |
Foundation of Computer Science USA |
ICSCN - Number 3 |
May 2014 |
Authors: R. Manopriya, C. P. Saranya |
R. Manopriya, C. P. Saranya . A Survey on Workload Classification and Job Scheduling by using Johnson�s Algorithm under Hadoop Environment. International Conference on Simulations in Computing Nexus. ICSCN, 3 (May 2014), 11-14.
Bigdata deals with the larger datasets which focus on storing, sharing and processing the data. The organisation face difficulties to create, manipulate and manage the large datasets. For example, if we take the social media Facebook,there will be some posts on the page. The number of likes, shares and comments are given at a second for a particular post,it leads to creation of large datasets which gives trouble to store the data and process the data. It involves massive volume of both structured and unstructured data. The major problem exists in Bigdata community is workload classification and scheduling of jobs with respect to the disks. Identifying the computation time of individual jobs in the machine uses mapreduce concepts rather than minimizing the overall computation time of entire set of jobs. Mapreduce algorithm is initially applied for splitting the larger datsets into minimized output dataset. Mapreduce consists of two phases for processing the data: map and reduce phases. Under map phase,the given radar input dataset is splitted into individual key-value pairs and an intermediate output is obtained and in reduce phase that key value pair undergoes shuffle and sort operation. Intermediate files are created from map tasks are written to local disk and output files are written to distributed file system of Hadoop. The different types of jobs are given to different disks for the process of scheduling. Johnson's algorithm is used for obtaining the minimum optimal solution among different jobs given in the Hadoop environment. Job type and data locality of the jobs are two important factors for job scheduling process. The Performance analysis of individual disks are calculated on the basis of size of the dataset taken and formation of number of nodes.