March, 9th - March, 15th
- ee09119
- Mar 15, 2015
- 2 min read
At the start of this week, the local cluster was reconfigured to consider the actual amount of RAM and number of vCPUs per node and container. Our Hadoop cluster uses the capacity scheduler, which by default limits the amount of ApplicationMasters (jobs) that run simultaneously in the cluster to a percentage of the available resources. The default value limited the number of jobs to 1, which makes no sense for our study. After altering the capacity scheduler configuration file, the number of jobs that can run in simultaneous is 4, which is more appropriate for our case.
With this configuration, the Load Balancer can successfully burst jobs to the public Cloud when the local cluster does not have enough containers to process the newly submitted job. However, the sole observation of the amount of available containers to base the decision whether to burst or not, leads to erroneous decisions. Imagine the situation in which the local cluster is idling and a new job is submitted, requiring more containers than those that the whole cluster offers: the job gets burst to the Cloud, which is undesired since the local cluster is idling. To counter this situation, the Load Balancer first considers the number of available containers and if they are not enough for the submitted job, it checks if the cluster is not fully utilized. If it's not, the job will be processed with less containers than it requires, which will lead to delays, but has no additional cost (as opposed to processing in the Cloud). Currently, to check if the cluster is not fully utilized, the Load Balancer checks a hardcoded value and compares it to the actual utilization. It would be interesting to make this procedure more dynamic.
A lot of research has also been done regarding the estimation of job completion time. There are many papers regarding the prediction of job completion time in MapReduce, providing analytical models to calculate lower and upper bounds. However, after testing in our cluster, these analytical models only seem to provide estimations that are close to the real measured values, when the jobs, whose completion time they are predicting, are running by themselves in the cluster, i.e., no other jobs run simultaneously (single application/job in the cluster). The analytical models are based on the job profiles (average Map time, average Reduce time, etc.), which vary when other jobs are running in the cluster, mainly because data locality is degraded and the shuffle phase takes longer. Furthermore, predicting how long a newly submitted job will take to complete based on the current available resources (containers) can lead to pessimistic predictions when the running jobs are very short and unstable (in terms of container/resource utilization). These are the current challenges that this study is facing and intends to answer in the following weeks.
NEXT OBJECTIVES: Meet with Professor Ricardo Morla to discuss how to approach the described issues and work out a plan to tackle them.
Comments