March, 2nd - March, 8th
- ee09119
- Mar 8, 2015
- 3 min read
The second week of this study was focused on the development of the Proof-of-Concept for the inter-cluster Cloud Bursting solution. Initially, the Proof-of-Concept was purely based on hardcoded decisions. The application queries the user for the job type, job input file and number of jobs to submit. (This behaviour is not intended for the final solution, but is very useful for debugging.) After the user submitted the requested data, the Load Balancer checked the number of submitted jobs to decide whether to launch the jobs locally or to burst them to the public Cloud. If the number of submitted jobs was higher than a given hardcoded number, the local data center got maxed out and the rest was burst to a locally simulated public cloud. The decision was based on a hardcoded threshold value mainly because the main idea for this phase was to test if the SSH commands to the cluster's Resource Manager was working as intended. During the first tests, I concluded that the establishment of the SSH connection to issue simple commands through the Python application was straightforward. However, the issueing of multiple jobs to run in parallel in the background was problematic at first because as soon as the jobs were submitted and the SSH connection closed, job execution was killed. Having found a solution for this issue, it was immediatly possible to see the simple initial Proof-of-Concept working as intended.
After successfully developing a basic Proof-of-Concept, the next step was to discover a way to retrieve the cluster's resource status. The Load Balancer needs to be dynamic and must only burst jobs to the public Cloud if the local data center cannot comply with the job deadline. To be able to estimate if the local data center can achieve job deadline compliance, the Load Balancer must know the available cluster capacity. Our study found that the Hadoop YARN Resource Manager provides a RESTful API that returns the information we're looking for. Through this API, our Load Balancer will be able to retrieve resource utilization information such as current cluster capacity, number of running and pending apps (jobs in our case), cluster's state (if it's running or stopped), used memory and vCPUs, the number of used containers, and many other values.
With this information, the Proof-of-Concept was upgraded, periodically checking the cluster's resources to calculate average resource usage. For now I am only concerned with the number of used containers, even though I have access to more information. Furthermore, I have developed an algorithm to estimate the maximum number of required containers to process a given input file. As the job is submitted, the Load Balancer analyzes the input file for the submitted job and estimates the maximum number of required containters. Then, the Load Balancer computes the number of free containers in the cluster and checks if the cluster has enough free containers for the newly submitted job. If it does, the job is launched locally. However, if it doesn't the job is launched to the public Cloud.
NEXT OBJECTIVES: The algorithm for container estimation must be further tested. Due to incorrectly filled configuration parameters in the Hadoop configuration files, Hadoop thinks it has more RAM than it actually has. So, the number of containers in use does not match the real number of used containers, and moreover, the maximum number of containers is never achieved (even though the cluster's used capacity is at 100%). Hadoop configuration parameters need to be corrected and new tests need to be done. Afterwards, work on the Network Gatherer will begin.
Comments