top of page

Cloud Bursting for Apache Hadoop YARN

This page introduces our proposed Cloud Bursting solution for a local private data center running Apache Hadoop YARN. It starts by describing the type of job which our Hadoop cluster will handle and follows by detailing how the solution will be implemented. 

Packet Capture (PCAP)

The monitoring and analyzing of network traffic is a crucial task of every network administrator. It provides a means to detect network bottlenecks, maintain the efficiency of the network, solve network problems, and determine if security policies are being followed.

 

Packet Capture (PCAP) is an application programming interface (API) used to capture network traffic that is traversing over a network. It is implemented in the libcap library, which is also used by many open source and commercial tools such as packet sniffers, network monitors, network intrusion detection systems, and others. PCAP allows the collection of all packets on a network and supports storage of captured packets for later analysis.

 

In our study, we use our Hadoop YARN cluster and the MapReduce programming paradigm to process PCAP files and perform batched network analysis. We implement a traffic monitoring system with batched analysis through the submission of small MapReduce jobs. Our traffic gatherer captures small batches of data and immediately submits a MapReduce job to analyze the batch of network data and extract useful network information.  

Map-Intensive Network Analysis Job

There are many issues that can be detected and solved through network analysis. Our aim is to provide the Network Administrator with a means to detect network intrusions through packet payload inspection. We envision a deep packet inspection mechanism that is essentially based on the idea of string comparisons. With the PCAP input format, Hadoop is able to find the packets in the PCAP file, thus enabling the inspection of their payloads to search for patterns that suggest an intrusion or some sort of malware. 

 

Most of the processing is intended to run during the map phase. The aggregation of the results, that is, the aggregation of the occurrences where a pattern was found is done during the reduce phase. The resulting MapReduce job is a map-intensive job as most of the processing is done in the map phase. 

Architecture for the proposed solution

The implementation of an online traffic monitoring system with real-time analysis usually requires more powerful hardware to handle the real-time processing and analyzing of data. Our study implements a batched traffic monitoring system using MapRe- duce jobs in a Hadoop YARN cluster. We use a network traffic gatherer that continuously gathers batches of network traffic and feeds them to the Hadoop cluster.

 

The idea is to launch small MapReduce jobs each time a batch of network data is collected. Thus, the size of the launched jobs at a given moment fully depends on the amount of generated network traffic at that moment. If the amount of generated network traffic grows beyond the capacity of our local cluster, analysis of the traffic will be slowed down. So, to guarantee that our infrastructure can handle peaks of network traffic without delaying the analyses, we propose a Cloud Bursting capable infrastructure that automatically bursts workload when the local cluster is overloaded or cannot launch further jobs without delaying them. 

 

The figure below depicts the architecture of our Cloud Bursting capable solution:

Our solution collects varying amounts of network traffic and stores them as PCAP files. The PCAP files are then fed to the inter-cluster Load Balancer. The inter-cluster Load Balancer is the key element that powers our Cloud Bursting solution. As batches of network data arrive, the Load Balancer has to decide which cluster is suitable to process the arriving batches. To make this decision, it has to rely on pre-specified criteria or metrics that are configured by an administrator and have threshold values. There are a wide variety of metrics that can be used to determine when to trigger a burst to the public cloud. The most common involve system-level metrics such as CPU utilization, RAM usage, and disk/network utilization, or application-level metrics such as application response time. For our scenario, our bursting algorithm will focus on four specific metrics: cost of bursting, resource utilization, batch size, and more importantly the guarantee that jobs are not delayed by the simultaneous execution of other jobs. 

 

The fundamental criterion is the guarantee that jobs execute as if the whole cluster was ded- icated to them. To ensure that jobs are not delayed, other metrics need to be taken into account. Naturally, resource utilization and network traffic batch size are both fundamental metrics to mon- itor, as these allow the Load Balancer to estimate if the local launch of the job will not delayed it or if additional remote resources are required. Usage of additional remote resources brings about another important metric: the cost of using remote resources. In this sense, it is crucial to note that usage of local resources should always be preferred, as it incurs no cost. The allocation of re- sources in a public cloud such as Amazon AWS, however, does incur costs. Thus, if the execution of a job can be performed without delays using local resources only, the allocation of additional resources in a public cloud should be discarded. 

 

In the end, our bursting mechanism will need to consider these metrics and attempt to mini- mize delays and the number of bursts, thus maximizing local resource usage and minimizing costs.

bottom of page