MOTIVATION & OBJECTIVES

Motivation

In the age of the Internet of Things, all sorts of devices are turning into connected gadgets that can share information to the Internet and improve our lives. Today there are millions of these devices generating unabating amounts of data that need to be processed and analyzed. These large sets of data are also known as Big Data. The ability to take value out of Big Data can be seen as a major asset for competing companies. However, the processing of Big Data is unsuitable for traditional computing systems. Apache Hadoop is a framework that is designed to handle these large data sets. Though, as the number of connected devices grows, so does the amount of streamed data to be processed. One of the biggest issues with the increasing amount of streamed data is its unpredictable variability. Existing infrastructures might not be capable of handling the unpredictable increased demand for processing power, leading to unwanted delays. Many companies have already invested enough in their data centers to meet most of their needs, making it unsuitable to invest in extra hardware to handle the ocasional demand for processing power. Migrating the whole application to a public Cloud infrastructure, though possibly cheaper than investing in the private data center because servers are only rented when needed, is still expensive and could compromise private and important data. We propose a hybrid solution, which offers the best of both worlds. It addresses the need for elasticity when unpredicted demand for processing power appears, while providing local privacy if needed.

Objective

Our study presents the state of the art for High Availability techniques that can be deployed in a Hadoop cluster to reduce downtime, delays and increase throughput. More importantly, it focuses on a specific technique called Cloud Bursting to enhance throughput and avoid server overloads, by leveraging a local private infrastructure through the usage of additional on-demand resources on a public Cloud. This hybrid Cloud model is a trending architecture that many companies providing services are adopting, as it offers scalability when the need arises. We propose a Cloud Bursting solution for a Hadoop cluster that processes and analyzes network traffic through the use of the Packet Capture (PCAP) application programming interface (API). Our solution is powered by a custom inter-cluster Load Balancer that decides when to burst jobs to the public cloud based on a set of metrics.

Main challenges

We have encountered two main challenges while developing the proposed solution:

The optimization of the performance of Hadoop for our PCAP MapReduce jobs
The development of a completion time prediction algorithm for the PCAP MapReduce jobs

The biggest and most concerning challenge is the development of the completion time prediction algorithm, which is one of the key elements for the Load Balancer. While there are a few analytical models to estimate job completion times, they do not seem to fit well when more than one job is running concurrently in the cluster. For the time being, we have chosen to use a simpler approach based on the input file size and the analytical models. However, we believe the best solution for this issue is to use machine learning and develop an algorithm capable of learning by itself through the jobs that get executed. Since this can be very time consuming, it has been put on hold.

network traffic batch size are both fundamental metrics to monitor, as these allow the Load Balancer to estimate if the local launch of the job will not delayed it or if additional remote resources are required. Usage of additional remote resources brings about another important metric: the cost of using remote resources. In this sense, it is crucial to note that usage of local resources should always be preferred, as it incurs no cost.

How the proposed solution will be evaluated

To evaluate our proposed Cloud Bursting solution, a fixed size local Hadoop YARN cluster will be used. For now, we have chosen eight nodes (one Master and seven Slaves), but throughout the study, this setup may change should we conclude that a different size is more appropriate. Initial testing will use another local (but segregated) Hadoop cluster as a public Cloud.

The fundamental criterion is the guarantee that jobs execute as if the whole cluster was dedicated to them. To ensure that jobs are not delayed, other metrics need to be taken into account. Naturally, resource utilization and

P R G

Pedro Rocha Gonçalves

Motivation

Objective

Main challenges

How the proposed solution will be evaluated