Conclusions

In this dissertation, we introduce the importance of embracing Big Data and presented Apache Hadoop, a framework capable of handling such large data sets. We define High Availability and present the state of the art for high availability techniques that can be implemented in Hadoop clusters. We follow by describing the problem with investing in highly available and more powerful hardware to improve the availability of private data centers and deal with unexpected peaks of computing demand. Not only is investing in local resources expensive, but it also usually leads to underutilization. The hybrid cloud model and Cloud Bursting are introduced as solutions capable of providing the required scalability for peaks of demand, while minimizing investment costs and maximizing resource utilization.

We present a Cloud Bursting capable solution for a Hadoop cluster running MapReduce jobs to process and analyze network traffic. The analysis of the captured network traffic is performed by our map-intensive signature matching job, which based on a set of rules from the Snort community is able to detect intrusions or similar attacks. Our Cloud Bursting capable solution is an inter- cluster Load Balancer that bursts jobs when the local cluster is overloaded and cannot process further jobs without delaying them. To achieve this, the Load Balancer closely monitors the local cluster’s resource utilization, captured network traffic batch size, and the possibility of delays to decide whether or not to burst a job. To develop our Load Balancer, we create a map-intensive job that simulates the signature matching logic and use it together with synthetic traffic generated by Iperf to design a model for the behavior of the job, based on the concept of map waves. This model, though simple, is of high importance to the Load Balancer to grant it the ability to estimate wave and job completion times, which aid the decision process. We thoroughly describe our solution and detail the development of the Load Balancer.

Finally, we submit our solution to a series of test cases and present the results in well defined sections. A simple version of the Load Balancer, where only container utilization and execution of other jobs is considered, is tested against a more advanced version with all of the features. We use synthetic traffic generated by Iperf to prove the concept and the development steps of our Cloud Bursting capable Load Balancer. It becomes clear that the need for the Load Balancer is real, as the occurrence of peaks in network traffic can lead to severe delays in the execution of the jobs that follow. Our Load Balancer then undergoes testing with the signature matching job together with real network traffic captured at the router in our network labs. Lastly, we test our Load Balancer in a heterogeneous cluster with real traffic. While using the simple Load Balancer is always better than not using any Load Balancer at all, we conclude that the more advanced version provides the best results by minimizing the number of bursts and maximizing local resource usage.

Thus, we believe that the Cloud Bursting technique is in fact a viable option to improve the availability of Hadoop clusters aimed at analyzing network traffic with our map-intensive signature matching MapReduce job. We are convinced that this is a subject that should be further studied and that our solution should be further developed to better suit clusters with different specifications. Particularly, our solution could be upgraded to a full application, capable of providing system ad- ministrators with a graphical user interface and powerful tools to automatically adapt our solution to the cluster’s specifications upon installation. It should be able to determine the cluster’s steady state limit, inform whether the cluster needs an upgrade or not considering the amount of bursts, provide visual and detailed alerts when attacks occur, display cluster utilization statistics, among other features.

P R G

Pedro Rocha Gonçalves

Conclusions