March, 16th - March, 22nd

ee09119
Mar 25, 2015
2 min read

During this week, we realized that the performance of the PCAP MapReduce jobs was non-ideal in terms of data locality. It didn't make sense to us that the jobs were launching rack local tasks (i.e. tasks that do not run where the data is located) when free containers existed where the data was located. The total number of containers in the cluster was more than double of the required containers to run the job. Thus, an analysis was made to understand what was actually happening and what could be done to improve the data locality.

After reading a document from Hortonworks®, it was clear that our cluster was still not correctly configured. It became clear that only two containers should be allowed per physical disk. So the VMs in our cluster were resized so that only one VM runs in physical machine, ocupying the full resources. As a consequence, the number of virtual Hadoop worker nodes decreased, leading to less containers in the cluster. Running the same jobs would now fully utilize the clusters, since less containers are available to process tasks in parallel, but this turned out to increase the data locality, with some job executions displaying zero rack local tasks. It became clear that the bigger the job (size of the file and as a consequece, the number of splits to process) the more likely it is for Hadoop to achieve better data locality.

Furthermore, we came to the conclusion that in order to obtain good job completion times, we will require a smart self-learning predictor, that cannot be solely based on analytical models. The best solution for this issue would be to implement an online machine learning algorithm that would be capable of learning and thus predict, based on the executions of the jobs. After meeting with Professor Ricardo Morla, we agreed that this would be an issue to deal with later, since these algorithms can be very complex. We will focus on a simpler prediction model for now and improve it if we have enough time for that.

NEXT OBJECTIVES: Start automating the packet capture and load balancing sequence with the current Proof-of-Concept.

P R G

Pedro Rocha Gonçalves

March, 16th - March, 22nd

Comments

Featured Posts

June, 16th - June, 29th

June, 15th - June, 21st

June, 8th - June, 14th

June, 1st - June, 7th

May, 25th - May, 31st

May, 18th - May, 24th

May, 11th - May, 17th

May, 4th - May, 10th

April, 27th - May, 3rd

April, 20th - April, 26th

Recent Posts