The parallelization method has been introduced to solve the limitation of single computing unit in large scale data processing. There have been two common approach in applying this parallelization method i.e., batch and stream processing mechanism. As batch processing mechanism, the MapReduce become the most popular programming paradigm in order to achieve the simplicity of programming with having map and reduce procedure. In our work, we adopt MapReduce paradigm to a specific domain in the TIN parallelization of large scale LiDAR dataset.
In order to enable efficient TIN parallelization of large scale dataset, we addressed the issues of data dependencies between parallel worker execution and the bottleneck in the shuffle phase of MapReduce. Here, we introduced triangulation approach with convex boundary to reduce the number of processed vertices in in each parallel worker. The convex boundary triangulation approach is a mechanism for choosing only a set of convex boundary triangles that located around the boundary area in the triangulation process. By thus, the parallel workers can work independently each other because the vertices that being processed in in the boundary area is not affected by other vertices subset areas. Hence, the better time performance can be achieved due to more efficient accessing number of vertices in the triangulation.
In addition of our work, we propose resource allocation strategy to manage the usage of cache, memory, and disk in order to reduce communication bottleneck in the data transfer of shuffling process. Our strategy works by giving priority of resource usage to the task that having more fraction of data compare to the other tasks. By thus, the task that have more data in the shuffle process able to complete the task execution more efficient.
In order to evaluate our work, we conduct an experiment with a number of LiDAR data set from NCALM Lab, Berkeley, USA. We evaluate the number of vertices that being processed both in our proposed method and other grid/bucket works as comparison. In addition, we measure the time improvement of our proposed method by varying the number of subset areas and scaling up the number of dataset. In the end of our evaluation, the simulation of proposed resource allocation strategy is provided in order to measure the performance improvement in the bottleneck phase.