Data has overwhelmed the digital world in terms of volume, variety, and velocity. Individuals, business organizations, computational science simulations, and experiments produce huge volumes of data on a daily basis. Often, this data is shared by data centers distributed geographically for storage and analysis. However, for transferring such huge volumes of data across geo-distributed data centers in a timely manner, data transfer tools are facing unprecedented challenges.
Fault is one of the major challenges in distributed environments; hardware, network, and software might fail at any instant. Thus, high-speed and fault tolerant data transfer frameworks are vital for transferring data efficiently between the data centers. In this thesis, we propose a novel bloom filter-based data aware probabilistic fault tolerance (DAFT) mechanism to efficiently recover from such failures. We also propose a data and layout aware mechanism for fault tolerance (DLFT) to effectively handle the false positive matches of DAFT. We evaluate the data transfer and recovery time overheads of the proposed fault tolerance mechanisms on the overall data transfer performance. The experimental results demonstrate that the DAFT and DLFT mechanisms are very efficient in recovering from the faults while minimizing the memory, storage, computation, and recovery time overheads. Furthermore, we observe negligible impact on the overall data transfer performance.
Protecting the integrity of data against the failures of various intermediate components involved in the end-to-end path of data transfer is a salient feature of big data transfer tools. Although most of these components provide some degree of data integrity, they are either too expensive or inefficient in recovering corrupted data. This necessitates the need to maintain application-level end-to-end integrity verification during data transfer. However, owing to the sheer size of the data, supporting end-to-end integrity verification with big data transfer tools incurs computational, memory, and storage overheads. In this thesis, we propose a cross-referencing bloom filter based data integrity verification framework for big data transfer systems. This framework has three advantages over state-of-the-art data integrity techniques: lower computation and memory overhead, and zero false-positive errors for a restricted number of elements. We evaluate the computation, memory, recovery time, and false-positive overhead of the proposed framework and compare them with state-of-the-art solutions. The evaluation results show that the proposed framework is very efficient in detecting and recovering from integrity errors while eliminating false-positives of the bloom filter data structure. In addition, we observe negligible computation, memory, and recovery overheads for all workloads.