Flexible Partitioning for Selective Binary Theta-Joins in a Massively Parallel Setting
Jun 1, 2018·
,,·
1 min read
Ioannis Koumarelas
Athanasios Naskos
Anastasios Gounaris

Abstract
Efficient join processing plays an important role in big data analysis. In this work, we focus on generic theta joins in a massively parallel environment, such as MapReduce and Spark. Theta joins are notoriously slow due to their inherent quadratic complexity, even when their selectivity is low (e.g., 1%). The main performance bottleneck differs between cases and is due to any combination of the following factors: amount of data being shuffled, memory load on reducers, or computation load on reducers.
We propose an ensemble-based partitioning approach that tackles all three aspects. In this way, we can save communication cost, better respect the memory and computation limitations of reducers, and reduce overall execution time. The key idea behind our partitioning is to cluster join key values using two techniques: matrix rearrangement and agglomerative clustering. These techniques can run either in isolation or in combination.
We present extensive experimental results using both band queries on real data and arbitrary synthetic predicates. Our approach can save up to 45% of communication cost and reduce the computation load of a single reducer by up to 50% in band queries, while for arbitrary theta predicates the savings reach 74% and 80%, respectively. Apart from being effective, the potential benefits can be estimated before execution from metadata, enabling informed partitioning decisions. Finally, our solutions are flexible and can account for any weighted combination of the three bottleneck factors.
Type
Publication
In Distributed and Parallel Databases (Springer), 2018
Note
Click the Cite button above to enable visitors to import publication metadata into their reference management software.
Note
Create your slides in Markdown - click the Slides button to check out the example.
Add supplementary notes, full text, or examples here. You can include code, math, and images.