Distributed Joins, The Basics

This post is really about some base understanding that will be needed to understand some later performance methods.  I find to understand database performance it is helpful to understand and think how things are done at the most basic level.  So when we do a join in a distributed system like Netezza there are really only a few options it has to break up the join across spus (or snippet processors).  

First in database joins it’s alway best to think of the query plan as only joining 2 things at any time.  For a 3 table join, say tables A, B, and C  you can join A and B and then that result in a data stream or a temp table joins to table C.  For that reason I’ll only ever talk about joining 2 things at a time.

When joining 2 tables or data sets, the only way to split up the work is to have a snippet processor have part of one table  and a full copy of the second table (know as a broadcast) or part of one table and a matching part of another table, basically a distributed join. 

If one of the tables is small, then broadcasting of that small table to all of the snippet processors works very well.  Like if we had a small customer table, and a large orders table we could have a snippet processor that had any fairly even distribution slice of orders, even random, and then each snippet processor gets a full copy of customer.  Then it does some sort of nice efficient join like a hash join and all works well. 

If both of the tables are large then it becomes impractical to send a copy of one of the tables to all of the snippet processors, so it must do a distributed join.  Each snippet processor needs a slice of orders and a matching slice of customer.  The key work is “matching slice”.  In this case the natural “slice” is on customer key, assuming this is the primary key of customer and a foreign key in orders.  When looking at distributed joins you will notice the “match” set will be on all or some of the join keys or values.  It seems to allow the use of “some” of the join keys when a subset allows an even spread of data, and takes advantage of the data already having that distribution from the way it is physically distributed, or the way it was distributed from previous snippets (steps) in the plan.  Important to note, these are the equijoin keys and values (like cust.customer_key =ord.customer_key),  and NOT any of the expression-join (exprjoin in the plan file) values (like ord.date between cust.start_date and cust.end_date) .  

Given our example , to do the distributed join, we need both customer and orders distributed on customer_key.  Let’s say the query is only joining these 2 tables, then if both are already distributed on customer_key then you have a perfect colocated join, no data has to move, and each snippet processor works on it’s set of customers and has the orders for that customer right there on it’s dedicated disk.  This is ideal.  If we have customer distributed by customer_key and orders distributed by orders key, then the orders table will need to redistribute by customer_key on the fly, so each snippet processor can have that “matching” set of customers and orders.  This is single redistribution.  If both tables are distributed on random, then both tables need to be redistributed on customer_key.  This is double redistribution

There is one very obvious tip you can already get from this.  The less redistribution, the better.  So colocated joins are faster than single redistributions, which are faster than double redistributions.  In this example, to have both customer and orders distributed on customer_key has a performance boost when joining these 2 tables.

Leave a comment