So I think in previous posts the need for distribution and therefore the benefit for colocation of data to handle joins has been pretty well covered. But there are several other scenarios where data needs to be distributed to process.
First one that comes to mind are count distincts. Take for example a count of distinct users or customers like count(distinct customer_id) perhaps when doing analysis on purchase transactions. An example sql might be
select count(distinct customer_id) from purchase_transaction
For argument we’ll say that purchase_transaction has random distribution. So Netezza has to break up the job. It can’t just count distinct on every spu because you can’t just add up the numbers at the end because some customers might be on more than one spu. The way to make it where you can just add them up is to make sure a customer can’t be on more than one spu, and that means you need the data distributed by customer. That’s exactly what the query plan will do, redistribute on customer. But of course if they data is already distributed on customer, then you save having to do a redistribution.
This is just one example of where Netezza needs all of a grouping of data together. Another is windowing analytic functions like rank , dense_rank, lead, lag, etc. They always have a partition over set. To do these partition overs, you need all the data for each partition on one spu. If you partition over on customer_id, reusing our above example, and perhaps want to do a rank on dollar amount for each customer, again it will redistribute the data to get each slice of customer data located together on a single spu.
Still another example is any group by aggregate function with group by columns. If the data set is small enough and all the aggregate functions are summable, it may try to merge each spu’s results together on the host. But look at your plan files, often if the result set will be large, it will distribute by some or often all of the group by columns.
Select distincts are in may ways like a group by without any aggregate functions. So to get like values together , it will redistribute on all or most of the column set.
A key thing to realize is netezza doesn’t need to distribute purely on all of the distinct, group by , or partition over columns. It just needs to make sure that all of that set of data is there local on the spu. For example if you had zipcode and customer id. Lets say we know that customer is a pure child of zipcode, in that a customer only has one zipcode and a zipcode has many customers. It doesn’t need to redistribute on customer_id if it’s already distributed at the zipcode level. But it has to know this relationship. So that means in distinct or group by to make sure that zipcode is included. Or in a partition over statement partition over zipcode, customer_id. What I’ve seen if the optimizer sees that it is already distributed at that higher, courser grain AND it sees that that grain’s cardinality has things pretty evenly spread out, it will usually just leave things at that distribution. It can do this because if customers don’t cross zipcodes, and zipcodes don’t cross spus, then customers don’t cross spus. What I do see though is if it’s going to need to do redistributions, it would redistribute on all of the column values. I see this especially on group bys. Sometimes joins and partition overs are a little smarter if there is a common higher grain to handle both.
So to summarize, in addition to joins , colocation of data through table distribution or redistribution is also needed for count distincts (or any agg distinct), partition overs in analytic functions, and column grouping in group by or select distincts. A really good table distribution strategy will find a column, perhaps like a user_id or customer_id that can commonly be used for joins and these other calculations. If you can find a good common distribution like that things will run very very fast.