Distribution, Not Just for Joins

So I think in previous posts the need for distribution and therefore the benefit for colocation of data to handle joins has been pretty well covered.  But there are several other scenarios where data needs to be distributed to process.

First one that comes to mind are count distincts.  Take for example a count of distinct users or customers like count(distinct customer_id) perhaps when doing analysis on purchase transactions.  An example sql might be

select count(distinct customer_id)  from purchase_transaction

For argument we’ll say that purchase_transaction has random distribution.  So Netezza has to break up the job.  It can’t just count distinct on every spu because you can’t just add up the numbers at the end because some customers might be on more than one spu.  The way to make it where you can just add them up is to make sure a customer can’t be on more than one spu, and that means you need the data distributed by customer.  That’s exactly what the query plan will do, redistribute on customer.  But of course if they data is already distributed on customer, then you save having to do a redistribution.

This is just one example of where Netezza needs all of a grouping of data together.  Another is windowing analytic functions like rank , dense_rank, lead, lag, etc.  They always have a partition over set.  To do these partition overs, you need all the data for each partition on one spu.  If you partition over on customer_id, reusing our above example, and perhaps want to do a rank on dollar amount for each customer, again it will redistribute the data to get each slice of customer data located together on a single spu.

Still another example is any group by aggregate function with group by columns.  If the data set is small enough and all the aggregate functions are summable, it may try to merge each spu’s results together on the host.  But look at your plan files, often if the result set will be large, it will distribute by some or often all of the group by columns.

Select distincts are in may ways like a group by without any aggregate functions.  So to get like values together , it will redistribute on all or most of the column set.

A key thing to realize is netezza doesn’t need to distribute purely on all of the distinct, group by , or partition over columns.  It just needs to make sure that all of that set of data is there local on the spu.  For example if you had zipcode and customer id.  Lets say we know that customer is a pure child of zipcode, in that a customer only has one zipcode and a zipcode has many customers.   It doesn’t need to redistribute on customer_id if it’s already distributed at the zipcode level.  But it has to know this relationship.  So that means in distinct or group by to make sure that zipcode is included.  Or in a partition over statement partition over zipcode, customer_id.   What I’ve seen if the optimizer sees that it is already distributed at that higher, courser grain  AND it sees that that grain’s cardinality has things pretty evenly spread out, it will usually just leave things at that distribution.  It can do this because if customers don’t cross zipcodes, and zipcodes don’t cross spus, then customers don’t cross spus.    What I do see though is if it’s going to need to do redistributions, it would redistribute on all of the column values.  I see this especially on group bys.  Sometimes joins and partition overs are a little smarter if there is a common higher grain to handle both.

So  to summarize, in addition to joins , colocation of data through table distribution or redistribution is also needed for count distincts (or any agg distinct), partition overs in analytic functions,  and column grouping in group by or select distincts.  A really good table distribution strategy will find a column, perhaps like a user_id or customer_id  that can commonly be used for joins and these other calculations.  If you can find a good common distribution like that things will run very very fast.

 

This entry was posted in Performance Tuning for Netezza and tagged , , , , . Bookmark the permalink.

4 Responses to Distribution, Not Just for Joins

  1. Anoop says:

    Hi,
    In my table I don’t find perfect match for distribution key since I am storing all the client data in to one table It is not a good idea to distribute on top of client specific data ,because for some client it will have value for others it will be null.So I used random distribution and I felt like if I can find out one distribution column It will be more faster.Could you provide more insight

    • NZGuy says:

      So generally you look for a distribution that helps your joins. Usually that drives to primary keys and foreign keys. But I’m not sure how you join on a column where you allow nulls. Are you saying that your client data you have no single primary key? Good modeling would probably be to have a surrogate key column unique for each row, and your natural keys might be sparsely populated across several different columns. Or if some are unknown, you use something like a -1… but then you can have a skew problem (I have a post on dealing with that). But bottom line is you don’t want nulls in you PKs or FKs or the columns you join on.

      As far as general approach, you’re trying to help your large table to large table joins. The only way it can do these joins is to have the tables/datasets on a common distribution. Go look at the plan file for your query (I use nzadmin to look at these). After the table scan you may see at the end of the snippet something like

      1[06]: spu DownloadTableNode distribute into link 2163400540, numDistKeys=1 keepStore=0 flags=0×0 dsIdDist=0 keys=[ 3 ]

      Right before that you should see a project node with a numbered list of columns that are moving to the next snippet. Those numbers correspond to the number in the key list in your redistribution. As example, the above may be on random distribution, but is needing to redistribute on cust_id (lets say that is what attribute 3 is). That means shipping your data across the network to get it distributed to where it can be joined. This is a bit of a slower process. IF the data was already sitting in this distribution, you wouldn’t need to do this redistribution and it would be much faster. So I would start by looking at common query plans and look for redistribution steps like above and see if you can pre-distribute the data the way it is needed by setting the table distribution and avoid these redist steps.

  2. Anoop says:

    Hi ,
    How to generate random number between .7 and 1 in Netezza.Is there any option to give in random function to do so

    • NZGuy says:

      All random number generators I’m aware of result in a number between 0 and 1. It’s simply math to get it into the range you want. There is no parameters to do this because it’s pretty straight forward.
      You have an offset of .7 and a range of 0 to .3 . So multiply times the top range number and add the offset.

      select (random()*.3)+.7

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>