This post will build on my previous post Distributed Joins, The Basics. So we will extend the example from a customer table and an orders table, to a customer table , orders table, and order_items table. Logically the customer skey is the primary key in customer and foreign key in orders, and order key is the primary key in orders and the foreign key in order items. We will assume all of these tables are too big to broadcast.
To execute a distributed join first customer and orders need to be distributed on customer key. Either they are both already distributed that way or one or both are redistributed. That custord result set and order_items need to be distributed on order key. Since custord is currently sitting with a customer key distribution from the first join, it HAS TO redistribute to be distributed on order key, and order_items either redistributes to order key, or is already physically distributed that way.
Looking at this model it seems impossible to avoid at least some on the fly redistribution of data, and possibly several redistributions. But there is a simple way to avoid this. Even though it is not logically needed, carry the customer key redundantly through to the order_items table. Then we could distribute all three tables on customer key and each snippet processor has a set of customers, all their orders, and all their order_items together on it’s dedicated disk.
But there is a catch , join customer to orders on customer key, and then orders to order_items on order key and it will redistribute on order key in the second step anyway because it doesn’t know that the order items belong to the customer and are already local to the snippet processor. The solution is simple. Join customer to orders on customer key, and then orders to order_items on order key (needed for the relation) AND customer key (needed for distribution). What the optimizer will do is figure out that I already have everything together by customer key, and that key alone spreads things out well, so I don’t need to move things around. It will then do the join between customer and orders, and nicely right in the same snippet with the same distribution do the join between the custord result set to order_items. It’s critical to take advantage of the distribution to always include the customer key in the join even when just joining orders to order_items .
Even if the distribution is not on this carried through key, having it in the model and using it in joins reduces the number of redistributions. Let’s say everything is distributed on random. When we don’t have the key carried through, it will redistribute both customer and order on customer key, and then redistribute that custord result set on order key and also redistribute order_items by order key. So that’s 3 tables and one result set redistributed. If we carry the key, it will distribute all 3 on customer and then join them in one snippet. You avoid the redistribution of the result set which makes it faster.
Carrying keys through your model, using them for common distribution, and making sure to alway include those distribution keys in your joins can greatly improve performance by achieving good colocation through several joins.