This post will build on concepts introduced in the Distributed Joins, The Basics and Distributed Joins, Modeling for Colocation posts. I’m assuming some familiarity with table skew, where a distribution key is chosen where some key values has significantly more rows than average, so that one spu or disk belonging to a snippet processor has more data than the average. And I talked about distributed joins joining 2 large tables needing to have them distributed or redistribute them on some or all of the join keys. But what is not always obvious is that just because a nice even distribution was achieve for the table, a join may cause that table to redistribute in temp space on join keys that do NOT have even distribution.
I’ll start with our simple Customer and Order table example. Let’s just say for what ever reason customer is distributed by customer_key and order is distributed by order_key. You look at the table distribution and everything looks nice and even. So you think all is well. But then you join these 2 large tables and it’s just not as fast as you thought. A very common problem, especially in data marts or data warehouses, is you had orders with perhaps no customer. Maybe these were cash orders or just for some reason have no identified customer. So you assign these to an “Unknown” customer record. But maybe you find that 20% of all orders have an “Unknown” customer. Well what can happen is to execute this join it naturally redistributes Orders table by customer_key. But now 20% of my order data is on one snippet processor disk. In my environment with a 10400 there are 432 active spus meaning that if there is even distribution any one should only have 0.23% of the data. So ruffly this snippet processor has 87 times the data, and 87 times the work to do of the average snippet. So guess what, you are only as good as your slowest snippet processor and you will find this snippet in your query plan will take 87 times as long as it would everything was evenly distributed during that particular snippet. Nothing to do with how evenly things are distributed in the table.
Another hard to follow situation is understanding the distribution in each step, because you could hit skew when it joins the results of the table A to table B step with the results of the table C to table D step. This is important because even when you think a table isn’t big enough to need to be distributed for a join, you expect it to be broadcast, in more complex queries it might first be joined to something else big and then joined on the key with a skew problem.
So, how to solve. Well one approach is to keep some of these dimension type tables small so they always broadcast. A quick hint, you can set enable_factrel_planner = true and then set factrel_size_threshold high like 15000000, which says anything under 15 million rows is more of a dimension so broadcast or redistribute it, and tables over 15M rows are big facts so try to leave in place and don’t redistribute. And also try setting spu_to_spu_broadcast to something small like .1 or .001 to help encourage broadcasting of smaller tables (these can be set in the config or at the session level). But like mentioned above, if these small tables act as sort of an intersection point between large tables, i.e. everything ties to them, you may still hit a skew problem like above.
The prescribed solution is to scatter your “Unknown” keys over numerous records by having a range of keys. For example, if -1 is the standard key for your single “Unknown” customer, you may need to have 10,000 “Unknown” customer records and have the keys range from -1 to -10,000. You will then have to spread out your child or fact type records over these many “unknown” parent or dimension rows.
Another approach is to break up your query where you join everything but the “Unknown” to the nice spread out customer keys, and in a second pass handle just the “Unknown” customer hopefully broadcasting that 1 or few low cardinality records and then union the results back together.
Unfortunately both of these methods can be a little problematic for certain reporting tools. I’m currently working on some ways to spread the unknowns sort of virtually without having 9999 extra records in the customer table, but I haven’t completed the testing on that yet. Hopefully I’ll have more in the future. But understanding the problem is half the battle. A little tip for tracking this type of skew down, actually make physical versions of the tables or intermediate results distributed as it is in the join step or snippet giving you problems. Actually the snippet redistributing into the skewed distribution is slow in addition to the joining snippet. One you have the results or table copies physically redistributed, look at the distribution. If you see a skew problem here, you have a process skew problem during that step in your original query.
There are other things that can cause process skew that I’ll discuss in later posts, but this data skew after on the fly redistribution is the most common.