How to “Undo” a Delete, Update, or Insert in Netezza

The great thing about Netezza is it does “soft” deletes until the groom process runs.  This means your deleted data is really still there.  This is of course very important if you just deleted something you didn’t mean too.  And since “updates” are really just a delete of the old values and an insert of the new behind the scenes, you can actually “undo” updates too.

So first a little about how Netezza handles transactions.  Every insert, update, or delete transaction is assigned a sequential transaction id or xid.   You can see this id by querying the column createxid .  Also there is a column to indicate deleted rows called deletexid.  This is set to 0 if this is a readable, not deleted row.  When the row gets deleted, this column gets populated with the transaction id assigned to the delete or update statement.  And as I mentioned above , updates are handled by doing a delete of the old record and inserting the entire new row with the new values.

Under normal circumstances when you run a select you will not see rows that have a deletexid not equal to zero.  But there is a simple session variable that can be set that allows you to see these deleted rows.  The simple command to run in your favorite query tool is …

set show_deleted_records = true

So to see what’s there you can then just select from your table and include the 2 transaction id columns, like ..

select createxid,deletexid, *

from your_table

If you want to see just the deleted rows , just select where deletexid is not zero

select createxid,deletexid, *

from your_table

where deletexid !=0

Once you can see your deleted data, and figure out which transaction you are trying to undo, you can simply re-insert the data

insert into your_table

select * from your_table

where deletexid=233443; –transaction id from delete.

To undo an update, just re-insert the deleted rows and delete the inserted rows.

insert into your_table

select * from your_table

where deletexid=233443 ;–transaction id from update

delete from your_table

where createxid=233443; –transaction id from update

And I’m stating the obvious, but to undo an insert, you don’t even need to show deleted rows, just delete by the transaction id

delete from your_table

where createxid=233443; –transaction id from insert

Do be warned, you cannot undo a truncate table!


Posted in SQL Tricks | Tagged , , , | 13 Comments

Efficient Data Upload Method Via External Table

If you ever have tried to upload data to Netezza using convention query tool methods which create individual inserts, you will know this can be EXTREMELY SLOW.  The way to upload data efficiently is to create an on-the-fly external table on your local datafile and use that in a CTAS (create table as) or an insert into a table.  This is similar to the data download method in the previous post.  In my example I create a temp table, but it could also be a perm table. This example is for comma delimited with backslash as the escape character, but you can change this to match your source file.  If you cut and paste these examples, becareful your quotes don’t become open and close quotes.

Via ODBC Like Aginity Workbench


SELECT * FROM EXTERNAL ‘C:\\Users\\ValuedCustomer\\customer dim dump.csv’ ( CUSTOMER_FIRST_NAME VARCHAR(50) NOT NULL,






LOGDIR ‘C:\\Users\\ValuedCustomer\\AppData\\Local\\Temp\\mylog.tmp\\’ DELIMITER ‘,’ Y2BASE 2000 ENCODING ‘internal’ REMOTESOURCE ‘ODBC’ ESCAPECHAR ‘\’ )

Via JDBC Like with Squirrel SQL


SELECT * FROM EXTERNAL ‘C:\\Users\\ValuedCustomer\\customer dim dump.csv’ ( CUSTOMER_FIRST_NAME VARCHAR(50) NOT NULL,






LOGDIR ‘C:\\Users\\ValuedCustomer\\AppData\\Local\\Temp\\mylog.tmp\\’ DELIMITER ‘,’ Y2BASE 2000 ENCODING ‘internal’ REMOTESOURCE ‘JDBC’ ESCAPECHAR ‘\’ )

Posted in Performance Tuning for Netezza, SQL Tricks | Tagged , , , , , | 10 Comments

Returning and Saving Large Result Sets Locally

Some of the common SQL Query tools like Squirrel Sql can have issues when it tries to return large result sets.  This seems to be because it needs to hold them in memory.  There are times you need to just bring back large result sets in a file, local to your client computer.  Believe it or not you can create an external table back to a local file via JDBC or ODBC.

If you are using a JDBC SQL tool like Squirrel , then this syntax works.


Just run this in a query window and go find your file in the directory you specified.

Doing this on an ODBC tool like Aginity Workbench (plug to my buddy Oleg, where I got this method!) is the same with just ODBC instead of JDBC…


Posted in SQL Tricks | Tagged , , , , , | 1 Comment

An Integer Check That Doesn’t Blow Up

Sometimes you load in data sourcing from charactor data into a numeric or integer field.  The challenge is any non-numeric data will blowup your entire SQL with something like an atoi (alpha to integer) error.  You really need to check the data or do something to prevent the one bad element from blowing up everything.

My favorite method is actually to use the Netezza supplied SQLToolkit and use a simple regular expression to only load a numeric.  An example is

select regexp_extract(my_source_column,’^[0-9]{1,18}$’)

from …

This will only return an integer of up to 18 digits. You can also use this in a condition by comparing it back to the original source column.

Of course often the SQLToolkit is not installed or there is a desire to not use regular expressions.   You can use a test method of doing a to_number , cast back to a varchar, compare to the original and use an istrue command to return a true or false.

Select ..



istrue(cast(to_number(sourcecolumn,’999999999999999999′) as varchar(20)) = sourcecolumn)

Conversely you can use the statement in your selected columns and use the true or false returned in a case statement if you wish to handle the non-numeric somehow, like return zero.

Below are some test examples to see how it works.

select istrue(cast(to_number(’124b5′,’999999999999999999′) as varchar(20)) = ’124b5′)

select istrue(cast(to_number(’1245′,’999999999999999999′) as varchar(20)) = ’1245′)

Posted in SQL Tricks | Tagged , , , , | Leave a comment

A Method for Handling Process Skew with Unknown Defaults

Back in the post Distributed Joins, Process Skew I mentioned a common cause of process skew were default values such as -1 for an “Unknown” customer creating skew when a redistribution happened on that customer key and there was a significant number of unknowns or -1 records.  I also mentioned one solution was to spread the unknowns across a range of negative values.  And I also mentioned I was working on a solution that would sort of virtualize this spread without having all the extra unknown negative values in you dimension.  Well this is that solution in detail.

First I want my solution to support several default values , so mine with support a -1 though -9, or 9 defaults.  I did not want to physically have to use anything other than the single default value in my fact.  And I didn’t want anything other than the single default row in my dimension .

So first I just wanted a dummy table with 10000 rows in it to generate data.




LIMIT 10000;

Then I want to create a table I’m calling a multiplexer table





Then I generate of range of negative values for each default.  As example I’ll assume we have a -1 unknown.  Others can be added in the same way










I’ll add a second set for -2 just so it’s clear how that would work









So it’s clear, all the values in the -1 set end in 1 and all the values in the -2 set end in 2.  The -1 set ranges from -1 to -99991  , 10000 values.

Below is the example.

First part is the MCUST sub select, which in practice should probably be hidden in a view.  You outer join to the multiplexer table , and use the 10000 values from the  multiplex table when you match on the single base -1 value, and use the original key for all the other keys. This is done with a simple nvl


This becomes the new cust_key with the -1 values now scattered from -1 to -99991 in steps of 10.

Then on the fact you take a large, high cardinality integer column ranging past 10000 and evenly distributed, and leverage it to generate the range of negative values. If you have a primary key on the table that’s probably a good one to use. If their is a chance that leveraged column has negatives, you need to wrap it with an abs() . To break it down, a MOD with a divisor of 10000, with result in a range of 0-9999 .  It is then multiplied by 10 to give a range from 0 to 99990 in steps of 10, and then it subtracts 1 for the -1 defaults, and it would need to subtract -2 for the -2 defaults.  Use a decode (or case if you like) to generate the range for -1, or -2, or up to your 9 defaults. This gives the same range of negatives as the dimension.


This calculation on the FACT table becomes the new customer key.  This too could be hidden in a view to make it easier to use with reporting tools.  I’ve tested it and it distributed on this decode/mod calculated value.  I did a test on a fact with 10% of the rows on a -1 unknown.  On a Twinfin 24 with 192 dataslices this solution gave an 18.5x performance gain.


















Posted in Performance Tuning for Netezza | Tagged , , | Leave a comment

Distribution, Not Just for Joins

So I think in previous posts the need for distribution and therefore the benefit for colocation of data to handle joins has been pretty well covered.  But there are several other scenarios where data needs to be distributed to process.

First one that comes to mind are count distincts.  Take for example a count of distinct users or customers like count(distinct customer_id) perhaps when doing analysis on purchase transactions.  An example sql might be

select count(distinct customer_id)  from purchase_transaction

For argument we’ll say that purchase_transaction has random distribution.  So Netezza has to break up the job.  It can’t just count distinct on every spu because you can’t just add up the numbers at the end because some customers might be on more than one spu.  The way to make it where you can just add them up is to make sure a customer can’t be on more than one spu, and that means you need the data distributed by customer.  That’s exactly what the query plan will do, redistribute on customer.  But of course if they data is already distributed on customer, then you save having to do a redistribution.

This is just one example of where Netezza needs all of a grouping of data together.  Another is windowing analytic functions like rank , dense_rank, lead, lag, etc.  They always have a partition over set.  To do these partition overs, you need all the data for each partition on one spu.  If you partition over on customer_id, reusing our above example, and perhaps want to do a rank on dollar amount for each customer, again it will redistribute the data to get each slice of customer data located together on a single spu.

Still another example is any group by aggregate function with group by columns.  If the data set is small enough and all the aggregate functions are summable, it may try to merge each spu’s results together on the host.  But look at your plan files, often if the result set will be large, it will distribute by some or often all of the group by columns.

Select distincts are in may ways like a group by without any aggregate functions.  So to get like values together , it will redistribute on all or most of the column set.

A key thing to realize is netezza doesn’t need to distribute purely on all of the distinct, group by , or partition over columns.  It just needs to make sure that all of that set of data is there local on the spu.  For example if you had zipcode and customer id.  Lets say we know that customer is a pure child of zipcode, in that a customer only has one zipcode and a zipcode has many customers.   It doesn’t need to redistribute on customer_id if it’s already distributed at the zipcode level.  But it has to know this relationship.  So that means in distinct or group by to make sure that zipcode is included.  Or in a partition over statement partition over zipcode, customer_id.   What I’ve seen if the optimizer sees that it is already distributed at that higher, courser grain  AND it sees that that grain’s cardinality has things pretty evenly spread out, it will usually just leave things at that distribution.  It can do this because if customers don’t cross zipcodes, and zipcodes don’t cross spus, then customers don’t cross spus.    What I do see though is if it’s going to need to do redistributions, it would redistribute on all of the column values.  I see this especially on group bys.  Sometimes joins and partition overs are a little smarter if there is a common higher grain to handle both.

So  to summarize, in addition to joins , colocation of data through table distribution or redistribution is also needed for count distincts (or any agg distinct), partition overs in analytic functions,  and column grouping in group by or select distincts.  A really good table distribution strategy will find a column, perhaps like a user_id or customer_id  that can commonly be used for joins and these other calculations.  If you can find a good common distribution like that things will run very very fast.


Posted in Performance Tuning for Netezza | Tagged , , , , | 4 Comments

Distributed Joins, Process Skew

This post will build on concepts introduced in the Distributed Joins, The Basics and Distributed Joins, Modeling for Colocation posts.  I’m assuming some familiarity with table skew, where a distribution key is chosen where some key values has significantly more rows than average, so that one spu or disk belonging to a snippet processor has more data than the average.  And I talked about distributed joins joining 2 large tables needing to have them distributed or redistribute them on some or all of the join keys.  But what is not always obvious is that just because a nice even distribution was achieve for the table, a join may cause that table to redistribute in temp space on join keys that do NOT have even distribution. 

I’ll start with our simple Customer and Order table example.  Let’s just say for what ever reason customer is distributed by customer_key and order is distributed by order_key.  You look at the table distribution and everything looks nice and even.  So you think all is well.  But then you join these 2 large tables and it’s just not as fast as you thought.   A very common problem, especially in data marts or data warehouses, is you had orders with perhaps no customer.  Maybe these were cash orders or just for some reason have no identified customer.  So you assign these to an “Unknown” customer record.  But maybe you find that 20% of all orders have an “Unknown” customer.  Well what can happen is to execute this join it naturally redistributes Orders table by customer_key.  But now 20% of my order data is on one snippet processor disk.  In my environment with a 10400 there are 432 active spus meaning that if there is even distribution any one should only have 0.23% of the data.  So ruffly this snippet processor has 87 times the data, and 87 times the work to do of the average snippet.  So guess what, you are only as good as your slowest snippet processor and you will find this snippet in your query plan will take 87 times as long as it would  everything was evenly distributed during that particular snippet.  Nothing to do with how evenly things are distributed in the table.

Another hard to follow situation is understanding the distribution in each step, because you could hit skew when it joins the results of the table A to table B step with the results of the table C to table D step.  This is important because even when you think a table isn’t big enough to need to be distributed for a join, you expect it to be broadcast, in more complex queries it might first be joined to something else big and then joined on the key with a skew problem. 

So, how to solve.  Well one approach is to keep some of these dimension type tables small so they always broadcast.  A quick hint, you can set enable_factrel_planner = true and then set factrel_size_threshold high like 15000000, which says anything under 15 million rows is more of a dimension so broadcast or redistribute it, and tables over 15M rows are big facts so try to leave in place and don’t redistribute.  And also try setting spu_to_spu_broadcast to something small like .1 or .001 to help encourage broadcasting of smaller tables (these can be set in the config or at the session level).  But like mentioned above, if these small tables act as sort of an intersection point between large tables, i.e. everything ties to them, you may still hit a skew problem like above. 

The prescribed solution is to scatter your “Unknown” keys over numerous records by having a range of keys.  For example, if -1 is the standard key for your single “Unknown” customer, you may need to have 10,000 “Unknown” customer records and have the keys range from -1 to -10,000.  You will then have to spread out your child or fact type records over these many “unknown” parent or dimension rows.

Another approach is to break up your query where you join everything but the “Unknown” to the nice spread out customer keys, and in a second pass handle just the “Unknown” customer hopefully broadcasting that 1 or few low cardinality records and then union the results back together. 

Unfortunately both of these methods can be a little problematic for certain reporting tools.  I’m currently working on some ways to spread the unknowns sort of virtually without having 9999 extra records in the customer table,  but I haven’t completed the testing on that yet.  Hopefully I’ll have more in the future.  But understanding the problem is half the battle.  A little tip for tracking this type of skew down,  actually make physical versions of the tables or intermediate results distributed as it is in the join step or snippet giving you problems.  Actually the snippet redistributing into the skewed distribution is slow in addition to the joining snippet.  One you have the results or table copies physically redistributed, look at the distribution.  If you see a skew problem here, you have a process skew problem during that step in your original query. 

There are other things that can cause process skew that I’ll discuss in later posts, but this data skew after on the fly redistribution is the most common.

Posted in Best Practices, Performance Tuning for Netezza | Tagged , , , , | 8 Comments

Distributed Joins, Modeling for Colocation

This post will build on my previous post Distributed Joins, The Basics.   So we will extend the example from a customer table and an orders table, to a customer table , orders table, and order_items table.  Logically the customer skey  is the primary key in customer and foreign key in orders, and order key is the primary key in orders and the foreign key in order items.  We will assume all of these tables are too big to broadcast. 

To execute a distributed join first customer and orders need to be distributed on customer key.  Either they are both already distributed that way or one or both are redistributed.  That custord result set and order_items need to be distributed on order key.  Since custord is currently sitting with a customer key distribution from the first join, it HAS TO redistribute to be distributed on order key, and order_items either redistributes to order key, or is already physically distributed that way.

Looking at this model it seems impossible to avoid at least some on the fly redistribution of data, and possibly several redistributions.  But there is a simple way to avoid this.  Even though it is not logically needed, carry the customer key redundantly through to the order_items table.   Then we could distribute all three tables on customer key and each snippet processor has a set of customers, all their orders, and all their order_items together on it’s dedicated disk. 

But there is a catch ,  join customer to orders on customer key, and then orders to order_items on order key and it will redistribute on order key in the second step anyway because it doesn’t know that the order items belong to the customer and are already local to the snippet processor.  The solution is simple.  Join customer to orders on customer key, and then orders to order_items on order key (needed for the relation) AND customer key (needed for distribution).  What the optimizer will do is figure out that I already have everything together by customer key, and that key alone spreads things out well, so I don’t need to move things around.  It will then do the join between customer and orders, and nicely right in the same snippet with the same distribution do the join between the custord result set to order_items.  It’s critical to take advantage of the distribution to always include the customer key in the join even when just joining orders to order_items .

Even if the distribution is not on this carried through key, having it in the model and using it in joins reduces the number of redistributions.  Let’s say everything is distributed on random.  When we don’t have the key carried through, it will redistribute both customer and order on customer key, and then redistribute that custord result set on order key and also redistribute order_items by order key.  So that’s 3 tables and one result set redistributed.  If we carry the key, it will distribute all 3 on customer and then join them in one snippet.  You avoid the redistribution of the result set which makes it faster. 

Carrying keys through your model, using them for common distribution, and making sure to alway include those distribution keys in your joins can greatly improve performance by achieving good colocation through several joins.

Posted in Best Practices, Performance Tuning for Netezza | Tagged , , , , | 3 Comments

Distributed Joins, The Basics

This post is really about some base understanding that will be needed to understand some later performance methods.  I find to understand database performance it is helpful to understand and think how things are done at the most basic level.  So when we do a join in a distributed system like Netezza there are really only a few options it has to break up the join across spus (or snippet processors).  

First in database joins it’s alway best to think of the query plan as only joining 2 things at any time.  For a 3 table join, say tables A, B, and C  you can join A and B and then that result in a data stream or a temp table joins to table C.  For that reason I’ll only ever talk about joining 2 things at a time.

When joining 2 tables or data sets, the only way to split up the work is to have a snippet processor have part of one table  and a full copy of the second table (know as a broadcast) or part of one table and a matching part of another table, basically a distributed join. 

If one of the tables is small, then broadcasting of that small table to all of the snippet processors works very well.  Like if we had a small customer table, and a large orders table we could have a snippet processor that had any fairly even distribution slice of orders, even random, and then each snippet processor gets a full copy of customer.  Then it does some sort of nice efficient join like a hash join and all works well. 

If both of the tables are large then it becomes impractical to send a copy of one of the tables to all of the snippet processors, so it must do a distributed join.  Each snippet processor needs a slice of orders and a matching slice of customer.  The key work is “matching slice”.  In this case the natural “slice” is on customer key, assuming this is the primary key of customer and a foreign key in orders.  When looking at distributed joins you will notice the “match” set will be on all or some of the join keys or values.  It seems to allow the use of “some” of the join keys when a subset allows an even spread of data, and takes advantage of the data already having that distribution from the way it is physically distributed, or the way it was distributed from previous snippets (steps) in the plan.  Important to note, these are the equijoin keys and values (like cust.customer_key =ord.customer_key),  and NOT any of the expression-join (exprjoin in the plan file) values (like between cust.start_date and cust.end_date) .  

Given our example , to do the distributed join, we need both customer and orders distributed on customer_key.  Let’s say the query is only joining these 2 tables, then if both are already distributed on customer_key then you have a perfect colocated join, no data has to move, and each snippet processor works on it’s set of customers and has the orders for that customer right there on it’s dedicated disk.  This is ideal.  If we have customer distributed by customer_key and orders distributed by orders key, then the orders table will need to redistribute by customer_key on the fly, so each snippet processor can have that “matching” set of customers and orders.  This is single redistribution.  If both tables are distributed on random, then both tables need to be redistributed on customer_key.  This is double redistribution

There is one very obvious tip you can already get from this.  The less redistribution, the better.  So colocated joins are faster than single redistributions, which are faster than double redistributions.  In this example, to have both customer and orders distributed on customer_key has a performance boost when joining these 2 tables.

Posted in Best Practices, Performance Tuning for Netezza | Tagged , , , , , , | Leave a comment

Alternate Method For Sequences

So if you’ve used Netezza sequences you may have noticed they are all over the place, and far from sequential.  This is just the nature of a share nothing massively parallel architecture.  But if you want sequential sequences that automatically keep in sync with your table you can leverage the ability of Netezza to do table counts REALLY fast, and the row_number logic covered in ROWNUM alternative post.  Here is an example


   WHERE …) A

The ordering clause is only needed to give you some control over what order the sequences are generated.  You just cross product to a SQL getting the max value, then used the row_number method and add that to that max value.  I’ve used this on tables with millions rows.  Larger tables should probably stick to the sequence object.  The great thing about this method is you don’t have to worry about ever getting your sequence out of sync with the table.

Posted in SQL Tricks | Tagged , , | Leave a comment