About

I’m a platform architect who has spent the last 2 years building an enterprise data warehouse on a Netezza 10400 at a digital media company.  We have large data volumes and very complex data.  My job is to make the platform do things that push the envelope on performance and data translation.  During these 2 years I’ve figured out technical solutions that are at best poorly documented.  I’ve had numerous lunches with other Netezza customers to help them with their challenges.  Often I’ve been told “You should really start a blog”.  So here I am.

 

Just an addition, it’s now been over 3 years and we are on a IBM Netezza Twinfin 24 . We also have a Cruiser 8 as a DR and archive server.  I refuse to use the new IBM names.

 

43 Responses to About

  1. Muttrox says:

    Gee – so this is what you do all day. I understand about half of this. More than I expected.

  2. Brad Terrell says:

    I like how your header image suggests that Netezza is the Ferrari of data warehouse solutions. :-)

  3. Christine Smiley says:

    Trying to get a quick answer to a question about cross database joins. If I have 2 databases running on same Netezza server, and am running a join that crosses the databases, is this typically any less efficient than if the same exact tables that I am joining were within the same database? Based on queries I have run, I am not seeing any noticeable difference, but someone I know is claiming it would be a horrible thing to do and is implying it would crash the system. Would love your thoughts on it.

    • NZGuy says:

      Cross database joins are fine on Netezza. We do it all the time as standard practice. It’s physically 1 set of storage, one set of temp, and one database engine. I think the confusion is because in other platforms like Oracle a database instance is a separate set of storage, executables, temp , etc and data has to first be on-the-fly copied to one database before a join can be performed. In Netezza a “database” is much more like a schema in Oracle. The only limits I’ve seen are that you can only insert,update, delete in the database you are locally logged into and you can only reference sequences locally.

  4. Sodi says:

    Hi,
    We created few table with primary key ( though constraint doesn’t enforce) and whenever we are trying to add or drop any column we have to recreate the table, which resulted disappearing of data( still in dev though).

    Is it bugs of Netezza? not sure why alter table forcing us to recreate the table with PK on.

    Appreciate your help on this.

    Thanks
    Sodi

    • NZGuy says:

      Adding columns in Netezza does require recreating the table. I assume you are talking about adding a column and not adding a constraint to an existing column. It has to do with the physical layout of the data across the spus and the fact it doesn’t chain rows like Oracle. And BTW even though Oracle allowed alters through chained rows (each updated row in 2 physical places) this affected read performance and it was best practice to recreate the table.

      Any time you want to change a table the best method is create a new table with a CTAS (Create Table As) statement, rename original table to some other name , like with an _old, and the rename the new table to the name of the original table. You could also just create the new table with a create statement, populate with a select insert statement, and then do the renames.

  5. Kajal says:

    Issue with Not Null constraint in Netezza and Informatica:

    We are facing bit odd issue here. While running the Informtica mapping, either NZ or Infa is not honouring the constraint of the table. I have a column which is not null, but when I am running the session with Null value, Informatica log is showing data inserted into Netezza table but when I query into NZ db I see no rows. This is very alarming and not sure how to fix this, as we can’t filter the NULL data always in the SQ.
    Does this issue lies with Informtica or Netezza, ideally Infa should read the definition of the TGT table and act upon this, but it’s not happening always.

    The behavior is not consistent. It’s happening with only 2 or 3 tables, we have some table where Informatica is throwing error saying null value can’t enter in Not-Null column.

    Let me know if you have any feedback on this.

    Thanks
    Kajal

    • NZGuy says:

      Because there is no data in the table, instead of data with nulls in the columns, I tend to think this is an Informatica issue. We currently do not use PowerExchange connector with Informatica and Netezza. We do 99.9% full Push Down Optimization with ODBC, and then straight ODBC for a handful of other very low volume mappings. Is this happening with powerexchange? Also what version of NPS are you running?

  6. Kajal says:

    Thanks for the reply.

    our Dev server and prod is TwinFin 3, our data load is not so huge, so client is fine with version 3. To load from Informatica, we are using ODBC connector for Netezza for all type of table loads( moderate volume to low volume).

    We observe another odd behaviour of this mapping, looks like it’s with Informatica, when we move this mapping to individual developer folder to other folder and run the session, it does working as expected (throwing error while it’s find NULL value in the column where it expects Not-Null).

    I am still not able to identify what is the exactly issue and why Infa is not able to throw the error, is it happening because we are using Netezza ?

    Regarding the Pushdown Optimization, do you apply the logic in Target or Source DB?

  7. NZGuy says:

    Actually that “3″ is 3 s-blades. The size, not the NPS software version. We are running NPS 6.0.5 . You can see this I believe on the database properties in Squirrel, I think Aginity workbench has something similar, and it should be in nzadmin. So either way, sounds like an Informatica issue if it works in one folder and not another. I’ll ask one of our Infa guys if he’s seen this.

    So we do full pushdown which turns the source sql into a view, and creates a select/insert. We as a standard do not cross databases. We always go from a source DB to a flat file on to network attached storage in one set of ETL, and then mount the files as an external table and do a full push down to select from internal and insert it into an internal table as another set of ETL. We do this flat file interface to give complete operational and developmental independence from our source systems and our EDW. It also aids us in reload because we just go back to the flat files.

  8. Kajal says:

    hello,
    We figured out the issue, it’s with NZ ODBC driver setup, it’s now behaving as expected when session is getting bad input data.

    Thanks
    Kajal

    • sree says:

      Hi Kajal,

      We are facing similar issue with informatica, Can you please tell us how to setup NZ ODBC driver to handle when session is getting bad input data.

      Thanks,
      sree

  9. Naveen says:

    Hello,
    Please need your help.

    Out target is Netezza database, source is .csv file, using Informatica 8.6.0.

    In the workflow manager, in the Relational Connection Browser, iam not able to get Netezza as the Database. (Iam able to c other database like Oracle, Sybase,Informix, DB2)

    Should i need to add any drivers or how do i get it.

    Thank You.

    • NZGuy says:

      I think you are needing to install the ODBC drivers. Netezza has an FTP site ftp://ntzftp.netezza.com/ where you can find the drivers, but someone at your company will need to tell you the username and password. Also if you are talking about up on your local windows machine, I’ve been told there is a quirk where for a 64bit Windows you need to install both the 64 bit and the 32 bit ODBC drivers before it will show.

      • Gary Colbran says:

        The issue with 64/32 bit is that the 32 bit applications running on 64 bit windows require access to the 32 bit drivers and install in SYSWOW64. Then you have to use the SYSWOW64\odbcad32.exe to configure the 32 bit Data Sources.
        On windows 64 bit the default ODBC admin option handles the 64 bit ones.

  10. G Bear says:

    I am looking at implementing ONE Netezza Twinfin 12 to support 2 separate lines of businesses, using separate databases, one will be primarily for in-database analytics where data will continually grow and the other for data transformation and aggregation where data will be loaded and deleted. Should I be concerned about the fact that these databases will use ONE Instance instead of othe traditional databases where I can have separate instances?

    • NZGuy says:

      So on conventional databases like Oracle when you set up a separate instance you allocate separate memory, and usually use separate physical disks, though sometimes this is only logically separate on the same physical array. No doubt this reduces the amount of contention between the 2 instances, but you are also subdividing your memory and disk spindles so no individual database can use all the resources. Also having 2 separate instances allows for 2 separate, perhaps optimized configurations.

      When you have 2 databases on Netezza, it’s really like 2 schemas on the same instance. I don’t think for your 2 purposes the ideal configuration would be any different. And really it’s not a lot different than many or most environments where you run ETL and reporting in the same warehouse. Also worth note, many in-database analytic solutions (Fuzzy Logix for example) use custom UDFs (user defined functions. These are installed at the database level.

      You mention one area the data will load and grow. The other will be loaded and deleted. So as long as you are mindful of taking advantage of zonemaps by having your data sorted on something useful like a timestamp, you will still be able to select a subset of data by date quickly, because it will only read the blocks for the dates you are pulling. This scales very well and works much like Oracle partitions, just easier to use. On the load and delete, you either actually do truncate table, or run groom to clean up your soft deleted rows and you will be fine. So I don’t have concerns about this applications co-existing on the same TF12.

      That said you do need to think about and manage your 2 applications competing for the same resources. One aspect may be scheduling. You might not really have an issue if your ETL is happening overnight, and your analytics in the day. But assuming that is not the case, you will want to enable GRA (Guaranteed Resource Allocation). This will help allocate and dedicate resources to different user groups. We are in the process of implementing this, so I can’t yet speak to it’s limitations.

      The one thing that may be more of a challenge is allocating storage between the 2 applications. There is hinted promise of setting storage limits in NPS 7, but they set it at an overall level, not at a dataslice level. So the issue is even with modest storage limits, a table with extreme skew could fill up a dataslice and make your ETL processing fail. The only real solution I see at this time is to run monitoring scripts and alert you if an analytic user (this is assuming they can create their own tables) creates a skewed table that fills up a dataslice. This also may be more of a training issue to get your uses to monitor what they create.

      So I think it is definitely possible to do what you are wanting, but it will take some management of resources. However you might think about how you handle outages (planned or other). Is it possible having 2 Twinfin 6s lets each also act as a disaster recovery (DR) environment for the other?

  11. Swati says:

    I’m looking for a system table or query which can give the details when there is contention on any data slice, where I can get the system IO details in Netezza?

    • NZGuy says:

      I don’t know of any way to do this remotely. It seems to require host access. There is a host utility that will spit out a row of 1s and 0s , each digit representing a dataslice or spu process. It will create a long row for each active query. Though it doesn’t break out io specifically, usually io is the bottleneck, though it probably will be io to and from temp. Basically this is looking for process skew. The command is
      nzsqa responders -sys

      • Gary Colbran says:

        select datasliceid, count(*) from table group by datasliceid order by datasliceid desc; will give you your distribution, therefore “contention”. There is no contention per-se on Netezza because of the locking methodolody, but you will get data skew (shown by above query) or processing skew (which you cannot identify without knowing your data).
        IO will therefore be related to data volumes. Alternative is to use the admin tool (list all tables, check shew number), which can list all tables in the system.

  12. Sam says:

    How to avoid loading duplicate records in Netezza?
    In oracle we have unique constraint and primary key which avoids loading duplicate records. What is the best solution to achieve this ?

    Thanks in advance

  13. NZGuy says:

    Even back when we did BI in Oracle we turned off constraints because it prevented bulk loading, primarily because uniqueness testing requires an index. FYI, we use a partition exchange loading method and Informatica bulk load into an indexless table.

    The simplest answer is to enforce constraints in your ETL. One method we use is to write the select for the insert statement in a way to guarantee uniqueness, and then do a not exists or an outer join to the target table where target primary key is null to check what you are inserting doesn’t already exist in the target table. Make sure to include any common distribute in your join and it should be pretty efficient.

    The other method we use is to do a simple audit post load to compare the count(*) and the count(distinct primary_key_field) for either the entire table, or you can usually do some smaller date range on target fact or transaction level tables. Then raise an error when the 2 counts don’t match. We also test foreign key relations by comparing count of child table, and count of child table joined to all parents.

  14. Sundeep says:

    Hi NzGuy,

    You are doing a great job by sharing your knowledge. I have couple of doubt:

    1.) We are migrating from DB2 9.1 (Aix) to Netezza and I was wondering if there is any better method of moving data from DB2 DB to Netezza DB rather than taking doing a flat file export from DB2 and loading those to Netezza. Can we connect to Db2 from Netezza database?

    2.) Similarly, can we create federated database in Netezza which can connect to say DB2 Z/OS objects by using something like synonyms ( like in Db2 UDB we can connect using Nicknames).

    Thanks in advance

    • NZGuy says:

      1) Pretty much you are stuck with setting up external tables on flat files, however we have a very smart Russian guy where I work that was able to reference a LINUX named pipe in an external table and I believe do the same on the Oracle source side and stream data inserting direct to the external table on Oracle and extracting it in Netezza. I believe he even wrote a C program to facilitate this . IBM has also been working on some replication solutions, so maybe they have something that can go DB2 to Netezza.

      2) There is nothing that I know of in Netezza that will cross databases like Oracle DB Links and their old transparent gateway product did. That usually wasn’t that great anyway because if you dug into what was happening under the covers, it had to pull data local first anyway, and was limited in the filtering it could push across. If you are set on taking this approach you might want to investigate Hadoop and a data integration product like Datameer. Basically Datameer pulls everything into a central Hadoop environment and finishes the integration there.

      • Sundeep says:

        Thanks For your reply,

        I am not as smart and active as Russian guy to take that much pain :P.
        I will get in touch with IBM Software group to ask if we can use Q replication or any other replication tool to move data.

        P.S. I found some tool Winsql which says it can move data from any DB to any DB, but I think what it is doing in back end is taking an export and then importing,

        Thanks

  15. Mahesh says:

    Hi.. Am a Informatica guy and using NZ as our DWH layer. We are using Full Pushdown in one of our informatica Mappings. But it seems more number of records are not loaded into Target. There is no difference in the data type expect the data value which are loaded into the target. But when i again re-run the Job only with not loaded records it is loading in the second attempt.
    Also in the Infa log its showing the count which is matching the Source table(NZ) but in actual NZ table it is not showing the full count. Please suggest is there any bug in the Infa(using pushdown) connecting with NZ…

    Thanks,
    Mahesh

    • NZGuy says:

      I have to wonder if you are truly achieving full pushdown. If you are, the informatica connection should be creating a view in NZ with all the sql logic and then doing a simple select from view (perhaps with a sort), insert into target table. Make sure it is actually doing this. Capture the view definition. Capture the sql. And you should be able to run the same select/insert from a query tool after you recreate the view. Because it is a straight single transaction select/insert it is hard to see how rows would be missing unless there is something wrong with the sql logic in the view.

    • Tomik says:

      As NZGuy wrote it is important to check sql generated by pushdown – when you set invalid values for properties you can get such strange sql like ‘AND NULL = NULL’ – it has turned out that column was not set as lookup…

  16. Bob says:

    Hi,
    I need help with migrating settings. I’ve recently upgraded to a new machine and want to migrate my settings from my old machine install of Aginity for Netezza to my new one. How do I do this? Specifically I want to migrate all my Connections profiles. Thanks in advance.

    • NZGuy says:

      This is more of a workbench question than a Netezza question. I always just add them manually. You guys might have too many to make that practical. I’ll email you the email address of the developer who wrote Aginity workbench and you can contact him directly.

  17. Thomas Petersen says:

    Currently we are converting to Netezza from Oracle and just about done. Switching from Toad to Aginity is a realy let down. Developers do not have access to the Netezza machine so can you tell me how to do little things like tell if a table has stats on it and when the were last refreshed. How about when a stored procedure was last compiled. Is there a list of system tables where we might get access to it.

    • NZGuy says:

      I would try NZAdmin or IBM Netezza Administrator tool. We have all users install it. Definitely good for looking at running queries, looking at plan files , seeing if you have stats on tables, and most importantly for us, killing your own runaway queries. Give it a try.

  18. Mark says:

    Ever hear of anyone trying to increase storage capacity by replacing all the 1TB disks with something larger? 2TB, 3TB disks? It would be cheaper than buying another twinfin.

    • NZGuy says:

      Won’t work. It’s an appliance. Software would have to support it. And no doubt IBM would no longer support you IF you even got it to come back up. In fact on some of the Mustang 10xxx series they used bigger drives but the software capped them to the same size as the older drives (700gb drives capped to 400gb). FYI, they are just now stopping selling twinfins… all Striper now. We just got a TF96 on a deal, and a new Striper 2 rack. If you want more Twinfin, talk to your rep, they may make you a crazy good deal.

  19. Mark says:

    Is there a way I can setup a Netezza event to run a script in a /path/scriptname passing it variables?

    Do I need the event to run a stored procedure or user defined function to accomplish this?

    Basically, it would be great if there were a simple way to have Netezza events call common reusable parameter driven scripts that we have developed for our production environments and converted to work with Netezza databases. Our production jobs call these to generate enterprise events and notifications through our corporations production support/monitoring infrastructure. It would be a lot simpler and easier if we could setup our Netezza production environments to operate like our other database production environments.

    Thus far, it has been like pulling teeth trying to figure this one out with this database. Either the documentation does not exist or I simply have not read it yet or have been looking in the wrong place. Hopefully, this will be, as it is in other databases, fairly simple and straightforward to do. Thanks in advance.

    • NZGuy says:

      Sorry to take so long to get to this. When I need to scripts something up outside of our ETL tool, frankly I just drive this type of parameter out of a table in the DB. There maybe be a better way to set a variable like you can in Oracle, I just haven’t taken the time to do that , since putting the parameters in a table has worked .

  20. David says:

    I’m looking for a method of bulk loading a directory of files (daily weather station data). Each file has the same definition. Would like to know what methods you would suggest. Thanks!

    • NZGuy says:

      We have a similar problem in that we have about 5000 weblog files a day we load. The short answer is we wrote a unix script that pipes the output , in our case of gunzip of the files into an nzload command. This was critical for us for performance, because we didn’t have to gunzip the files to disk first. Each “startup” of the nzload command has overhead,so we actually build a list of files to gunzip into the standard input of the nzload and do batches at a time, instead of 1 at a time. It may not be well advertised that nzload has the ability to input from standard input on a command line, but this method has been working for 5 years across 3 generations of Netezza appliances. Here is the command line from a rather more involved shell script. If your files are not gzipped, just use something like a cat of the files.

      gunzip -c `grep “$i” $vApacheLogFileList` | nzload -host $NZHOST -db $db -maxErrors 1000000000 -fillRecord -t $db_table -delim ‘ ‘ -quotedValue ‘DOUBLE’ -escapeChar ‘\’ -lf $NZLOG -bf $NZBAD >> $LOG_FILE 2>&1

  21. Nitesh says:

    Q. We have implemented a data lake (PDM based on IBM Banking DW) on a Netezza appliance and we use DataStage as the ETL tool.

    I have to design multiple ETL flows for data (Events: opening of an account and transactions: pay at the grocery store, recieve ur salary etc…) coming from the same source system which would run every 20 mins and load several different tables and a common table (columns: surrogate key, source system id, unique id in source system (account number)).

    The trouble lies in

    - The ETL flow must be able to run in parallel
    - When they run in parallel, they could load the same reference row in the common table as we can’t cover the possibility even with a lookup at the database in DS before loading the incremental load in NZ.

    As a solution, I am looking at https://www-01.ibm.com/support/knowledgecenter/SSZJPZ_11.5.0/com.ibm.swg.im.iis.conn.netezza.use.doc/topics/netezza_config_prinmary_key_validation.html

    However, I was wondering if it will result in a performance hit due to (I am assuming) bringing the bulk load down to row by row load due to the check for duplicate.

    +

    could you suggest alternate solutions

    Thanks
    NC

    • NZGuy says:

      I think even the method you linked to will not protect against duplicates if you are concurrently running 2 loads into the same target table. This is because an insert of a value by one load won’t be seen by another session until that batch insert completes and implicitly commits. So if 2 loads load the same account not existing in the target at the same time, I believe you will have dups. You need to serialize the load into this common table. There is a method I use all the time, and I would guess something similar is done in your linked method where you do not have to prevent existing keys from being in your source. You outer join your source select to the target table and put the condition where the target key is null.

      So roughly

      Insert into account_target
      select s.* from account_source s
      left join account_target t
      on s.account_num =t.account_num
      where t.account_target is null;

      This will not insert that account_num where it already exists in the target, and do it in a bulk insert.

      Hope this helps.

      • Nitesh says:

        Hi,

        Thanks for your response.

        The response you gave is actually the problem that i am facing. However, the problem at our site is that we have the one commit principle due to which we can’t load the account_target (from your example) during the processing of the ETL flow only as the last step of the ETL flow. Thus, the issue….

        I suggested using an intermediate table which we commit/write to during the execution of the job by using the netezza connector’s brute force solution https://www-01.ibm.com/support/knowledgecenter/SSZJPZ_11.5.0/com.ibm.swg.im.iis.conn.netezza.use.doc/topics/netezza_config_prinmary_key_validation.html

        and then using this intermediate table to load the facts/relationship tables.

        So, with the brute force, I ensure that only of the conflicting/potential duplicate finds it’s way to the DB and the rest of the jobs to build the relationship tables/sub-types utilize the surrogate key which made it to the intermediate table.

        However, my solution was turned down by the architect :( and i am now looking for alternates…..

        by experience do you know one apart from

        - Having a separate flow for the common table/lookup
        - serializing the ETL flows….

        Thanks :)

  22. Shraddha says:

    I need to do a string search on all the database objects in all the databases on a given server. e.g. When querying _v_object it returns the objects on the selected database and not all the databases on the server. Could you please suggest what can I do, other than connecting to all the databases one by one to get the desired result?

    • NZGuy says:

      So it’s a privilege issue. I can see objects from any database connected to any database in _v_object. Privs is something I really don’t deal with much, but I would guess it might be something like having list on global.

Leave a Reply to Gary Colbran Cancel reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>