Types of partition
Key based partition
Determines partition based on key value(s). One or more keys with different data type are supported.
Example: Key is State. All “CA” rows go into one partition; all “MA” rows go into one partition. Two rows of the same state never go into different partitions.
Partition based on modulus of key divided by the number of partitions. Key is an Integer type.
Example: Key is OrderNumber (Integer type). Rows with the same order number will all go into the same partition.
Matches DB2 EEE partitioning
Example: This partition is used when loading data into the DB2 table. It takes the partition key from the loading DB2 table and inserts the records effectively. If the partition key is defined in the DB2 database then it takes that Partition key otherwise it defaults to primary key.
This figure gives the clear view of Key based Partitioning and repartitioning.
Ascential parallel technology operates by a divide-and-conquer technique, splitting the largest integration jobs into subsets ("partition parallelism") and flowing these subsets concurrently across all available processors ("pipeline parallelism"). This combination of pipeline and partition parallelism delivers true linear scalability (defined as an increase in performance proportional to the number of processors) and makes hardware the only mitigating factor to performance. However, downstream processes may need data partitioned differently. Consider a transformation that is based on customer last name, but the enriching needs to occur on zip code - for house-holding purposes - with loading into the warehouse based on customer credit card number (more on parallel database interfaces below). With dynamic data re-partitioning, data is re-partitioned on-the-fly between processes - without landing the data to disk - based on the downstream process data partitioning needs.
The first record goes to the first processing node, the second to the second processing node, and so on. When DataStage reaches the last processing node in the system, it starts over. This method is useful for resizing partitions of an input data set that are not equal in size. The round robin method always creates approximately equal-sized partitions. This method is the one normally used when DataStage initially partitions data.
Example: Assume 8 nodes are allocated to store the records then 1st record will go into the first node and the 2nd record will go into the second node ……8th record will go into the eighth node and the 9th record will go into the first node and so on….
Send all rows down all partitions
Example: Assume 8 nodes are allocated, then in all the 8 nodes all the records will be passed.
Preserve the same partitioning.
Example: Two stages in a job (Sort and Dataset). In sort stage you have done “Hash” partition and in the dataset you have given “Same” partition. In the dataset the data will be preserved with the hash partition.
Parallel jobs can be executed in two ways
In the first slot of the below figure sequential execution is shown. Parallel job can be executed in two processing, SMP and MPP, in the second and third slot of the below figure it is shown.
Best allocation of Partitions in DataStage for storage area
|S. No||No of Ways||Volume of Data||Best way of Partition||Allocation of Configuration File (Node)|
|DB2 EEE extraction in serial||Low||-||1|
|2.||DB2 EEE extraction in parallel||High||Node number = current node (key)||64 (Depends on how many nodes are allocated)|
|3.||Partition or Repartition in the Stages of DataStage||Any||Modulus (It should be single key that to integer)|
Hash (Any number of keys with different data type)
|8 (Depends on how many nodes are allocated for the job)|
|4.||Writing into DB2||Any||DB2||-|
|5.||Writing into Dataset||Any||Same||1,2,4,8,16,32,64 etc… (Based on the incoming records it writes into it.)|
|6.||Writing into Sequential File||Low||-||1|
Best allocation of Partitions in DataStage for each stage
|S. No||Stage||Best way of Partition||Important points|
|1.||Join||Left and Right link: Hash or Modulus||All the input links should be sorted based on the joining key and partitioned with higher key order.|
|2.||Lookup||Main link: Hash or same|
Reference link: Entire
|Both the links need not be in the sorted order|
|3.||Merge||Master and update link: Hash or Modulus||All the input links should be sorted based on the merging key and partitioned with higher key order. Pre-sort makes merge “lightweight” for memory.|
|4.||Remove Duplicate, Aggregator||Hash or Modulus||If the input link is in sorted order based on the key it will perform better.|
|5.||Sort||Hash or Modulus||Sorting happens after partitioning|
|6.||Transformer, Funnel, Copy, Filter||Same||None|
|7.||Change Capture||Left and Right link: Hash or Modulus||Both the input links should be in the sorted order based on the key and partitioned with higher key order.|
These core tutorials will help you to learn the fundamentals of DataStage.
Step 1: (Serial extraction with proper partition)
In this job, extraction is made serial in both the DB2 stages. All the records will reside in 1 partition. As input link1 of the Sort stage has been “hash” partitioned, 1 partition will split into 8 partitions based on the given key(s) and records are also sorted based on the joining key(s). In Join stage for link2, “same” partition is given and on the other side for link3 with the same key(s) “hash” partition and internal sorting is done so that records from 1 partition will split into 8 partitions and also sorted within the partition. Now joining will be done effectively and the records are captured in the dataset with the “same” partition, which will maintain the records in 8 partitions. For 1 partition, 1 page will be created. So, for the entire job 32 pages will be created.
This will be one of the best ways to design the job for huge volume of data. But, extracting the records in serial makes performance to go down. Multiple instances can be enabled in this job and can be called in parallel to attain parallelism.
Step 2: (Parallel extraction with proper partition)
This job is same like the previous job but extraction alone made in parallel. In extracting the records in the DB2 side, 64 partitions will be created.
This will be one of the best ways to design the jobs for huge volume of data. But, the number of database connections been established is huge.
As the extraction is done in serial mode in both the DB2 stages, the allocated node will be 1 that is all the records will be in one partition and in all the remaining stages have the “same” partition, so the number of partition is retained as 1 in all the stages. So, the page created for this job will be 4.
This job will work fine for the low records. But for the huge records, it will hang. Since the entire job is in serial mode.
Here the extraction is made parallel in both the DB2 stages. So, the allocated node will be 64 and in all the other stage the partition are “same” so for every stage the allocated node will be 64. So the page created for this job will be 256.
For less or huge records this job will work very fast but it will occupy maximum space in the system. For this job wastage of memory (Paging Issue may come up) happens because 256 pages will be created and some times this kind of job will make the entire system to come down.
Unpredictable thing may happen here because it is all depend on the stages it will decide on its own since it is an auto partition.
Never use auto partition in the job.
In this case performance wise no issues but will loss the data on the Join stage because on the Link2 it is coming as 8 nodes and on the other side Link3 in 64 nodes so DataStage cannot join the data accurately since Join Stage is depended on the master link. In this case Link2 is the master link so it will join only with the 8 node and it will leave the other 56 node which is coming from the Link3. Therefore data coming from that 56 node will not be considered.
Never join the link without having same partitioning key on both sides otherwise data loss may occur and also sort the records before join.
For Indepth Knowledge on DataStage, click on below