Types of Partition

Ratings:

(4.3)

Share this blog:

Types of partition

Hash
Modulus
DB2
Auto
Random
Range
Round Robin
Entire
Same

Key-based partition

Hash
Modulus
DB2

Hash

Determines partition based on key-value(s). One or more keys with different data types are supported.

Example: Key is State. All “CA” rows go into one partition; all “MA” rows go into one partition. Two rows of the same state never go into different partitions.

Modulus

Partition based on the modulus of key divided by the number of partitions. The Key is an Integer type.

Example: Key is OrderNumber (Integer type). Rows with the same order number will all go into the same partition.

DB2

Matches DB2 EEE partitioning

Example: This partition is used when loading data into the DB2 table. It takes the partition key from the loading DB2 table and inserts the records effectively. If the partition key is defined in the DB2 database then it takes that Partition key otherwise it defaults to the primary key.

This figure gives a clear view of Key-based Partitioning and repartitioning.

DB2

Ascential parallel technology operates by a divide-and-conquer technique, splitting the largest integration jobs into subsets ("partition parallelism") and flowing these subsets concurrently across all available processors ("pipeline parallelism"). This combination of pipeline and partition parallelism delivers true linear scalability (defined as an increase in performance proportional to the number of processors) and makes hardware the only mitigating factor to performance. However, downstream processes may need data partitioned differently. Consider a transformation that is based on customer last name, but the enriching needs to occur on zip code - for house-holding purposes - with loading into the warehouse based on customer credit card number (more on parallel database interfaces below). With dynamic data re-partitioning, data is re-partitioned on-the-fly between processes - without landing the data to disk-based on the downstream process data partitioning needs.

Keyless partition

Round Robin
Entire
Same

Round Robin

The first record goes to the first processing node, the second to the second processing node, and so on. When DataStage reaches the last processing node in the system, it starts over. This method is useful for resizing partitions of an input data set that are not equal in size. The round-robin method always creates approximately equal-sized partitions. This method is the one normally used when DataStage initially partitions data.

Example: Assume 8 nodes are allocated to store the records then 1^st record will go into the first node and the 2^nd record will go into the second node ……8^th record will go into the eighth node and the 9^th record will go into the first node and so on….

Entire

Send all rows down all partitions

Example: Assume 8 nodes are allocated, then in all the 8 nodes all the records will be passed.

Same

Preserve the same partitioning.

Example: Two stages in a job (Sort and Dataset). In the sort stage, you have done “Hash” partition and in the dataset, you have given the “Same” partition. In the dataset, the data will be preserved with the hash partition.

Application Execution:

Parallel jobs can be executed in two ways

Sequential
Parallel

In the first slot of the below figure sequential execution is shown. A parallel job can be executed in two processing, SMP and MPP, in the second and third slots of the below figure it is shown.

Application Execution

Best allocation

Best allocation of Partitions in DataStage for storage area

S. No	No of Ways	Volume of Data	Best way of Partition	Allocation of Configuration File (Node)
	DB2 EEE extraction in serial	Low	-	1
2.	DB2 EEE extraction in parallel	High	Node number = current node (key)	64 (Depends on how many nodes are allocated)
3.	Partition or Repartition in the Stages of DataStage	Any	Modulus (It should be a single key that to integer) Hash (Any number of keys with different data type)	8 (Depends on how many nodes are allocated for the job)
4.	Writing into DB2	Any	DB2	-
5.	Writing into Dataset	Any	Same	1,2,4,8,16,32,64 etc… (Based on the incoming records it writes into it.)
6.	Writing into Sequential File	Low	-	1

Best allocation of Partitions in DataStage for each stage

S. No	Stage	Best way of Partition	Important points
1.	Join	Left and Right link: Hash or Modulus	All the input links should be sorted based on the joining key and partitioned with higher key order.
2.	Lookup	Main link: Hash or same Reference link: Entire	Both the links need not be in the sorted order
3.	Merge	Master and update link: Hash or Modulus	All the input links should be sorted based on the merging key and partitioned with higher key order. Pre-sort makes merge “lightweight” for memory.
4.	Remove Duplicate, Aggregator	Hash or Modulus	If the input link is in sorted order based on the key it will perform better.
5.	Sort	Hash or Modulus	Sorting happens after partitioning
6.	Transformer, Funnel, Copy, Filter	Same	None
7.	Change Capture	Left and Right link: Hash or Modulus	Both the input links should be in the sorted order based on the key and partitioned with higher key order.

Inclined to build a profession as Datastage Developer? Then here is the blog post on, explore Datastage Training

Appropriate Ways

Step 1: (Serial extraction with proper partition)

In this job, extraction is made serial in both the DB2 stages. All the records will reside in 1 partition. As input link1 of the Sort stage has been “hash” partitioned, 1 partition will split into 8 partitions based on the given key(s), and records are also sorted based on the joining key(s). In the Join stage for link2, the “same” partition is given, and on the other side for link3 with the same key(s) “hash” partition and internal sorting is done so that records from 1 partition will split into 8 partitions and also sorted within the partition. Now joining will be done effectively and the records are captured in the dataset with the “same” partition, which will maintain the records in 8 partitions. For 1 partition, 1 page will be created. So, for the entire job, 32 pages will be created.

Serial extraction

Suggestion:

This will be one of the best ways to design the job for a huge volume of data. But, extracting the records in serial makes performance to go down. Multiple instances can be enabled in this job and can be called in parallel to attain parallelism.

Step 2: (Parallel extraction with proper partition)

This job is the same as the previous job but extraction alone made in parallel. In extracting the records in the DB2 side, 64 partitions will be created.

Parallel extraction

Suggestion:

This will be one of the best ways to design jobs for a huge volume of data. But, the number of database connections been established is huge.

Step 1:

As the extraction is done in a serial mode in both the DB2 stages, the allocated node will be 1 that is all the records will be in one partition and in all the remaining stages have the “same” partition, so the number of partition is retained as 1 in all the stages. So, the page created for this job will be 4.

step 1

Suggestion:

This job will work fine for the low records. But for the huge records, it will hang. Since the entire job is in serial mode.

Step 2:

Here the extraction is made parallel in both the DB2 stages. So, the allocated node will be 64 and in all the other stage the partition are “same” so for every stage the allocated node will be 64. So the page created for this job will be 256.

Step 2

Suggestion:

For less or huge records this job will work very fast but it will occupy maximum space in the system. For this job wastage of memory (Paging Issue may come up) happens because 256 pages will be created and sometimes this kind of job will make the entire system to come down.

Step 3:

An unpredictable thing may happen here because it all depends on the stages it will decide on its own since it is an auto partition. Step 3

Suggestion

Never use auto partition in the job.

Step 4:

In this case performance wise no issues but will lose the data on the Join stage because on the Link2 it is coming as 8 nodes and on the other side Link3 in 64 nodes so DataStage cannot join the data accurately since Join Stage is depended on the master link. In this case, Link2 is the master link so it will join only with the 8 nodes and it will leave the other 56 nodes which is coming from the Link3. Therefore data coming from that 56 nodes will not be considered.

Step 4

Suggestion:

Never join the link without having the same partitioning key on both sides otherwise data loss may occur and also sort the records before joining.

For an in-depth Knowledge on DataStage, click on below

You liked the article?

Like : 0

Vote for difficulty

Current difficulty (Avg): Medium

Recommended Courses

1/6

About Author

Name

TekSlate

Author Bio

TekSlate is the best online training provider in delivering world-class IT skills to individuals and corporates from all parts of the globe. We are proven experts in accumulating every need of an IT skills upgrade aspirant and have delivered excellent services. We aim to bring you all the essentials to learn and master new technologies in the market with our articles, blogs, and videos. Build your career success with us, enhancing most in-demand skills in the market.

Stay Updated

Get stories of change makers and innovators from the startup ecosystem in your inbox

Related Blogs

DataStage Tutorial

DataStage Interview Questions

Introduction to DataStage

ETL Project Phases in Data Stage

Architecture of Data Stage

Roles and Responsibilities of ETL Developer in Data Stage