DataStage Overview

It is a Comprehensive ETL Tool, Which provides, end to end ERP Solutions.

Some of the Most popular ETL Tools are:

  • DSPX àleader of ETL Tools, Started from 2006
  • Informatics
  • ODI
  • SAS (ETL STUDIO)
  • BODI
  • ABNITRO

 

History of DataStage

Has more than 12 years of History

1st release was in 1997

1997 – VMARK – UK  - - >

Mr. LEE SCHEFFLER  - - > Father of Data stage

 - - >Data Stage was called as Data Integrator   during 1997  - - > Torrent (Data Integrator)

DataStage    

IBM has acquired Informix with Database is 2000.

  • 2000 ASCENTIAL Data Stage Server Edition

Cit is the combination if Informix + Data Integrator

  • 2000 ASCENTIAL Data Stage Server + DRCHESTRATE

  • Orchestrate is an ETL Tool, and is has Extensive parallel capabilities
  • It is only Executed on UNIX flavors

 

 UNIX flavors

  • ALX
  • Linux
  • HPUX
  • SUNSOLARIS

- - > Due to the Combination with DRCHESTRATE, Data Stage acquired a parallel combination

  • Version – 6
  • Version – 6 – 7.5.1
  • Parallel Environment

 

ASCENTIAL Data Stage PX (Parallel extender)

It Can be Configured only on UNIX flavors

- - > Up to Version 7.5.1, Server Components are configured only on UNIX flavors à  

2004 December

  • 5 * 2 ASDSPX + MKS Tool kit

            ↓            

Accentual Data Stage PX      To create a virtual environment (like UNIX)  In XP to run the Data Stage.

  • So, the MKS Tool kit has the capability to run the Data stage on windows.
  • 5 * 2 ASDSPX + MKS Tool kit

Can perform only Data Transformation

  - - >MKS Tool kit à Assential  Suite Components

Release

(a) Profile Stage

(b) Quality  Stage

(c) Audit Stage

(d)Meta Stage

(e)Data Stage PX

(f)Data Stage TX as Software  

2005  

- ->IBM has acquired entire ASCENTIAL - - >     IBM Data Stage Enterprise Edition 7.5 *2 - - >  (used by 50 % of users)  

2006  

- ->IBM Web sphere Data Stage & Quality Stage 8.0.1   - -> IDE (Integrated Environment)  - -> (used by 40 % of users)  

Integrated Environment of

(a) Profile Stage

(b) Quality Stage

(c) Audit Stage

(d)Meta Stage

(e)Data Stage PX    

2009

- ->IBM infasphure   Data Stage & Quality Stage 8.0.1   - ->  Improved web servicers  & Server has changed.  - -> (used by 10 % of users)  

Features of Data Stage

 

  1. Any to Any
  2. Platform Independent
  3. Node Configuration
  4. Portion Parallelism
  5. Pipeline Parallelism

Any to Any

Reads the data from any Source and loads it to any Target.

Any SRC    ↔   Any Target  

Platform Independent

Designed for one O.S, can be executed

  - - >Platform generally can be either Software or Hardware.

Platform Independent      

  • - > In the Data stage, Platform is w. r. t Hardware.

Hardware environment 

  • Uni processing Environment

  Hard disk à CPU - - > RAM  

  • Symmetric Multi-Processing: - (SMP)

  Hard Disk

can have 32–64 CPU that is Hard disk with multiple CPU‘S

  Massively Parallel processing:-  (MPP)

MPP

  • Collection of different SMPS

  Node Configuration

  • The best feature of the Data stage
  • It is a technique of creating logical CPUs

  Node - - > logical CPU (or) instance of (physical) CPU

àIt is an S/W which will Create virtual CPU’S

  • Data Stage is Executed on logical CPU’S
  • TO run a job in the Data stage, WE require at least 1 Node.

EX:- ETL  

UNI Process

Hard disk - - > CPU - - > RAM

  • To access 1000 records, it takes 10 mins.

SMP SMP

  • To access 1000 records, with 4 CPU’S it takes 2.5 min

Node config:

  • Uni Processing - - > Virtual SMF

Node Configuration

S is not using the max. capabilities of CPU, So Node config. is an S/W Which drives into different Nodes. That is Boost up the Capabilities & Energy level of CPU  

Partition parallelism

- - > Horizontal Combining

  - - > Combining primary rows with Secondary rows w. r. t  Key column values

Partition Parallelism

Inclined to build a profession as Datastage Developer? Then here is the blog post on, explore Datastage Training

Partitioning

It is a technique of distributing the records across the nodes, based on partitioning techniques.

Partitioning Techniques  

  • In addition, We have a 9th technique known as ‘AUTO’

  NOTE:

  • Partitioning techniques plays an important rule in Performance Tuning

  Note:-

- - > Key-based technique assures that the same key column values are collected at the same partition.  

Ex:-

 EMP

DNO= Primary key  

E NOE NameDNO
11a10
12b20
13c10
14d30
15e20

   

D NO D Name Loc 
10ACEHyd
20MeterSec
30SalesEng

  When combine, I.e, using a horizontal combination

  Horizontal combining  

That is Same key column values are collected at the same partition  

Repatriating

The Portioned data is once again repatriated

Ex:  

ENameDnoLoc
A10AP
B20TN
C10TN
D20KN
E30TN
F10KN
G20AP

  Repatriating  

  • Partitioning and Repatriating are automatic process in the Data stage

  Reverse Partitioning

  • Reverse Partitioning is collecting the data from the nodes.
  • It happens only in 1 Situation that is Parallel to Sequential.

Reverse Partitioning    

Reverse Partitioning is also called as Collecting  

Different Collecting Methods

  1. Ordered
  2. Round Robin
  3. Sort – Merge
  4. Auto

 

Pipeline Parallelism

Simultaneously doing the extraction of Transforming and loading jobs.

Pipe link

A channel through which data moves from one stage to another stage

  Pipe link  

Traditional Batch Processing:-

(Server jobs)

Sequential processing  

EX:-  for Suppose, We have 3 instructions

I1 – Fetch (F), Decode (D), Execute (E), Write lock (W)

I2 – F, D, E, W

I3 –F,D, E,W

- - > In sequential process

Traditional Batch Processing  

Parallel Processing

Parallel Processing  

Running all transactions in parallel  

T1T2T3T4T5T6T7T8
FDEW    
 FDEW   
  FDEW  
   FDEW 
    FDEW

 

 The core difference between Version 7.5 *2 and 8.0.1  of DataStage

 

7.5*28.0.1

  1. 4 client components

  1. DS Designer
  2. DS Director
  3. DS Manger
  4. DS Admin

5 client components

  1. DS Designer
  2. DS Director
  3. DS admin
  4. Web console

OS-dependent(OS; the user will be data stage users)OS independent(User can be created at datastage, but one dependent)
File-based repository(Folder)Database repository (default is DB/2)
No web-based administrationWeb-based administration

  1. 2 architecture components

  1. Server
  2. client

5 architecture components

  1. common user interface
  2. common repository
  3. common engine
  4. common connectivity
  5. common shared services

can perform phase 3,4Can perform phase 1,2,3,4
2 tierN tier

    Note:--

Features of Manager in 7.5 *2, are integrated into a designer in 8.0.1

(a) In 7.5 * 2 user id and  used to login for authentication, are created in the O.S, O.S wires will   become D.S users

(b) In 8.0.1, they are created at the Data stage Environment    

  • Repository

In  7.5 *2, everything is Stored in the folder in the form of files

  • 8.0.1

Data is organized in 2 layers

  • Global Repository àData base à more security
  • Local Repository à folder àperformance
  • In 8.0.1, admin can work from home that is using the web console component

4.(a) In 7.5 * 2 it is 2 –tier

S - - > server

C - - >  machine  

(b)In 8, We can have multiple Servers / Engine, Only 1 Repository

R- C1, C- C2, E1 – C3, E2 – C4, E3 – C4 ------En – Cn - - > n –tier components can be configured in n no of machines.  

Client components of 7.5 * 2 and 8.0.1

 7.5*2 Designer  

  • Create jobs (Mainframe Jobs (MF),  Server jobs  (SJ) Sequence jobs (SJ), Parallel Jobs (PJ)
  • Compile
  • Run
  • Multiple job Compile

  Director

  • Views

    • Jobs
    • Status
    • Logs

  • Monitor
  • Batch Jobs
  • unlock jobs
  • Message Handing
  • Schedule jobs

Manager   

  • Import / Export
  • Node Configuration

Admin

  • Create projects
  • Delete projects
  • Organize project

8.1.0 Designer  

  • Create jobs (Mainframe Jobs (MF),  Server jobs  (SJ) Sequence jobs (SJ), Parallel Jobs (PJ)
  • Compile
  • Run
  • Multiple job Compile
  • Import / Export
  • Node Configuration
  • Advanced Find
  • Performance Analysis
  • Estimate Resource

Director

  • Views
  • Jobs
  • Status
  • Logs
  • Monitor
  • Batch Jobs
  • unlock jobs
  • Message Handing
  • Schedule jobs

Admin  

  • Create projects
  •  Delete projects
  • Organize projects

Web Console

  • Security  Services
  • Reporting Services
  • Logging Services
  • Scheduling Services
  • Domain Management
  • Session Management

Information Analyzer:--/ Console for IBM Information Service

Data profiling  (CA, PA, FA, Baseline, Cross-domain)

For an in-depth understanding of DataStage click on