DataStage Overview
It is a Comprehensive ETL Tool, Which provides, end to end ERP Solutions.
Some of the Most popular ETL Tools are:
- DSPX àleader of ETL Tools, Started from 2006
- Informatics
- ODI
- SAS (ETL STUDIO)
- BODI
- ABNITRO
Do you want to master DataStage? Then enrol in "DataStage Training" This course will help you to master DataStage
History of DataStage
Has more than 12 years of History
1st release was in 1997
1997 – VMARK – UK - - >
Mr. LEE SCHEFFLER - - > Father of Data stage
- - >Data Stage was called as Data Integrator during 1997 - - > Torrent (Data Integrator)
IBM has acquired Informix with Database is 2000.
- 2000 ASCENTIAL Data Stage Server Edition
Cit is the combination if Informix + Data Integrator
- 2000 ASCENTIAL Data Stage Server + DRCHESTRATE
- Orchestrate is an ETL Tool, and is has Extensive parallel capabilities
- It is only Executed on UNIX flavors
UNIX flavors
- ALX
- Linux
- HPUX
- SUNSOLARIS
- - > Due to the Combination with DRCHESTRATE, Data Stage acquired a parallel combination
- Version – 6
- Version – 6 – 7.5.1
- Parallel Environment
ASCENTIAL Data Stage PX (Parallel extender)
It Can be Configured only on UNIX flavours
- - > Up to Version 7.5.1, Server Components are configured only on UNIX flavours à
2004 December
- 5 * 2 ASDSPX + MKS Tool kit
↓ ↓
Accentual Data Stage PX To create a virtual environment (like UNIX) In XP to run the Data Stage.
- So, the MKS Tool kit has the capability to run the Data stage on windows.
- 5 * 2 ASDSPX + MKS Tool kit
Can perform only Data Transformation
- - >MKS Tool kit à Assential Suite Components
Release
(a) Profile Stage
(b) Quality Stage
(c) Audit Stage
(d)Meta Stage
(e)Data Stage PX
(f)Data Stage TX as Software
2005
- ->IBM has acquired entire ASCENTIAL - - > IBM Data Stage Enterprise Edition 7.5 *2 - - > (used by 50 % of users)
2006
- ->IBM Web sphere Data Stage & Quality Stage 8.0.1 - -> IDE (Integrated Environment) - -> (used by 40 % of users)
Integrated Environment of
(a) Profile Stage
(b) Quality Stage
(c) Audit Stage
(d)Meta Stage
(e)Data Stage PX
2009
- ->IBM infasphure Data Stage & Quality Stage 8.0.1 - -> Improved web servicers & Server has changed. - -> (used by 10 % of users)
Features of Data Stage
- Any to Any
- Platform Independent
- Node Configuration
- Portion Parallelism
- Pipeline Parallelism
Any to Any
Reads the data from any Source and loads it to any Target.
Any SRC ↔ Any Target
Platform Independent
Designed for one O.S, can be executed
- - >Platform generally can be either Software or Hardware.
- - > In the Data stage, Platform is w. r. t Hardware.
Hardware environment
Uni processing Environment
Hard disk à CPU - - > RAM
Symmetric Multi-Processing: - (SMP)
can have 32–64 CPU that is Hard disk with multiple CPU‘S
Massively Parallel processing:- (MPP)
- Collection of different SMPS
Node Configuration
- The best feature of the Data stage
- It is a technique of creating logical CPUs
Node - - > logical CPU (or) instance of (physical) CPU
àIt is an S/W which will Create virtual CPU’S
- Data Stage is Executed on logical CPU’S
- TO run a job in the Data stage, WE require at least 1 Node.
EX:- ETL
UNI Process
Hard disk - - > CPU - - > RAM
- To access 1000 records, it takes 10 mins.
SMP
- To access 1000 records, with 4 CPU’S it takes 2.5 min
Node config:
- Uni Processing - - > Virtual SMF
S is not using the max. capabilities of CPU, So Node config. is an S/W Which drives into different Nodes. That is Boost up the Capabilities & Energy level of CPU
Partition parallelism
- - > Horizontal Combining
- - > Combining primary rows with Secondary rows w. r. t Key column values
Partitioning
It is a technique of distributing the records across the nodes, based on partitioning techniques.
- In addition, We have a 9th technique known as ‘AUTO’
NOTE:
- Partitioning techniques plays an important role in Performance Tuning
Note:-
- - > Key-based technique assures that the same key column values are collected at the same partition.
Ex:-
EMP
DNO= Primary key
E NO | E Name | DNO |
11 | a | 10 |
12 | b | 20 |
13 | c | 10 |
14 | d | 30 |
15 | e | 20 |
D NO | D Name | Loc |
10 | ACE | Hyd |
20 | Meter | Sec |
30 | Sales | Eng |
When combine, I.e, using a horizontal combination
That is Same key column values are collected at the same partition
Repatriating
The Portioned data is once again repatriated
Ex:
EName | Dno | Loc |
A | 10 | AP |
B | 20 | TN |
C | 10 | TN |
D | 20 | KN |
E | 30 | TN |
F | 10 | KN |
G | 20 | AP |
- Partitioning and Repatriating are automatic processes in the Data stage
Reverse Partitioning
- Reverse Partitioning is collecting the data from the nodes.
- It happens only in 1 Situation that is Parallel to Sequential.
Reverse Partitioning is also called as Collecting
Different Collecting Methods
- Ordered
- Round Robin
- Sort – Merge
- Auto
Pipeline Parallelism
Simultaneously doing the extraction of Transforming and loading jobs.
Pipe link
A channel through which data moves from one stage to another stage
Traditional Batch Processing:-
(Server jobs)
Sequential processing
EX:- for Suppose, We have 3 instructions
I1 – Fetch (F), Decode (D), Execute (E), Write lock (W)
I2 – F, D, E, W
I3 –F,D, E,W
- - > In sequential process
Parallel Processing
Running all transactions in parallel
T1 | T2 | T3 | T4 | T5 | T6 | T7 | T8 |
F | D | E | W | ||||
F | D | E | W | ||||
F | D | E | W | ||||
F | D | E | W | ||||
F | D | E | W |
The core difference between Version 7.5 *2 and 8.0.1 of DataStage
7.5*2 | 8.0.1 |
| 5 client components
|
OS-dependent(OS; the user will be data stage users) | OS independent(User can be created at datastage, but one dependent) |
File-based repository(Folder) | Database repository (default is DB/2) |
No web-based administration | Web-based administration |
| 5 architecture components
|
can perform phase 3,4 | Can perform phase 1,2,3,4 |
2 tier | N tier |
Note:--
Features of Manager in 7.5 *2, are integrated into a designer in 8.0.1
(a) In 7.5 * 2 user id and used to login for authentication, are created in the O.S, O.S wires will become D.S users
(b) In 8.0.1, they are created at the Data stage Environment
Repository
In 7.5 *2, everything is Stored in the folder in the form of files
8.0.1
Data is organized in 2 layers
- Global Repository àData base à more security
- Local Repository à folder àperformance
- In 8.0.1, admin can work from home that is using the web console component
4.(a) In 7.5 * 2 it is 2 –tier
S - - > server
C - - > machine
(b)In 8, We can have multiple Servers / Engine, Only 1 Repository
R- C1, C- C2, E1 – C3, E2 – C4, E3 – C4 ------En – Cn - - > n –tier components can be configured in n no of machines.
Client components of 7.5 * 2 and 8.0.1
7.5*2 Designer
- Create jobs (Mainframe Jobs (MF), Server jobs (SJ) Sequence jobs (SJ), Parallel Jobs (PJ)
- Compile
- Run
- Multiple job Compile
Director
- Views
- Jobs
- Status
- Logs
- Monitor
- Batch Jobs
- unlock jobs
- Message Handing
- Schedule jobs
Manager
- Import / Export
- Node Configuration
Admin
- Create projects
- Delete projects
- Organize project
8.1.0 Designer
- Create jobs (Mainframe Jobs (MF), Server jobs (SJ) Sequence jobs (SJ), Parallel Jobs (PJ)
- Compile
- Run
- Multiple job Compile
- Import / Export
- Node Configuration
- Advanced Find
- Performance Analysis
- Estimate Resource
Director
- Views
- Jobs
- Status
- Logs
- Monitor
- Batch Jobs
- unlock jobs
- Message Handing
- Schedule jobs
Admin
- Create projects
- Delete projects
- Organize projects
Web Console
- Security Services
- Reporting Services
- Logging Services
- Scheduling Services
- Domain Management
- Session Management
Information Analyzer:--/ Console for IBM Information Service
Data profiling (CA, PA, FA, Baseline, Cross-domain)
For an in-depth understanding of DataStage click on
- Introduction to DataStage
- Architecture of Data Stage
- Oracle Enterprise in Data Stage
- SCD(Slow changing Dimension) in Data Stage
- DataStage Tutorials