Introduction To Datastage
Welcome to DataStage Tutorials. The objective of these tutorials is to gain an understanding of the IBM DataStage Tool. In these tutorials, we will cover topics such as DataStage Architecture, Job Sequencing in DataStage, Containers & Joins in DataStage, etc.
- Introduction to DataStage
- Partitioning Technique in DataStage
- Staging Area in DataStage
- Viewing Scheduled Jobs in DataStage
- Scheduling Jobs in DataStage
- Head Tail and Sample in DataStage
- Configuring Oracle Database in DataStage
- String Functions in DataStage
- Project Architecture in DataStage
- Key less Technique in Data Stage
In the Datastage, we have three types of Jobs is there:
Do you want to master DataStage? Then enrol in "DataStage Training" This course will help you to master DataStage
New Features In DataStage
DataStage continued to enhance its capabilities to manage data quality and data integration solutions. DataStage 8.0 introduced many new features to make the development and maintenance of the project comfortable. These enhancements include data quality management, connectivity methods, implementation of slowly changing dimension.
What is IBM Information Server?
IBM Information Server, consist of the following components, WebSphere DataStage and Quality Stage, WebSphere Information Analyzer, Federation Server, and Business Glossary, common administration, logging and reporting. These components are designed to provide much more efficient ways to manage metadata and develop ETL solutions. Components can be deployed based on client need.
Top Ten Features
1.The Metadata Server
With the Hawk release, DataStage has created common administration, logging and reporting and this will improve metadata reporting available, compared to prior releases.
2. Quality Stage
Data Quality is highly critical for data integration projects. Earlier releases such as MetaStage, Quality Stages used to add lot of additional overhead in installation, training and implementation. With new release of QualityStage, integration projects using standardization, matching and survivorship to improve quality will be more accessible and easier to use. Also, the developer will be able to design jobs with data transformation stages and data quality stages in the same session. Designer is called DataStage and QualityStage Designer in the current release, based on its usage.
3. Frictionless Connectivity and Connection Objects
Managing connectivity information and propagating connectivity information between different environments, has added additional development and maintenance overhead. These new objects help in connecting to remote database connectivity easier. Earlier releases, the development team may need to spend considerable time in resolving connectivity issues with the database. DataStage 8 will help the team by providing frictionless connectivity and connectivity objects, ensure reusability and reduces the risk of data issues due to wrong connectivity information.
4.Parallel job range lookup
It’s always important to get different options to access data for lookup and accessing over a range is always a better option when data range is available for improving performance. Range lookup has been merged into the existing lookup form and is easy to use.
Data Warehouse developers need to develop complex jobs to implement Slowly Changing Dimension. With this stage introduced in DataStage 8, the following enhancements can be done easily, surrogate key generation, there is the slowly changing dimension stage and updates passed to in-memory lookups. That's it for me with DBMS generated keys, I'm only doing the keys in the ETL job from now on! DataStage server jobs have the hash file lookup where you can read and write to it at the same time, parallel jobs will have the updateable lookup.
This new feature allows developers to open any job, which is already opened by other developers. This copy of the developer will be READ ONLY. This helps the developers in reducing wait time when the job is currently LOCKED by other user. New enhancements also allow you to unlock the job associated with a disconnected session from the web console in an easier way than prior releases.
7. Session Disconnection
With this feature, an administrator can disconnect sessions and unlock jobs.
8.Improved SQL Builder
This feature reduces the effort spent in synchronizing SQL Select list to the DataStage column list. This will ensure that column mismatches. Adding to this in ODBC Connector, you will be able to complex queries with GUI, which includes adding columns and where clause to the statement.
9. Improved job startup times
With this new enhancement, when lot of small parallel jobs gets invocated, this will have less impact on DataStage long-running jobs. Connectivity and resource allocation for parallel jobs has improved and the load is balanced based on the job requirement.
With this new feature, Data Stage has introduced common logging of Data Stage job logs. This helps in searching from the Data Stage log. Data Stage has also introduced time-based and record based job monitoring.
Change Data Capture
These are add on products (at an additional fee) that attach themselves to source databases and perform change data capture. Most source system database owners I've come across don't like you playing with their production transactional database and will not let you near it with a ten-foot pole, but I guess there are exceptions:
- Microsoft SQL Server
- DB2 for z/OS
There are three ways to get incremental feeds on the Information Server: the CDC products for DataStage, the Replication Server (renamed Information Integrator: Replication Edition, does DB2 replication very well) and the change data capture functions within DataStage jobs such as the parallel CDC stage.
These are the functions that are not in DataStage 8,
- dssearch command line function
- dsjob "-import"
- Version Control tool
- Released jobs
- Oracle 8i native database stages
The loss of the Version Control tool is not a big deal as the import/export functions have been improved. Building a release file as an export in version 8 is easier than building it in the Version Control tool in version 7.
The common connection objects functionality means the very wide range of DataStage database connections is now available across Information Server products.
Latest supported databases for version 8:
- DB2 8.1, 8.2 and 9.1
- Oracle 9i, 10i, 10gR2 not Oracle 8
- SQL Server 2005 plus stored procedures.
- Teradata v2r5.1, v2r6.0, v2r6.1 (DB server) / 8.1 (TTU) plus Teradata Parallel Transport (TPT) and stored procedures and macro support, reject links for bulk loads, restart capability for parallel bulk loads.
- Sybase ASE 15, Sybase IQ 11.5, 12.5, 12.7
- Informix 10 (IDS)
- SAS 612, 8.1, 9.1 and 9.1.3
- IBM WS MQ 6.1, WS MB 5.1
- Netezza v3.1
- ODBC 3.5 standard and level 3 compliant
- UniData 6 and UniVerse?
- Red Brick?
A new stage from the IBM software family, new stages from new partners and the convergence of QualityStage functions into Datastage. Apart from the SCD stage these all come at an additional cost.
- WebSphere Federation and Classic Federation
- Netezza Enterprise Stage
- SFTP Enterprise Stage
- iWay Enterprise Stage
- Slowly Changing Dimension: for type 1 and type 2 SCDs.
- Six QualityStage stages
New Functions Existing Stages
- Complex Flat File Stage: Multi-Format File (MFF) in addition to existing Cobol file support.
- Surrogate Key Generator: the key source is a new feature included in this stage which is maintained via integrated state file or DBMS sequence.
- Lookup Stage: Range Look-up is a new function that is equivalent to the operator between. Lookup against a range of values was difficult to implement in previous DataStage versions. By having this functionality in the lookup stage, comparing a source column to a range of two lookup columns or a lookup column to a range of two source columns can be easily implemented.
- Transformer Stage: new surrogate key functions Initialize() and GetNextKey().
- Enterprise FTP Stage: now choose between ftp and sftp transfer.
- Secure FTP (SFTP) Select this option if you want to transfer files between computers in a secured channel. Secure FTP (SFTP) uses the SSH (Secured Shell) protected channel for data transfer between computers over a non-secure network such as a TCP/IP network. Before you can use SFTP to transfer files, you should configure the SSH connection without any passphrase for RSA authentication.
New Database Connector Functions
This is a big area of improvement.
LOB/BLOC/CLOB Data: pictures, documents etc of any size can now be moved between databases. The connector can transfer large objects (LOB) using inline or reference methods. However, a connector is the only stage that does reference methods so another connector is needed to transfer the LOB inline later in the job.
Reject Links: Connecter has its own reject-handling function which eliminates the need to add a Modify or a Transformer stage for capturing SQL errors or for aborting jobs. A choice between a number of rows or percentage or rows rejected can be specified for terminating the job run.
Schema Reconciliation: Connector has a schema reconciliation function that automatically compares DataStage schemas to external-resource schemas such as a database. Schemas include data types, attributes and field lengths. Based on the reconciliation rules that you specify, runtime errors or extra transformation on mismatched schemas can be avoided.
Improved SQL Builder that supports more database types.
The connector is the best stage to use for your database because it gives the maximum parallel performance and offers more features compared to a database
Test button The Test Button on connectors allows developers to test database connections without having to view the data or to run the job.
Connectors are for accessing external data sources and can be used to read, write, look up and filter data or simply to test the database connectivity during job design.
Drag and drop your configured database connections onto jobs.
Before and after SQL defined per job or per node with a failure handling option. Neater than previous versions.
DataStage 8 gives you access to the latest versions of databases that DataStage 7 may never get. Extra functions on all connectors include improved reject handling, LOB support and easier stage configuration.
Note the database compatibility for the Metadata Server repository is the latest versions of the three DBMS engines. DB2 is an optional extra in the bundle if you don't want to use an existing database.
IBM Information Server does not support the Database Partitioning Feature (DPF) for use in the repository layer -DB2 Restricted Enterprise Edition 9 is included with IBM Information Server and is an optional part of the installation however its use is restricted to hosting the IBM Information Server repository layer and cannot be used for other applications
- Oracle 10g
- SQL Server 2005
Different enterprise packs are available in version 8. These packs are:
SAP BW Pack
- BAPI: (Staging Business API) loads from any source to BW.
- OpenHub: extract data from BW.
SAP R/3 Pack
- ABAP: (Advanced Business Application Processing) auto-generate ABAP, Extraction Object Builder, SQL Builder, Load and execute ABAP from DataStage, CPI-C Data Transfer, FTP Data Transfer, ABAP syntax check, background execution of ABAP.
- IDoc: create source system, IDoc listener for extract, receive IDocs, send IDocs.
- BAPI: BAPI explorer, import-export Tables Parameters Activation, call and commit BAPI.
EIM: (data integration manager) interface tables
- Business Component: access business views via Siebel Java Data Bean
- Direct Access: use a metadata browser to select data to extract
- Hierarchy: for extracts from Siebel to SAP BW.
Oracle Applications Pack
- Oracle flex fields: extract using enhanced processing techniques.
- Oracle reference data structures: simplified access using the Hierarchy Access component.
Metadata browser and importer
DataStage Pack for PeopleSoft Enterprise
- Import business metadata via a metadata browser.
- Extract data from PeopleSoft tables and trees.
JD Edwards Pack
- Standard ODBC calls
- Pre-joined database tables via business views
These packs can be used by the server and/or parallel jobs to interact with other coding languages. This lets you access programming modules or functions within a job:
- Java Pack: Produce or consume rows for DataStage Parallel or Server jobs. Using a java transformer.
- Web Service Pack: Access web services operations in a Server job transformer or Server routine.
- XML Pack: Read, write or transform XML files in parallel or server jobs.
The DataStage stages, custom stages, transformer functions, and routines will usually be faster at transforming data than these packs however they are useful for reusing existing code.
Database OPEN and CLOSE Commands
The native parallel database stages provide options for specifying OPEN and CLOSE commands. These options allow commands (including SQL) to be sent to the database before (OPEN) or after (CLOSE) all rows are read/written/loaded to the database. OPEN and CLOSE are not offered by plug-in database stages.
For example, the OPEN command could be used to create a temporary table, and the CLOSE command could be used to select all rows from the temporary table and insert into a final target table.
As another example, the OPEN command can be used to create a target table, including database-specific options (tablespace, logging, constraints, etc) not possible with the “Create” option. In general, don’t let EE generate target tables unless they are used for temporary storage. There few options to specify Create table options, and doing so may violate data-management (DBA) policies.
It is important to understand the implications of specifying a user-defined OPEN and CLOSE command. For example, when reading from DB2, a default OPEN statement places a shared lock on the source. When specifying a user-defined OPEN command, this lock is not sent – and should be specified explicitly if appropriate.
Further details are outlined in the respective database sections of the Orchestrate Operators Reference which is part of the Orchestrate OEM documentation.
Data Stage Designer
DataStage Designer is used to designing ETL jobs. Some of the functionalities provided are Detailed below:
- Create DS jobs
- Create and use parameters within jobs
- Insert and link stages
- Configure stage and job properties
- Load and save table definitions
- Save and compile DS jobs
- Run jobs
Logging-In to DS Designer The ‘Attach to Project’ window is used to log-into DS.
DS Log On Window
Note: Do not use the ‘Omit’ option while working in the UNIX environment. This option is Used for ‘Windows authentication. It should not be used when DS is run on UNIX
The Data Stage Job Starting Data Stage The screen below displays when the user successfully logs-in. DS Job Selection
Select a ‘New Parallel Job’ from the new job window. Note: Options to choose from ‘Existing’ jobs or from ‘Recent’ jobs are available from the tab of the same name. DataStage EE Canvas A typical DS Enterprise Edition canvas looks like the example below.
DS Canvas--Typical Data Stage Parallel Job
DS Stages and Usage
The Datastage stages are divided into two categories
1. Active Stages
2. Passive Stages
Active Stages: Active stages model the flow of data and provide mechanisms for combining data streams, aggregating data, and converting data from one data type to another.
Ex: Transformer Stage, Aggregator, Sort, Remove Duplicates, Switch…etc
Passive Stages: A passive stage handles access to databases for the extraction or writing of data.
Ex: Sequential File, File Set, Data Set, Db2, Oracle, Hash File Stages
The look and feel of DataStage and QualityStage canvas remain the same but the new functionalities are major enhancements over the previous version. Data Connection Object, Parameter Set, Range Look-up and Slowly Changing Dimension are all designed to simplify design, help cut implementation effort and reduce cost. Advance Find provides a good way to do impact analysis, an important step in project management. Resource Estimation is as important for project planning. Meanwhile, the Performance Analysis tool is another useful feature that can be used throughout the lifecycle of a job. By knowing what causes a performance bottleneck, production support groups can better cope with the ever-shrinking batch windows.
While Advance Find will not perform a Replace function and SQL Builder will not let us build complex SQL, all the changes in version 8 have a positive impact on job development, production support and project management. Combined with the features offered in Information Server, existing customers who are looking to upgrade or new DataStage clients will benefit from the new enhancements.