Introduction to Datastage

DataStage is an ETL tool that is used in graphical notation to build solutions to data integration. The available versions of DataStage are SE (Server Edition), EE (Enterprise Edition), MVS (Multiple Virtual Storage) editions. The job role in DataStage comprises the knowledge of data warehouse, ETL, data storage, configuration, design, various stages, and modules in data stages. DataStage is used to integrate various systems (multiple systems) and processes high volumes of data. The popularity of DataStage is because of the user-friendly graphical front-end to design jobs.

The following write-up is written to make you aware of the DataStage interview questions and answers. It will be helpful in cracking foundation interview levels of DataStage related jobs. Nonetheless, DataStage real-time interview questions will help to get shortlisted for a job position.

Categories of Datastage Interview Questions

DataStage Interview Questions

Let’s begin with IBM DataStage Interview questions and answers.

Q1) What is DataStage?

Ans: DataStage is a tool developed by IBM that is used to design, develop, and execute different applications to fill the gap of multiple tables in data warehouses or data marts. It is software for windows servers that helps in data extraction from databases and modifies them into data warehouses. DataStage is an important part of the IBM WebSphere Data Integration Suite.

Q2) What are the types of stages?


  • General Objects

  • Stages of Data Quality

  • Development and Debug Stages

  • Database connectors

  • Restructure stages

  • Real-time stages

  • Debug and Development stages

  • Sequence activities

Q3) What are the components of DataStage?

Ans: DataStage has a number of client and server components. It has four main components, namely:

  • Datastage Designer

  • Datastage Director

  • Datastage Manager

  • Datastage Administrator

Q4) Explain a few features of DataStage.


  • Extracts data from any number or types of database.

  • Handles all the metadata definitions required to define your data warehouse.

  • You can view and modify the table definitions at any point during the design of your application.

  •  Aggregates data

  •  You can modify SQL SELECT statements used to extract data

  • DataStage transforms data easily. It has a set of predefined transforms and functions.

  • You can use it to convert your data and you can easily extend the functionality by defining your own transforms to use.

  • Loads the data warehouse

Q5) What are the jobs available in DataStage?


  • Server job

  • Parallel job

  • Sequencer job

  • Container job

Q6) State the difference between Server job and parallel job

Ans: Server Jobs work in a sequential way while parallel jobs work in a parallel fashion (Parallel Extender works on the principle of pipeline and partition) for I/O processing.

Q7) At which location DataStage repository is stored?

Ans: DataStage stores its repository in IBM Universe Database.

Q8) Who are the DataStage clients or users?

Ans: DataStage tool can be used by

  • DataStage Administrator

  • DataStage Designer

  • DataStage Manager

  • DataStage Director

Q9) Explain Metadata.

Ans: Data about data is known as metadata. A table definition describing the structure of the table is known as metadata.

Datastage Technical Interview Questions

Q10) State maps and locales


  • Maps: Maps defines the sets that can be utilized in a project
  • Locales: Locales define the local formats for date, time, sorting order, etc. that can be utilized in a project.

Q11) Differentiate between DataStage and Informatica.


  • DataStage supports parallel processing which Informatica doesn’t.

  • Links are objects in the DataStage, but in Informatica, it’s a port to port connectivity.

  • In Informatica, it is easy to implement Slowly Changing Dimensions which is a little bit complex in DataStage.

  • DataStage doesn’t support complete error handling.

Q12) State the types of server components.


  • Repository

  • DataStage Server

  • DataStage Package Installer

Q13) Define universal stage

Ans: A stage that extracts data from or loads data into a Universe database using SQL. It is used to represent a data source, an aggregation step, or a target data table.

Q14) What is a dynamic array in DataStage?

Ans: Dynamic arrays map the structure of DataStage file records to character string data. Any character string can be a dynamic array. A dynamic array is a character string containing elements that are substrings separated by delimiters.

Q15) What are the components of Ascential Datastage?


  • Client Components: Administrator, Director, Manager, and Designer.
  • Server Components: Repository, Server, and Plugins.
  • Q16) What are the two types of containers used in DataStage?

DataStage has two types of containers - Local Container and Shared Container

Do you want to master DataStage? Then enrol in "DataStage Training" This course will help you to master DataStage

Q17) What is the hash file stage in DataStage?

Ans: Hash file stage is a binary file used for lookup, for better performance.

Q18) State staging variable

Ans: Staging variables are the temporary variables created in the transformer for calculation.

Q19) How can u convert the columns to rows in DataStage?

Ans: The conversion of columns into rows can be done in DataStage utilizing the pivot stage.

Q20) What is the merge stage?

Ans: The Merge stage combines a sorted master data set with one or more sorted update data sets. The columns from the records in the master and update data sets are merged so that the output record contains all the columns from the master record plus any additional columns from each update record.

Q21) State the term repository.

Ans: A repository is a central store that contains all the information required to build a data mart or data warehouse.

Q22) What is the work of a DataStage server?

Ans: A DataStage server runs executable jobs that extract, transform, and load data into a data warehouse.

Q23) State a few advantages of DataStage.

Ans: A data warehousing strategy has the following advantages :

  • Capitalizes on the potential value of the organization’s information.

  • Improves the quality and accessibility of data.

  • Combines valuable archive data with the latest data in operational sources.

  • Increases the amount of information available to users.

  • Reduces the requirement of users to access operational data.

  • Reduces the strain on IT departments, as they can produce one database to serve all user groups.

  • Allows new reports and studies to be introduced without disrupting operational systems.

  • Promotes users to be self-sufficient.

Q24) We have two containers in DataStage – local and shared containers, what does it mean?

Ans: Local Container is local to the particular job in which we have developed the container and Shared Container can be used in any other jobs.

Q25) I have stored my data at orabulk stage, what does it mean?

Ans: OraBulk stage is used to Bulk Load the Oracle Target Database. By storing data at the OraBulk stage of DataStage means your data is stored at the Oracle Target Database.

Q26) Define Active and Passive stages.

Ans: The passive Stage is used for data extraction and loading

Active Stage is used to implement and process the business rules

DataStage Tutorials

Q27) What do you mean by data repository?

Ans: MetaData is data about the data. It contains-

  • Query statistics

  • ETL statistics

  • Business subject area

  • Source information

  • Target information

  • Source to target mapping Information

Q28) What is meant by the join stage and merge stage?

Ans: The join stage is used to perform join operations on one or more data sets the input to the stage and then outputs the resulting dataset.

The merge stage is used to combine a sorted master data set with one or more sorted updated data sets. The columns from the records in the master and update data sets arc merged so that the output record contains all the columns from the master record plus any additional columns from each update record that required.

Q29) What is the default cache size in DataStage?

Ans: The default cache size in DataStage is 256 mb.

Scenario-based Data stage Interview Questions

Q30) Can you reuse the components? If yes, then how?

Ans: The components in DataStage can be reused by using shared and local containers.

Q31) How can we eliminate duplicate rows?

Ans: DataStage provides us with the capability to eliminate duplicate rows in EE (Enterprise Edition) only. Using EE of DataStage, the elimination of duplicate rows can be done based on a key column.

Q32) What command line functions are used to import and export the DataStage jobs?

Ans: dsimport.exe command-line function is used to import the DataStage components

dsexport.exe command-line function is used to export the DataStage components

Q33) Did you use conditional scheduling in your project? What is the use of it?

Ans: Once during my project, I have used conditional scheduling. Using Sequencer Job we can create conditional scheduling.

Q34) What can we do with the DataStage Director?

Ans: Validating, Scheduling, Executing, and Monitoring Jobs (server Jobs).

Q35) What is a Hash file and a Sequential file?

Ans: A Hash file saves data on a hash algorithm and on a hash key value, while a sequential file doesn’t have any key value to save the data. Based on this hash key feature, searching in a Hash file is faster than in a sequential file.

Q36) What is meant by NLS in DataStage?

Ans: NLS is referred to as National Language Support

Q37)Explain the integrity stage?

Ans: The integrity stage is also known as the quality stage that helps in assistance in integrating various types of data from different sources.

Q38) Describe Oconv () And Iconv () Functions In Datastage?

Ans: In Datastage, OConv () and IConv() functions are used to convert formats from one format to another i.e. conversions of time, roman numbers, radix, date, numeral ASCII etc. IConv () is mostly used to change formats for the system to understand. While, OConv () is used to change formats for users to understand.

Q39) Explain the BCP stage?

Ans: The BCP stage is used to store a big amount of data in one target table of Microsoft SQL Server.

Q40) What is the need for a link partitioner and link collector in DataStage?

Ans: In Datastage, Link Partitioner is used to split data into various parts by certain partitioning methods. Link Collector is used to collecting data from many partitions to a single data and save it in the target table.

Q41) State the difference between DataStage and DataStage TX?

Ans: Datastage is a tool from ETL (Extract, Transform and Load) and DataStage TX is a tool from EAI (Enterprise Application Integration).

Q42) What is the use of the Surrogate key?

Ans: In Datastage, Surrogate Key is used instead of a unique key. Surrogate key is mostly used for retrieving data faster. It uses Index to perform the retrieval operation. Surrogate key is a 4-byte integer that replaces the transaction/business / OLTP key in the dimension table. We can store up to 2 billion records.

The need for the Surrogate key is for integrating the data may help better for the primary key. Additionally, it is used for index maintenance, joins, table size, key updates, disconnected inserts and partitioning.

Q43) How will you declare a constraint in DataStage?

Ans: Constraints are expressions and you can specify a constraint for each output link from a transformer stage. You can also specify that a particular link is to act as an otherwise link and catch those rows that have failed to satisfy the constraints on all other output links.

  • Select an output link and click the constraints button

  • Double-click the output link's constraint entry field

  • Choose Constraints from the background or header shortcut menus

  • A dialogue box will appear which allows you either to define constraints for any of the transformer output links or to define a link as an otherwise link

Q44) How to execute a DataStage job from the command line?


  • Using "dsjob". Following is the command:
  • dsjob -run -jobstatus projectname jobname

Q45) What components are there in DataStage architecture?

Ans: DataStage has three main components:

Client component

  • DataStage Administrator
  • DataStage Manager
  • DataStage Designer
  • DataStage Director

Server component

  • DataStage Engine
  • Metadata Repository
  • Package Installer

DataStage administrator

  • Used to create the project
  • Contains a set of properties

Q46) Explain Types of Fact Tables?


  • Factless Fact: It contains only foreign keys to the dimension tables.
  • Additive Fact: Measures can be added across any dimensions.
  • Semi-Additive: Measures can be added across some dimensions. Eg. % age, discount
  • Non-Additive: Measures cannot be added across any dimensions. Eg. Average
  • Conformed Fact: The equation or the measures of the two fact tables are the same under the facts are measured across the dimensions with the same set of measures.

Q47) What are the types of dimension tables?


  • Conformed Dimension: If a dimension table is connected to more than one fact table, the granularity that is defined in the dimension table is common across between the fact tables.
  • Junk Dimension: The Dimension table, which contains only flags.
  • Monster Dimension: If rapidly changes occur in dimension then it is known as Monster Dimension.
  • Degenerate Dimension: It is a line item-oriented fact table design.

Q48) What are Macros?

Ans: They are built from DataStage functions and do not require arguments. A number of macros are provided in the JOBCONTROL.H file to facilitate getting information about the current job, and links and stages belonging to the current job. These can be used in expressions (for example for use in Transformer stages), job control routines, filenames and table names, and before/after subroutines.

Q49) What is Conformed Dimension?

Ans: In DataStage, a conformed dimension is a dimension that has the same meaning to every fact with what it relates. It can exist as a single dimension table that relates to multiple fact tables within the same data warehouse, or as identical dimension tables in separate data marts.

Q50) Difference between Validated Ok and Compiled in DataStage

Ans: Validating a job is all about running the job in ‘Check only’ mode.

The checks that will perform are-

  • The connections are established to data sources or data warehouse

  • The SELECT statements are prepared

  • Intermediate files are opened in Hashed file, UniVerse or ODBC stages

 The compilation is done by-

  • Checking data stage engine to determine whether all needed properties are given

  • Upon selecting Validate Job, the data stage checks the validity of all given properties

Most FAQ’s in MNC DataStage Interview

Let us now give a quick view at a few of the real-time DataStage interview questions that are asked in various MNC Companies Interviews. You might get many of the answers in the above questions and answers area but many of them are scenario-based DataStage question & answers that depend upon your logic and knowledge of DataStage.

  • Tell me about your current Project?
  • What is the Difference between server jobs and Parallel jobs?
  • How can you capture the Duplicate values if I have Duplicate Records?
  • How do you Remove the Dataset in Unix?
  • What is the difference between Hash file and Lookup File set stages?
  • What is the default padding character?
  • What are the partition techniques available in your last project?
  • What is meant by preserving partitioning mode?
  • What is the default execution order of the sequential file?
  • Can we use a sequential file as a source to hash files? Have you done it? If yes, what error will it give?
  • How many input links can you give to a Transformer stage?
  • Can aggregator and transformer stages be used to sort the data? How?
  • What is the use of Union in Oracle? If I write a query select * from EMP UNION select * from DEPT, is it executed well?
  • What is the difference between Union and Union All?
  • What is the difference between Replace and Truncate?
  • How do you get the system Date in oracle?
  • How do you design PX Routines?
  • What are the stages you used in the Job sequencer?
  • What is the use of Wait-for- file activity?
  • What are the triggers available in sequencer activity?
  • How many Exception handler activities can you call to one sequencer?
  • How can you Execute jobs in the command line?
  • What are the third party tools you used in your project?
  • Explain the situation where you have applied SCD in your project
  • Tell me the syntax of the configuration file?
  • How many Nodes configuration files did you use in your last project?
  • What is the command to see the list of files?
  • What is the Use of ls –ltr Command?
  • How to kill a job in UNIX?
  • What is the command to send mail to the corresponding person when my job got aborted?

Final Words

The above write up is written to give a glance at DataStage production support interview questions. I hope the above write up will help you in gaining your dream job in DataStage!