Performance Tuning in DataStage

Ratings:
(4)
Views:0
Banner-Img
  • Share this blog:

Environment Variable Settings

DataStage EE provides a number of environment variables to control how jobs operate on a UNIX system.  In addition to providing required information, environment variables can be used to enable or disable various DataStage features, and to tune performance settings.

Environment Variable Settings for All Jobs

Ascential recommends the following environment variable settings for all Enterprise Edition jobs. These settings can be made at the project level, or may be set on an individual basis within the properties for each job.

Environment Variable Settings For All Jobs

Environment Variable Setting Description
$APT_CONFIG_FILE Filepath Specifies the full pathname to the EE configuration file.
$APT_DUMP_SCORE   1 Outputs EE score dump to the DataStage job log, providing detailed information about actual job flow including operators, processes, and datasets. Extremely useful for understanding how a job actually ran in the environment. (see section 10.1 Reading a Score Dump)
$OSH_ECHO 1 Includes a copy of the generated osh in the job’s DataStage log.  Starting with v7, this option is enabled when “Generated OSH visible for Parallel jobs in ALL projects” option is enabled in DataStage Administrator.
$APT_RECORD_COUNTS 1 Outputs record counts to the DataStage job log as each operator completes processing. The count is per operator per partition.
$APT_PM_SHOW_PIDS 1 Places entries in DataStage job log showing UNIX process ID (PID) for each process started by a job. Does not report PIDs of DataStage “phantom” processes started by Server shared containers.
$APT_BUFFER_MAXIMUM_TIMEOUT 1 Maximum buffer delay in seconds
$APT_THIN_SCORE (DataStage 7.0 and earlier) 1 Only needed for DataStage v7.0 and earlier. Setting this environment variable significantly reduces memory usage for very large (>100 operator) jobs.

 

Additional Environment Variable Settings

Ascential recommends setting the following environment variables on an as-needed basis. These variables can be used to tune the performance of a particular job flow, to assist in debugging, and to change the default behavior of specific EE stages.

NOTE: The environment variable settings in this section are only examples. Set values that are optimal to your environment.

Sequential File Stage Environment Variables

$APT_EXPORT_FLUSH_COUNT [nrows] Specifies how frequently (in rows) that the Sequential File stage (export operator) flushes its internal buffer to disk. Setting this value to a low number (such as 1) is useful for realtime applications, but there is a small performance penalty from increased I/O.
$APT_IMPORT_BUFFER_SIZE   $APT_EXPORT_BUFFER_SIZE [Kbytes] Defines size of I/O buffer for Sequential File reads (imports) and writes (exports) respectively. Default is 128 (128K), with a minimum of 8. Increasing these values on heavily-loaded file servers may improve performance.
$APT_CONSISTENT_BUFFERIO_SIZE [bytes] In some disk array configurations, setting this variable to a value equal to the read / write size in bytes can improve performance of Sequential File import/export operations.
$APT_DELIMITED_READ_SIZE [bytes] Specifies the number of bytes the Sequential File (import) stage reads-ahead to get the next delimiter. The default is 500 bytes, but this can be set as low as 2 bytes. This setting should be set to a lower value when reading from streaming inputs (eg. socket, FIFO) to avoid blocking.
$APT_MAX_DELIMITED_READ_SIZE [bytes] By default, Sequential File (import) will read ahead 500 bytes to get the next delimiter. If it is not found the importer looks ahead 4*500=2000 (1500 more) bytes, and so on (4X) up to 100,000 bytes.   This variable controls the upper bound which is by default 100,000 bytes.  When more than 500 bytes read-ahead is desired, use this variable instead of APT_DELIMITED_READ_SIZE.

Oracle Environment Variables

$ORACLE_HOME [path] Specifies installation directory for current Oracle instance. Normally set in a user’s environment by Oracle scripts.
$ORACLE_SID [sid] Specifies the Oracle service name, corresponding to a TNSNAMES entry.
$APT_ORAUPSERT_COMMIT_ROW_INTERVAL $APT_ORAUPSERT_COMMIT_TIME_INTERVAL [num] [seconds] These two environment variables work together to specify how often target rows are committed for target Oracle stages with Upsert method. Commits are made whenever the time interval period has passed or the row interval is reached, whichever comes first. By default, commits are made every 2 seconds or 5000 rows.
$APT_ORACLE_LOAD_OPTIONS [SQL* Loader options] Specifies Oracle SQL*Loader options used in a target Oracle stage with Load method. By default, this is set to OPTIONS(DIRECT=TRUE, PARALLEL=TRUE)
$APT_ORA_IGNORE_CONFIG_FILE_PARALLELISM 1 When set, a target Oracle stage with Load method will limit the number of players to the number of datafiles in the table’s tablespace.
$APT_ORA_WRITE_FILES [filepath] Useful in debugging Oracle SQL*Loader issues. When set, the output of a Target Oracle stage with Load method is written to files instead of invoking the Oracle SQL*Loader. The filepath specified by this environment variable specifies the file with the SQL*Loader commands.
$DS_ENABLE_RESERVED_CHAR_CONVERT 1 Allows DataStage to handle Oracle databases which use the special characters # and $ in column names.

Job Monitoring Environment Variables

$APT_MONITOR_TIME [seconds] In v7 and later, specifies the time interval (in seconds) for generating job monitor information at runtime. To enable size-based job monitoring, unset this environment variable, and set $APT_MONITOR_SIZE below.
$APT_MONITOR_SIZE   [rows] Determines the minimum number of records the job monitor reports. The default of 5000 records is usually too small. To minimize the number of messages during large job runs, set this to a higher value (eg. 1000000).
$APT_NO_JOBMON 1 Disables job monitoring completely. In rare instances, this may improve performance. In general, this should only be set on a per-job basis when attempting to resolve performance bottlenecks.
$APT_RECORD_COUNTS 1 Prints record counts in the job log as each operator completes processing. The count is per operator per partition.

Job Monitoring Environment Variables

$APT_MONITOR_TIME [seconds] In v7 and later, specifies the time interval (in seconds) for generating job monitor information at runtime. To enable size-based job monitoring, unset this environment variable, and set $APT_MONITOR_SIZE below.
$APT_MONITOR_SIZE   [rows] Determines the minimum number of records the job monitor reports. The default of 5000 records is usually too small. To minimize the number of messages during large job runs, set this to a higher value (eg. 1000000).
$APT_NO_JOBMON 1 Disables job monitoring completely. In rare instances, this may improve performance. In general, this should only be set on a per-job basis when attempting to resolve performance bottlenecks.
$APT_RECORD_COUNTS 1 Prints record counts in the job log as each operator completes processing. The count is per operator per partition.

Configuration Files

The configuration file tells DataStage Enterprise Edition how to exploit underlying system resources (processing, temporary storage, and dataset storage). In more advanced environments, the configuration file can also define other resources such as databases and buffer storage. At runtime, EE first reads the configuration file to determine what system resources are allocated to it, and then distributes the job flow across these resources.

When you modify the system, by adding or removing nodes or disks, you must modify the DataStage EE configuration file accordingly. Since EE reads the configuration file every time it runs a job, it automatically scales the application to fit the system without having to alter the job design.

There is not necessarily one ideal configuration file for a given system because of the high variability between the way different jobs work. For this reason, multiple configuration files should be used to optimize overall throughput and to match job characteristics to available hardware resources. At runtime, the configuration file is specified through the environment variable $APT_CONFIG_FILE.

Logical Processing Nodes

The configuration file defines one or more EE processing nodes on which parallel jobs will run. EE processing nodes are a logical rather than a physical construct. For this reason, it is important to note that the number of processing nodes does not necessarily correspond to the actual number of CPUs in your system.

Within a configuration file, the number of processing nodes defines the degree of parallelism and resources that a particular job will use to run. It is up to the UNIX operating system to actually schedule and run the processes that make up a DataStage job across physical processors. A configuration file with a larger number of nodes generates a larger number of processes that use more memory (and perhaps more disk activity) than a configuration file with a smaller number of nodes.While the DataStage documentation suggests creating half the number of nodes as physical CPUs, this is a conservative starting point that is highly dependent on system configuration, resource availability, job design, and other applications sharing the server hardware. For example, if a job is highly I/O dependent or dependent on external (eg. database) sources or targets, it may appropriate to have more nodes than physical CPUs.

For typical production environments, a good starting point is to set the number of nodes equal to the number of CPUs. For development environments, which are typically smaller and more resource-constrained, create smaller configuration files (eg. 2-4 nodes). Note that even in the smallest development environments, a 2-node configuration file should be used to verify that job logic and partitioning will work in parallel (as long as the test data can sufficiently identify data discrepancies)

Optimizing Parallelism

The degree of parallelism of a DataStage EE application is determined by the number of nodes you define in the configuration file. Parallelism should be optimized rather than maximized. Increasing parallelism may better distribute your work load, but it also adds to your overhead because the number of processes increases. Therefore, you must weigh the gains of added parallelism against the potential losses in processing efficiency. The CPUs, memory, disk controllers and disk configuration that make up your system influence the degree of parallelism you can sustain.

Keep in mind that the closest equal partitioning of data contributes to the best overall performance of an application running in parallel. For example, when hash partitioning, try to ensure that the resulting partitions are evenly populated. This is referred to as minimizing skew.

When business requirements dictate a partitioning strategy that is excessively skewed, remember to change the partition strategy to a more balanced one as soon as possible in the job flow. This will minimize the effect of data skew and significantly improve overall job performance.

Configuration File Examples

Given the large number of considerations for building a configuration file, where do you begin? For starters, the default configuration file (default.apt) created when DataStage is installed is appropriate for only the most basic environments.

The default configuration file has the following characteristics:

  • number of nodes = ½ number of physical CPUs
  • disk and scratch disk storage use subdirectories within the DataStage install filesystem

You should create and use a new configuration file that is optimized to your hardware and file systems. Because different job flows have different needs (CPU-intensive? Memory-intensive? Disk-Intensive? Database-Intensive? Sorts? need to share resources with other jobs/databases/ applications? etc), it is often appropriate to have multiple configuration files optimized for particular types of processing.

With the synergistic relationship between hardware (number of CPUs, speed, cache, available system memory, number and speed of I/O controllers, local vs. shared disk, RAID configurations, disk size and speed, network configuration and availability), software

topology (local vs. remote database access, SMP vs. Clustered processing), and job design, there is no definitive science for formulating a configuration file. This section attempts to provide some guidelines based on experience with actual production applications.

Interested in mastering DataStage Training?
Enroll now for FREE demo on DataStage Training.

IMPORTANT: It is important to follow the order of all sub-items within individual node specifications in the example configuration files given in this section.

Example for Any Number of CPUs and Any Number of Disks

Assume you are running on a shared-memory multi-processor system, an SMP server, which is the most common platform today. Let’s assume these properties:

            computer host name “fastone”

6 CPUs

4 separate file systems on 4 drives named /fs0, /fs1, /fs2, /fs3

You can adjust the sample to match your precise environment.

The configuration file you would use as a starting point would look like the one below. Assuming that the system load from processing outside of DataStage is minimal, it may be appropriate to create one node per CPU as a starting point.

In the following example, the way disk and scratchdisk resources are handled is the important.

{ /* config files allow C-style comments. */

            /* Configuration do not have flexible syntax. Keep all the sub-items of  the individual node specifications in the order shown here. */

   node "n0" {

pools "" /* on an SMP node pools aren’t used often. */

     fastname "fastone"

resource scratchdisk "/fs0/ds/scratch" {} /* start with fs0 */

  resource scratchdisk "/fs1/ds/scratch" {}

  resource scratchdisk "/fs2/ds/scratch" {}

  resource scratchdisk "/fs3/ds/scratch" {}

resource disk "/fs0/ds/disk" {} /* start with fs0 */

  resource disk "/fs1/ds/disk" {}

  resource disk "/fs2/ds/disk" {}

  resource disk "/fs3/ds/disk" {}

   }

node "n1" {

pools ""

  fastname "fastone"

resource scratchdisk "/fs1/ds/scratch" {} /* start with fs1 */

resource scratchdisk "/fs2/ds/scratch" {}

  resource scratchdisk "/fs3/ds/scratch" {}

    resource scratchdisk "/fs0/ds/scratch" {}

resource disk "/fs1/ds/disk" {} /* start with fs1 */

   resource disk "/fs2/ds/disk" {}

resource disk "/fs3/ds/disk" {}

  resource disk "/fs0/ds/disk" {}

}

    node "n2" {

  pools ""

  fastname "fastone"

  resource scratchdisk "/fs2/ds/scratch" {} /* start with fs2 */

resource scratchdisk "/fs3/ds/scratch" {}

resource scratchdisk "/fs0/ds/scratch" {}

  resource scratchdisk "/fs1/ds/scratch" {}

   resource disk "/fs2/ds/disk" {} /* start with fs2 */

  resource disk "/fs3/ds/disk" {}

   resource disk "/fs0/ds/disk" {}

   resource disk "/fs1/ds/disk" {}

    }

node "n3" {

  pools ""

fastname "fastone"

resource scratchdisk "/fs3/ds/scratch" {} /* start with fs3 */

   resource scratchdisk "/fs0/ds/scratch" {}

  resource scratchdisk "/fs1/ds/scratch" {}

resource scratchdisk "/fs2/ds/scratch" {}

resource disk "/fs3/ds/disk" {} /* start with fs3 */

   resource disk "/fs0/ds/disk" {}

resource disk "/fs1/ds/disk" {}

  resource disk "/fs2/ds/disk" {}

  }             node "n4" {                        pools ""

  fastname "fastone"

/* Now we have rotated through starting with a different disk, but the fundamental problem

* in this scenario is that there are more nodes than disks. So what do we do now?

* The answer: something that is not perfect. We’re going to repeat the sequence. You could

* shuffle differently i.e., use /fs0 /fs2 /fs1 /fs3 as an order, but that most likely won’t

* matter. */

resource scratchdisk “/fs0/ds/scratch” {} /* start with fs0 again */

  resource scratchdisk “/fs1/ds/scratch” {}

resource scratchdisk “/fs2/ds/scratch” {}

resource scratchdisk “/fs3/ds/scratch” {}

   resource disk “/fs0/ds/disk” {} /* start with fs0 again */

  resource disk “/fs1/ds/disk” {}

resource disk “/fs2/ds/disk” {}

resource disk “/fs3/ds/disk” {}

}

node “n5” {

pools “”

fastname “fastone”

  resource scratchdisk “/fs1/ds/scratch” {} /* start with fs1 */

  resource scratchdisk “/fs2/ds/scratch” {}

resource scratchdisk “/fs3/ds/scratch” {}

  resource scratchdisk “/fs0/ds/scratch” {}

   resource disk “/fs1/ds/disk” {} /* start with fs1 */

resource disk “/fs2/ds/disk” {}

resource disk “/fs3/ds/disk” {}

  resource disk “/fs0/ds/disk” {}

    }

  } /* end of entire config */

The file pattern of the configuration file above is a “give every node all the disk” example, albeit in different orders to minimize I/O contention. This configuration method works well when the job flow is complex enough that it is difficult to determine and precisely plan for good I/O utilization.

Within each node, EE does not “stripe” the data across multiple filesystems. Rather, it fills the disk and scratchdisk filesystems in the order specified in the configuration file. In the 4-node example above, the order of the disks is purposely shifted for each node, in an attempt to minimize I/O contention.

Even in this example, giving every partition (node) access to all the I/O resources can cause contention, but EE attempts to minimize this by using fairly large I/O blocks.

This configuration style works for any number of CPUs and any number of disks since it doesn't require any particular correspondence between them. The heuristic here is: “When it’s too difficult to figure out precisely, at least go for achieving balance.”

Example that Reduces Contention

The alternative to the first configuration method is more careful planning of the I/O behavior to reduce contention. You can imagine this could be hard given our hypothetical 6-way SMP with 4 disks because setting up the obvious one-to-one correspondence doesn't work. Doubling up some nodes on the same disk is unlikely to be good for overall performance since we create a hotspot.

We could give every CPU two disks and rotate them around, but that would be little different than the previous strategy. So, let’s imagine a less constrained environment with two additional disks:

         computer host name “fastone”

6 CPUs

6 separate file systems on 4 drives named /fs0, /fs1, /fs2, /fs3, /fs4, /fs5

Now a configuration file for this environment might look like this:

{            node "n0" {
  pools ""
fastname "fastone"
  resource disk "/fs0/ds/data" {pools ""}
resource scratchdisk "/fs0/ds/scratch" {pools ""}
   }
node "node2" {
fastname "fastone"pools ""
resource disk "/fs1/ds/data"
{pools ""}
resource scratchdisk "/fs1/ds/scratch" {pools ""}
 }
 node "node3" {
  fastname "fastone"
     pools ""
resource disk "/fs2/ds/data" {pools ""}
resource scratchdisk "/fs2/ds/scratch" {pools ""}
    }
node "node4" {
  fastname "fastone"
pools ""
  resource disk "/fs3/ds/data" {pools ""}
  resource scratchdisk "/fs3/ds/scratch" {pools ""}
    }
node "node5" {
fastname "fastone"
  pools ""
resource disk "/fs4/ds/data" {pools ""}
resource scratchdisk "/fs4/ds/scratch" {pools ""}
    }
node "node6" {
  fastname "fastone"
  pools ""
resource disk "/fs5/ds/data" {pools ""}
  resource scratchdisk "/fs5/ds/scratch" {pools ""}
     }
   } /* end of entire config */

While this is the simplest scenario, it is important to realize that no single player, stage, or operator instance on any one partition can go faster than the single disk it has access to.

You could combine strategies by adding in a node pool where disks have a one-to-one association with nodes. These nodes would then not be in the default node pool, but a special one that you would specifically assign to stage / operator instances.

Other configuration file tips:

Consider avoiding the disk(s) that your input files reside on. Often those disks will be hotspots until the input phase is over. If the job is large and complex this is less of an issue since the input part is proportionally less of the total work.

Ensure that the different file systems mentioned as the disk and scratchdisk resources hit disjoint sets of spindles even if they're located on a RAID system. Do not trust high-level RAID/SAN monitoring tools, as their “cache hit ratios” are often misleading.

Never use NFS file systems for scratchdisk resources. Know what's real and what's NFS: Real disks are directly attached, or are reachable over a SAN (storage-area network - dedicated, just for storage, low-level protocols).

Proper configuration of scratch and resource disk (and the underlying filesystem and physical hardware architecture) can significantly affect overall job performance. Beware if you use NFS (and, often SAN) filesystem space for disk resources. For example, your final result files may need to be written out onto the NFS disk area, but that doesn't mean the intermediate data sets created and used temporarily in a multi-job sequence should use this NFS disk area. It is better to setup a "final" disk pool, and constrain the result sequential file or data set to reside there, but let intermediate storage go to local or SAN resources, not NFS.

About Author
Authorlogo
Name
TekSlate
Author Bio

TekSlate is the best online training provider in delivering world-class IT skills to individuals and corporates from all parts of the globe. We are proven experts in accumulating every need of an IT skills upgrade aspirant and have delivered excellent services. We aim to bring you all the essentials to learn and master new technologies in the market with our articles, blogs, and videos. Build your career success with us, enhancing most in-demand skills in the market.