Data Stage Homework Help, Assignment Help, Online Tutor, Tutoring

Submit Assignment

* *

Payment Online Assignment Help

Homework Help

sfsf

Data Stage

DataStage is actually two separate things.

In production (and, of course, in development and test environments) DataStage is just another application on the server, an application which connects to data sources and targets and processes ("transforms") the data as they move through the application. Therefore DataStage is classed as an "ETL tool", the initials standing for extract, transform and load respectively.

DataStage "jobs", as they are known, can execute on a single server or on multiple machines in a cluster or grid environment. Like all applications, DataStage jobs consume resources: CPU, memory, disk space, I/O bandwidth and network bandwidth.

DataStage also has a set of Windows-based graphical tools that allow ETL processes to be designed, the metadata associated with them managed, and the ETL processes monitored. These client tools connect to the DataStage server because all of the design information and metadata are stored on the server. On the DataStage server, work is organized into one or more "projects".

There are also two DataStage engines, the "server engine" and the "parallel engine".

The server engine is located in a directory called DSEngine whose location is recorded in a hidden file called /.dshome (that is, a hidden file called .dshome in the root directory) and/or as the value of the environment variable DSHOME. (On Windows-based DataStage servers the folder name is Engine, not DSEngine, and its location is recorded in the Windows registry rather than in /.dshome.)

The parallel engine is located in a sibling directory called PXEngine whose location is recorded in the environment variable APT_ORCHHOME and/or in the environment variable PXHOME.

DataStage Engines

The server engine is the original DataStage engine and, as its name suggests, is restricted to running jobs on the server. The parallel engine results from acquisition of Orchestrate, a parallel execution technology developed by Torrent Systems, in 2003. This technology enables work (and data) to be distributed over multiple logical "processing nodes" whether these are in a single machine or multiple machines in a cluster or grid configuration. It also allows the degree of parallelism to be changed without change to the design of the job.