Data warehouse loading involves processing of huge number of records and hence even the fine tuned ETLs will take considerable amount of time to load the warehouse. The parallel processing concept of ODI aids in reducing the overall time taken to load your data warehouse. In this article, we will talk about different ways to achieve parallel processing in ODI and discuss in detail about the Inter Entity Parallel Processing which is provided by ODI out of box..
Parallel processing can be achieved in ODI in two levels.
- Entity level parallelism: An entity is an database table like dimension, fact etc. Entities can be loaded in parallel in an ODI package using Asynchronous mode.
- Slice Level Parallelism: A slice is a chunk of data. Data to be loaded to an single entity can be sliced based on various criteria and loaded into the warehouse in parallel. This has to be developed in ODI by customizing few components.
This article discusses about Entity Level Parallelism in detail.
Entity Level Parallelism:
In ODI Packages, execution of steps is of two types i.e., Synchronous and Asynchronous.
A synchronous execution will serialize the scenario execution with other steps in the package: ODI executes the scenario, and only after its execution is completed, runs the next step.
An asynchronous execution will only invoke the scenario but will immediately execute the next step in the calling package: the scenario will then run in parallel with the next step. You can use this option to start multiple scenarios concurrently: they will all run in parallel, independently of one another.
Before we implement parallel processing between entities of a data warehouse, it is very important to identify dependant entities within a functional area of a warehouse.
To understand this better, let us assume we have the Orders star schema in our warehouse.
The Orders star schema looks like
In the above star, Fact is “Order Fact” and dimensions are “Time Period, Geo Location, Sales Org, Product, Customer”
By law of data warehouse, Facts are dependent on dimensions. So, the order of loading should be
- 1. Dimensions
- 2. Facts
Now, within each group, the entities can be run in parallel. This can be achieved in ODI Packages using asynchronous mode.
To achieve this, two packages are created in ODI one for Loading Dimensions and other to load dimensions.
The ODI package for loading dimensions looks like
The order looks like the dimensions are being run one after another. But, based on the options set in the properties window of each scenario, parallel execution is achieved.
“Asynchronous” mode is enabled for Load Geo Dim scenario. Hence, ODI package starts the execution of the next step in parallel. Similarly, for each dimension scenario Asynchronous mode is anabled. The last step odiWaitforChildSession waits till all the dimensions are loaded in this package before marking session “Load Dimensions” as complete.
The Load Fact package looks like
In this package, there is only one fact. So, by default ODI package will execute it in Synchronous mode.
Now, RUN ALL package contains these two steps executed in synchronous mode. RUN ALL is scheduled and hence when RUN ALL is fired, the LOAD DIMENSIONS scenario is invoked first with the dimensions loading in parallel and LOAD FACTS Scenario is invoked after the dimensions are loaded hence maintaining the dependency between Dimensions and Facts.
The Data warehouse loading times can be minimised to a great extent by using this ODI out of box feature. For larger warehouses, slice parallelism and database paralellism techniques are used to further reduce the loading times.
Comments
RSS feed for comments to this post