May 20, 2010

Data Warehouse

A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process.

Ø Subject-Oriented: It can be used to analyze a particular subject area. For example, "sales" can be a particular subject.

Ø Integrated: It integrates data from multiple data sources.

Ø Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data from 3 months, 6 months, 12 months, or even older data from a data warehouse. This contrasts with a transactions system, where often only the most recent data is kept. For example, a transaction system may hold the most recent address of a customer, where a data warehouse can hold all addresses associated with a customer.

Ø Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data warehouse should never be altered.


Functional view of a data warehouse:

A data warehouse is a copy of transaction data specifically structured for query and analysis.

What it takes to implement a successful data warehouse project?

1. Tools: - To accomplish a goal, make sure the proper tools are selected. The selection of business intelligence tools and the selection of the data warehousing team.

Database/Hardware Selections:

ü Scalability:

o How can the system grow as your data storage needs grow?

o Which RDBMS and hardware platform can handle large sets of data most efficiently?

ü Parallel Processing Support:

ü RDBMS/Hardware Combination: Because the RDBMS physically sits on the hardware platform, there are going to be certain parts of the code that is hardware platform-dependent. As a result, bugs and bug fixes are often hardware dependent.


ETL (Extraction, Transformation, and Loading) - Tool Selection

ü Complexity of the data transformation: The more complex the data transformation is, the more suitable it is to purchase an ETL tool.

ü Data cleansing (correcting (or removing) corrupt or inaccurate records) needs: Does the data need to go through a thorough cleansing exercise before it is suitable to be stored in the data warehouse? If so, it is best to purchase a tool with strong data cleansing functionalities. Otherwise, it may be sufficient to simply build the ETL routine from scratch.

ü Data volume: Available commercial tools typically have features that can speed up data movement. Therefore, buying a commercial product is a better approach if the volume of data transferred is large.

ETL Tool Functionalities:

Ø Functional capability: This includes both the 'transformation' piece and the 'cleansing' piece.

Ø Ability to read directly from your data source:

Ø Metadata support: The ETL tool plays a key role in your metadata because it maps the source data to the destination, which is an important piece of the metadata.


OLAP - Online analytical processing:

(Multidimensional OLAP - make it easy for users to look at the data from multiple dimensions. ) and ROLAP (Relational OLAP):

Ø Ability to leverage parallelism supplied by RDBMS and hardware: This would greatly increase the tool's performance, and help loading the data into the cubes as quickly as possible.

Ø Performance: In addition to leveraging parallelism, the tool itself should be quick both in terms of loading the data into the cube and reading the data from the cube.

Ø Customization efforts: More and more, OLAP tools are used as an advanced reporting tool. In such cases, the ease of front-end customization becomes an important factor in the tool selection process.

Ø Security Features: Because OLAP tools are geared towards a number of users, making sure people see only what they are supposed to see is important.

Ø Metadata support: Because OLAP tools aggregates the data into the cube and sometimes serves as the front-end tool, it is essential that it works with the metadata strategy/tool you have selected.


Reporting Tool Selection

Ø Number of reports

Ø Desired Report Distribution Mode (email or over the browser )

Ø Ad Hoc Report Creation (Will the users be able to create their own ad hoc reports?)


Reporting Tool Functionalities

ü Front End: Data is useless if all it does is sit in the data warehouse. As a result, the presentation layer is of very high importance.

ü Data source connection capabilities - Two types of data sources:

I. The relationship database

II. The OLAP multidimensional data source

ü Scheduling and distribution capabilities

ü Security Features:

ü Customization: Provide easy way to pre-set the reports to look exactly the way that adheres to the corporate standard

ü Export capabilities: Excel/CSV, PDF, XML etc.


Buy vs. Build decision

ü Cost

ü Implementation Time

ü Documentation

ü Functionality /Features

ü Tailored for the exact needs

ü Reliance on third-party


Ø User technical skills

Ø Requirements

Ø Available budget

Ø Time

Ø Support


Data Warehouse Team

Ø Project Manager

Ø DBA

Ø Technical Architect from the backend hardware/software to the client desktop configurations.

Ø ETL Developer

Ø Front End Developer

Ø OLAP Developer

Ø Trainer

Ø Data Modeler

Ø QA Group

No comments: