May 20, 2010

Data Warehousing – Concepts

Dimensional Data Model

Some of the terms commonly used in this modeling:

Dimension: A category of information. For example, the time dimension.

Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension.

Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year → Quarter → Month → Day.

Fact Table: A fact table is a table that contains the measures of interest. For example, sales amount would be such a measure. This measure is stored in the fact table with the appropriate granularity. For example, it can be sales amount by store by day. In this case, the fact table would contain three columns: A date column, a store column, and a sales amount column.

Lookup Table: The lookup table provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse. Each row (each quarter) may have several fields, one for the unique ID that identifies the quarter, and one or more additional fields that specifies how that particular quarter is represented on a report (for example, first quarter of 2001 may be represented as "Q1 2001" or "2001 Q1").

A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables.

In designing data models for data warehouses / data marts, the most commonly used schema types are Star Schema and Snowflake Schema.


Star schema

In the star schema design, a single object (the fact table) sits in the middle and is radically connected to other surrounding objects (dimension lookup tables) like a star. Each dimension is represented as a single table. The primary key in each dimension table is related to a foreign key in the fact table.

All measures in the fact table are related to all the dimensions that fact table is related to. In other words, they all have the same level of granularity.

A star schema can be simple or complex. A simple star consists of one fact table; a complex star can have more than one fact table.


Sample star schema

Let's look at an example: Assume our data warehouse keeps store sales data, and the different dimensions are time, store, product, and customer. In this case, the figure on the left repesents our star schema. The lines between two tables indicate that there is a primary key / foreign key relationship between the two tables. Note that different dimensions are not related to one another.

The snowflake schema is an extension of the star schema, where each point of the star explodes into more points. In a star schema, each dimension is represented by a single dimensional table, whereas in a snowflake schema, that dimensional table is normalized into multiple lookup tables, each representing a level in the dimensional hierarchy.

Sample snowflake schema

For example, the Time Dimension that consists of 2 different hierarchies:

1. Year → Month → Day 2. Week → Day

We will have 4 lookup tables in a snowflake schema: A lookup table for year, a lookup table for month, a lookup table for week, and a lookup table for day. Year is connected to Month, which is then connected to Day. Week is only connected to Day. A sample snowflake schema illustrating the above relationships in the Time Dimension is shown to the right.

The main advantage of the snowflake schema is the improvement in query performance due to minimized disk storage requirements and joining smaller lookup tables. The main disadvantage of the snowflake schema is the additional maintenance efforts needed due to the increase number of lookup tables.

Slowly Changing Dimension

This problem is a common one particular to data warehousing. In a nutshell, this applies to cases where the attribute for a record varies over time.

e.g.,

Christina is a customer with ABC Inc. She first lived in Chicago, Illinois. So, the original entry in the customer lookup table has the following record:

Customer Key

Name

State

1001

Christina

Illinois

At a later date, she moved to Los Angeles, California on January, 2003. How should ABC Inc. now modify its customer table to reflect this change? This is the "Slowly Changing Dimension" problem.

There are three ways to solve this type of problem:

Type 1: The new record replaces the original record. No trace of the old record exists. Easy to implement but history will be lost.

Customer Key

Name

State

1001

Christina

California

Type 2: A new record is added into the customer dimension table. Therefore, the customer is treated essentially as two people. Keeps history but too much of data will be stored.

Customer Key

Name

State

1001

Christina

Illinois

1005

Christina

California

Type 3: The original record is modified to reflect the change. This will not increase the size in disk and also allows us to keep some part of history, but with this system will not be able to keep all history where an attribute is changed more than once.

Customer Key

Name

Original State

Current State

Effective Date

1001

Christina

Illinois

California

15-JAN-2003

Conceptual data model

A conceptual data model identifies the highest-level relationships between the different entities. The only information shown via the conceptual data model is the entities that describe the data and the relationships between those entities.

Features of conceptual data model include:

Ø Includes the important entities and the relationships among them

Ø No attribute is specified

Ø No primary key is specified


Logical data model

This model describes the data in as much detail as possible, without regard to how they will be physical implemented in the database.

Features of a logical data model include:

Ø Includes all entities and relationships among them

Ø All attributes for each entity are specified

Ø The primary key for each entity is specified

Ø Foreign keys (keys identifying the relationship between different entities) are specified

Ø Normalization occurs at this level

The steps for designing the logical data model are as follows

ü Specify primary keys for all entities

ü Find the relationships between different entities

ü Find all attributes for each entity

ü Resolve many-to-many relationships

ü Normalization


Physical data model

This model represents how the model will be built in the database. A physical database model shows all table structures, including column name, column data type, column constraints, primary key, foreign key, and relationships between tables.

Features of a physical data model include:

Ø Specification all tables and columns

Ø Foreign keys are used to identify relationships between tables

Ø Denormalization may occur based on user requirements

Ø Physical considerations may cause the physical data model to be quite different from the logical data model

The steps for physical data model design are as follows:

Ø Convert entities into tables

Ø Convert relationships into foreign keys

Ø Convert attributes into columns

Ø Modify the physical data model based on physical constraints / requirements


Data integrity

Data integrity refers to the validity of data, meaning data is consistent and correct. In a data warehouse or a data mart, there are three areas of where data integrity needs to be enforced:

Database level - We can enforce data integrity at the database level. Common ways of enforcing data integrity include:

· Referential integrity

· Primary key / Unique constraint

· Not NULL vs NULL-able

· Valid Values - Only allowed values are permitted in the database.

ETL process - For each step of the ETL process, data integrity checks should be put in place to ensure that source data is the same as the data in the destination. Most common checks include record counts or record sums.

Access level - We need to ensure that data is not altered by any unauthorized means either during the ETL process or in the data warehouse. To do this, there needs to be safeguards against unauthorized access to data (including physical access to the servers), as well as logging of all data access history.


What is OLAP?

Fast Analysis of Shared Multidimensional Information

For people on the business side, the key feature out of the above list is ‘Multidimensional’. In other words, the ability to analyze metrics in different dimensions such as time, geography, gender, product etc.

In the OLAP world, there are mainly two different types:

MOLAP - Multidimensional OLAP

In MOLAP, data is stored in a multidimensional cube. The storage is not in the relational database, but in proprietary formats.

Advantages:

Excellent performance: MOLAP cubes are built for fast data retrieval, and is optimal for slicing and dicing operations.

Can perform complex calculations: All calculations have been pre-generated when the cube is created. Hence, complex calculations are not only doable, but they return quickly.

Disadvantages:

Limited in the amount of data it can handle: Because all calculations are performed when the cube is built, it is not possible to include a large amount of data in the cube itself.

Requires additional investment: Cube technology are often proprietary and do not already exist in the organization. Therefore, to adopt MOLAP technology, chances are additional investments in human and capital resources are needed.

ROLAP - Relational OLAP (ROLAP)

This methodology relies on manipulating the data stored in the relational database to give the appearance of traditional OLAP's slicing and dicing functionality. In essence, each action of slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement.

Advantages:

Can handle large amounts of data

Can leverage functionalities inherent in the relational database

Disadvantages:

Performance can be slow: Because each ROLAP report is essentially a SQL query (or multiple SQL queries) in the relational database, the query time can be long if the underlying data size is large.

Limited by SQL functionalities: Because ROLAP technology mainly relies on generating SQL statements to query the relational database, and SQL statements do not fit all needs (for example, it is difficult to perform complex calculations using SQL), ROLAP technologies are therefore traditionally limited by what SQL can do. ROLAP vendors have mitigated this risk by building into the tool out-of-the-box complex functions as well as the ability to allow users to define their own functions.


HOLAP - Hybrid OLAP

HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For summary-type information, HOLAP leverages cube technology for faster performance. When detail information is needed, HOLAP can "drill through" from the cube into the underlying relational data.

Data Warehouse - datawarehousing project cycle

Requirement Gathering

· A list of reports / cubes to be delivered to the end users by the end of this current phase.

· Updated project plan that clearly identifies resource loads and milestone delivery dates.

Environment Setup

Set up a development environment and a production environment including different processes (such as ETL, OLAP Cube, and reporting).

Outcome: Hardware / Software setup document for all of the environments, including hardware specifications, and scripts / settings for the software

Data Modeling

A good data model will allow the data warehousing system to grow easily, as well as allowing for good performance. In data warehousing project, the logical data model is built based on user requirements, and then it is translated into the physical data model.

ü Identification of data sources.

ü Logical data model.

ü Physical data model

ETL Design Phase

This can easily take up to 50% of the data warehouse implementation cycle or longer. The reason for this is that it takes time to get the source data, understand the necessary columns, understand the business rules, and understand the logical and physical data models.

Ø Data Mapping Document

Ø ETL Script / ETL Package in the ETL tool


OLAP Cube/Report Design

Usually the design of the OLAP cube can be derived from the Requirement Gathering phase.

Ø Documentation specifying the OLAP cube dimensions and measures.

Ø Actual OLAP cube / report.


Front End Development (Web based) and Deployment Documentation


Report Development

ü User customization

ü Report delivery

ü Access privileges


Ø Report Specification Documentation.

Ø Reports set up in the front end / reports delivered to user's preferred channel.


Performance Tuning and Query Optimization:

1. ETL - Tune the ETL process as much as possible

2. Query Processing

Ø Understand how your database is executing your query

Ø Retrieve as little data as possible

Ø Store intermediate results - store the intermediate results in a temporary table

Query optimization strategies.

· Use Index

· Aggregate Table - Pre-populating tables at higher levels so less amount of data need to be parsed.

· Vertical Partitioning - Partition the table by columns. This strategy decreases the amount of data a SQL query needs to process.

· Horizontal Partitioning - Partition the table by data value, most often time. This strategy decreases the amount of data a SQL query needs to process.

· De-normalization - The process of de-normalization combines multiple tables into a single table. This speeds up query performance because fewer table joins are needed.

· Server Tuning - Each server has its own parameters, and often tuning server parameters so that it can fully take advantage of the hardware resources can significantly speed up query performance.

3. Report Delivery - Network traffic, Server setup and front-end code

Quality Assurance

Ø QA Test Plan

Ø QA verification that the data warehousing system is ready to go to production


Backup strategy


Rollout To Production- Go Live.

Take care to address the user education needs.


Production Maintenance

Consistent availability of the data warehousing system to the end users

Incremental Enhancements

Ø Change management documentation

Ø Actual change to the data warehousing system

Additional Observations

Ø Overall Implementation Time and effort

Ø How To Measure Success

Each sub section discussed above needs to address following this:

Ø Task Description -What needs to be accomplished?

Ø Time Requirement - Rough estimate of time.

Ø Deliverables - One or more documents which describe the steps and results of the task.

Ø Possible Pitfalls - Things to watch out for.

Recipes for Data Warehousing Project Failure

Ø Focusing On Ideology Rather Than Practicality

Ø Making The Process Unnecessarily Complicated

Ø Lack of Clear Ownership

Ø Not Understanding Proper Protocol - understand the organization protocol

Ø Not Fully Understand Project Impact Before The Project Starts

Ø Blindly Sticking To Certain Standards

Ø Bad Project Management


Data Warehouse

A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process.

Ø Subject-Oriented: It can be used to analyze a particular subject area. For example, "sales" can be a particular subject.

Ø Integrated: It integrates data from multiple data sources.

Ø Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data from 3 months, 6 months, 12 months, or even older data from a data warehouse. This contrasts with a transactions system, where often only the most recent data is kept. For example, a transaction system may hold the most recent address of a customer, where a data warehouse can hold all addresses associated with a customer.

Ø Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data warehouse should never be altered.


Functional view of a data warehouse:

A data warehouse is a copy of transaction data specifically structured for query and analysis.

What it takes to implement a successful data warehouse project?

1. Tools: - To accomplish a goal, make sure the proper tools are selected. The selection of business intelligence tools and the selection of the data warehousing team.

Database/Hardware Selections:

ü Scalability:

o How can the system grow as your data storage needs grow?

o Which RDBMS and hardware platform can handle large sets of data most efficiently?

ü Parallel Processing Support:

ü RDBMS/Hardware Combination: Because the RDBMS physically sits on the hardware platform, there are going to be certain parts of the code that is hardware platform-dependent. As a result, bugs and bug fixes are often hardware dependent.


ETL (Extraction, Transformation, and Loading) - Tool Selection

ü Complexity of the data transformation: The more complex the data transformation is, the more suitable it is to purchase an ETL tool.

ü Data cleansing (correcting (or removing) corrupt or inaccurate records) needs: Does the data need to go through a thorough cleansing exercise before it is suitable to be stored in the data warehouse? If so, it is best to purchase a tool with strong data cleansing functionalities. Otherwise, it may be sufficient to simply build the ETL routine from scratch.

ü Data volume: Available commercial tools typically have features that can speed up data movement. Therefore, buying a commercial product is a better approach if the volume of data transferred is large.

ETL Tool Functionalities:

Ø Functional capability: This includes both the 'transformation' piece and the 'cleansing' piece.

Ø Ability to read directly from your data source:

Ø Metadata support: The ETL tool plays a key role in your metadata because it maps the source data to the destination, which is an important piece of the metadata.


OLAP - Online analytical processing:

(Multidimensional OLAP - make it easy for users to look at the data from multiple dimensions. ) and ROLAP (Relational OLAP):

Ø Ability to leverage parallelism supplied by RDBMS and hardware: This would greatly increase the tool's performance, and help loading the data into the cubes as quickly as possible.

Ø Performance: In addition to leveraging parallelism, the tool itself should be quick both in terms of loading the data into the cube and reading the data from the cube.

Ø Customization efforts: More and more, OLAP tools are used as an advanced reporting tool. In such cases, the ease of front-end customization becomes an important factor in the tool selection process.

Ø Security Features: Because OLAP tools are geared towards a number of users, making sure people see only what they are supposed to see is important.

Ø Metadata support: Because OLAP tools aggregates the data into the cube and sometimes serves as the front-end tool, it is essential that it works with the metadata strategy/tool you have selected.


Reporting Tool Selection

Ø Number of reports

Ø Desired Report Distribution Mode (email or over the browser )

Ø Ad Hoc Report Creation (Will the users be able to create their own ad hoc reports?)


Reporting Tool Functionalities

ü Front End: Data is useless if all it does is sit in the data warehouse. As a result, the presentation layer is of very high importance.

ü Data source connection capabilities - Two types of data sources:

I. The relationship database

II. The OLAP multidimensional data source

ü Scheduling and distribution capabilities

ü Security Features:

ü Customization: Provide easy way to pre-set the reports to look exactly the way that adheres to the corporate standard

ü Export capabilities: Excel/CSV, PDF, XML etc.


Buy vs. Build decision

ü Cost

ü Implementation Time

ü Documentation

ü Functionality /Features

ü Tailored for the exact needs

ü Reliance on third-party


Ø User technical skills

Ø Requirements

Ø Available budget

Ø Time

Ø Support


Data Warehouse Team

Ø Project Manager

Ø DBA

Ø Technical Architect from the backend hardware/software to the client desktop configurations.

Ø ETL Developer

Ø Front End Developer

Ø OLAP Developer

Ø Trainer

Ø Data Modeler

Ø QA Group