Slowly Changing Dimensions
Javier Ríos

Javier Ríos

Business Intelligence Consultant | Marketing Expert | Data Specialist

Other Articles:

Slowly Changing Dimensions

Share on linkedin
LinkedIn
Share on facebook
Facebook
Share on twitter
Twitter
Share on whatsapp
WhatsApp

When working on the dimensional modeling of a Data Warehouse, we must know and understand the nature of the information within the context of the project and the needs of the business to define if it can be overwritten or if it must keep a historical record of all its changes.

By default, a transactional system can modify the information. However, it retains only the latest version of the data affected by the change. The way to keep in our Data Warehouse a history of all the changes that allow us to explore the specific value of data at a given time is by applying the Slowly Changing Dimensions (SCD) methodologies.

A Slowly Changing Dimension will change slowly over time without a defined pattern rather than at predetermined intervals.

Conclusions

To understand the importance of SCDs, it is best to look at an example. A company that sells cell phones divides the devices into 3 ranges: high, medium, and basic. In this sector, technology evolves quickly, so that a terminal that is high-end (for example), a few months later may be considered mid-range with the arrival of new devices.

The company’s ERP transactional system stores the information in a standardized database. On the one hand, we have the “sales” table, where each transaction is recorded. Another “product” table contains the technical information of the mobiles, including a field indicating the range.

SCD Intro 2

The marketing department, when loading the data into the Data Warehouse, needs to know how many cell phones are sold each year in each of the ranges. What happens when a cell phone changes range? By modifying this data in the product table, all your past sales also change to the new range, falsifying the information and making it impossible to know this data with certainty.

Solution with Slowly Changing Dimensions

A possible solution to the previous problem would be to use the methodology, adding to the product table two date fields that indicate respectively the start and end date of validity of a record and adding a new row/version of the record each time a change is applied to the range.

Continuing with the example, the product table would initially have a structure like this:

Slowly Changing Dimensions Intro1

Applying the above solution, the new data structure of the product table would be as follows for the same cell phone (SA-23) once it is decided to change its range:

It is important to note that a new SK (surrogate key) column has also been added, as the previously used natural product ID key will no longer be unique.

Thanks to this solution, it will be possible to know the range of each product at the exact moment of sale.

Is it necessary to always reflect the different versions?

It will always depend on the nature of the data. If for example, the data that we change is the attribute “Name” of the cell phone because we had written it incorrectly, it will not be necessary to have another row/version with the corrected name since what we want is that it always appears correctly, even in the sales that were made previously.

This leads us to the existence of different types of SCD.

Types of Slowly Changing Dimensions

There are different types of SCD depending on the methodology used in each case.

– Type 0. The data never changes. The facts are associated with the original value of the attribute.

– Type 1. Overwriting. A new attribute value overwrites a previous value… The facts are associated with the current value of the attribute. There is no history of changes in the data.

– Type 2. Add a new dimension row with the new attribute value. The method saves a history of each change, overwriting the “end date” and “current version” attributes of the previous version. It is also necessary to add a surrogate key. Events are associated with the active value of the attribute at the time the event occurs. There are several ways to implement this type of SCD, the most common and complete being the following:

The main disadvantage is that the natural key is repeated so it is necessary to remove the restriction of using a unique value. This implies that it is necessary to add a surrogate key.

– Type 3. Add a new column to keep only the current value and the previous value of the attribute. When a change is detected, the current value is moved to the previous value column and the new value overwrites the current value column. The facts are associated with the current and previous attribute values.

Its main disadvantage is that the number of states is limited to the current and previous state.

– Type 4. Use a mini-dimensional table derived from a main historical table. The history table contains all attributes, including a “start date” and a subrogated key. A new row is created for each version. The mini-dimensional table contains only the most recent version of each attribute with all its columns, only some columns or columns derived from these. The fact table will be associated with both the mini-dimensional table and the history table.

– Type 5. It is a combination of type 1 and type 4. In practice, it would consist of adding a copy of the mini-dimensional table related in this case directly to the historical table so as not to have to necessarily go through the fact table to perform a query on it.

– Tipo 6. It is a combination of types 1, 2, and 3. Type 1 attribute overwrite, type 2 additional dimensional rows and type 3 “historical value” (or previous value) overwrite are used. The “historical value” will coincide with the “current value”, so all previous statements are overwritten. The events are associated with the active value of the attribute at the time the event occurs and with the current values.

– Type 7. Following the functionality of type 6, an additional table containing only the current version is used to improve reporting performance. The events are associated with both the active value of the attribute at the time the event occurs and the current values.

What type to choose?

It will depend entirely on the needs of the project. Moreover, in the same table, different SCD types can be applied for each column.

For example, in a table where we have among other data the city of an employee and his date of birth, we could apply a type 2 for the city and a type 1 for the date of birth.

Therefore, we need to analyze and understand for each dimension:

– Which attributes are constant

– Which attributes are variables

– In which attributes should we save a complete version history or save only the current and previous version.

It is also interesting to know the current and expected future data volume as it may help in the choice of the type of SCD.

However, when in doubt and the absence of other constraints, maintaining a version history will always be a more flexible and reversible solution.

Conclusion

Slowly Changing Dimensions methodologies are a must when planning data loading in the Data Warehouse. But it is also essential to know the specific needs of the project and of each of the attributes to decide when their application is required and the appropriate type of SCD.

Now you know the raison d’être of the SCD and its implications. In the next article, we will discuss how to implement it at a practical level.

References

Deja un comentario

Your email address will not be published. Required fields are marked *