Temporal Data Update Methodologies for Data Warehousing

,


INTRODUCTION
In a competitive business environment, successful businesses are data driven. A data warehouse architecture selection is founded on business needs (Ariyachandrs and Watson 2010). The business executives would want to make strategic as well as tactical business decisions (Brobst et al. 2008) with accurate information at the right time. The accuracy of information is dependent on detailed data as well as time-varying data. The data warehouse with time-varying data is instrumental in strategic decision making. The business requirements for temporal data go beyond what is typical of conventional database implementation.
Customer transactions keep changing over time with changing customer behavior patterns (Apeh and Gabrys 2013). Temporal data is concerned with time-varying data. Time-varying data states that each version of a record is relevant to some moment in time (Inmon et al. 2001;Martin and Abello 2003;and Snodgrass 2010). The temporal aspects normally consist of valid-time and transaction-time. Valid time defines the time period when a particular tuple is true in modeled reality, while the transaction time defines the time period when that particular tuple is captured in the database (Martin and Abello 2003;and Torp et al. 2000).
A temporal data warehouse is significantly different from an operational database in many respects (Shin 2003). Operational source systems are usually non-temporal and maintain only current state of data as opposed to complete history of data (Bruckner and Tjoa 2002;and Rahman 2008a) with transaction lineage. Data warehouses are always maintained to hold large volumes of historical data.
Data management and warehousing is considered the foundation of business intelligence (BI) and analytics (Chen et al. 2012). During the last decade data warehousing has achieved prominence. Scattered databases and datamarts are being consolidated into more useful data warehouses. The advent of new information technologies and techniques such as temporal data warehousing presents unique opportunities for firms to enhance their customer agility (Roberts and Grover 2012). This also speaks for maturity of data warehousing technologies (Sen et al. 2006). Temporal data warehousing has gained prominence among different stakeholders including suppliers, business users, and researchers because of user popularity and management patronage (Jensen 2000).
"A temporal data warehouse is a repository of historical information, originating from multiple, autonomous, (sometimes) heterogeneous and non-temporal sources. It is available for queries and analysis (such as data mining) not only to users interested in current information but also to those interested in researching past information to identify relevant trends (Amo and Alves 2000)." W.H. Inmon defines temporal data warehouse as "a collection of integrated, subject-oriented databases designed to support the DSS function, where each unit of data is relevant to some moment in time. The data warehouse contains atomic data and lightly summarized data (Inmon 2002)." In this definition time-varying means the possibility to keep different values of the same record according to its changes over time (Malinowski and Zimányi 2006).
Temporal data warehouses provide a history of serialized changes to data identified by times when changes occurred (Golfarelli and Rizzi 2009). This allows for querying the current state as well as past states of a record (Fegaras and Elmasri 1998). Conventional databases provide users only current state of data which is true as of a single point in time (Ozsoyoglu and Snodgrass 1995). Users of a data warehouse are not only interested in the current state of data, but also in the transaction lineage as to how a particular record has evolved over time (Bruckner and Tjoa 2002). A record inserted in a database is never physically deleted (Chountas et al. 2004). A new record or a new version of an existing record is always added to reflect a transaction lineage for that data. Thus an evolving history of data is maintained in the temporal data warehouse.
Temporal data has important applications in many domains (Jensen 1999;Jestes 2012). Most of those domains applications can benefit from a temporal data warehouse (Thomas and Datta 2001;and Yang and Widom 1998) such as banking, retail sales, financial services, medical records, inventory management, telecommunications, and reservation systems. In the case of a bank account, an account holder's balance will change after each transaction. The amount or descriptions of a financial document will change for business purposes. Such data is often valuable to different stakeholders and should be stored in both current state and all previously current states.
Although there are clear benefits and demand for temporal database management systems (DBMS), there are only a few commercially available (Snodgrass 2010;and Torp 1998). Most of the current commercial databases are non-temporal and hence, they do not provide a special temporal query language, a temporal data definition language, or a temporal manipulation language (Bellatreche and Wrembel 2013; Kaufmann 2013; Mkaouar et al. 2011;and TimeConsult 2013).
In the absence of a temporal DBMS, we argue that an effort should be made to take advantage of current commercial databases and allow for handling multiple versions of data including past, current, and future states of data. This can be done with application coding for handling multiple versions of data. The current commercial relational databases with a high-level language such as SQL are mature enough to manage complex data transformations (Stonebraker et al. 2005) and also have performance improvement measures, such as various efficient algorithms for indexing. The improvements in the area of disk storage technology and declining cost of data storage (Chaudhuri et al. 2011) have also made it possible to efficiently store and manage temporal data with all transaction lineages (Ahn and Snodgrass 1986;and Torp 1998).
The temporal database implementations could be done by extending a non-temporal data model into a temporal data model and building temporal support into applications. Two timestamp fields need to be added to each table of the conventional data model. The new columns consist of 'row effective timestamp' and 'row expired timestamp' which hold date and time values to identify each individual row in terms of their present status such as past or current, or future.
The data warehouses are refreshed at a certain time intervals with data from different operational databases. In order to keep data warehouses run efficient and to maintain consistent data in the warehouse it is important that data arrive in the warehouse in a timely fashion and be loaded via batch cycle runs. Since data warehouse consists of thousands of tables in multiple different subject areas the table refreshes must be done in order of dependencies via batch cycles. Batch refreshes have proven to be an efficient method of loading from the standpoint of performance (Brobst et al. 2008) and data consistency. Another aspect of storing data in data warehouses is that initially data is captured in staging subject areas (Ejaz and Kenneth 2004;and Hanson and Willshire 1997) with one to one relation between operational source and data warehouse staging area tables. Analytical subject areas are refreshed from the staging area tables. The analytical subject area refresh requires collecting data from more than one subject area or more than one table from a particular staging subject area.
The purpose of this article is to discuss implementations such as temporal data update methodologies, viewing of data consistently, coexistence of load and query against the same table, performance improvement of load and report queries, and maintenance of views. The intended result is a temporal data warehouse that can be used concurrently to load new data and allow various reporting applications to return results consistent with their selected time slice.

LITERATURE REVIEW
The data warehouse refreshes have been a research topic for more than a decade. The research is mostly related to storing and maintaining the current state of data. Current state of data fails to provide data lineage information. Discarding updates between two refresh points of time with periodic complete reloads leads to a loss of transaction lineage (Vavouras et al. 1999). Most previous work on data warehousing focused on design issues, data maintenance strategies in connection with relational view materialization (Huq et al. 2010) and implementation aspects (Chen et al. 2010;Kim et al. 2004;Samtani et al. 1998;and Widom 1995). There has been little research work done to date on capturing transaction lineage and the temporal view maintenance problem (Yang and Widom 1998) and most of the previous research ignores the temporal aspects of data warehousing (Bruckner and Tjoa 2002). There is a consensus that Information Systems research must respond to theoretical contributions and make attempt to solving the current and anticipated problems of practitioners (Sein et al. 2011). There is a need for coming up with mechanisms to store transaction lineage in conventional databases. Current commercial database systems provide little built-in capabilities to capture transaction lineage or to support query language for temporal data management (Mahmood et al. 2010). As of today, a few companies started providing time-referenced data storing functionality and SQL facilities in their DBMS system (Chau and Chittayasothorn 2008;and Snodgrass 2010). In data warehouses, data comes from many sources and data warehouse refreshes happen several times a day. Data warehouse is a shared environment and the data in it is typically used by so many applications. These applications may need a different time-slice of data. The data warehouses must cope with the temporal granularities of data (Terenziani 2012).
Temporal data warehouses raise many issues including consistent aggregation in presence of time-varying data, temporal queries of multidimensional data, storage method, and temporal view materialization ( In this article, we focus on an innovative (Downes and Nunes 2013; and Ramiller and Swanson 2009) approach for dealing with transaction lineage and storing them with time-stamp granularities. We present methodologies for refreshing data warehouses with time-varying data via batch cycles. This is suitable for large data warehouses with hundreds of subject areas and thousands of tables where refreshes occur in a span of one to four-hour windows. We propose the use of conventional extract-transform-load (ETL) tools to extract data from source systems and load the staging subject areas in the data warehouse without performing any kind of transformation tasks. As soon as staging tables are refreshed, the data warehouse software performs transformations to insert new rows in the actual data warehouse (analytical subject areas) tables and also update the tables by applying row expired timestamps to the preexisting rows that correspond to the newly arrived rows. ETL represents the most important stage of the (temporal) data warehouse design as 70% of the risk and effort attributed to this stage (Berkani et al. 2013). We also examine the possibility of using metadata tables (Rahman et al. 2012) to recompile views based on subject area refresh timestamps. We show that there are opportunities to use different performance improvement features, such as indexing (Li et al. 2010) conventional commercial databases to load and query temporal data in the commercial non-temporal databases (Kaur et al. 2013) as these features are very important to handle large volume of transaction lineage data.

TEMPORAL DATA UPDATE METHODOLOGIES
During the past two decades, several dozen temporal data models have been generated, all with timestamps being included as integral components (Torp et al. 2000). There are very few (Chau and Chittayasothorn 2008;and Snodgrass 2010) commercial databases yet on the market, perhaps due to the complex nature of temporal data. This article presents a technical outline of how to use the conventional commercial databases to update with temporal data. The goal is to make sure data consistency is maintained, and load and query performance is not compromised. Updating data warehouses with temporal data is a mechanism for storing the lineage of data in the tables. It captures all changes made to a data row over time (transaction lineage). Deleted rows are not physically deleted; they are labeled to exhibit expiration instead. Updated rows are handled by expiring the existing rows and inserting the new version of the rows. Both current and historical time slices are available to any user by manipulating view filter criteria with less-than-equal-to (<=) predicates, because each version of a row share the same key (Ahn and Snodgrass 1986). Temporal data must be evaluated by joins with inequality predicates; rather than equality predicates used in conventional data evaluation techniques (Gao et al. 2005).
A temporal data warehouse provides a consistent view of data for customer queries while data is being loaded into the same table being queried. It provides transaction lineage of data and mechanisms to harmonize (aggregate and synchronize) data based on time slicing. The business needs for several different time-varying data can be met using a temporal data warehouse. Figure 1 presents a temporal data warehousing environment. The data comes from operational databases. The source systems provide change timestamp for temporal data. In a data warehouse source data is initially landed in staging subject areas. The data is then moved to downstream layers of data warehouse such as layers 2 (L2) and 3 (L3). L2 subject areas are shared subject areas used by any application(s) that needs data. An L3 subject area is dedicated to a specific application. Each L3 subject area provides an analytical environment used for reporting and business intelligence (BI) purposes.

Figure 1: A Typical Temporal Data Warehousing Environment.
In order to implement temporal data update methodology the data model will have four additional columns, such as 'row effective date', 'row effective time', 'row expired date' and 'row expired time', to mark row effective date/time and row expired date/time against each row in the table. To make each row unique, the row effective date and time columns need to be part of primary key. The data from the operational databases will arrive in a timely fashion via flat files. The cycle refresh time intervals can be 30 minutes, one, two, three, or four hours, etc based on the needs of the business organization. The data manipulation (DML) code (I/U/D) and data row change timestamp are provided by the source system (Malinowski and Zimányi 2006) in the data files. The data row change timestamp will be used as row effective date and time. '9999-12-31 12:00:00' will be used as row expired timestamp however the presence of this high value indicates an 'active' or current row. Time stamping is used to tag each record with some moment in time when a record is created or passed from one environment to another (Inmon et al. 2001).
Immediately after staging tables are loaded from source system the data warehouse SQL will be used to process and expired the rows if multiple versions have arrived in a file. It is likely that during a cycle refresh a data file will contain multiple versions of rows for a particular record, with the same primary key. In that case, each previous version will be expired with the row effective timestamp of the immediate next version of that row. Only the current version will have the expired timestamp value '9999-12-31 12:00:00'. For example, if the same key has a DML code 'I' (insert) followed by 'U' (update) in that case only the row with 'U' will be treated as the active row. This will insure rows are expired in the staging table in case there are multiple versions of rows arriving via a source data file in a given cycle. Next the rows with both 'U' and 'I' will be inserted in the final target table. Following insert into the target table all rows with 'D' (delete) in the staging table will be used to expire the corresponding row in the target table. 'D' rows are not deleted physically in the target table. All the existing rows with DML code 'U' will be expired in the target table and new rows inserted. All these steps are used to perform incremental refreshes (Rahman 2010) with temporal data. In case of full refresh as initial load, the current version of rows will be used to perform the load. The change date of current row will be used as row effective timestamp and '9999-12-31 12:00:00' as expired timestamp.  Among the highlighted rows (with the same key: fin_doc_nbr & fscl_yr_nbr) in Table 1, the last row is the current row which is active with row expired date and time as '9999-12-31 12:00:00'.

Process of Loading Derived Tables
In data warehouses, quite often derived tables are created for analytical purposes. These tables are loaded by pulling data from multiple tables and doing various aggregations, summations, and other computations. The response time of standard repetitive queries can be improved significantly if the answers of complex reporting query are stored in a simple table with keyed access (Gardner 1998). These table structures are created in such a way that they fulfill the reporting needs of different business applications. The report tools will point to these tables via simple (SELECT *) views. When loading the derived tables, by pulling data from primary source tables and dimension or header tables, row effective timestamp ranges need to be generated for dimension/ header/ secondary source table to make a relationship between the row effective date and the time on primary/ fact table rows.  is an example of loading staging and analytical subject area tables. Initial full refreshes in staging subject areas are performed using database utilities such as Fast-Load and Multi-Load. The downstream analytical subject areas are refreshed with data from staging tables. Since both source (staging) and target (analytical) tables reside in the same data warehouse database-specific software such as stored procedures and macros could be used to update the analytical subject areas.
The SQL in the form of stored procedures can be used to load data into derived tables. To do a full or incremental refresh, when joining between source primary and secondary or dimension tables or joining between line and header tables, the table join has to be based on primary key + row_eff_ts columns. As both primary and secondary/ dimension tables will hold transaction lineage it is important to make a one to one relationship for transactions lineage data using row_eff_ts columns. In this case the secondary/dimension table row_eff_ts must be less than or equal to the row_eff_ts column of primary source table. Note that the secondary/ dimension table data must come from the source at the same time or during a previous refresh in order to have the primary source table row effective timestamp match else data will be filtered out in loading the target table. This might happen when the primary source table and the secondary/ dimension tables is joined with an 'inner join' (instead of left outer) conditions.   Table-A, for doc_nbr 10002, we can see that every time changed data comes from source the most recent row for a key gets updated with a new date (row_expr_dt) and time (row_expr_tm). The most current row for a particular key (e.g., fin_doc = 10002) displays the row_expr_dt as '9999-12-31'.
The SQL in Figure 4 shows how the join relating to row_eff_ts should appear:

Figure 4: Temporal Relation between primary source and dimension tables.
The 'and' clause highlighted in Figure 4 is a less-than-equal-to predicate which generates the one to one relationship for transaction lineage data of two tables. The join facilitates the one to one relationship should there be lower and upper bounds of the timestamp intervals that are not the same in the joining tables. The dimension table may have different lower and upper bounds of timestamps for current record, compared to primary source or fact table, as dimension data changes slowly.

VIEWING CONSISTENT DATA IN REPORTING ENVIRONMENT
The biggest challenge for creating an integrated data model in a data warehouse is to provide a consistent view (Radeschutz et al. 2013) of the data across subject areas. A data warehouse with temporal data must be able to manage data in a shared, reliable, time-sliced and efficient way (Hartmann et al. 2012). Aggregation and synchronization of data across subject areas provide unique challenges. View maintenance in a temporal data warehouse is complicated (Amo and Alves 2000) because they have to deal with multiple versions of data in a table. We propose separate application specific views to allow applications to have consistent view of data with separate time filters as needed by the users.
There are several different business application needs which can be filled by providing a separate set of views with different timestamp filters as required. Business users do not want to see data showing up in their reports as it is being appended or changed in a given table. Also, users would like to have all the updates to related tables in a subject area completed before a report shows any of that new cycle's data. They want to maintain "data latency" in this particular case. The report users want to see data based on the most current cycle refresh that occurred across all tables in the subject area(s).
The application specific views with timestamp filters are defined and dynamically recompiled right after each individual application's upstream subject area refreshes are completed per service level agreement (SLA). The view filters are based on 'row effective date', 'row effective time', and 'row expired date' and 'row expired time'. The row effective and expired timestamps associated with each subject area refresh begin and end timestamps are captured in the data warehouse metadata model (Kim et al. 2000) during the cycle refresh and later used for view recompilation.

Base Views
The base views that point to the target table, will have the 'lock for access' locking mechanism defined in the view. This will allow row level access to the table no matter if the table is being updated. The views will be defined with timestamp filter row_eff_ts <= last_refresh_ts. These views will be recompiled in the end of each cycle refresh. Figure 5 shows a complete set of data after cycle refresh followed by views swap. The Base-View for Table A shows all change history for a particular key.

Source Application Views
Source application business views will be defined on top of base views (for dirty reads). These views will provide data with filter based on application needs. For example, row_eff_ts <= application_reqd_asof_ts and row exp_ts > application_reqd_asof_ts. These views will be recompiled at end of last cycle refresh of an application. In these views the row uniqueness is maintained via business views. Figure 5 shows BusinessView for Table A. The view is showing the most current version of rows for each key (e.g., doc_nbr) value.
The data warehouse is a shared environment. Some subject areas are application specific while some others are shared by more than one subject area. Each application has its own SLA for data freshness. Also there are dependencies between subject area refreshes. All these factors make applications use different time slices of data. For example, finance related subject areas may run six times a day such as 2:00am 6:00am, 10:00am, 2:00pm, 6:00pm, and 10:00pm. On the other hand, capital related subject areas may run three times a day such as 3:00am, 11:00am and 7:00pm. Both these applications share some common subject areas. They use some application specific subject areas, too. This requires data to be "frozen" via a different set of views on the proper time slice to make data consistent and available per each applications specific business needs and SLA. For example a finance application might want to display finance data right after the finance subject areas load (e.g., 10:00am) while the capital application would not want to display just refreshed data right at that moment because capital analysis is based on the previous load or the analysis cannot begin until the other related subject areas have completed loading. In that case, a capital analyst will use data based on a separate set of views with previous refresh timestamp filter specified in the view definition. The finance analysis will see finance data up to the latest refresh via a different set of views. This way, data as of a point in time can be obtained across multiple tables or multiple subject areas, resolving consistency and synchronization issues. In this case two applications will be provided data freshness based on their individual SLA.
The report users normally want to see data based on the most recent cycle refresh that occurred across all tables in the subject area(s). For that particular time slice they like data demographics to remain as-is for analysis purposes. So, they may be provided with business views for each table that will show any data up to a point in time as needed. The reports will not see any new data that is being loaded as part of current cycle refresh until the cycle is finished. The report users will run queries in a business view with below timestamp filters in Figure 6 and 7:

Query Performance
In acquiring data warehouse database systems customers take cost control and performance optimization as the critical evaluation criteria (Feinberg and Beyer 2010). Performance is referred to a product's primary operating characteristics (Gavin 1987). To improve database query performance, commercial databases have come up with several efficient indexes. The row effective date columns may have partitioned primary index (PPI) defined on them. That will make queries faster as the partition primary index pulls rows based on partition number instead of a full table scan. When a query is run with filters on PPI columns the DBMS will directly pull data based on particular bucket(s) instead of scanning the whole table.
Based on a SQL score-card on both PPI and non-PPI tables it was found that the SQL uses only 33% of the resources to pull rows from a PPI table in relation to a non-PPI table. The run time is also less in the same proportion. The potential gain derived from partitioning a table is the ability to read a small subset of the table  instead of the entire table. Queries which specify a restrictive condition on the partitioning column will avoid full table scans. By defining a PPI on 'row effective date' the report query performance was found to be four times faster and CPU savings about 33%.  Figure 8 shows a comparison of query response time and computational resource savings between PPI and No-PPI queries. The first query was run to pull 53K rows, with no PPI defined. The response time was eight seconds and CPU consumption was 29 seconds in row one. The same query was run against the same table with PPI defined on row effective date. For the second run the response time was one second and resource consumption was two seconds per row two. The first two rows show the resource usage statistics. A second query was run to pull 424K rows, with no PPI defined. The response time was 25 seconds and resource consumption was 101 CPU seconds in row three. The same query was run against the same table with PPI defined on row effective date. This second run response time was four seconds and resource consumption was 33 seconds in row four.
There are many techniques to improve performance (Rahman 2013) of data warehouse queries, ranging from commercial database indexes and query optimization. A number of indexing strategies have been proposed for data warehouses in literature and are heavily used in practice.

ENSURING DATA QUALITY IN LOADING AND VIEWING
Once Joshua Boger, CEO and founder of Vertex Pharmaceuticals said that, "I've never made a bad decision. I've just had bad data (Pisano et al. 2006)". This speaks for the importance of providing quality data to users for internal business decision making and external regulatory compliance (Bai et al. 2012; and Baskarada and Koronios 2013). Data quality is essential for business intelligence success (Isik et al. 2013). Better data quality has a positive influence on sales, profit making, and value added (Xiang et al. 2013). System quality has positive influence on data quality and information quality (Hwang and Xu 2008). During the load process, from operational source to the temporal data warehouse, performing transformation and loading from staging area to analytical area of data warehouse, utmost care must be taken to avoid data corruption. This data quality might be compromised for many reasons. While joining multiple source tables if join operators do not consider inequality predicates the integrity of timestamp-based data might be compromised. Data quality might be compromised due to lack of maintaining referential integrity (Ordonez and Garcia-Garcia 2008) as well. Data quality also might be compromised while retrieving data from analytical tables viewing information via reporting, business intelligence, and data mining tools. This could happen due to missing join criteria or bad transformation logic. All care must be taken to ensure the most accurate data retrieval. This is very true in the case of temporal data, as it is stored with different versions of time-referenced data.

COEXISTENCE OF LOAD AND QUERY-RUN
During the load process by the stored procedures, the DBMS will use 'write lock', by default, to perform DML during the new cycle refresh. The report SQL can be defined by 'locking for access' lock to retrieve the rows for a specific time slice. Both read and write locks are compatible. The author of this article conducted a test by In order to make sure that report users see consistent data they will be provided business views with last data refresh. Every time all related subject area refresh for a particular application is completed, the views will be recompiled with new timestamps or the views will be pointed to a metadata (Rahman et al. 2012) table, via a join condition, to get the most recent refresh timestamp and use it as filters in report queries.

CONCLUDING REMARKS
In this article, we have proposed temporal data update methodologies for data warehousing. The goal here is to come up with mechanisms for capturing transaction lineage for each record in data warehouse tables. We identified the key areas of temporal data warehouse refreshes based on practical experience in data warehouse implementation. Many database applications need to retain all previous and current states of data from accountability and traceability perspectives (Jensen and Lomet 2001). We provided methodologies to extract operational data from heterogeneous operational sources and to capture them in staging areas of the data warehouse with transaction lineage. We showed how to pull source data from more than one source staging tables in DW, how to perform join operations with inequality predicates for more than one joins and then load them in analytical subject area tables. We have provided SQL syntax to pull and load data from staging area tables to analytical tables maintaining transaction lineage.
We proposed load methodologies by way of database specific load utilities and software components such as stored procedures and macros. We showed onetime full refresh and subsequent incremental refreshes of temporal data in conventional data warehouse. In order to implement temporal data update methodologies we proposed a data model consisting of four additional columns, such as 'row effective date', 'row effective time', 'row expired date' and 'row expired time', to mark row effective date/time and row expired date/time against each row in the table. In order to make each row unique we proposed the row effective date and time columns to be part of primary key.
The transaction lineage that we capture in data warehouse provides different applications and analytical community read data whatever the subset of data they need is based on different time slices. Temporal joins have been presented for temporal relations. Also suggestions have been made to take advantage of several indices offered by current commercial databases.
We proposed several business views based on timestamp filters for use by different applications. Application specific business views have been defined with timestamps as needed by individual applications. There are several different business application needs which can be filled by providing a separate set of views with different timestamp filters as required. The application specific views with timestamp filters are defined and dynamically recompiled right after each individual application's upstream subject area refreshes are completed per service level agreement (SLA). The temporal data update methodologies presented in this article should sufficiently meet the needs of application owners and customers as it takes into consideration several factors such as providing common data source with different time-varying data.