What are data warehousing concepts

Data warehouse

classification

The provision of information is and remains an essential aspect of management support and business intelligence systems. The collection, compression and selection of decision-relevant information unfolds great potential, especially on the basis of consistent company-wide data management.

This is where the data warehouse concept comes in and calls for the creation of a central database that is separate from the previous systems to support dispositive tasks [Gluchowski, Gabriel, Dittmar 2008, p. 118; Kemper, Baars, Mehanna 2010, p. 19]. Ideally, such a database should be company-wide or group-wide and cover the information needs of a wide variety of user groups. For technical reasons, it makes sense to decouple such a central data warehouse (DW) from the data-supplying upstream systems and operate it on a separate platform, which is sometimes even rated as a constituent feature for data warehouse solutions [Hahne 2014, p. 2]. Decoupling leads on the one hand to a relief of the operational systems and on the other hand opens up the option of optimizing the analysis-oriented system for the needs of evaluations and reports [Gluchowski, Gabriel, Dittmar 2008, p. 118].

The first use of the term can be traced back to the late 1980s, when Devlin and Murphy sketched the data warehouse as a central data collection point that relieves the user of technical details of data integration so that he can concentrate on the use of the prepared content [Devlin , Murphy 1988, p.61].

William Inmon contributed greatly to the popularity of the data warehouse concept, who formulated the four ideal characteristics of topic orientation, standardization, time orientation and stability, which are still largely valid today [Inmon 2005]. With this content-based orientation of a data warehouse solution, the associated task is consequently defined, subject-oriented and integrated (in the sense of unified) information over long periods of time and with time reference to support decision-makers from different sources, to periodically collect, process and use-related to be made available as needed [Gluchowski, Gabriel, Dittmar 2008, p. 121; Kemper, Baars, Mehanna 2010, p. 21].

Design of a data warehouse solution

When setting up a data warehouse concept, both business, organizational and technical design aspects must be carefully considered. From a business and organizational point of view, it is important to consider which information has to be stored in the data memory at which compression level and which employees should be able to access it. In addition, it must be clarified what exactly is to be understood by individual terms or what the individual variables are made up of, what they represent and how they are determined [Gluchowski 2016, pp. 230-232].

In addition, a viable technical implementation concept must also be developed with the aim of systematically merging the atomic data from the diverse and heterogeneous operational upstream systems. For this reason, connections must be established periodically or ad-hoc in order to extract the relevant data. These are cleaned using a variety of processing mechanisms and stored in a structured manner according to the requirements. The integration of the data in a system means that similar access to a very wide range of content is made possible. Since, ideally, all management applications in a company work with this data, there is only one “version of the truth” [Gluchowski, Gabriel, Dittmar 2008, p. 124], i. H. that no deviating figures can appear in different reports and evaluations, even across departments.

Architecture and components

At the beginning of the discussion about the data warehouse concept, the correct arrangement of individual components was still vehemently struggled. In the course of time, the hub-and-spoke architecture has established itself as a suitable form of setting up a decision-oriented IT landscape [Hahne 2014, p. 10f.]. The chosen terminology is due to the fact that the arrangement of the components is reminiscent of a hub-spoke combination.

The data flow starts with the operational and external information systems that serve as suppliers for the raw data material, through the components used for the extraction, transformation and loading of data (ETL), initially to the enterprise (or core) data warehouse. Here you can find processed and aggregated data (for example on a daily basis) from all areas of the company and with a long history, usually spanning several years. The volume of data stored in the core data warehouse is usually very extensive and can reach the high terabyte range.

With such a large volume of data, interactive and possibly multidimensional views of the available data stock with pronounced navigation functionality in particular cause considerable problems. Finally, according to the OLAP requirements, any rotations and slices, along with analysis functions, can be offered. For this reason, data extracts are formed for further processing, which can be understood as personal, application, functional or problem-specific segments of the central data warehouse database and can be referred to as data marts [Kemper, Baars, Mehanna 2010, p. 22].

When it comes to the interaction between a data warehouse and the associated data marts, different basic forms can be distinguished from one another. Very often data marts are based on the central data warehouse and save partial extracts of the entire data stock separately in physical form.

Figure 1 visualizes the different storage components of a data warehouse solution together with the associated data flows.

 

Fig. 1: Hub-and-spoke architecture

In addition to the storage components for the problem data, a data warehouse architecture also has an archiving system that serves both for data archiving and for data backup. The data backup is required to restore the contents of a data warehouse in the event of program or system errors. The data volumes to be stored in a data warehouse can reach a considerable extent over the course of the period of use. In order to keep the volume within tolerable limits, archiving systems are used that remove atomic and compressed data from the data warehouse database without losing it for later analyzes. Technologically, part of the data from the data warehouse database is relocated to slower data carriers, which can either be accessed directly or from which the original database can be regenerated at any time.

In addition to the decision-oriented data, a data warehouse also contains metadata that is managed in a meta database system. Ideally, this is not just a matter of purely technical information, which, in addition to its informational function, also serves to control the data warehouse operation. Rather, business information is also stored in the meta database system, which helps the end user to use analysis applications more effectively and to interpret their results or to find the data relevant to him. This category of semantic metadata includes, for example, documentation on predefined queries and reports as well as the explanation of technical terms and terminology.

However, it should be noted that the hub-and-spoke architecture concept presented is only one possible architecture variant. In the practical design, there are also solutions that deviate from this and, for example, operate without data marts or with independent data marts [Kemper, Baars, Mehanna 2010, p. 22; Seiter 2017, p. 85].

In addition to the classic hub-and-spoke architecture, a logical division of the DW database into layers has emerged to an ever greater extent in recent years, which in its entirety is now referred to as the data warehouse layer architecture [Hahne 2014, p. 16-23]. Even if the individual levels are sometimes named differently and the number of layers can vary, an ideal-typical architecture with a total of four data layers or layers is presented here.

Fig. 2: DW layer architecture

The acquisition layer is used to receive untransformed raw data from internal and external data sources. Such a data area known as the staging area was already in use in the context of the hub-and-spoke architecture. Especially when the delivered raw data is formatted as text files, this is also referred to as a landing zone. In the case of permanent storage of the raw data for possible later use of the extraction history, the term staging memory is used.

The integration layer corresponds to the classic enterprise or core data warehouse and, with processed data, forms the goal of the transformation and harmonization processes in their complete history. In contrast to earlier solutions, the data is now often found here in the greatest possible detail, i.e. down to the document level of the data-supplying upstream systems. The total volume of data stored in this component can grow into the three-digit terabyte range and beyond, especially since the data is retained over a period of several years. By providing all relevant company data in a detailed and integrated form, the integration layer combines both a collection and integration function and, as a hub for downstream layers, a distribution function.

The propagation layer implements uniform business logics or business rules by centrally determining derived characteristics and key figures in one place. The centralized derivation prevents different calculation rules from being used on the way to the downstream data marts. Depending on the application, this layer can also be dispensed with.

The reporting layer makes data available to the specialist department in a way that is prepared in accordance with the business tasks. In general, this data is stored in different, task-related data marts and there with multidimensional data structures, whereby the concerns of self-service are also increasingly being taken into account. Due to the pronounced volatility of the requirements from the departments (business requirements), good adaptability through high agility must be guaranteed.

In summary, a data warehouse is to be understood as a company-wide concept that, as a logically central memory, offers a uniform and consistent database for the various dispositive applications and is operated separately from the operational databases.

literature

Gluchowski, Peter; Gabriel, Roland; Dittmar, Carsten: Management Support Systems and Business Intelligence, computer-aided information systems for managers and decision-makers. 2nd Edition. Berlin et al .: Springer, 2008.

Gluchowski, Peter: Development tendencies in analytical information systems, in: Chamoni, Peter; Gluchowski, Peter (Ed.): Analytical Information Systems. Business intelligence technologies and applications, Berlin et al .: Springer, 2016, pp. 225 - 238.

Hahne, Michael: Modeling of Business Intelligence Systems, Heidelberg: dpunkt, 2014.

Inmon, William H .: Building the Data Warehouse. 4th edition. New York: Wiley, 2005.

Kemper, Hans-Georg; Baars, Henning; Mehanna, Walid: Business Intelligence - Basics and Practical Applications. 3rd edition, Wiesbaden: Vieweg, 2010.

Seiter, Mischa: Business Analytics. Effective use of advanced algorithms for corporate management, Munich: Vahlen, 2017.

Authors


 

Prof. Dr. Peter Gluchowski, Chemnitz University of Technology, Faculty of Economics, Professorship System Development / Application Systems, Thüringer Weg 7/225, 09126 Chemnitz

Author info


Item Actions