What does data warehousing mean

Data warehouse

Summary

A data warehouse is an integrated, structured and historical collection of all data available in a company or division. Internal and external data flow into this data pool, which is consolidated, condensed, analyzed and processed for decision-making. This should also include the knowledge and experience of the employees. The main benefit lies in the variety of possible evaluations, but also requires structured and target-oriented data acquisition.

Reasons for using a data warehouse

Knowledge-based services - such as research, development, design, marketing, logistics and support - are increasingly determining the success of companies. It is therefore of fundamental importance to adequately manage the knowledge available in the company for the optimal provision of these services and to make it available to the users concerned in a timely and usable form.

In every company, however, vast amounts of data accumulate every day and it is becoming more and more difficult to use this data profitably. For this it is necessary to filter out the necessary information from this mountain of data and thus create a basis for secure strategic decisions. This is difficult for two reasons:

  • On the one hand, a company's data is mostly distributed across all departments.
  • On the other hand, the data is available in different formats (from the Excel sheet to the Word document to the e-mail database or even to the file box in the cupboard).

If you want to extract usable information from data, you have to

  1. summarize the most diverse data sources,
  2. prepare in a form suitable for data analysis and
  3. make it available to users.

This is the driving force behind building a data warehouse. The number of employees who have to make decisions has increased significantly in recent years compared to the group of users of the earlier management information systems. The resulting need to provide data in a decentralized manner is another reason to set up a data warehouse.

Another reason is the direct addressing of customers through the new media, such as the Internet. The amount of electronic customer contacts exceeds everything that was previously available in terms of data in internal sources in terms of quality and quantity. Ever faster markets require correspondingly short (re-) action times. More informed and more critical customers require a more flexible offer that is better tailored to their individual needs. In order to be able to make the right decisions in this environment, those responsible need meaningful facts.

Characteristics of a data warehouse

Due to different prerequisites and requirements, a data warehouse is not to be understood as a comprehensive standard software, but always as a company-specific solution. A data warehouse is different from transaction-oriented operational systems[1]

  • subject-oriented: Data is oriented towards subject or topic areas and not process or function-oriented as in operational systems, i.e. the internal processes and functions are of subordinate interest for the development of the database of the data warehouse and therefore have only a small influence on its structure. The company data must be viewed from different perspectives, i.e. the decision-maker must be able to derive relevant knowledge from the company data under different dimensions. Frequently considered dimensions are the company, the product, the regional, the customer or the time structure as well as business parameters (e.g. sales, contribution margins, etc.).
  • fully integrated: There are clear naming conventions, units of measurement, etc. for the same facts. A company-wide integration of data in a uniformly designed system is sought in order to avoid data redundancies and related inconsistencies in the data system.
  • nonvolatile: Changes to field contents remain traceable by historizing information. The process of the company is reflected by historical information.
  • period related: Data are arranged according to periods, as queries are usually based on time periods. While the focus of operational systems is on the precise point of view of data, this is about time-related data, which represent the development of the company over a certain period of time and are used to identify and examine trends.
  • not volatile[2]: The values ​​of a data warehouse should only be changed subsequently in exceptional cases. For this reason, almost all data access is only read.

The content orientation of the data warehouse can thus be summarized from the point of view of tasks as follows: A data warehouse has the task of periodically collecting topic-oriented and integrated (in the sense of standardized) information over long periods of time and with reference to time to support decision-makers from various internal and external sources, to prepare usage-related and to make available as needed.[3]

General data warehouse architecture

A data warehouse is a "company-wide concept which, as a logically central storage facility, offers a uniform and consistent database for the various applications for management support and which is operated separately from the operational databases".[4] As a central database, it collects entries from the operational systems and from external sources at regular intervals, consolidates them, filters out unimportant things, arranges and compresses them, provides them with additional descriptive information, so-called metadata, and uses analysis tools to display relevant information.

Data warehousing is an evolutionary process: you cannot buy a data warehouse. It has to be built. There is therefore no simple standard design for a data warehouse, but the size and appearance of a data warehouse vary according to the company structure and the requirements of the users. Conception and creation can therefore grow into a lengthy and expensive process.

When implementing a data warehouse solution, it must be ensured that the information required for corporate management comes from the various operational areas (e.g. production, sales) or external sources such as news or online services. In accordance with the basic idea of ​​the data warehouse, this information must first be combined and reshaped before it is transformed and finally collected and processed in the "data warehouse".

The basic structure that most data warehouse solutions follow is shown in Fig. 1.


Fig. 1: General data warehouse architecture

In contrast to the operational databases that are specialized in their area of ​​expertise, a data warehouse should enable a global view. To do this, it must be able to offer as much information as possible. In addition to the current data, it therefore also contains older values. New information is integrated periodically. The information warehouse collects the information from all available sources, which can include all types of data management systems. In addition to modern relational database management systems, this includes, for example

  • hierarchical databases,
  • Documents that are scattered somewhere in the corporate network,
  • Groupware systems such as Lotus Notes,
  • Standard applications such as SAP ERP,
  • modern object-oriented databases and
  • external sources.

Very heterogeneous data sources come into question as external data sources, such as news services from business associations, market, opinion and trend research institutes, external databases and the Internet. These are of interest because many evaluations and analyzes that are created on the basis of the company's internal data and only become significant for the decision-maker when they are compared with external data. This central storage of company-external data in the data warehouse ensures that all decision-makers work with the same database. In addition, the effort involved in acquiring knowledge is minimized.

Components of a data warehouse

A typical data warehouse comprises the actual data warehouse database, suitable transformation programs for data acquisition from the company's internal and external sources, an archiving system and a meta database system as components.[5]

  • Data warehouse database: The database represents the data warehouse in the sense that it contains both current and historical data in different compression levels.
  • Extraction, transformation and loading programs (ETL): These should ensure uniformity when the data is transferred to the data warehouse. To do this, they must include functions for extracting data from a wide variety of operational systems, for the actual transformation of this data, as well as its transport and loading into the data warehouse. Due to the heterogeneity of the different sources, these programs must not only master the various database idioms, but also the different network protocols of the operating systems and the data structures of the individual applications. The extracted data is validated by the transformation programs and corrected if necessary. Integrator programs collect the data from the various sources into a common view. In a further step, they are reduced to the required amount of information. Particularly intelligent programs go one step further. If you find interesting values ​​(e.g. sales figures particularly high or low), you independently search for the causes and provide them at the same time. Ideally, these transformation programs are the only interface to the operational data processing systems and the external data sources. The rules, assignments and definitions resulting from the transformation and extraction processes form the basis for the meta-database.
  • Archiving system: This is mainly used for data backup and archiving. The data backup is carried out to restore the data warehouse in the event of a program or system error. The aim of the archiving system is to ensure short response times and the fast, simple and, above all, comprehensive provision of knowledge through efficient storage and processing of large amounts of data, even for complex ad hoc analyzes.
  • Meta database system: Metadata describe the meaning of the stored data - they thus represent "data about data". But they also describe the sources, processes and structures of a data warehouse.[6] The meta database system supports the end user in finding the data relevant to his task solution and also provides the necessary background information about data sources, transformations and compression. In addition, it also supports the employees in the IT department who are responsible for the operation of the data warehouse. For data warehouse management, the meta database system provides all the information necessary to control the transformation processes, as the metadata defines all information flows from the source to the target databases.

Evaluation and analysis

As a company-wide instrument, a data warehouse should flexibly support all decision-makers. They should be able to find their information in the most intuitive way possible. They are supported by a wide variety of end-user tools - the spectrum ranges from simple query tools to integrated packages that offer a whole hodgepodge of analysis methods. Well-known products, for example report generators or statistics packages, carry the title "Data Warehouse" as well as new developments, such as OLAP tools. The tools for decision-makers and decision-makers can be roughly assigned to the following categories:

  • Report and query generators
  • statistics
  • Document retrieval
  • active information filters
  • geographic information systems (GIS)
  • Online Analytic Processing (OLAP)
  • Data mining
  • Text mining
  • Management information systems
  • Decision support systems
  • department-specific tools
  • industry-specific tools.

Report and query tools are among the most common tools, especially for accessing relational databases. The widespread use of standards for these data managers (SQL[7], ODBC[8]) certainly contributed to this. In addition to the products specializing in query and report generation, you can use the generators of the familiar desktop database to access any ODBC data source. Nowadays nobody has to use SQL directly anymore. Graphical point-and-click generators or query-by-example tools, in which users generate queries from sample data, make work easier.

Statistics programs enable a wide variety of evaluation methods - from the calculation of arithmetic mean and standard deviation to t-test, regression and time series analysis.

These can be used to create summaries for large amounts of data or to examine the degree of dependency between two factors. This can be used for a wide variety of purposes: For example, the development manager looks for connections between materials and certain types of defects, while the marketing manager looks for correlations between advertising and buying behavior. In addition to the traditional all-purpose statistics packages, there are now also specialized solutions. With modeling tools, processes are described mathematically in order to better assess them. International companies, for example, have to hold stocks of money or other collateral that are similarly "liquid" to cash. In principle, you can park these values ​​in any country in which you are represented. By simulating how the exchange rates of the currencies in question and what is happening on the stock exchange will affect your portfolio, you can make additional profits or minimize losses. Optimization tools help to find out which input values ​​of a process maximize or minimize its results with given resource constraints. For example, it is possible to calculate which product mix leads to maximum profit for a given machine park, certain raw materials and prices.

Document retrieval systems To cope with the flood of files created with word processors, spreadsheets, etc., all documents of a company are stored in their own central database. There the user can search for all files that contain a certain key term, are of a certain file type, were created within a specified period of time or were created by a certain employee.

Active information filters automatically search the databases for information. For this purpose, the user specifies certain topics or events that he would like to be informed about in the future - for example, that the sales income of a region is below a certain value. The filter programs send so-called agents on their way. These applications comb through the databases for the objects they are looking for at regular intervals. They deliver their results regularly; If you wish, they can also sound the alarm immediately, e.g. if there is a threat. Document retrieval systems also have such active research aids.

Geographic information systems (GIS for short) represent data in their geographical dimension. In so-called thematic maps, these color certain areas differently depending on the data material. In this way, for example, the sales of a product can be processed very clearly by postcode area. Location decisions can also be prepared with the help of these data cards. In this way, taking into account logistical information such as road networks, rivers and branches, you can determine the ideal location for a central warehouse. If the user stores regional demographic and purchasing power-related data, they can suggest new branch locations.

OLAP describes a software technology that enables operational analysts and managers a multidimensional view of the data in the data warehouse. For this purpose, various dimensions, such as product, region or month, are stored in a data cube (or on the axes of an n-dimensional space) and the analyst can select the criteria that are relevant to him and combine them as required. For example, the area manager selects the dimension combination "Sales of all products and months for an area", while the product manager takes a closer look at "Sales of all areas and months for one product" (see Fig. 2).



Fig. 2: Flexible data access

While OLAP gives the answer to specific questions, it goes with Data miningabout the discovery of usable relationships and structures in the mostly very extensive databases of the data warehouse.[9] Data mining is a central process in the context of Knowledge Discovery in Databases (KDD), the process of acquiring knowledge from databases. The core of every data mining system are analysis algorithms, whereby an analysis unit is always composed of parts of the pattern recognition and the pattern description.

  • With pattern recognition, large stocks of data objects are used to locate hidden relationships between individual data fields. Patterns in the structure of the data objects under consideration are to be developed. Patterns can include all relationships and regularities between records and data fields.
  • The main task of the pattern description is to use a generated pattern or a hypothesis about relationships and regularities and to describe it as precisely as possible with the help of algorithms.
  • In addition to uncovering unknown relationships in subordinate databases, data mining tools should also make the automated prediction of trends, behavior and patterns on the basis of stored data.

As part of Knowledge Discovery in Databases (KDD), Text mining as a supplement to data mining with the analysis of semi-structured or unstructured text databases. The main goal is to largely automatically identify meaningful patterns and content from large document collections and present them to the user in compressed form. Due to the qualitative nature of the data, the problems are algorithmically more complex than in classic data mining. The text mining system first prepares the documents for further analysis, then the texts have to be structured, anomalies visualized and the most important facts filtered out (extraction of the core information). A very important area of ​​text mining is structuring, the aim of which is to provide an overview of the content of large document collections, to identify hidden commonalities, to enable information to be absorbed quickly and to find similar documents more easily. Text mining does not completely eliminate the need to read texts. The system leads the user to potentially interesting statements. The benefit is that the facts hidden in an extensive amount of documents can be made accessible in decision-making situations with less effort.[10]

Since the tools discussed so far take some time both in terms of familiarization and use, managers who usually do not have enough time to familiarize themselves with how the complicated tools work Management information systems at. These offer simple methods of querying or generating information. In this way, certain performance values ​​can be queried using ready-made analysis templates. The price for the ease of use, however, is the limited flexibility of these tools.

The applications presented so far can be used in many different areas of companies. Department-specific tools on the other hand, are tailored to the requirements of certain parts of the company. Financial analysis tools are specialists in tasks such as budgeting, forecasting, consolidation, and reporting. In addition, there are own solutions for the areas of market analysis and for planning production and sales. Industry-specific tools however, are geared to the needs of certain branches of the economy, such as banking and healthcare, insurance, retail, telephone companies or freight forwarding companies.

In addition to the potential for technical use, the business potential of a data warehouse system lies primarily in better decisions based on more efficient information provision, in increasing competitiveness through the early detection of trends and in improving customer service and satisfaction based on the historical and current data provided by the data warehouse Knowledge of the customer.[11]

Footnotes

[1] Cf. Mertens, P .; Wieczorrek, H. W. (2000), p. 20.

[2] Volatility: the degree to which data changes over the course of normal use.

[3] Gabriel, R .; Chamoni, P .; Gluchowski, P. (2000); P. 78.

[4] Gabriel, R .; Chamoni, P .; Gluchowski, P. (2000), p. 76.

[5] Cf. Mucksch, H .; Behme, W. (1998), p. 40.

[6] Cf. Mertens, P .; Wieczorrek, H. W. (2000), p. 151.

[7] SQL = Structured Query Language (standard language for access to relational databases).

[8] ODBC = Open Database Connectivity (Microsoft interface standard).

[9] Cf. Mertens, P .; Wieczorrek, H. W. (2000), p. 212 ff.

[10] Meier, M .; Beckh, M. (2000), p. 165 ff.

[11] Cf. Mucksch, H .; Behme, W. (1998), p. 86 ff.

Literature tips

Gabriel, R .; Chamoni, P .; Gluchowski, P.: Data Warehouse and OLAP - Analysis-Oriented Information Systems for Management, in: Zfbf - Schmalenbach's Journal for Business Research, February 2000, pp. 74-93.

Meier, M .; Beckh, M.: Text Mining, in: Wirtschaftsinformatik, April 2000, pp. 165-167.

Mertens, P .; Wieczorrek, H. W.: Data X strategies. Data warehouse, data mining and operational systems for practice, Berlin / Heidelberg 2000.

Muksch, H .; Behme, W.: The data warehouse concept. Architecture - Data Models - Applications, Wiesbaden 1998.

First-time author

Cornelia Putzhammer

<img src="http://vg08.met.vgwort.de/na/94ddc8f646a2497696a797b358f315c1" width="1" height="1" alt="">