What makes Gaussian processes so powerful

# 4 distributions: is this still normal? Or crooked?

How inadvertently fitting that I am writing this part of the Data Literacy series on - at least so far - the hottest day of the year.

I can therefore hardly ignore the reference to the “climate” debate when it comes to “distributions” today and how these are visualized.


We will look at the "normal distribution", deal with asymmetrical distributions and look at some climate data along the way. After that we can hopefully answer competently when asked:


Is this warmth actually still normal (or is it all already wrong)?


So let's start with the first data literacy question: What is distribution (in a mathematical sense)?

Mathematical distributions: an example to get you started

Distributions describe the Alignment of individual data points, broken down according to the observed or predicted frequency of their occurrence.


Let's start with an example:


Let us assume that there is a test that checks how well you are with the subject «Data literacy»Know. We'll run this test once before you've worked through this blog series and once after.


The remaining 1,999 readers are tested in the same way and rated on a scale of 0-100 points.

To check how terrific this blog series is, let's look at the “before” test results and compare them with the “after” results.


In summary, our test parameters look like this:

  • Before test to determine the initial data literacy level
  • Post-test to determine the data literacy level after the blog series
  • Test results from 0 to 100
  • Total number of participants: 2,000


Let's start with the evaluation of the before tests:


In order to analyze the test results, a distribution is defined here. For this purpose, I have divided the possible test results into different value ranges that you can see on the X-axis.


The respective number of test participants is plotted on the Y-axis. This means that each bar represents the number of participants whose test result is in the respective value range.


This gives us a first graphical impression of the level of knowledge on the subject of data literacy before working through the blog series.


The second distribution overview shows the results of the post-test that the participants completed after they had worked through all the blog articles (this test will actually be available in a few weeks ...):


In this second bar chart you can immediately see that the values ​​have "shifted to the left". This means that the distribution of the test results has shifted in the direction of better test results.


Let's take a closer look at both diagrams:

First of all, it is noticeable that the individual value ranges in the prior test have a maximum number of 590 participants, in the second of 610. How come?


The areas into which I have divided the possible test results have not changed. In the post-test, there are obviously more participants who are at a similar level of knowledge (= in the same range of values).


Much more interesting, however, is * which * area you are in: Here you can see that we now have significantly more participants in the "81-90" area. The top range, “91-100”, has also increased significantly.


Is this increase in participants with high data literacy now the consequence of the blog series? I can hope so, of course, but the change in distribution alone doesn't tell me that. (In order to verify this, we would need a correlation analysis - we will only come to that in the next article.)


All I see is the movement to the left - so clearly better test results.

Mathematical distributions and their visualization

After we have seen and understood the topic of «distribution» using the example, we can delve a little deeper.


Let us recall the definition of "distribution" again:

Distributions describe the Alignment of individual data points, broken down according to the observed or predicted frequency of their occurrence.


This means:

We obtain a distribution in the analytical sense by collecting and summarizing (see -> Aggregation) of data. Their visualization takes place grouped according to defined value ranges.


Particularly interesting for practice: The Compare distributions of actual and forecast data!


From my point of view as a marketer, a forecast distribution would be interesting for marketing campaigns, for example:


Which age group clicks how often? Which income groups, which places of residence give the most clicks? I could then compare this with the data actually measured later and thus improve my forecasts in the future.

Visualization of distributions: histograms & box graphics

In my little example at the beginning, I have one histogram used to represent the distributions. Here you can see the distribution based on the different column heights, which represent the respective number of results within a defined value range.


A second way to visualize distributions is one Box graphic. In order to be able to create these, we need different types of aggregation:

  • Minimum (lower "whisker")
  • Maximum (upper "whisker")
  • Median

and new:

  • First (lower) quartile
  • Third (upper) quartile


The lower and upper "whiskers" - also known as antennas - form this Minimum and maximum from. These values ​​are self-explanatory (otherwise -> aggregations).


The Median is the mean value (* not * the mean value!) of the data series. (Memory gap? -> aggregations).

The box forms from the lower quartile to the upper quartile the middle 50% of the data values from.

Let's look again at the distribution of the "after" test results (the score 0-100 is shown here as a percentage on the Y-axis):


If you look closely, you naturally ask yourself: Why are there the points below the minimum? Why are these "Runaway»Not shown in a box diagram within the minimum and maximum?


Good question! Box charts are quite a unique visualization method. They show data that are considered to be “reasonable” for data analysis.


Data points that are very far below or above the box are considered to be outliers. “Very wide” specifically means: more than 1.5 times the length of the box.


Of course, that doesn't mean that outliers don't count! No question about it, every value counts - but the outliers are considered separately here and checked to see whether they are relevant for our analyzes.


Continuous and Discrete Distributions: Which Data Types Are Analyzed?

There are two types of distributions: continuous and discrete (the terms sound familiar? Great! Otherwise -> data types).


Continuous distributions can be in an INFINITE number of places within a data area. A classic example of this is the temperature, which can be not only 40 ° C or 41 ° C (nowadays selected arbitrarily ...), but also 30.5 ° C or 30.547837374 ° C.


In our example of the “before” and “after” tests, there is also a constant distribution: a test participant can not only achieve 80% or 81% of the total number of points, but also 80.11112% (whether that would make sense is left behind here ).


Discrete distributions, on the other hand, are limited; the data points could theoretically be counted here. An example of this would be, for example, the "number of days with temperatures above 35 ° C".

There can be no value like "2.5 days" - either it is warmer than 35 ° C or not (... and today it is definitely warmer!).


To cool off, let's look at some historical data:

Source: Watson.ch (https://www.watson.ch/schweiz/wetter/190343660-rekord Temperaturen-so-oft-hat-man-in-der-schweiz-schon-35-grad-gemessen)


I have aggregated the above table once, i.e. adding up the hot days for 10 years each (the table does not begin until 1959 because Locarno was added in 1935 and Sion in 1958):


Well - a tendency can be seen. If you add earlier years, you can also see significantly warmer decades in the 40s and 50s. But even then, the following applies: It has never been more than 35 ° C in Switzerland as in the last two decades.


BUT, now to the initial question: Is that still normal?

Let us first take a look at the best known and most “normal” (pardon!) Of all distributions.

Normal distribution & standard deviation: bell-shaped, symmetrical, good!

The normal distribution is also known as the "bell curve" or Gaussian normal distribution.


Typical is:

  • the bell-shaped curve
  • half of the results are below the mean (!), the other half above.
  • Mean value = median = mode value (see also: aggregation post)


The normal distribution plays an important role in the most widely used form of data analysis: the determination of the standard deviation. This is an absolute standard tool and essential for «data literate» users.

The standard deviation measures the spread of an “initial population” of data, ie how strongly the data points are "distributed".


A low standard deviation indicates that the scatter of the data points is not very large and that they are therefore close to the mean. If the standard deviation is high, the data points are widely scattered.


In statistics, the standard deviation is indicated by the small Greek letter Sigma (σ), the mean value by a small My (µ).


But what does the dispersion tell us in terms of data analytics?


As an example, let's look at the distribution of body weight in men (very gentleman!). Let's assume that the weight is normally distributed:

  • The mean value is µ = 84 kg
  • The standard deviation is σ = 5.4 kg


This means:

  • +/- 1σ = 78.6 kg - 89.4 kg
  • +/- 2σ = 73.2 kg - 94.8 kg


In the case of a normal distribution, there are:

  • 68% of men in the range +/- 1σ
  • 95% of men in the range +/- 2σ
  • 7% of all men included in the range of +/- 3σ


Let's visualize this using the bell curve:

Here we see how powerful this analysis tool is. With normal distributions, we get 95% of the data points covered within two standard deviations.


That works both for whole data sets as well as for samples. If these are well selected, then random samples can be used to infer the distribution of the entire data set - the initial population.

But are all data normally distributed now? Of course not! Now it's getting wrong ...

Asymmetrical distributions: now it's getting crooked!

Most data distributions do not show a Gaussian bell shape, but are characterized by asymmetry - they are therefore "skewed" distributions. In mathematical terms, this means that there is no uniform distribution around the mean.


If a distribution is «right-skewed», one speaks of a positive skew. It falls more flatly on the right side than on the left, so that the mean value is above the modal value. It can look like this, for example:

Conversely, the «left-skewed» distribution:

This is a negative skewness that is more flat on the left than on the right. Here the mean value is below the mode value.

Other properties of distributions

Finally, we will discuss a few more properties of distributions - the most important ones, but not the complete list ...


Bimodal distribution


"Bi" means "two" and we know "modal" from the modal value. With the bimodal distribution there are two clusters or two peaks in the diagram.


We see here a global mode with 55 and a second with 54 counts.


Multimodal distribution

Here it is - you can already guess - a distribution with several «peaks». So there are several clusters.


A multimodal distribution always means that it is not a normal distribution.


How «normal» is today's hot day?

With the newly acquired knowledge about distributions, let's take another look at the historical distribution of temperatures in Switzerland:


Admittedly, this is a very rough aggregation. But even if you take the data apart again or add more cities - we now know that it is a «negative skewness».


In the case of temperatures, “normal” would of course not be a bell-shaped normal distribution, but a multimodal distribution with sometimes more and sometimes less hot days.


The negative skewness shows, however, that the mean value (here µ == 24.83 [hot days per decade]) is below the mode value (58 days per decade).


So we can say without any judgment: It is evidently getting hotter in Switzerland and that is not normal!

Key takeaways

  • Distributions describe the alignment of individual data points, broken down according to the observed or predicted frequency of their occurrence.
  • Histograms or box diagrams are used to visualize distributions.
  • There are continuous and discrete distributions.
  • A normal distribution looks like a bell curve. Here 95% of the data points are covered by 2σ standard deviations (1σ -> 68%, 3σ -> 99.7%).
  • There are also asymmetrical distributions (left-skewed and right-skewed) as well as bimodal and multimodal distributions with several «peaks».



I hope the analysis toolbox has been filled up a little more with this article - as I said, the final test is in preparation! ;)


The next post will be about correlations and - more importantly! - Go causalities.