What is an endogeneity problem

Multiple linear regression

The first part of the series of articles (simple linear regression) dealt with the case that the dependent variable y is only influenced by an explanatory variable x. In practice, however, the relationships are often more complex and the dependent variable y is influenced by several factors multiple turn to linear regression model.

Multiple vs. multivariate regression model
It should be noted at this point that a multiple and a multivariate regression model are not the same thing. A multiple regression model examines the influence of several independent variables \ (x_k \) on a dependent variable \ (y \). In a multivariate model there are several dependent variables. For example, the influence of various explanatory variables on the turnover and profit of a company could be examined simultaneously. Two model equations (one for sales, one for profit) would be estimated simultaneously. Further information on the topic of multivariate procedures can be found here.

Representation and interpretation of the multiple linear regression model

The general equation of the model with k explanatory variables is

$$ y_i = \ hat {\ beta_0} + \ hat {\ beta_1} \ cdot x_ {i1} + \ hat {\ beta_2} \ cdot x_ {i2} + \ hat {\ beta_3} \ cdot x_ {i3} + \ ldots + \ hat {\ beta_k} \ cdot x_ {ik} $$

The coefficients are interpreted similarly to the simple linear regression model. In the multiple linear regression model, however, \ (\ hat {\ beta_0} \) is no longer the intersection of a regressionstraight with the y-axis. Due to the additional independent variables, the above equation describes a hyperplane in a k + 1-dimensional space. A visual example of a model with two independent variables is shown in Figure 1. \ (\ hat {\ beta_0} \) can be interpreted as the value for y that occurs when all independent variables are equal to 0. The coefficients of the independent variables also indicate in the multiple linear regression model what effect a change in the corresponding variable by one unit has on the expected value of y (if all other independent variables are kept constant). For example, if the value of the coefficient \ (\ hat {\ beta_3} \) is -0.35, this means that if \ (x_3 \) increases by one unit, the expected value of y decreases by exactly 0.35. \ (\ hat {\ beta_3} \) is therefore equal to the partial derivative of the expectation value from y to \ (x_3 \):

$$ \ frac {\ delta E (y)} {\ delta x_3} = \ beta_3 $$

As in the simple linear regression model, the error term is the distance between the estimated \ (\ hat {y_i} \) and the actual \ (y_i \). The method of least squares can be used analogously in the multiple linear regression model to estimate the coefficients. As in the case of simple regression, this method delivers an "optimal" result (see Best Linear Unbiased Estimator, "BLUE"), provided that some assumptions are met.

Illustration 1:


Multiple linear regression model assumptions

1. The independent variables have one linear Influence on y

$$ \ begin {split} y_i = \ beta_0 + \ beta_1 \ cdot x_ {i1} + \ beta_2 \ cdot x_ {i2} + \ beta3 \ cdot x_i3 + \ ldots + \ beta_k \ cdot x_ {ik} + e_i \ \ \ text {with} \ i = 1, \ ldots, N \ end {split} $$

Whether this assumption is violated can be checked for each individual explanatory variable using a scatter diagram or residual plot.

What if the assumption of a linear relationship is violated?

If the assumption of the linear influence of x on y is fulfilled, the scatter plot and the associated residual plot look something like this:

In the residual plot (right) the points should scatter unsystematically ("white noise"), i.e. that there should be no systematic structures in the errors.

In the next example diagram, the assumption of the linear influence of x on y is violated - instead there is a quadratic relationship.

The residual plot (right) clearly shows a tendency to overestimate for smaller and larger values ​​of the dependent variable and a tendency to underestimate in the middle range. A remedy here would be to take the squared variable into account or to model a spline in the case of more complicated relationships.

2. The expected value of the error term is 0

As with the simple linear regression model, the expected value of the error term is 0, equivalent to the expected value of y.

$$ E (e_i) = 0 \ Leftrightarrow E (y_i) = \ beta_0 + \ beta_1 \ cdot x_ {i1} + \ beta_2 \ cdot x_ {i2} + \ beta_3 \ cdot x_ {i3} + \ ldots + \ beta_k \ cdot x_ {ik} $$ This assumption is usually unproblematic as long as the model contains the constant \ (\ beta_0 \).

3. Homoscedasty

The variance of e and y is assumed to be constant, i.e. homoscedastic.

$$ Var (e_i) = \ sigma ^ 2 = Var (y_i) $$


How do I recognize heteroscedasty?

A violation of the assumption of constant variance can be recognized from the scatter diagram or residual plot. The following diagram shows a model with heteroscedastic variance.

What are the consequences of heteroscedasty in our model?

The least squares estimators can then no longer be regarded as the best estimators. Caution is advised with the standard errors of the estimators, these are no longer correct. However, this problem only affects us when it comes to confidence intervals or testing hypotheses. A two-stage estimation procedure can be used as a remedy. "Normal" is initially estimated. The model is then estimated again, with case weighting being introduced and each row being given a weight that is inversely proportional to the size of the associated estimated residual from the first level.

4. The covariance between the individual errors is 0.

$$ Cov (e_i, e_j) = 0 \ quad \ text {for} \ i \ ne j $$ This assumption can be violated e.g. in the case of time series or it can also represent an indication of a non-linear relationship. Closely related is the "endogeneity problem" in which there is a relationship between the residuals and an explanatory variable.

The endogeneity problem

Is the covariance between the independent variable x and the error term e Not 0, i.e. if x and e correlate with each other, then one speaks of Endogeneity. The causes of endogeneity include the following:

  • Important explanatory variables were not taken into accountomitted variables")
  • The occurrence of simultaneous causality (several equations describe a relationship)
  • Measurement error in an explanatory variable
  • Unobserved heterogeneity (so-called individual effects)
  • Etc.

What is the consequence of endogeneity?

When the endogeneity problem occurs, the least squares method should be used Not can be used for the estimation. The Hausman test can be used to find out whether there is any endogeneity in the model. One approach is to use so-called instrument variables ("IV" for short). For individual effects, the use of a panel data model provides consistent estimates.

5. No multicollinearity

The independent variables \ (x_k \) are not random variables and they must not be (exact) linear functions of another independent variable.

If there are linear dependencies between two or more explanatory variables, one speaks of Multicollinearity. The higher the multicollinearity, the more unstable the model becomes. Unusually high standard errors can indicate that multicollinearity is occurring. These lead to a loss of significance in the statistical tests within the framework of the model.

What can be done against multicollinearity?

The affected variables can usually be easily identified using a correlation matrix. Once this has happened, there are two possible solutions:

  • If the explanatory variables in such a group can be interpreted in a similar way in terms of content and are very highly correlated with one another (amount of correlation according to Pearson> 0.8 or 0.9), a variable from the group can be selected as a "proxy", which represents the group in the Model is recorded. Example: In a regression model for a call center, the explanatory variables taken into account include the monthly call volume of a customer, the total duration of the calls and the monthly invoice amount. All three variables correlate strongly with one another (Pearson correlation> 0.9). Only one variable is included in the model as a proxy for the "intensity of use".
  • If the above approach does not lead to success (e.g. because the relationships are more complex or the correlation between the variables is too low, so that too much information would be lost if variables were completely removed), variables can possibly be analyzed with the help of a principal component or factor analysis to a smaller number of uncorrelated components or factors that are taken into account in the model instead of the original variables.
6. Normal distribution of residuals (optional)

$$ e_i \ sim N (0, \ sigma ^ 2) $$

The normal distribution of the residuals is an optional additional assumption.

This assumption is not necessary for the least squares method. The normal distribution of the residuals is only important when hypothesis tests or confidence intervals are to be interpreted. So if a model is only used to make a forecast (hypothesis tests are irrelevant), this assumption is negligible. It is also important to clarify that the assumption of normal distribution relates to the residuals. In practice, one often comes across the (wrong) opinion that the assumption must apply to the explanatory variable or the dependent variable.

How can the assumption be checked?

The residuals from the model can be checked for normal distribution using a QQ plot. If the number of cases is sufficient, a test for normal distribution (e.g. the Shapiro-Wilk test) can alternatively be carried out.

What to do if the residuals are not normally distributed, but significance tests are relevant?

Occasionally, the problem is solved by including important, previously neglected explanatory variables in the model. If this does not lead to success, all other assumptions should be critically checked again (in particular linearity, homoscedasticity and uncorrelatedness). If the problem persists, it should be checked whether the model class has been selected correctly. Linear regression is often used in situations in which other models would actually be preferable. If the dependent variable is, for example, a counting variable, a Poisson regression would actually be appropriate.

Other parts of the series of articles on regression: