Are stratified and proportional samples equal

Stratified random sample

Transcript

1 Stratified random sample What is this module about? Definition of a stratified random sample Drawing of a stratified sample Step 1: Division of the elements of the population into strata Step 2: Selection of a random sample from each individual stratum Advantages of stratified samples Disadvantages of stratified samples The selection of the stratification factor The distribution of the sample to the strata Summary What is this about this module? As part of the module, stratified random selections are defined and the drawing of stratified random samples is shown. The basic idea is to divide the population into strata, which is why the term and choice of the stratification factor is explained. As a result of the stratification, a stratification effect can result. What is meant by a stratification effect is therefore also shown. Finally, the difference between proportional and disproportional drawing is explained. Example: Application examples for stratified random selection - As part of a communal citizen survey, satisfaction with the communal city administration was to be ascertained. It is assumed that the geographical distance to the administrative query structure plays a role. Different information was assumed depending on the distance. Therefore, for the sampling, the city was divided into city districts and a random sample was drawn in each district. This type of drawing corresponds to a stratified sample using the boroughs as strata. - A survey of satisfaction with public service providers in Europe assumes that this varies from country to country. Each country should therefore be considered separately. This is ensured by stratification by country. A separate random sample is then drawn from the public service providers in each country and the satisfaction with the respective Page 1

2 service companies collected. Definition of a stratified random selection In a stratified random selection, the population is first divided into subpopulations, so-called layers. A random sample is drawn from each stratum. The resulting sample is called a stratified random sample. It is important that every element of the population belongs to exactly one stratum. Furthermore, the sampling is carried out independently in each stratum. The formal definition reads a bit more "elegant": Formal definition of the stratified random sample Elements of a population are broken down into disjoint strata of size with. Where is the total number of elements in the population. Then a random sample of the sizes is taken from each stratum independently. The stratified random sample then contains elements. Example: Applications for stratified random sampling - In the 1995 Chicago Marathon, approx. Runners crossed the finish line. From these, a stratified sample of 1000 runners was drawn for doping control. The shifts were defined over the running time (less than two and a half hours, two and a half to four hours and over four hours). For the first shift, the proportion of doped runners was 25%, in the second shift 7% and in the third shift 3%. - A two-fold stratified sample was drawn for the PISA study on the international comparison of the performance of the school system in different countries. In a first step, the school systems of the participating states were stratified according to regions (states, provinces, cantons, etc.) and according to types of schools. Within these subdivisions (in Germany: school types within the federal states), the schools were then drawn through a random sample. In a second step, the students within the schools were selected by means of a random sample. - If one examines the mortality of newborns within the first week after birth, clear differences can be observed between different maternity hospitals. The size of the clinic is of particular importance, with size being defined as the number of births per year. The mortality in large clinics is significantly lower than in small clinics. If you want to investigate the causes of these differences, it is advisable to stratify the sample according to the size of the maternity hospital. Drawing a stratified sample A stratified random sample is carried out in two steps: Page 2

3 - Classification of the elements of the population in strata - Selection of a random sample from each individual stratum Step 1: Classification of the elements of the population in strata For a stratified random selection one needs to know about a variable which enables the division of the population in strata. The layers should be as homogeneous as possible and differ as much as possible with regard to the variables of interest. Basic information about the selection units is therefore assumed, which makes it possible to subdivide them into layers. The characteristics of the stratification feature for all elements of the sample population must be known. A variable that is used to break down a population into strata is known as a stratification factor. The stratification factors can often be taken from the registers or databases that are available for the population. Typical stratification factors are spatial or demographic variables such as electoral district, residential area, gender or age. This information is usually known. Other frequently used stratification factors that are known for administrative reasons are e.g. Company affiliation, school or school class. According to the characteristics of the stratification factor, the layers are now formed in such a way that selection units with the same characteristics belong to the same layer. The population is then divided into non-overlapping layers. Step 2: Selection of a random sample from each individual stratum After the strata have been determined, random samples are drawn within the strata. The same techniques are used here as for random sampling in general. For example simple random samples can be drawn in one of the ways described there (cf. section Procedure in) The random samples are drawn independently of one another in each stratum. We differentiate e.g. three strata in the population, our stratified sample consists of the elements of three completely independent samples. Page 3

4 Advantages of Stratified Samples Using stratified samples instead of simple random samples has a number of advantages and a few disadvantages. Advantages of stratified random selection A stratified sample almost always allows a more precise estimate of population parameters than a simple random sample. This always applies when the strata in the population and the samples are sufficiently large. Since this condition is always given in practice with the professionally used samples (especially with population samples), stratified samples are always preferable to unstratified samples of the same size. In addition, the stratification ensures that even smaller subpopulations of the population are represented in the sample in a sufficient number for the data analysis. Advantages of stratified samples outweigh the disadvantages If one weighs the mentioned advantages and disadvantages against each other, then the advantages of stratified samples outweigh by far. For a given sample size, under realistic conditions it can always be expected that the estimates of stratified samples are more precise than the estimates based on unstratified samples. This gain in accuracy is called the "stratification effect". Example: Stratified sample If, ​​for example, a survey is to take place within the Bundeswehr and if you are specifically interested in women, you should stratify the population according to the characteristic "gender". Otherwise one would have to draw a very large simple random sample from the soldiers in order to have at least a few women in the sample. A very similar problem arises when examining the low proportion of women in the professorship. E.g. in the nineties in the Federal Republic of Germany the proportion of women doing habilitation was less than one sixth. A sample of post-doctoral candidates should therefore be stratified by gender if data analysis is to be used to make statements about female post-doctoral candidates and their career prospects. Stratification avoids random samples from the outset that do not take into account the strata of interest. If you take a series of random samples from the approx. Voting districts of the FRG, each with 160 voting districts (this roughly corresponds to the procedure for drawing a sample from the residents' registration offices for the FRG), then more than 40% of all these samples do not become a voting district from the state of Bremen contain. If the sample is stratified according to federal states, such a sample is excluded from the outset. Disadvantages of Stratified Sampling Sometimes the information for an appropriate stratification factor is not accessible. Would you like e.g. draw a stratified sample where Page 4

5 "State of health" is used as a stratification variable, there is no suitable selection basis for drawing such a sample. In this case one will probably stratify the sample according to the variable "age", since age is highly correlated with state of health. In practice, therefore, geographical or demographic characteristics are often chosen as stratification factors. Finally, it must be emphasized that a stratified random selection always requires knowledge of population parameters. At least the extent of the strata in the population must be known. This is one of the reasons why sample statisticians are very interested in censuses: the design of stratified samples requires census data. In general, the less you know about the population, the less precise you can and should stratify. As already mentioned, a stratified sample almost always allows a more precise estimate of population parameters than a simple random sample.If a sample is drawn a large number of times from the same population, the estimates for stratified samples would differ less from one another than for simple random samples. The parameter estimates vary less with stratified samples than with simple random samples. The inter-sample variation can be more precisely characterized by calculating the variance of these estimates. You got to know this in the module "Estimation" under the term "Variance of an estimator". The square root of this variance is called the "standard error" or, for short, the "standard error" (see section "Illustration of the development of a characteristic value distribution" in) The greatest advantage of stratified samples is the reduction in the standard error. This effect is called the "layering effect". Reduction of the standard error through stratification: stratification effect Stratified samples almost always have smaller standard errors than simple random samples. This always applies when the strata in the population and the samples are sufficiently large. The extent to which the standard error is reduced (i.e. the size of the layering effect) depends on two factors: - The size of the differences between the layers - The degree of homogeneity within the layers The layering effect is higher, the greater the difference between the layers and the more homogeneous each individual layer is. The easiest way to understand the stratification effect is to compare the possible samples with simple random samples and stratified samples. Let's take our student example to extremes and assume a population that consists of people aged around 20 (students) on the one hand and people over 60 (non-students) on the other. If one uses the characteristic "student vs. non-student" as a stratification criterion, then with stratified samples some of the samples possible with simple random samples are excluded from the outset by the design of the sample. This includes all samples that only contain elements from one stratum. With a Page 5

6 simple random sample, it is possible that the entire sample only contains people over 60 years of age. The estimate of the population mean based on such a sample would be far removed from the actual population mean. A few such samples would also significantly increase the variance of the estimates. Such extreme samples are not possible with stratified samples. Therefore the variance of the estimates and thus the standard error are reduced. The aim of the layering must therefore be to define the layers with regard to the feature of interest in such a way that they differ as much as possible and the elements within each layer are as similar as possible. The applet is intended to illustrate this gain through the layering, the layering effect. Stratification applet (ab4.jar) The selection of the stratification factor The choice of the stratification factor, i.e. the feature according to which the stratification is carried out, depends on the aim of the stratification. As a rule, the aim is either to reduce the standard error or to ensure that smaller subpopulations are also sufficiently represented in the sample. In order to be able to exploit the stratification effect, it must first be clear for which variable estimates are to be made. Of course, the values ​​of this characteristic are not known in the population, otherwise it would not have to be estimated. Therefore, the population is stratified on another variable that is believed to be related to the characteristic of interest. In general, stratification criteria are sought which lead to strata that have large differences in mean values ​​with regard to the variable of interest and are as homogeneous as possible with regard to the variable of interest. The layers should therefore differ greatly from one another (large differences in mean values). Furthermore, the elements within a layer should differ little from one another (small standard deviation within a layer). The better a stratification fulfills these two criteria, the higher the stratification effect and thus the more accurate the estimate based on the stratified sample. Sometimes data from previous surveys, e.g. of the previous year, which can then be used for stratification. However, stratification variables often have to be used that are known for all elements of the population and that correlate only slightly with the variables that are actually of interest. In the case of population surveys, these are e.g. often characteristics that reflect regional affiliation. There is an abundance of regional disparities in population samples. So Page 6

7, the supply of social infrastructure facilities differs regionally just like the unemployment rate. Since such living conditions can affect the behavior of the population in many ways, almost all supraregional studies and even many municipal studies stratify according to the region of residence. Different stratification criteria are used in the Federal Republic of Germany for these regionally stratified samples. Common are e.g. Stratifications according to - old and new federal states - federal state - administrative district - district - municipality size class - municipality - district Frequently, when sampling the general population, stratification is not only based on one criterion, but on several criteria. For example, large numbers of strata are sometimes used in large samples. E.g. in the USA the official unemployment rate is estimated with a survey ("current population survey", CPS). The CPS is based on the information from households, which are stratified according to 792 strata. In addition to regional characteristics, the unemployment rate of the previous month, the number of single mothers and the proportion of multi-person households in all households are used as stratification criteria. Full details on all aspects of the CPS can be found in this. If the aim of the stratification is to ensure that sufficient numbers of certain groups are included in the sample, then, as a rule, you will stratify directly according to this grouping variable. Around 5% of the population over 65 years of age live in old people's homes. Although this is more than 50,000 people overall, the proportion of retirement home residents in a simple random sample - even if limited to those over 65 - is likely to be too small to include a sufficient number of cases in the sample for statistical analysis receive. One will therefore use the characteristic of retirement home residents as a stratification characteristic (details on drawing such a sample can be found in. The distribution of the sample among the strata Another important decision when preparing a stratified sample is to assign the desired total size of the sample to the individual strata It must therefore be decided under the condition of a specified sample size how many elements of the sample should come from each stratum.Basically, you can choose between proportional and disproportionate divisions. Proportional drawings A proportionally stratified sample is obtained when the proportion of a stratum in the total sample is equal to the proportion of this stratum in the population. The proportion of the -th layer in the total population is the proportion of the -th Page 7

8th . For a proportionally stratified sample, the following applies. For a desired sample size n, the sizes of the individual samples are then calculated from the strata according to disproportional drawings If the individual sample strata are not selected proportionally to their distribution in the population, one speaks of a disproportionately stratified sample. Then the proportions of a characteristic in the population and the sample do not correspond. For example, a disproportionately stratified sample can be useful if a certain population group is to be deliberately overrepresented in the sample ("oversampling") in order to receive enough interviews from this group. Because: If results are of interest for special small subpopulations of the population, a proportionally stratified sample often does not guarantee that a sufficient number of members of the subpopulations will get into the sample. Example: Use of a proportional stratification From the 1000 undergraduate students of a faculty, 200 are to be selected for a random sample. The number of semesters should serve as a stratification criterion. The evaluation of the matriculation results in the following distribution of the students over the semesters. Semesters Number of students Proportion,,,, 1 In the case of proportional stratification, the proportions of the strata in the population correspond to the proportions in the sample. Therefore, for this sample, the necessary number of students per shift for the sample is calculated as follows: Semester 1 0.4x200 = 80 2 0.3x200 = 60 3 0.2x200 = 40 Number of students in the sample Page 8

9 4 0.1x200 = 20 In the second step, a simple random sample selects 80 from the 400 students in the first semester, 60 from the 300 students in the second semester, etc. Example: Use of a disproportionate stratification Experience has shown that many students change after the first semester your field of study. If a faculty would like to make statements about all students, but would like to focus the analysis on the potential change of subject, it is advisable to draw a disproportionate sample. If 200 students are to be selected from the 1000 undergraduate students of the faculty for a random sample stratified disproportionately according to the number of semesters, the question arises how the disproportional share should be determined if the distribution of the students looks as follows: Semesters Number of students Share,,,, 1 In this case one could - more or less arbitrarily - decide that twice as many freshmen should be selected as members of other semester numbers. Since there are four semesters and we want to select twice as many people from one semester as from all others, we divide the total size of the sample by 4 + 1. This gives 200/5 = 40. For the semester numbers 2 to 4, 40 students are selected by a simple random sample of their own, for the first semester 2 * 40. Semester Number of students in the sample Proportion of students Proportion of students in the sample in the semester of the students in the all students Semester Proportion of students in the semester in the sample, 4 80/400 = 0.2 80/200 = 0,, 3 40/300 = 0, / 200 = 0.2 Page 9

10 3 40 0.2 40/200 = 0.2 40/200 = 0.1 40/100 = 0.4 40/200 = 0.2 As you can see, with this distribution of the sample, the proportions of the strata are correct of the sample does not match the proportion of strata in the population (cf. columns three and five). Furthermore, the proportion of the elements of a stratum in the sample is not equal to the proportion of the elements of a stratum in all elements of the stratum in the population (cf. columns three and four). Before you read any further, make sure you understand the last sentence. The idea of ​​proportional and disproportional stratification should be illustrated using two illustrations. The population consists of 48 elements each, which we have divided into three layers. The layers have 24, 16 or 8 elements in the population. A sample of n = 12 should be drawn. The proportional stratification looks like this: In the case of a disproportional stratification, however, the following division of the sample elements would be possible: Page 10

11 If we draw a proportional sample, then the proportion of strata in the sample must correspond to the proportion in the population. For the first layer (green) this means that it forms half of the population, so it should also form half of the sample. 6 elements of this stratum are selected for the sample. These 6 correspond to 25% of the elements in this layer. The second layer (blue) forms 1/3 of the population, so the proportion in the sample should also be 1/3. 1/3 of 12 is 4, so 4 will be selected for the sample. Stratum Proportion in proportion in the basic total sample, if stratification type p proportional proportion in the sample, if stratification type p disproportiona l stratum size sample size, if stratification type p proportional green blue red sample size, if stratification type p disproportiona l Page 11

12 In the case of a disproportionate stratification, the proportions for the strata in the sample can be freely selected. Other considerations must be drawn on here. Would you like e.g. To make statements about the differences between the layers, then (with the same variances of the characteristic of interest in the layers) a division with the same number of elements per layer would be optimal. So we decide here with n = 12 and three layers for 4 elements (12/3 = 4) per layer. Each stratum thus has a share of 33 in the sample. The following applet should clarify the different drawing variants. Applet dispropotional drawing (d3d.jar) The following table shows the number of major hospitals in Illinois in four regional strata. Stratified Total 255 Number of hospitals A proportionally stratified sample of 51 hospitals is to be drawn. For the sample, calculate the necessary number of hospitals for each shift. Solution In the case of a proportional distribution, the proportion of the respective stratum in the population must be multiplied by the total size of the planned sample. The result must of course be rounded to the nearest whole number. Shift Share of the shift in the population Number of hospitals 1 44/255 = 0.173 0.173 * 51 = 8.8 ((9) / 255 = 0.455 0.455 * 51 = 23.2 ((23) / 255 = 0.188 0.188 * 51 = 9 , 6 ((10) / 255 = 0.184 0.184 * 51 = 9.4 ((9) 9 Sum 1, To be drawn Page 12

13 The division (proportional or disproportionate) of the stratified sample also affects the estimation accuracy. Proportional Split Estimates based on a proportionally stratified sample almost never have larger, and often smaller, standard errors than estimates from a simple random sample. This applies to every stratification factor. Disproportionate distribution There are situations in which a disproportionately stratified sample is preferable. If some strata in the population show a higher variance than others, then such strata are less well represented in the sample if they are the same. The greater variance means greater sample variability. This can be reduced by increasing the proportion of this stratum in the sample. It can be shown that for a given sample size n, the distribution of the total sample size to the individual strata has the lowest variance of the estimator in which the division is proportional to the population standard deviations in the strata. If we divide the population into a total of H strata with a size of Nh, then the "optimal" size for stratum h is a sample size of n: where the standard deviation in stratum is h. Such a division is optimal (statisticians speak of "optimal allocation") only with regard to the size of the standard error of a variable. Since more than one variable is usually of interest, tradeoffs must be made for practical sampling. Random samples are characterized by the fact that the selection probabilities of the elements of the population are known and greater than zero. In the simplest case, e.g. in the case of simple random samples, the selection probabilities are the same for all elements. In the stratified random sample, the probability of selection of the elements depends on whether it is a proportionally or disproportionately stratified sample. Selection probabilities for proportionally stratified samples In proportionally stratified samples, the proportion of each stratum in the sample corresponds to the proportion of the stratum in the population, i.e. for each stratum h applies (1) where the number of elements in the random sample within the stratum and the number of all elements in the population in stratum. Within each stratum, each element i of the population has the chance (2) to get into the sample, since in each stratum there are independent simple

14 random samples are drawn. The number of elements of the sample in layer h was determined with (3). If one substitutes the expression from equation (2) for in equation (3), one obtains (4) The selection probabilities for all elements of a proportionally stratified sample are therefore the same (cf. EPSEM samples in). Selection probabilities in the case of disproportionately stratified samples In the case of disproportionate stratification, the proportions of the elements in the strata in the sample do not correspond to the proportions of the elements of the strata in the population, i.e. (5) Even with disproportionately stratified samples - usually simple - random samples from the individual Layers drawn so that the selection probability for each element is equal to (6). The selection probabilities are therefore only the same within the layers. They differ between the layers. Therefore, the sample statistics from disproportionately stratified samples cannot readily be used to estimate population parameters. Disproportionately stratified samples must be "weighted". Weighting for disproportionately stratified samples One can imagine that one element of a sample is representative of many other elements of the population. If e.g. a population of 8000 people consists of 1000 students and 7000 non-students, we take a sample of 100 people from each group, then each student in the sample stands for ten students, each non-student for 70 non-students. Such factors are commonly referred to as "sample weights". In the case of stratified samples, they can be different for different elements. The factor (7) is therefore the weight of the element i in the layer h. As you can see, the weight is the reciprocal of the selection probability: (8) For the example with the stratification according to students and non-students, the following results for the students: Correspondingly for the non-students: With a proportionally stratified sample in the example with the Students, the sample would be 25 students and 175 non-students (make sure you can calculate these numbers yourself). The weighting factors for a proportional sample would be: Page 14

15 For students and non-students the weighting factors would each be exactly 40 (since the probability of selection is inversely proportional to the weight, the probability of selection is p = 1/40, this is equal to n / n, here 200/8000, i.e. equal to 0.025 ). With a proportional sample, the weighting factors are the same in all strata and for all elements of the sample. One speaks of a "self-weighting sample", since no different weighting factors have to be taken into account when making estimates. If, on the other hand, one actually draws a disproportionate sample as described above from a population of 8000 people (mean 37.98), and if the mean values ​​in the samples differ for the students and the non-students (e.g. mean = or), then the weighting factors must be taken into account . The formula for estimating the mean of a variable in the population in a stratified sample is: (9) Using the correct weights in this example, using formula (9) as the estimate of the mean, the deviation from the actual value is with. 87 very small and a consequence of the sample variability to be expected in samples. If the sample were not weighted (or weighted with the same weight for all elements), the result would be a mean value of The deviation from the mean value of the population is then The deviation would be almost seven times greater than with correct weighting. The table shows the example again in the overview. Mean Percentage of sample Incorrect weighting factor Students Correct weighting factor Non-students Mean value if weighted incorrectly Overall mean if weighted correctly It is therefore crucial that the correct weighting factors are used in estimations if the weighting is disproportionate. As has been seen, the calculation of these factors is very simple as long as one considers the sizes of the layers

16 in the population knows. Summary Stratified random samples serve two purposes: - The accuracy of the estimation of a parameter is increased compared to simple random samples, and - Small subpopulations can be sufficiently represented by suitable stratification. The drawing of stratified random samples involves the use of random samples within the different strata. Whether the proportions of the strata in the sample should correspond to the proportions in the population or not (proportional or disproportional stratification) depends on the purpose of the sample. The disadvantages of a stratified sample are that some additional information about the population is required for the drawing and - in the case of disproportionate stratification - the analysis of the results of the sample is minimally more time-consuming. In a stratified random sample, a) a random sample of strata is drawn b) a random sample is drawn from each stratum after a justified breakdown of the population into strata c) first a random sample of strata and then another random sample is drawn within certain strata d) a population is randomly drawn into Stratified, from which a random sample is then drawn. Solution b) a) Which of the following variables should not be used as a stratification factor? Explain your decision. Constituency, residential area, mental health, school b) In how many layers can an element occur after the population has been decomposed? Solution a) To break down a population into layers, information about the population must be available. The stratification factors can often be taken from the registers or databases that are available for the population. It therefore makes no sense to use a variable as a stratification factor for which it can be assumed that only little or difficult to access information is available about the population. However, this can be assumed for information on mental health. b) The population is broken down into non-overlapping layers

17 divided. Each element may only appear in one layer. What are the advantages of layered random selection? a) the most cost-effective method in each case b) optimization of the estimates with regard to all characteristics of the investigation c) more precise estimation of population parameters than with a simple random sample d) optimization of the homogeneity within the strata e) adequate consideration of small subpopulations solution c), e) a) explain The difference between a proportionally stratified and a disproportionately stratified random sample. b) What is meant by an optimal disproportionate distribution? Solution a) A proportionally stratified sample is obtained when the proportion of a stratum in the total sample is equal to the proportion of this stratum in the population. If the proportions of a characteristic in the population and in the sample do not correspond, one speaks of a disproportionately stratified sample. b) For a given sample size n, the total sample size is divided into the individual strata in such a way that the variance of the estimator is minimal. A "self-weighting sample" is used if ... a) you can assign the weights yourself b) weighting factors must be calculated so that the estimate is optimal c) no different weighting factors have to be taken into account in estimates d) the stratification effect for it ensures that weighting can be dispensed with. Solution c) The population of working doctors in 1999 comprised men and women (Federal Statistical Office, 2002: 72). You want to carry out an investigation to record the weekend work performed. To do this, you draw a proportional sample of 300 doctors stratified by gender, using the data from 1999 as information about the population. How many men and women are in the sample? Round your result.Solution 190 men, 110 women. Page 17

18 Number in the stratum Characterization of the population Share in the population Share in the sample,,,, Number in the sample Source: Federal Statistical Office (2002): Zahlenkompass 2002 für die Bundes Republik Deutschland. Metzler-Poeschel: Stuttgart. Disproportionately stratified sample Explanation of stratified random sample Explanation of optimal distribution Explanation of proportionally stratified sample Explanation of layers Explanation of stratification effect Explanation of stratification factor Explanation of stratification characteristics Explanation of standard errors of explanation strata see layer of explanation "self-weighting samples". Explanation (c) Project New Statistics 2003, Free University Berlin, Center for Digital Systems Contact: Page 18