Blog

# Modelling Sample Attrition in "Value-Added" Models of Learning

We introduce a method to adjust for dropout while calculating learning gained by a cohort over a school year.

# Authors ### Mesele Araya

University of Cambridge ### Dawit Tibebu Tiruneh

University of Cambridge ### John Hoddinott

Cornell University

As primary school enrollment rates continue to increase in low-income countries, attention is shifting away from factors that affect attendance and towards those factors that affect learning. Put simply; it is not enough to get children in school; it is of considerable policy importance to understand what factors affect how much they learn. A common approach to this issue is to test children at the start and end of the school year and estimate a “value-added” model. The outcome variable is the end-of-year test score. This is a function of the start-of-year test score and other variables, including child, parental, teacher, and school characteristics.

Estimating value-added models, however, is not as straightforward as it might seem. One issue that frequently arises is that children may drop out over the course of the school year, maybe ill, or may have their schooling interrupted by factors such as floods or civil conflict. We call this sample attrition. These attritions are potentially problematic in two ways: (1) The smaller sample sizes at the end of the year reduce statistical power; (2) Where sample attrition is non-random, there is a risk that parameter estimates in the value-added model will be biased. In our Research for Improving Systems of Education (RISE) Ethiopia 2018-19 survey data, 19 percent of children (that is, 723 children out of an initial sample of 3901) who took our mathematics test at the start of Grade 4 in October 2018 did not complete the end-of-year test in May 2019. This blog post explains how we address this sample attrition in our RISE-Ethiopia 2018-19 survey. While our results are specific to our data, the method is generalisable.

## A nine-step approach to creating sample attrition weights

The approach we take, based on work by Fitzgerald, Gottschalk, and Moffitt (Journal of Human Resources, 1998), is to create sample attrition weights. These are constructed by estimating two models of loss to follow-up measures or, more simply, attrition: one where we estimate attrition as a function of the variables that we think affect our outcome of interest (here, endline math scores); and a second where we estimate attrition as a function of the variables that we think affect our outcome of interest (end-of-year math test scores) and variables that we think only affect the likelihood that the child stays in school. We estimate predicted probabilities of remaining in school from both models; Fitzgerald, Gottschalk, and Moffitt show that the weights are the ratio of these predicted probabilities.

There are nine steps to doing this:

1. Identify variables that we think affect the outcome. Call these "X"
2. Identify variables that we think influence attrition but do not directly affect the outcome. Call these "Z"
3. Run a regression where the dependent variable is the attrition variable. Include both "X" and "Z"
4.  Test whether we can reject the null hypothesis that the significance of the "Z" variables are jointly zero
5. Predict the likelihood of attrition and store these predicted values
6. Run a regression where the dependent variable is the attrition variable. Include only "X.
7. Predict the likelihood of attrition and store these predicted values
8. Construct attrition weights where these = predicted values from only "X" ÷ predicted values from "X" and "Z"
9. Run regressions with and without these weights

## Step 1

For the purposes of illustration, we assume that end-of-year math scores are a function of:

• child’s age and sex;
• household wealth; and
• whether the Grade 4 math teacher scored in the upper tertile of the distribution of scores on the math test that we administered to math teachers.

Collectively, these are the X variables.

## Step 2

We identify three variables that we think might affect whether the child is present for the end-of-year math test but do not directly affect math test scores. These are:

• a dummy variable equaling one if the child has lived in the locality for five years or less – the idea here is that children (or their families) who are relatively recent arrivals in the area are less “connected” and thus more likely to drop out of school;
• a dummy variable equaling one if the road to the school is a mud track – the idea being that children may be less likely to be in school at the end of the year if it is physically difficult for them to get to school; and
• the number of older children in the household attending school – the logic here is that if other children in the household are attending school, the child in the RISE sample is more likely to attend school.

These are the Z variables.

## Step 3

Run the attrition regression with both the X and Z variables as regressors. Here the dependent variable =1 if the child completed the endline test, zero otherwise. We use a linear probability model; our standard errors account for clustering at the sampling unit, the school. A negative coefficient means that that variable is associated with a lower likelihood that the child completed the end-of-year test; a positive coefficient means that that variable is associated with a higher likelihood of test completion.

Variable

Coefficient

Standard error

P value

Child has lived in the locality for five years or less (=1 if yes)

-0.053

0.024

0.03

Road to school is a mud track (=1 if yes)

-0.033

0.021

0.12

Number of older children in household attending school

0.013

0.013

0.02

Start of year math score

0.0004

0.000007

<0.01

Age

-0.019

0.004

< 0.01

Sex (=1 if boy)

-0.026

0.012

0.03

Household wealth index

0.013

0.007

0.05

Math teacher scored in the highest tertile on math content test

0.006

0.022

0.80

Constant

0.856

0.056

<0.01

Children who are recent arrivals to the locality are less likely to be tested at the endline (put differently, they are more likely to drop out); when the road to the school is of poor quality (i.e., a mud track), though the coefficient is just outside usual significance levels, children are less likely to be tested at endline; and children in households with older children also attending school are more likely to complete the endline test (put differently, they are less likely to drop out).

## Step 4

The F statistic on the joint significance of the three Z variables is 3.85 with a p-value of 0.011. This gives us confidence that we have variables that affect the likelihood of completing the end-of-year math test.

## Steps 5, 6, and 7

These are omitted for brevity but are available on request.

## Step 8

Calculate the attrition weights. Having done so, we can graph their distribution using a kernel density function. There is some variability in the weights – they range from 0.88 to 1.12 – but the mass of the distribution is around 1. This hints at the possibility that the attrition-weighted regression estimates will not differ too much from the unweighted regressions. ## Step 9

Lastly, we estimate our value-added model without and with the sample attrition weights.

Here are the regression results, where the outcome variable is the end-of-year math score without the weights.

#### Table 1: Regression results without sample attrition weights

Variable

Coefficient

Standard error

P value

Start of year math score

0.78

0.023

<0.01

Age

3.48

0.98

<0.01

Sex (=1 if boy)

2.80

2.76

0.31

Household wealth index

1.93

1.82

0.29

Math teacher scored in the highest tertile on math content test

16.27

4.89

<0.01

Constant

86.84

13.94

<0.01

After controlling for baseline test scores, child age, sex, and household wealth index, having a teacher in the top tertile of scores of the math test we administered to teachers is associated with an increase of 16 points on the endline math score.

And here are the regression results, where the outcome variable is the end-of-year math score, with the attrition weights.

#### Table 2: Regression results with sample attrition weights

Variable

Coefficient

Standard error

P value

Start of year math score

0.78

0.023

<0.01

Age

3.46

0.98

<0.01

Sex (=1 if boy)

2.83

2.77

0.30

Household wealth index

1.97

1.83

0.28

Math teacher scored in the highest tertile on math content test

16.33

4.88

<0.01

Constant

87.00

13.90

<0.01

The weighted parameter estimate for math teacher content knowledge is virtually the same as the unweighted estimate. So, in this case, sample attrition does not appear to be biasing these associations.

It is important that researchers estimating these value-added models construct sample attrition weights using data appropriate to the settings where their study takes place. To that end, the method described here is relatively straightforward to estimate.

RISE blog posts and podcasts reflect the views of the authors and do not necessarily represent the views of the organisation or our funders.