New York University
Not Drawn to Scale? RCTs and Education Reform in Developing Countries
Randomized controlled trials (RCTs) have recently come under attack for yet another reason: many (if not most) of the interventions that have been found successful in RCTs have not been adopted by governments, let alone taken to scale.
This fact is indisputable. One need only take a look at the policy lessons from the RCTs conducted by the Abdul Latif Jameel Poverty Action Lab (J-PAL) across education, health, governance, finance, and crime to confront the undeniable disconnect between the interventions that have been found most (cost-)effective and those at the top of government agendas.
To blame this entirely on governments (and not on RCTs) would be incorrect. At least in education, some of the earliest and best-known RCTs were implemented with relatively small convenience samples by well-functioning NGOs, often outside of school hours (e.g., Pratham’s reading camps in Uttar Pradesh, India), and sometimes even outside of the formal school system (e.g., Seva Mandir’s teacher attendance experiment in non-formal education centers in Rajasthan). Admittedly, this first generation of RCTs was limited in what it could tell us about whether interventions that were effective for specific populations when implemented with high fidelity would have the same impact when expanded to entire states or countries and implemented by regular teachers in formal schools during the school day. Ironically, it was an RCT of a contract teachers reform in which the mode of implementation was randomized (either through an NGO or the government) that first illustrated this shortcoming most clearly.
Yet, in spite of the merits of the underlying facts, most criticisms of the potential of RCTs to identify interventions that can be taken to scale suffer from one or more problems: (a) they ignore the multiple purposes of RCTs; (b) they are based on an overly simplistic view of how interventions should scale; and (c) they fail to offer a clear alternative research strategy for informing system reform.
RCTs, you had one job
A fundamental problem with many critiques is that they assume that RCTs have a single purpose: to find out “what works” so we can take it to scale. This expectation is not entirely unfounded. Much of the early hype around RCTs was fueled by the frustration with correlational studies, and conversely, by the achievements of randomized clinical trials. And it would be disingenuous to deny that some of the early talks, articles, and books by proponents of RCTs either explicitly encouraged or implicitly allowed for the unrealistic expectation that randomized trials would do for education what they had done for health.
Yet, however justified, this single-minded understanding of the purpose of RCTs is incomplete. In education, RCTs have been helpful in challenging conventional wisdom (e.g., “class size reductions are always good”), understanding the binding constraints of school systems (e.g., why textbooks do not necessarily lead to more learning, parts 1 and 2), establishing a “proof of concept” (e.g., performance pay for teachers), testing the external validity of successful interventions (e.g., contract teachers in Kenya and India), evaluating large-scale reforms (e.g., private school vouchers), and shedding light on questions of human behavior of interest to economists (e.g., intra-household bargaining in education decisions). To measure the success of RCTs over the past 20 years solely on the basis of whether they led to scale-ups misses some of the most interesting insights that these studies have generated.
To be fair, some of the RCTs mentioned above were not originally conceived for these purposes. Their authors genuinely believed to be testing an intervention that could be taken to scale. Thus, it may seem dishonest to impute these purposes after the fact. Yet, regardless of their intended purposes, these studies show that the insights produced by two decades of randomized evaluations is larger than the sum of the individual impact estimates.
Curb your enthusiasm (for scale)
Perhaps the biggest problem with existing critiques of RCTs is that they seem to be based on an overly simplistic model of how interventions should be scaled-up. The model has two periods: in the first period, researchers evaluate an intervention and find it successful; in the second period, the government learns about the evaluation results and takes the intervention to scale.
This model may describe how input-based interventions in health could be scaled-up (e.g., deworming or chlorine dispensers). Yet, for most initiatives worth scaling in education, which require reforming pedagogy and governance—demanding changes in the institutionalized, day-to-day activities of bureaucrats and teachers—this model seems overly naïve.
As Innovations for Poverty Action’s (IPA) founder Dean Karlan explained in his testimony to the U.S. Congress (video here and full text here), fighting poverty is a process. This process is much messier than the two-period model in some people’s minds or than researchers care to publicly acknowledge. At a minimum, it entails: (a) understanding the binding constraints of the system; (b) identifying an intervention that addresses that constraint; and (c) experimenting with different models of taking this intervention to scale.
This process is best exemplified by Pratham’s “Teaching at the Right Level” (TaRL) approach, the only education intervention that J-PAL South Asia actively seeks to scale-up. The story that often gets told about TaRL is that it is the only education program that was evaluated in multiple RCTs and found successful across different contexts. Yet, the story of TaRL is actually much “messier”, and it is this mess that offers valuable insights into how interventions evaluated through RCTs can be taken to scale.
The idea of TaRL can actually be traced back to two of J-PAL’s earliest education RCTs, which were crucial in understanding the binding constraints of school systems in developing countries. The first was an RCT of remedial education and computer-assisted learning developed by the NGO Pratham in Maharashtra and Gujarat, India in 2001-2003; the second an RCT of ability tracking in Western Province, Kenya in 2005-2007. These studies drew attention to the fact that, in many low- and middle-income countries, teachers face both student groups with heterogeneous ability and high-stakes exams (typically, at the end of secondary school). Thus, teachers have strong incentives to target their instruction to their highest-performing students, since they are the ones that are most likely to take (and pass) these exams. The results of the RCT in India suggested that low-performing students were so far behind their grade-based ability level that they could not understand what was taught to them at school. The results of the RCT in Kenya implied that all students, not just low-performing students, benefited from more homogeneous peer groups, if/when such groups helped teachers tailor instruction. (These two studies were so influential in Abhijit Banerjee and Esther Duflo’s thinking that the education chapter in their book Poor Economics was entitled “Top of the Class.”)
The proof of concept of TaRL can also be traced back to these early studies. The impressive impact of the remedial education and computer-assisted learning in Maharashtra and Gujarat suggested that tailoring instruction to children’s actual ability levels was a promising way of counteracting the adverse effects of highly heterogeneous, exam-driven school systems. Soon thereafter, an RCT of remedial learning camps led by community volunteers recruited by Pratham in Uttar Pradesh, India in 2005-2006 provided further confirmation of the potential of personalized instruction, while highlighting the difficulty of ensuring that children attend out-of-school activities.
The multiple iterations of what eventually became known as TaRL were assessed in two lesser-known RCTs that were pivotal in figuring out the optimal way to increase the level of personalization in the school system. The first RCT in Bihar and Uttarakhand in 2008-2010 compared four alternative ways of delivering a pedagogical approach developed by Pratham that required teachers to assess children’s learning levels using simple tools, group students based on their performance on those tests, and lead different activities for each group. This study found that public school teachers were capable of implementing the pedagogical approach outside school by themselves, or inside school when trained by Pratham staff and monitored by government officials, but they were unwilling and/or unable to implement it during school hours when they were by themselves (presumably, due to the aforementioned exam-driven culture that encourages teachers to follow the curriculum). The second RCT in Haryana in 2012-2013 tested a fifth alternative method of implementation and found that public school teachers could deliver this pedagogical approach during school hours, but only with intensive support from Pratham.
Nine years and five papers since the first RCT, the approach was branded as “Teaching at the Right Level” and J-PAL South Asia began advocating for its scale-up in the two versions that produced learning gains: (a) as a learning camp led by Pratham staff, which includes 50 three-hour sessions per day during the school day (in 10- or 20-day increments); or (b) as a teacher-led model, which includes one hour per day during the school day, with monitoring and mentoring by government officials. In both cases, Pratham is responsible for monitoring that the approach is implemented as intended. (The results from the multiple experiments are summarized in a joint paper by the authors of all the impact evaluations. The reflections from the decade-long journey of taking TaRL from a “proof of concept” to a scalable policy are documented in a companion paper by the same authors).
The results of J-PAL’s scale-up efforts seem well worth the wait. The Indian state of Gujarat first tried TaRL in 310 schools in 2014-2015, then scaled it up to 2,000 schools in 2015-2016, and plans to take it to 4,000 schools in 2016-2017. Andhra Pradesh started with 1,800 schools this year and plans to take TaRL to 8,000 schools next year. Finally, Jharkhand and Delhi plan to take TaRL to 12,000 and 400 schools, respectively, in 2016-2017.
Given that the two problems that TaRL addresses—grade-based instruction and high-stakes exams—are not unique to India, TaRL is now also being taken to scale in other developing countries. Yet, the different approaches taken in these scale-ups is also illustrative of how the reality of scale-ups departs from the naïve two-period model.
In contexts where the binding constraints of the school system closely resemble those of India, where TaRL was conceived and prototyped, the emphasis has been on ensuring that it is implemented in a way that complies with the lessons learned from the evidence. For example, in Zambia, the Ministry of General Education’s rollout of the “Catch-up Program” (as TaRL is known in the country) will be implemented in Grades 3-5 across 80 schools in four districts in the Eastern and Southern provinces. J-PAL Africa will not conduct an impact evaluation of the program, under the assumption that it will transition well into the Zambian context. Instead, it will conduct a process evaluation to ensure that the key components of the TaRL model (e.g., assessment of students, regrouping of students by performance levels, and differentiated activities by group) are respected and that children’s learning increases as expected. This evaluation aims to ensure that implementation occurs as expected and to help the government take ownership of the program and start holding teachers accountable for it. (For a thoughtful discussion of how to assess whether a program found effective in one context holds promise in another context, watch this excellent talk by Rachel Glennerster and/or read the article with Mary Ann Bates that followed).
In contexts where the needs of the system differ from those of India, the emphasis has been on testing the principles on which TaRL is based. For example, in Ghana, IPA has evaluated four versions of the “Teacher Community Assistant Initiative” (TCAI) in Grades 1-3 across a nationally-representative sample of 500 schools in 42 districts: (a) one in which TCAs provide two hours of remedial instruction for the lowest-performing students during school; (b) another one in which TCAs provide similar instruction after school; (c) a third one in which TCAs pull a random subset of students out of the class for a few hours to review the teacher’s lesson; and (d) a fourth one in which civil service teachers are trained to develop the skills of providing small-group instruction to low-performers. The purpose of this evaluation was to understand which application of the principles of the TaRL model would be the most cost-effective to scale within the public school system in Ghana (see cost-effectiveness report here). As Karthik Muralidharan writes in his chapter on education in the Handbook of Field Experiments, “in trying to learn across contexts… it may be more appropriate to focus on principles that have been validated in multiple settings rather than the point estimates of specific studies.”
Many might find this road from research to policy to be too long and/or too messy, but no other education intervention that improves learning has either a more solid evidence base or a larger number of children reached through scale-ups. As Karthik Muralidharan and Paul Niehaus argue in a new article, “researchers have devoted more effort to persuading their institutional partners to randomize (for internal validity) than to be representative (for external validity)”, and several education policies (e.g., across-the-board pay increases or private school vouchers) have directly been evaluated at scale through randomized roll-outs. (Karthik Muralidharan makes a compelling case for this approach in this talk – starting at 1.10:28). Yet, the experience of TaRL suggests that the type of fundamental changes that school systems in developing countries need in pedagogy and governance often require a more iterative process of prototyping and fine-tuning, or as Esther Duflo recently put it in her Ely lecture to the American Economic Association, adopting “the mindset of a plumber”.
RCTs: Turned down for what?
An important, yet oft-neglected, part of this debate is that critics of RCTs have thus far been unable to find a clearly superior alternative for identifying interventions for scale. There are many experiments underway worth tracking, including the Research for Improving Systems of Education (RISE) initiative, a GBP 27.6 million program funded by the United Kingdom’s Department for International Development (DFID) and Australia’s Department of Foreign Affairs and Trade (DFAT), which has recruited some of the leaders in the field to combine quasi-experimental and experimental studies with diagnostics of education systems and qualitative descriptions of the political process of the adoption of and resistance to education reforms. The World Bank’s Early Learning Partnership (ELP) is leading a similar project for early childhood development. Yet for the most part, most existing innovations along these lines have developed complements (not substitutes) to RCTs. And if the (brief) history of RCTs in education in developing countries has taught us anything, it is that we should proceed with a healthy dose of humility and realism in setting our expectations for the value-added of new approaches.
Ultimately, the debate on the potential of RCTs to identify interventions for scale resembles parallel discussions about other criticisms of RCTs, in which the arguments are based on truthful claims: they have been useful in moving the field forward, but they are yet to converge towards clear alternatives. This should be sobering for “randomistas” and skeptics alike.
RISE blog posts and podcasts reflect the views of the authors and do not necessarily represent the views of the organisation or our funders.