AJ IMPACTEVALUATION CONSULTING

Category: Uncategorized

  • Power Analysis

    Power Analysis Explained: Why Sample Size Matters

    Understanding statistical power is crucial for rigorous impact evaluation

    By Aubrey Jolex | November 26, 2025

    You’re designing an impact evaluation. You’ve identified
    your research question, chosen your outcome measures, and decided on a randomized controlled trial (RCT)
    design. Now comes a critical question that many organizations get wrong:

    How many people do you need in your study?

    This is where power analysis comes in—one of the most important (and most misunderstood)
    concepts in impact evaluation.

    What Is Statistical Power?

    In simple terms, statistical power is the probability that your study will detect an
    effect if one actually exists.

    Think of it like a metal detector:

    • A high-powered detector can find small coins buried deep underground
    • A low-powered detector will only find large metal objects near the surface
    • With a weak detector, you might walk right over buried treasure and never know it was there

    In evaluation, power determines whether you’ll be able to detect your program’s true
    impact. A well-powered study can detect small but meaningful effects. An underpowered study might miss
    real impacts entirely.

    The Four Key Concepts

    Every power analysis involves four interrelated components:

    1. Sample Size (N)

    How many people are in your study

    2. Effect Size

    How large an impact you expect (or want to detect)

    3. Significance (α)

    Your threshold for calling a result “statistically significant” (usually 5%)

    4. Power (1-β)

    The probability of detecting a true effect (typically 80% or 90%)

    These four are mathematically linked. If you know three, you can calculate the fourth.

    Try Our Power Calculator

    Skip the complex formulas! Use our RCT Workflow Toolkit to calculate sample sizes
    instantly.

    ✓ Statistical power analysis and sample size determination
    ✓ Support for continuous and binary outcomes
    ✓ Individual and cluster randomization with ICC calculations
    ✓ Interactive power curves and baseline adjustments

    Launch Power
    Calculator →

    Why Sample Size Matters: A Real Example

    Let’s say you’re evaluating a girls’ education program. You believe the program increases secondary
    school enrollment by 10 percentage points (from 60% to 70%).

    Scenario 1: Small Sample (N=100)

    • Treatment group: 50 girls
    • Control group: 50 girls
    • Only 35% chance of detecting +10pp effect

    Result: Evaluation will likely conclude “no significant effect” even though the
    program works

    ✓ Scenario 2: Adequate Sample (N=400)

    • Treatment group: 200 girls
    • Control group: 200 girls
    • 80% chance of detecting +10pp effect

    Result: Evaluation has sufficient power to detect the program’s impact

    The difference? Sample size. Too small, and you’re flying blind.

    The Consequences of Being Underpowered

    When studies are underpowered, several bad things happen:

    1. False Negatives (Type II Errors)

    Your program actually works, but your evaluation fails to detect it. You conclude the program is
    ineffective and shut it down, wasting a potentially valuable intervention.

    Real example: An early childhood
    education program genuinely improved child development, but the evaluation had only 100 children and
    failed to detect statistical significance. The program was defunded. Years later, a larger study
    with 800 children showed strong positive effects.

    2. Wasted Resources

    You spent money, time, and effort implementing an evaluation that was doomed from the start. All that
    investment in data collection, analysis, and reporting yields inconclusive results.

    3. Publication Bias

    Journals and funders favor statistically significant results. Underpowered studies that show “no effect”
    are less likely to be published, even if they were conducted rigorously.

    4. Incorrect Conclusions

    Sometimes, underpowered studies do find statistically significant results—but these are often false
    positives or wildly inflated effect sizes. This misleads future program designers.

    Conducting a Power Analysis: Step-by-Step

    Here’s how to do a power analysis for your evaluation:

    Step 1: Define Your Primary Outcome

    What is the one most important outcome you’re measuring? Examples: test scores,
    household income, clinic visits, business profit.

    Step 2: Estimate Baseline Variance

    How much does this outcome vary in your population? Get this from existing data, baseline
    surveys, published studies, or pilot data.

    Step 3: Define Minimum Detectable Effect (MDE)

    What’s the smallest program impact that would be practically meaningful to
    detect? This isn’t about what you hope for—it’s about what matters.

    Step 4: Choose Alpha and Power

    Standard choices: Alpha (α) = 0.05 and Power = 0.80 (80% chance of detecting true effect) or
    0.90.

    Step 5: Calculate Required Sample Size

    Use our power analysis toolkit to determine
    your needed sample size.

    Step 6: Adjust for Real-World Factors

    Account for attrition (dropout), clustering (village/school randomization), and stratification.

    Use Our Free RCT Field Flow
    Toolkit

    Comprehensive platform for
    managing your entire RCT lifecycle—including power calculations

    Power Calculations

    Statistical power analysis with ICC calculations and interactive
    power curves

    Randomization

    Treatment assignment with balance diagnostics and validation tools

    Analysis & Results

    Statistical analysis with treatment effects and heterogeneity
    analysis

    Practical Example: Sample Size Calculation

    Scenario: Evaluating a Savings Program

    • Outcome: Household savings
    • Standard deviation: 1,400 Philippine peso
    • Mean Household Savings: 800 Philippine pesos
    • MDE: 100 Philippine pesos (minimum meaningful impact)
    • Alpha: 0.05
    • Power: 0.80

    Result: You need approximately 3,077 households per group (6,154
    total) to detect a 100 pesos difference in savings with 80% power.

    Adjustments for Real-World Factors

    Attrition

    If you expect 10% sample decrease in the follow up, then your sample should be adjusted as follows:

    Example: N × 90% = 6,154

                    N = 6,154/90%

    N = 6,838

    Clustering

    If randomizing groups (villages, schools), use design effect:

    Design effect = 1 + (m-1) × ICC

    Stratification

    Stratifying on baseline covariates can increase power, letting you detect smaller effects with
    the same sample size

    Common Power Analysis Mistakes

    Mistake 1

    Doing power analysis AFTER data collection

    Power analysis must be done before you start. Post-hoc power analysis
    is statistically meaningless.

    Mistake 2

    Powering for multiple outcomes

    Pick your primary outcome and power for that. Other outcomes are
    exploratory.

    Mistake 3

    Using unrealistic effect sizes

    Be realistic based on prior research and theoretical expectations.

    Mistake 4

    Ignoring clustering

    If you randomize groups, account for it in your power calculations.

    What If You Can’t Afford the Required Sample?

    Power analysis might reveal that you need 2,000 participants, but you can only afford 500. What now?

    Option 1: Accept Lower Power

    Document that your study is underpowered. You might still detect large effects, but you’ll miss
    small-to-moderate effects.

    Option 2: Focus on Larger Effect Sizes

    Design your program to have bigger impacts. Instead of a light-touch intervention, implement
    something more intensive.

    Option 3: Use More Efficient Designs

    Strategies like stratification, baseline covariates, or within-subject designs can increase power
    without increasing sample size.

    Option 4: Postpone Until You Have Resources

    Sometimes it’s better to wait and do a properly powered study than to proceed with an
    underpowered one.

    Conclusion: Don’t Skip the Power Analysis

    Power analysis is not an optional luxury—it’s a fundamental requirement for any rigorous impact
    evaluation. Skipping it is like building a house without checking if the foundation can support the
    structure.

    Key Takeaways

    • Always conduct power analysis before starting your evaluation
    • Be realistic about effect sizes you want to detect
    • Account for attrition, clustering, and other real-world factors
    • Don’t proceed with an underpowered study unless you accept the risks

    Remember: An underpowered evaluation is
    worse than no evaluation at all. It wastes resources and generates misleading conclusions.

    Ready to Power Your Evaluation?

    Get started with our tools and expert guidance

    Use Our Toolkit

    Access the RCT Field Flow platform for power calculations, randomization, and complete RCT
    management.

    Launch Toolkit

    Free Consultation

    Need help with power analysis for your evaluation? Schedule a free consultation to discuss your study
    design.

    Book Now →

    More Resources

    Explore our blog for more guides on impact evaluation, RCT design, and statistical methods.

    Read Blog →

    About the Author

    Aubrey Jolex is the founder of AJ Impact Evaluation Consulting, specializing in rigorous impact
    evaluation for development programs. With 7+ years of experience with leading research organizations
    such as IPA, IFPRI and IITA, Aubrey has designed and powered a number of impact evaluations across
    multiple countries.

  • “5 Common RCT Design Mistakes (And How to Avoid Them)”





    5 Common RCT Design Mistakes – AJ Impact Evaluation Consulting


    5 Common RCT Design Mistakes

    And actionable strategies to avoid them in your evaluation

    By Aubrey Jolex | February 10, 2025 | 15 min read

    Randomized controlled trials (RCTs) are the gold
    standard for impact evaluation—when done correctly. But we’ve reviewed dozens of RCT
    designs over our years in the field, and we see the same mistakes repeatedly.

    These aren’t minor technical issues. They’re fundamental flaws that can invalidate your entire
    evaluation, waste resources, and lead to incorrect conclusions about program effectiveness.

    Mistake #1: Insufficient Statistical Power

    The Problem: You design an RCT with too small a sample to detect meaningful program
    effects.

    Real Example

    An NGO randomizes 20 schools (1,000 students). Sounds big enough? Wrong. After
    accounting for clustering, this design has only 35% power. Even if the program
    works, there’s a 65% chance the evaluation will conclude “no significant effect.”

    The Fix: Proper Power Analysis

    Don’t guess your sample size. Use our RCT Field Flow Toolkit to calculate
    exactly what you need.

    • Calculate sample size for individual & cluster RCTs
    • Account for attrition and baseline correlation
    • Visualize power curves interactively

    Run Power
    Analysis →

    Mistake #2: Poor Randomization Implementation

    The Problem: Randomization is compromised by field staff or logistical errors,
    undermining the entire design.

    Common Failures

    • Staff “randomly” assigning based on need
    • Swapping participants after assignment
    • Using predictable patterns (every other person)

    How to Avoid It

    • Centralize randomization (computer-based)
    • Blind staff to assignment when possible
    • Lock assignment lists immediately

    Mistake #3: Measuring Outcomes Too Soon (Or
    Late)

    The Problem: You collect data before effects materialize or after they’ve faded.

    Too Soon

    Measuring employment 3 months after training. Job search takes time—you’ll miss the impact.

    Too Late

    Measuring health 5 years after short-term vitamin supplements. Effects may have faded.

    Rule of Thumb: Map your theory of change carefully.
    Education often needs 1-2 years; health might need 6-24 months depending on the outcome.

    Mistake #4: Ignoring Implementation Fidelity

    The Problem: You assume the program was implemented as designed, but it wasn’t. A
    “null result” might just mean the program never actually happened.

    Monitor Your Fieldwork Real-Time

    Use the Monitoring Dashboard in our toolkit to track implementation fidelity as
    it happens.

    • Track submissions per enumerator
    • Verify intervention delivery
    • Detect quality issues immediately

    Explore
    Dashboard →

    Mistake #5: Multiple Testing Without
    Correction

    The Problem: Testing 30 outcomes and reporting only the 2 “significant” ones. This
    is statistical cherry-picking.

    The Solution

    1. Pre-specify primary outcomes: Pick 1-2 main goals.
    2. Adjust p-values: Use Bonferroni or Benjamini-Hochberg corrections.
    3. Create indices: Combine related measures (e.g., “Empowerment Index”) to
      reduce the number of tests.

    Bonus: Skipping Pre-Registration

    Pre-registration prevents “p-hacking” and “outcome switching.” Always register your study design and
    analysis plan on the AEA RCT Registry or OSF before collecting endline data.

    Conclusion

    Great RCT design requires care, technical knowledge, and commitment to rigor. Avoid these five
    mistakes:

    Checklist for Success

    • Conduct proper power analysis
    • Implement randomization with integrity
    • Time your measurement appropriately
    • Monitor implementation fidelity
    • Correct for multiple testing

    Avoid These Mistakes with Our Toolkit

    Tools designed to ensure rigor at every step

    RCT Field Flow

    Comprehensive platform for power analysis, randomization, and field monitoring.

    Launch Toolkit

    Expert Guidance

    Don’t risk your evaluation. Schedule a consultation to review your design.

    Book Consultation →

    Design Guides

    Read more about RCT design best practices in our blog.

    Read More →

    About the Author

    Aubrey Jolex has designed and implemented dozens of RCTs across Asia and Africa with 7+ years of
    experience at IFPRI. Learn from real-world experience—avoid costly mistakes in your
    evaluation.

  • “Why Rigorous Impact Evaluation Matters for Development”





    Why Rigorous Impact Evaluation Matters – AJ Impact Evaluation Consulting


    Why Rigorous Impact Evaluation Matters

    Moving beyond good intentions to measure true impact

    By Aubrey Jolex | November 23, 2025 | 10 min read

    Every year, billions of dollars flow into
    international
    development programs. Organizations are driven by a genuine desire to create positive change. Yet a
    critical question often goes unanswered: Are these programs actually working?

    This is where rigorous impact evaluation comes in—and why it matters more than ever.

    The Problem: Good Intentions Aren’t Enough

    Development practitioners work hard. But here’s the hard truth: activity doesn’t equal
    impact.

    Activity

    10,000 people attended training.
    50 schools received textbooks.
    100 farmers adopted new
    seeds.

    Impact?

    Did savings increase?
    Did test scores improve?
    Did farm income rise?

    Without rigorous evaluation, we’re operating in the dark. We might be wasting resources on programs
    that
    don’t work, or worse, cause harm.

    What Makes Evaluation “Rigorous”?

    A rigorous impact evaluation answers a specific causal question: Did this program cause the
    observed outcomes?

    The gold standard is the Randomized Controlled Trial (RCT). When that’s not
    feasible,
    Quasi-Experimental Designs (QEDs) like RDD or DID can provide credible evidence.

    Measure What Matters

    Don’t guess about your impact. We help organizations design and implement rigorous evaluations.

    • RCT and Quasi-Experimental Design
    • Power Analysis & Sample Size Calculation
    • Data Quality Assurance & Analysis

    Schedule Free
    Consultation

    Common Evaluation Pitfalls

    Before-and-After

    Comparing outcomes over time without a control group ignores external factors (economy,
    weather).

    Self-Selection

    Comparing volunteers to non-volunteers is biased because volunteers are more motivated.

    Small Samples

    Underpowered studies fail to detect real effects, leading to false “no impact” conclusions.

    Why Rigorous Evaluation Matters

    1. Learn What Works: Discover the truth about your
      program’s effectiveness.
    2. Stop Wasting Money: Cut losses on ineffective
      programs
      and double down on what works.
    3. Improve Programs: Learn which components drive
      impact
      and for whom.
    4. Build Credibility: Funders demand evidence.
      Rigorous
      proof gives you a competitive edge.
    5. Contribute to Knowledge: Help the global
      development
      community learn.

    Common Objections (And Why They’re Wrong)

    “Evaluation is too expensive”

    Reality: It costs 5-10% of the budget. Is it worth saving 5% to waste the other
    95%
    on a program that doesn’t work?

    “We know it works—we see it every day”

    Reality: Humans are biased. We remember successes and forget failures. We need
    objective data.

    “Randomization is unethical”

    Reality: When resources are scarce, a lottery is the fairest way to allocate
    them.

    Conclusion: Evidence Matters

    Rigorous impact evaluation isn’t a luxury. It’s a moral imperative. We owe it to the people we serve
    to
    ensure our programs actually improve their lives.

    Start Your Evaluation Journey

    • Think about evaluation from Day One
    • Invest in Power Analysis
    • Partner with Technical Experts
    • Commit to Learning (even from failure)

    Ready to Measure Your Impact?

    Get the tools and expertise you need

    Free Consultation

    Discuss your program and evaluation needs with an expert.

    Book Now →

    RCT Field Flow

    Our all-in-one toolkit for managing rigorous evaluations.

    Launch Toolkit

    Learn More

    Read our guides on Power Analysis, RCT Design, and more.

    Read Blog →

    About the Author

    Aubrey Jolex is the founder of AJ Impact Evaluation Consulting, specializing in rigorous impact
    evaluation for development programs. With 7+ years of experience at IFPRI and 6+ peer-reviewed
    publications, Aubrey helps organizations generate credible evidence of program impact.

  • Quasi-Experimental Designs






    Quasi-Experimental Designs – AJ Impact Evaluation Consulting


    Quasi-Experimental Designs

    Rigorous impact evaluation when RCTs aren’t possible

    By Aubrey Jolex | February 15, 2025 | 14 min read

    Randomized controlled trials (RCTs) are the gold standard
    for impact evaluation. But let’s be honest: sometimes an RCT just isn’t feasible.

    Maybe your program is already running, political constraints prevent withholding services, or you’re
    evaluating a national policy. Does this mean you can’t conduct rigorous impact evaluation?
    No.

    Enter quasi-experimental designs (QEDs)—methods that approximate experimental conditions
    without random assignment. When implemented carefully, QEDs can provide credible causal evidence.

    When to Use Quasi-Experimental Designs

    ✓ Feasibility

    Randomization is politically or ethically impossible.

    ✓ Timing

    Program is already implemented (too late for RCT).

    ✓ Scale

    Universal rollout prevents creating a control group.

    Four Common Quasi-Experimental Designs

    1. Regression Discontinuity Design (RDD)

    Best for: Programs with strict eligibility cutoffs (e.g., test scores, income
    thresholds).

    RDD compares people just above vs. just below the cutoff. If the cutoff is arbitrary, people on either
    side are likely very similar, making the difference in outcomes attributable to the program.

    Example: Scholarship for students scoring ≥80. Compare students scoring 79 vs. 81.
    They are likely identical in ability, so any difference in future success is due to the scholarship.

    2. Difference-in-Differences (DID)

    Best for: Policy changes where you have pre- and post-data for treatment and comparison
    groups.

    DID compares the change over time in the treatment group vs. the change in the
    comparison group. This removes time-invariant differences and common trends.

    3. Propensity Score Matching (PSM)

    Best for: When you have rich data on all factors influencing participation.

    PSM creates a comparison group by matching each treated person with a non-participant who has similar
    characteristics (age, education, motivation, etc.).

    4. Instrumental Variables (IV)

    Best for: When you have a variable (instrument) that affects treatment but not outcomes
    directly.

    IV uses “natural experiments” (like distance to a clinic or lottery numbers) to isolate causal effects.

    📊 Need Advanced Analysis?

    QEDs require sophisticated statistical analysis to be credible. Our Analysis &
    Results
    module and consulting services can help.

    • Rigorous statistical modeling (RDD, DID, PSM)
    • Robustness checks and sensitivity analysis
    • Clear interpretation of causal claims

    Discuss Your Analysis
    Needs →

    Which Design Should You Use?

    1. Eligibility Cutoff?

    Yes → Consider RDD (Regression Discontinuity)

    2. Pre/Post Data?

    Yes → Consider DID (Difference-in-Differences)

    3. Rich Covariates?

    Yes → Consider PSM (Propensity Score Matching)

    4. Valid Instrument?

    Yes → Consider IV (Instrumental Variables)

    Strengthening Your Design

    Since QEDs rely on assumptions, you need to work harder to prove your results are robust:

    • Combine Methods: Use DID + PSM together for stronger evidence.
    • Falsification Tests: Test for effects on “placebo” outcomes that shouldn’t change.
    • Sensitivity Analysis: Show that results hold under different assumptions.
    • Transparency: Be honest about limitations and assumptions.

    Conclusion

    Quasi-experimental designs offer rigorous causal inference when RCTs aren’t feasible. But they’re not a
    free lunch—they require strong assumptions that must be justified and tested.

    Key Takeaways

    • Choose design based on available data and variation
    • Understand and defend your assumptions (Parallel Trends, Continuity)
    • Test robustness extensively with placebo tests
    • Be honest about limitations

    Expert Evaluation Support

    Whether RCT or QED, we help you measure what matters

    📅 Free Consultation

    Not sure which design fits your program? Let’s discuss your options.

    Book Now →

    🛠️ Analysis Tools

    Explore our toolkit for data management and analysis support.

    Launch Toolkit
    →

    📚 Methods Guide

    Read more about evaluation methodologies in our blog.

    Read Blog →

    About the Author

    Aubrey Jolex has designed and implemented both experimental and quasi-experimental evaluations across
    multiple countries. Let’s find the right approach for your context.


  • “Randomized Controlled Trials 101: The Gold Standard for Impact Evaluation”





    RCTs 101: The Gold Standard for Impact Evaluation – AJ Impact Evaluation Consulting


    RCTs 101: The Gold Standard for Impact Evaluation

    Why randomization is the most powerful tool for measuring impact

    By Aubrey Jolex | February 1, 2025 | 12 min read

    Randomized controlled trials (RCTs) are the gold
    standard for measuring program impact—but what makes them so powerful? And how do you actually
    design and implement one correctly?

    Whether you’re evaluating a health intervention, education program, or poverty reduction strategy,
    understanding RCTs is essential for rigorous impact evaluation.

    What is an RCT?

    An RCT is an experimental design where participants are randomly assigned to either a treatment group
    (receives the program) or a control group (does not receive the program).

    The Key Principle: Randomization

    Random assignment ensures that, on average, treatment and control groups are identical except for the
    program itself. This means any difference in outcomes can be attributed to the program—not to
    pre-existing differences between groups.

    Why Randomization Matters

    Without randomization, you might compare program participants (who
    self-selected or were chosen) to non-participants. But these groups likely differ in motivation,
    resources, or other characteristics—making it impossible to isolate the program’s true impact.

    The Anatomy of an RCT

    1. Baseline Survey

    Collect data on participants before program starts

    2. Randomization

    Randomly assign participants to treatment or control

    3. Program Implementation

    Deliver the intervention to treatment group only

    4. Endline Survey

    Measure outcomes for both groups after program

    5. Analysis

    Compare treatment vs control group outcomes

    6. Reporting

    Document findings and program impact

    Types of Randomization

    Individual Randomization

    Assign individual people to treatment or control. Best when: intervention is individual-level
    (e.g., scholarship, training)

    Cluster Randomization

    Assign groups (schools, villages, clinics) to treatment or control. Best when: intervention
    operates at group level or spillovers are concern

    Implementing Randomization with Integrity

    Common Threats to Validity

    Warning: Compromised Randomization

    • Staff changing assignments based on “need”
    • Participants swapping between groups
    • Selective attrition from one group

    Best Practices for Clean Randomization

    Centralize Assignment

    Use computer-based randomization, not manual selection

    Blind When Possible

    Keep staff unaware of assignment until after baseline

    Lock Assignments

    Do not allow changes after randomization

    Document Everything

    Record randomization procedure and any deviations

    Power Analysis: Getting Sample Size Right

    A critical step in RCT design is determining how many participants you need. Too few, and you won’t
    be able to detect program effects. Too many wastes resources.

    Design Your RCT Like a Pro

    Use our RCT Field Flow Toolkit to document your research design, intervention
    logic, and prepare for randomization.

    Centralized study planning hub
    Document intervention logic and theory of change
    Prepare for power calculations and randomization

    Start
    Designing →

    Common RCT Design Challenges

    Ethical Concerns

    Solution: Offer control group the program after study, use lottery for oversubscribed
    programs

    Spillovers

    Solution: Use cluster randomization, ensure sufficient distance between treatment/control

    Attrition

    Solution: Track participants carefully, over-sample in baseline, analyze differential
    attrition

    Conclusion

    RCTs provide the most credible evidence of program impact when done correctly. Key principles:

    RCT Success Checklist

    • Conduct proper power analysis
    • Implement randomization with integrity
    • Collect baseline data before randomization
    • Monitor implementation fidelity
    • Minimize and track attrition
    • Pre-register your analysis plan

    Ready to Design Your RCT?

    Get the tools and guidance you need

    Use Our Toolkit

    Comprehensive platform for power analysis, randomization, and complete RCT management.

    Launch Toolkit

    Free Consultation

    Get expert guidance on your RCT design and implementation strategy.

    Book Now →

    More Resources

    Explore our blog for more guides on RCT design and impact evaluation.

    Read Blog →

    About the Author

    Aubrey Jolex has designed and implemented dozens of RCTs across Asia and Africa with 7+ years of
    experience at IFPRI. Learn from real-world experience to implement rigorous evaluations.