The Bayesian Reproducibility Project

Alexander Etz on why we need a better metric for "success" in reproducibility.

Based on these two metrics, the headlines are accurate: Over half of the replications “failed”. But these two reproducibility metrics are either invalid (comparing significance levels across experiments) or very vague (confidence interval agreement). They also only offer binary answers: A replication either “succeeds” or “fails”, and this binary thinking leads to absurd conclusions in some cases like those mentioned above. Is replicability really so black and white? I will explain below how I think we should measure replicability in a Bayesian way, with a continuous measure that can find reasonable answers with replication effects near zero with wide CIs, effects near the original with tight CIs, effects near zero with tight CIs, replication effects that go in the opposite direction, and anything in between.

Daniel Lakens: Power of replications in the Reproducibility Project

Nice take.

For now, it means 35 out of 97 replicated effects have become quite a bit more likely to be true. We have learned something about what predicts replicability. For example, at least for some indicators of replication success, “Surprising effects were less reproducible” (take note, journalists and editors of Psychological Science!). For the studies that did not replicate, we have more data, which can inform not just our statistical inferences, but also our theoretical inferences. The Reproducibility Project demonstrates large scale collaborative efforts can work, so if you still believe in an effect that did not replicate, get some people together, collect enough data, and let me know what you find.

Why you should use omega-squared instead of eta-squared

Nice post from Daniel Lakens.

The table shows the bias. With four groups of n = 20, a One-Way ANOVA with a medium effect (true η² = 0.0588) will overestimate the true effect size on average by 0.03512, for an average observed η² of = 0.0588 + 0.0347 = 0.0935. We can see that for small effects (η² = 0.0099) the bias is actually larger than the true effect size (up to ANOVA’s with 70 participants in each condition).

When there is no true effect, η² from small studies can easily give the wrong impression that there is a real small to medium effect, just due to the bias. Your p-value would not be statistically significant, but this overestimation could be problematic if you ignore the p-value and just focus on estimation.

Daniel Lakens: The perfect t-test

Great idea from Daniel Lakens—an R script that helps you properly compare two groups.

The goal of this script is to examine whether more researcher-centered statistical tools (i.e., a one-click analysis script that checks normality assumptions, calculates effect sizes and their confidence intervals, creates good figures, calculates Bayesian and robust statistics, and writes the results section) increases the use of novel statistical procedures. Download the script here:

How many participants should you collect? An alternative to the N * 2.5 rule

Another great post from Daniel Lakens.

  1. Determine the maximum sample size you are willing to collect (e.g., N = 400)
  2. Plan equally spaced analyses (e.g., four looks at the data, after 50, 100, 150, and 200 participants per condition in a two-sample t-test).
  3. Use alpha levels for each of the four looks at your data that control the Type 1 error rate (e.g., for four looks: 0.018, 0.019, 0.020, and 0.021; for three looks: 0.023, 0.023, and 0.024; for two looks: 0.031 and 0.030).
  4. Calculate one-sided p-values and JZS Bayes Factors (with a scale r on the effect size of 0.5) at every analysis. Stop when the effect is statistically significant and/or JZS Bayes Factors > 3. Stop when there is support for the null hypothesis based on a JZS Bayes Factor < 0.3. If the results are inconclusive, continue. In small samples (e.g., 50 participants per condition) the risk of Type 1 errors when accepting the null using Bayes Factors is relatively high, so always interpret results from small samples with caution.
  5. When the maximum sample size is reached without providing convincing evidence for the null or alternative hypothesis, interpret the Bayes Factor while acknowledging the Bayes Factor provides weak support for either the null or the alternative hypothesis. Conclude that based on the power you had to observe a small effect size (e.g., 91% power to observe a d = 0.3) the true effect size is most likely either zero or small.
  6. Report the effect size and its 95% confidence interval, and interpret it in relation to other findings in the literature or to theoretical predictions about the size of the effect.

Why big data is in trouble: they forgot about applied statistics

Turns out doing a proper statistical anlaysis actually takes some thought and knowledge.

One reason is that when you actually take the time to do an analysis right, with careful attention to all the sources of variation in the data, it is almost a law that you will have to make smaller claims than you could if you just shoved your data in a machine learning algorithm and reported whatever came out the other side.

Always use Welch's t-test instead of Student's t-test

Helpful post from Daniel Lakens with simulations and explanations.

Take home message of this post: We should use Welch’s t-test by default, instead of Student’s t-test, because Welch's t-test performs better than Student's t-test whenever sample sizes and variances are unequal between groups, and gives the same result when sample sizes and variances are equal.

What does a Bayes factor look like?

Helpful visualization from Felix Schönbrodt.

To summarize: Whether a strong evidence “hits you between the eyes” depends on many things – the kind of test, the kind of visualization, the sample size. Sometimes a BF of 2.5 seems obvious, and sometimes it is hard to spot a BF>100 by eyeballing only. Overall, I’m glad that we have a numeric measure of strength of evidence and do not have to rely on eyeballing only.