The principle is that whenever one finds something using a computational analysis that fits with one’s predictions or seems like a “cool” finding, they should assume that it’s due to an error in the code rather than reflecting reality. Having made this assumption, one should then do everything they can to find out what kind of error could have resulted in the effect. This is really no different from the strategy that experimental scientists use (in theory), in which upon finding an effect they test every conceivable confound in order to rule them out as a cause of the effect. However, I find that this kind of thinking is much less common in computational analyses. Instead, when something “works” (i.e. gives us an answer we like) we run with it, whereas when the code doesn’t give us a good answer then we dig around for different ways to do the analysis that give a more satisfying answer. Because we will be more likely to accept errors that fit our hypotheses than those that do not due to confirmation bias, this procedure is guaranteed to increase the overall error rate of our research. If this sounds a lot like p-hacking, that’s because it is
Social scientists often seek to demonstrate that a construct has incremental validity over and above other related constructs. However, these claims are typically supported by measurement-level models that fail to consider the effects of measurement (un)reliability. We use intuitive examples, Monte Carlo simulations, and a novel analytical framework to demonstrate that common strategies for establishing incremental construct validity using multiple regression analysis exhibit extremely high Type I error rates under parameter regimes common in many psychological domains. Counterintuitively, we find that error rates are highest—in some cases approaching 100%—when sample sizes are large and reliability is moderate. Our findings suggest that a potentially large proportion of incremental validity claims made in the literature are spurious. We present a web application (http://jakewestfall.org/ivy/) that readers can use to explore the statistical properties of these and other incremental validity arguments. We conclude by reviewing SEM-based statistical approaches that appropriately control the Type I error rate when attempting to establish incremental validity.
Good summary and interesting background on the ASA's statement on p-values.
Let’s be clear. Nothing in the ASA statement is new. Statisticians and others have been sounding the alarm about these matters for decades, to little avail. We hoped that a statement from the world’s largest professional association of statisticians would open a fresh discussion and draw renewed and vigorous attention to changing the practice of science with regards to the use of statistical inference.
Good reminder (and a good analogy) from Dorothy Bishop on reporting all variables we test:
Quite simply p-values are only interpretable if you have the full context: if you pull out the 'significant' variables and pretend you did not test the others, you will be fooling yourself - and other people - by mistaking chance fluctuations for genuine effects.
The most widely used task fMRI analyses use parametric methods that depend on a variety of assumptions. While individual aspects of these fMRI models have been evaluated, they have not been evaluated in a comprehensive manner with empirical data. In this work, a total of 2 million random task fMRI group analyses have been performed using resting state fMRI data, to compute empirical familywise error rates for the software packages SPM, FSL and AFNI, as well as a standard non-parametric permutation method. While there is some variation, for a nominal familywise error rate of 5% the parametric statistical methods are shown to be conservative for voxel-wise inference and invalid for cluster-wise inference; in particular, cluster size inference with a cluster defining threshold of p = 0.01 generates familywise error rates up to 60%. We conduct a number of follow up analyses and investigations that suggest the cause of the invalid cluster inferences is spatial auto correlation functions that do not follow the assumed Gaussian shape. By comparison, the non-parametric permutation test, which is based on a small number of assumptions, is found to produce valid results for voxel as well as cluster wise inference. Using real task data, we compare the results between one parametric method and the permutation test, and find stark differences in the conclusions drawn between the two using cluster inference. These findings speak to the need of validating the statistical methods being used in the neuroimaging field.
Seems like an important paper, and a very cool website.
In this paper, we argue that advocacy of CIs is based on a folk understanding rather than a principled understanding of CI theory. We outline three fallacies underlying the folk theory of CIs, and place these in the philosophical and historical context of CI theory proper. Through an accessible example adapted from the statistical literature, we show how CI theory differs from the folk theory of CIs. Finally, we show the fallacies of confidence in the context of a CI advocated and commonly used for ANOVA and regression analysis, and discuss the implications of the mismatch between CI theory and the folk theory of CIs.
Alexander Etz on why we need a better metric for "success" in reproducibility.
Based on these two metrics, the headlines are accurate: Over half of the replications “failed”. But these two reproducibility metrics are either invalid (comparing significance levels across experiments) or very vague (confidence interval agreement). They also only offer binary answers: A replication either “succeeds” or “fails”, and this binary thinking leads to absurd conclusions in some cases like those mentioned above. Is replicability really so black and white? I will explain below how I think we should measure replicability in a Bayesian way, with a continuous measure that can find reasonable answers with replication effects near zero with wide CIs, effects near the original with tight CIs, effects near zero with tight CIs, replication effects that go in the opposite direction, and anything in between.
Great idea from Daniel Lakens—an R script that helps you properly compare two groups.
The goal of this script is to examine whether more researcher-centered statistical tools (i.e., a one-click analysis script that checks normality assumptions, calculates effect sizes and their confidence intervals, creates good figures, calculates Bayesian and robust statistics, and writes the results section) increases the use of novel statistical procedures. Download the script here: https://github.com/Lakens/Perfect-t-test.
Good reminder that there is a lot more to improving the quality of science than p values.
P values are an easy target: being widely used, they are widely abused. But, in practice, deregulating statistical significance opens the door to even more ways to game statistics — intentionally or unintentionally — to get a result. Replacing P values with Bayes factors or another statistic is ultimately about choosing a different trade-off of true positives and false positives. Arguing about the P value is like focusing on a single misspelling, rather than on the faulty logic of a sentence.
He even has a mythical origin story. He was raised in Greece, the home of Pythagoras and Euclid, by physician-researchers who instilled in him a love of mathematics. By seven, he quantified his affection for family members with a "love numbers" system. ("My mother was getting 1,024.42," he said. "My grandmother, 173.73.")
and thoughts on how to improve science:
Recently there’s increasing emphasis on trying to have post-publication review. Once a paper is published, you can comment on it, raise questions or concerns. But most of these efforts don’t have an incentive structure in place that would help them take off. There’s also no incentive for scientists or other stakeholders to make a very thorough and critical review of a study, to try to reproduce it, or to probe systematically and spend real effort on re-analysis. We need to find ways people would be rewarded for this type of reproducibility or bias checks.