As far as significance testing goes (or anything else that does essentially the same thing as significance testing), I have long thought that the best approach in most situations is likely to be estimating a standardized effect size, with a 95% confidence interval about that effect size. There's nothing really new there--mathematically you can shuffle back and forth between them--if the p-value for a 'nil' null is <.05, then 0 will lie outside of a 95% CI, and vise versa. The advantage of this, in my opinion, is psychological; that is, it makes salient information that exists but that people can't see when only p-values are reported. For example, it is easy to see that an effect is wildly 'significant', but ridiculously small; or 'non-significant', but only because the error bars are huge whereas the estimated effect is more or less what you expected. These can be paired with raw values and their CI's.
d=−1.6±.5? I tend to think that there can still be value in reporting both, and functions can be written to compute these so that it's very little extra work, but I recognize that opinions will vary. At any rate, I argue that point estimates with confidence intervals replace p-values as the first part of my response.
On the other hand, I think a bigger question is 'is the thing that significance testing does what we really want?' I think the real problem is that for most people analyzing data (i.e., practitioners not statisticians), significance testing can become the entirety of data analysis. It seems to me that the most important thing is to have a principled way to think about what is going on with our data, and null hypothesis significance testing is, at best, a very small part of that. Let me give an imaginary example (I acknowledge that this is a caricature, but unfortunately, I fear it is somewhat plausible):
Bob conducts a study, gathering data on something-or-other. He
expects the data will be normally distributed, clustering tightly
around some value, and intends to conduct a one-sample t-test to see
if his data are 'significantly different' from some pre-specified
value. After collecting his sample, he checks to see if his data are
normally distributed, and finds that they are not. Instead, they do
not have a pronounced lump in the center but are relatively high over a given
interval and then trail off with a long left tail. Bob worries about
what he should do to ensure that his test is valid. He ends up doing
something (e.g., a transformation, a non-parametric test, etc.), and
then reports a test statistic and a p-value.
I hope this doesn't come off as nasty. I don't mean to mock anyone, but I think something like this does happen occasionally. Should this scenario occur, we can all agree that it is poor data analysis. However, the problem isn't that the test statistic or the p-value is wrong; we can posit that the data were handled properly in that respect. I would argue that the problem is Bob is engaged in what Cleveland called "rote data analysis". He appears to believe that the only point is to get the right p-value, and thinks very little about his data outside of pursuing that goal. He even could have switched over to my suggestion above and reported a standardized effect size with a 95% confidence interval, and it wouldn't have changed what I see as the larger problem (this is what I meant by doing "essentially the same thing" by a different means). In this specific case, the fact that the data didn't look the way he expected (i.e., weren't normal) is real information, it's interesting, and very possibly important, but that information is essentially just thrown away. Bob doesn't recognize this, because of the focus on significance testing. To my mind, that is the real problem with significance testing.
Let me address a few other perspectives that have been mentioned, and I want to be very clear that I am not criticizing anyone.
- It is often mentioned that many people don't really understand
p-values (e.g., thinking they're the probability the null is
true), etc. It is sometimes argued that, if only people would use
the Bayesian approach, these problems would go away. I believe that people
can approach Bayesian data analysis in a manner that is just as
incurious and mechanical. However, I think that misunderstanding the meaning of p-values would be less harmful if no one thought getting a p-value was the goal.
- The existence of 'big data' is generally unrelated to this issue. Big data only make it obvious that organizing data analysis around 'significance' is not a helpful approach.
- I do not believe the problem is with the hypothesis being tested. If people only wanted to see if the estimated value is outside of an interval, rather than if it's equal to a point value, many of the same issues could arise. (Again, I want to be clear I know you are not 'Bob'.)
- For the record, I want to mention that my own suggestion from the first paragraph, does not address the issue, as I tried to point out.
For me, this is the core issue: What we really want is a principled way to think about what happened. What that means in any given situation is not cut and dried. How to impart that to students in a methods class is neither clear nor easy. Significance testing has a lot of inertia and tradition behind it. In a stats class, it's clear what needs to be taught and how. For students and practitioners it becomes possible to develop a conceptual schema for understanding the material, and a checklist / flowchart (I've seen some!) for conducting analysis. Significance testing can naturally evolve into rote data analysis without anyone being dumb or lazy or bad. That is the problem.