P-values Broke Scientific Statistics—Can We Fix Them?


[♪ INTRO] A little over a decade ago, a neuroscientist
stopped by a grocery store on his way to his lab to buy a large Atlantic salmon. The fish was placed in an MRI machine, and then it completed what was called an
“open-ended mentalizing task” where it was asked to determine the emotions that were being experienced by different people in photos. Yes, the salmon was asked to do that. The
dead one from the grocery store. But that’s not the weird part. The weird part is that researchers found that so-called significant activation occurred in neural tissue in a couple places in the dead fish. Turns out, this was a little bit of a stunt. The researchers weren’t studying the mental abilities of dead fish; they wanted to make a point about statistics,
and how scientists use them. Which is to say, stats can be done wrong,
so wrong that they can make a dead fish seem alive. A lot of the issues surrounding scientific
statistics come from a little something called a p-value. The p stands for probability, and it refers to the probability that you would have gotten the results you did just by chance. There are lots of other ways to provide statistical
support for your conclusion in science, but p-value is by far the most common, and, I mean, it’s literally what scientists mean when they report that their findings are “significant”. But it’s also one of the most frequently misused and misunderstood parts of scientific research. And some think it’s time to get
rid of it altogether. The p-value was first proposed by a statistician
named Ronald Fisher in 1925. Fisher spent a lot of time thinking about how to determine if the results of a study were really meaningful. And, at least according to some accounts, his big breakthrough came after a party in the early 1920s. At this party there was a fellow scientist
named Muriel Bristol, and reportedly, she refused a cup of tea from Fisher because he had added milk after the tea was poured. She only liked her tea when the milk was added
first. Fisher didn’t believe she could really taste
the difference, so he and a colleague designed an experiment to test her assertion. They made eight cups of tea, half of which
were milk first, and half of which were tea first. The order of the cups was random, and, most
importantly, unknown to Bristol, though she was told there would be four of each cup. Then, Fisher had her taste each tea one by
one and say whether it that cup was milk or tea first. And to Fisher’s great surprise, she went
8 for 8. She guessed correctly every time which cup was tea-first and which was milk-first! And that got him to thinking, what are the
odds that she got them all right just by guessing? In other words, if she really couldn’t taste
the difference, how likely would it be that she got them all right? He calculated that are 70 possible orders
for the 8 cups if there are four of each mix. Therefore, the probability that she’d guess
the right one by luck alone is 1 in 70. Written mathematically, the value of P is about 0.014. That, in a nutshell, is a p-value, the probability that you’d get that result
if chance is the only factor. In other words, there’s really no relationship
between the two things you’re testing, in this case, how tea is mixed versus how it
tastes, but you could still wind up with data that suggest there is a relationship. Of course, the definition of “chance”
varies depending on the experiment, which is why p-values depend a lot on experimental
design. Say Fisher had only made 6 cups, 3 of each
tea mix. Then, there are only 20 possible orders for the cups, so the odds of getting them all correct is 1 in 20, a p-value of 0.05. Fisher went on to describe an entire field
of statistics based on this idea, which we now call Null Hypothesis Significance Testing. The “null hypothesis” refers to the experiment’s
assumption of what “by chance” looks like. Basically, researchers calculate how likely
it is that they’ve gotten the data that they did, even if the effect they’re testing
for doesn’t exist. Then, if the results are extremely unlikely
to occur if the null hypothesis is true, then they can infer that it isn’t. So, in statistical speak, with a low enough
p-value, they can reject the null hypothesis, leaving them with whatever alternate hypothesis
they had as the explanation for the results. The question becomes, how low does a p-value
have to be before you can reject that null hypothesis. Well, the standard answer used in science
is less than 1 in 20 odds, or a p-value below 0.05. The problem is, that’s an arbitrary choice. It also traces back to Fisher’s 1925 book,
where he said 1 in 20 was quote “convenient”. A year later, he admitted the cutoff was somewhat
subjective, but that 0.05 was generally his personal preference. Since then, the 0.05 threshold has become
the gold standard in scientific research. A p of less than 0.05, and your results are
quote “significant”. It’s often talked about as determining whether
or not an effect is real. But the thing is, a result with a p-value of 0.049 isn’t more true than one with a p-value of 0.051. It’s just ever so slightly
less likely to be explained by chance or sampling error. This is really key to understand. You’re
not more right if you get a lower p-value, because a p-value says nothing about how correct
your alternate hypothesis is. Let’s bring it back to tea for a moment. Bristol aced Fisher’s 8-cup study by getting them all correct, which as we noted, has a
p-value of 0.014, solidly below the 0.05 threshold. But it being unlikely that she randomly guessed
doesn’t prove she could taste the difference. See, it tells us nothing about other possible
explanations for her correctness. Like, if the teas had different colors rather
than tastes. Or she secretly saw Fisher pouring each cup! Also, it still could have been a one-in-seventy
fluke. And sometimes, one might even argue often, 1 in 20 is not a good enough threshold to really rule out that a result is a fluke. Which brings us back to that seemingly undead
fish. The spark of life detected in the salmon was actually an artifact of how MRI data is collected and analyzed. See, when researchers analyze MRI data, they look at small units about a cubic millimeter or two in volume. So for the fish, they took
each of these units and compared the data before and after the pictures were shown to
the fish. That means even though they were just looking
at one dead fish’s brain before and after, they were actually making multiple comparisons,
potentially, thousands of them. The same issue crops up in all sorts of big
studies with lots of data, like nutritional studies where people provide detailed diet information about hundreds of foods, or behavioral studies where participants fill out surveys
with dozens of questions. In all cases, even though each individual
comparison is unlikely, with enough comparisons, you’re bound to find some false positives. There are statistical solutions for this problem,
of course, which are simply known as multiple comparison corrections. Though they can get fancy, they usually amount
to lowering the threshold for p-value significance. And to their credit, the researchers who looked
at the dead salmon also ran their data with multiple comparison corrections, when they
did, their data was no longer significant. But not everyone uses these corrections. And though individual studies might give various
reasons for skipping them, one thing that’s hard to ignore is that researchers are under a lot of pressure to publish their work, and significant results are more likely to get
published. This can lead to p-hacking: the practice of
analyzing or collecting data, until you get significant p-values. This doesn’t have to be intentional, because researchers make many small choices that lead to different results, like we saw with 6 versus
8 cups of tea. This has become such a big issue because,
unlike when these statistics were invented, people can now run tests lots of different
ways fairly quickly and cheaply, and just go with what’s most likely to get their work
published. Because of all of these issues surrounding
p-values, some are arguing that we should get rid of them altogether. And one journal has
totally banned them. And many that say we should ditch the p-value
are pushing for an alternate statistical system called Bayesian statistics. P-values, by definition, only examine null
hypotheses. The result is then used to infer if the alternative is likely. Bayesian statistics actually look at the probability
of both the null and alternative hypotheses. What you wind up with is an exact ratio of
how likely one explanation is compared to another. This is called a Bayes factor. And this is a much better answer if you want
to know how likely you are to be wrong. This system was around when Fisher came up
with p-values. But, depending on the dataset, calculating Bayes factors can require some serious computing power, power that wasn’t available at the time,
since, y’know, it was before computers. Nowadays, you can have a huge network of computers
thousands of miles from you to run calculations while you throw a tea party. But the truth is, replacing p-values with
Bayes factors probably won’t fix everything. A loftier solution is to completely separate
a study’s publishability from its results. This is the goal of two-step manuscript submission, where you submit an introduction to your study and a description of your method, and the journal decides whether to publish before seeing your results. That way, in theory at least, studies would get published based on whether they represent good science, not whether they worked out
the way researchers hoped, or whether a p-value or Bayes factor was more or less than some arbitrary threshold. This sort of idea isn’t widely used yet, but it may become more popular as statistical significance meets more sharp criticism. In the end, hopefully, all this controversy
surrounding p-values means that academic culture is shifting toward a clearer portrayal of what research results do and don’t really show. And that will make things more accessible
for all of us who want to read and understand science, and keep any more zombie fish from
showing up. Now, before I go make myself a cup of Earl
Grey, milk first, of course, I want to give a special shout out to today’s President
of Space, SR Foxley. Thank you so much for your continued support! Patrons like you give
us the freedom to dive deep into complex topics like p-values, so really, we can’t thank
you enough. And if you want to join SR in supporting this channel and the educationalcontent we make here at SciShow, you can learn more at Patreon.com/SciShow. Cheerio! [♪ OUTRO]

100 thoughts on “P-values Broke Scientific Statistics—Can We Fix Them?”

  1. I remember complaining about this in psychology research when I was in school. "So you have a 1 in 20 chance of seeing a result… and you're comparing subjects on 20 metrics… and one of them is significant?" One thing I would add in defense of p values though is that lower thresholds of .01 or .001 are often used especially in the "hard sciences." Also, you can't blame a technique for when people abuse it.

  2. The fish was making "eyes" at me the whole time during the MRI. How do you tell a dead fish I'm just not that in to you?

  3. As an English woman, I know i should be horrified at the prospect of tea made milk first. However, as a coffee drinker, I'm not that bothered.

  4. This is what some of us have been shouting from the wilderness for years. If you want to see my father foam at the mouth, just mention the phrase "P value" and sit back to watch the fun. I am VERY glad that this has finally become a widespread topic of conversation!

  5. Years ago I failed my statistics class multiple times and really never understood the null hypotesis. Thanks for clearing that up for me 😀

  6. This was a good video! However, Fisher's choice of a 95% confidence threshold was not entirely arbitrary. He worked in a variety of different areas and so was well versed in the challenges of conducting practical field research. A question that must be frequently asked is how to get the most reliability for a given research budget.

    The most obvious avenue for investment is a larger sample size, either to increase the precision of my estimates (smaller confidence interval) or a higher level of confidence in my existing error bars. You have diminishing marginal returns on both. If we limit ourselves to the relationship of n (number of observations, often the big driver of cost for a study) to standard error and to confidence, we find that the curves are non-linear and elbow around 1.5<Z<2.5. Surprise, surprise: that's right where the three classic Z scores are (for 90%, 95%, and 99% confidence). Going much lower won't lower your costs all that much (relative to the benefits you forgo), but increasing your Z score beyond this point quickly gets very expensive for not that much gain. Lets also not forget that, for confirmatory research, a confidence level of 100% is impossible. (Because even censuses are samples when you're doing confirmatory research.) So throwing n at a problem won't ever solve it, not matter how big a budget you have.

    Plus, Z = 1.96 ≈ 2 makes the standard error calculations (or figuring out the necessary sample size to get a required level of precision for your parameter estimate) very easy to do in your head. This was valuable back before computers, sure, but it's still valuable to this day when you're in a budget meeting looking at different scenarios for potential sponsors. These questions can often come out of left field, so it's nice to be able to give a solid answer to a question that might come out of left field. Similarly, if you're in the field, see some important phenomenon that needs scrutiny, you may not have time to do all the calculations. And yet you still have to make sure you have enough observations for later analysis. This found art won't wait for you, but being able to estimate to within an order of magnitude how much data you will need to collect can save you millions of dollars and years of time trying for the do-over later.

    But it's about more than just maximizing the value of n. Random sample error has nice neat measures, so we fixate on those. But it's often a small part of overall error. Better study design, more rigorous measurement, better control of the study conditions, all help reduce error also. But they all cost, and are all subject to diminishing marginal returns as well. The same goes for using the increased budget to do more studies and answer more research questions. So there's a sweet spot in each avenue of investment. Sure, I could increase n by ~75% to raise confidence from 95% to 99%. But it's not a good use of my budget if some hard-to-measure but potentially fatal flaw in my research isn't fixed first.

    So while I agree that we're overly fixated on p-values and the 95% confidence threshold, those rules were created for some very good reasons, many of which still stand to this day.

  7. I've heard the "Milk before the Tea" thing before. In another context, and it was coffee, not tea. The saying goes, "add the acid to the base, not the base to the acid". I didn't study much chemistry in college, but have been doing this myself ever since, and would testify that it is 'better'. Especially if you've ever seen the little 'creamer' floaties in a cup of coffee that you just added creamer to. I never seemed to get the floaties if I put the creamer in first and poured the coffee in after… Food for thought….

  8. The solution is to trash statistics altogether. It is all questionable, and with so many liars in the world, it is not to be trusted.

  9. Nobody in this thread is addressing the real elephant in the room. Why is there a different depending on how you mix your tea? (I say that jokingly, but maybe there is something to it. Some unknown tea hypothesis maybe?)

  10. The sad thing is, is that when you ask about the statistical facts and what was within the study, as well as the actual results, you can come to your own conclusion without all the "if this is false then mine is true!" black and white. Your brain does it's own statistical probability and doesn't think about a "p value". That's how we should approach statistics – comparing only the truthful facts and not coming certain ones are true (no matter how unlikely) simply because another possible event is considered so unlikely that it's not true.

    Confusing but falsehood statistics annoy me simply because.. You don't prove yourself right by proving someone else wrong. You still can be wrong.

  11. I work at a popular coffee shop, if you insist that you can tell the difference between pouring the milk or coffee first, you're getting decaf.

  12. This is why I hate science. “Scientists” are presented with irrefutable data that contradicts their preassumptions, & insist their religious beliefs are correct. !

  13. Cooking wise, it seems completely reasonable that she could taste the difference. I don't like tea, so I won't be trying it, but if you think about making ice cream and having to add a little of the hot milk mixture to the egg mixture so that the eggs don't cook, the same would be logical for tea. Adding the tea to the milk would warm the milk slower and make it less likely to curdle ever so slightly.

  14. As a personal preference when I did my Political Science capstone, I didn't accept any p-values greater than 0.01 as statistically significant. I think it's a much more robust indicator.

  15. It doesn't matter how many cups there are or what order. All that matters is how many options. Every cup presented can either be guessed right or wrong. There is a 50% chance every time.

  16. Not that they are all that great in general, but p-values didn't break scientific statistics, it's the way they've been used. Unfortunately we've ended up with this bizarre chimera of Neyman-Pearson and Fisher, with many of the worst aspects retained.  

    Neyman-Pearson makes sense when one thinks about their general use case of quality control statistics, where there's a well-defined population (e.g., "all widgets produced in this factory in the last work day") and sampling plan ("randomly sample the boxes from the line and randomly sample the widgets from the box"). In this case, the hypothesis has a meaningful action, such as "Do we agree that the production of the last work day is acceptable or does it need to be re-examined?" Unlike Fisher, Neyman-Pearson focused on specifying a test a priori and considering the consequences of different kinds of errors, too. Fisher totally blew that off in his theory and only focused on the null hypothesis. While there's a lot to be said for Bayesian statistics more broadly, Bayes Factors are kind of bizarre too. While they take the alternative hypothesis into account, unlike the p-value, they average over possible alternatives, which can be highly dependent on the choice of priors.

    Confidence intervals and credible intervals (the Bayesian version) are much more useful and are focused on the actual outcome variables rather than some highly indirect measure like the p-value.

  17. Get rid of the p value. Require experiments to be judged before they are used. Also require a certain language to be used when presenting results… A language the common man can understand.

  18. Throw a tea party while you wait for results, but what about coffee drinkers? Especially those who prefer to put the milk in first!

  19. I can't even concentrate listening to this video when I can't take my eyes off that disgusting nose and booger infested ring hanging from her proboscis.

  20. This is incorrect. Correction for multiple comparisons doesn't fix the issues with p values. 2 step isn't going to work either, nor Bayes. The use of stats in science is largely just incorrect. You need to do science, not data analysis; the two are not the same.

  21. On the topic of video format, is it my imagination or is the pace of speaking used in presenting a bit slower than in previous videos? Is that a response to criticism on previous fast-paced talking videos, just trying something new, or an attempt to get more watch time out of videos?
    Consider me politely curious.

  22. One thing to keep in mind is that the “gold standard threshold” of .05 depends a lot on your field of study (socials sciences use higher p values like .05 and things like cutting edge physics use much smaller p values ( <.0001)

  23. SAMPLE SIZZZZZEE!!!

    The more samples of data you get the more accurate.
    You can never get an accurate probability, because there's always science to do.
    Your sample size can be the entire population to get an estimate, but then how much of that is flawed data.
    Polling 100% is still less than 100% accurate probability for any factor.

  24. I agree that any chosen p-value is arbitrary, but I don't agree that it should be discarded. Instead, the acceptable level, and consideration given to the value obtained, should be based on the study conducted, and it's purpose. Say I'm validating a titration. I know that when I titrate the solvent, I get a response of 0 ppm. When I add a known solute, I get a response that agrees with that of standard addition – I can safely assume the response is a result of the solvent, and is representative of the capability of the titration method. If I repeat the test several times, calculate the bias, uncertainty, p-value, etc. I can have confidence in those values, due to the level of control given by the design of the experiment. Now, it's important that I repeat the test and collect enough data to suffice for the purpose the test will be used. In other words, if lives depend on the response being very precise, I better repeat the test many times, before I accept the results, or put the method into practice. And, I need to base my limit of acceptance on the uncertainty of those values during industrial practices. I've seen this done time and again, and it works. Of course, your p-value may never be the actual likelihood that your values are true, but it is a literal calculation of that the likelihood they are, it increases based on the number of tests conducted, and you have prior knowledge and the experiment limited by design. I would agree that p-value should be taken with a grain of salt when applied to the example of milk in tea, because the experiment sucks. As mentioned in the video, the variables are numerous, there are few controls, it was not a quantitative (it was qualitative) test, etc. I don't mean to imply that qualitative experiments can't be good experiments, or that they can't get use out of p-value, but it limits the setup and consideration that can given to p-value.

  25. There is a really good book about this, specifically talking about medical research. Check out "Rigor Mortis: How Sloppy Science Creates Worthless Cures, Crushes Hope, and Wastes Billions"

  26. I'm gonna regret asking this.. Is there really a difference between having your milk first and pour in the tea, or to have your tea first and pour in your milk?

  27. Excellent. A must for every high school student being introduced to probability and something to show university students after being sprayed with current social science research "gold standard" techniques.

  28. Another issue is that journals don't like to publish negative results. Negative results are interesting. So researchers have to keep hunting until they get significant positive results.

  29. IF you're gonna use teabags then Milk first is best!
    let the tea steep for a few minutes in the milk and add almost boiling water to that to avoid blanching the leaves.

    or just do tea properly and have loose leaf.

  30. Of course, to rule out the possibility of a P-value based on luck, an experiment would be run several times using different subjects. It doesn't make a luck result impossible, but it does make it unlikely.

  31. I've seen references to P-hacking in two unrelated places in the past two days, therefore, I can conclude that there must be some recent event that has gotten everyone talking about it.

  32. It seems to me that with the actual combinatorics proof unsolved, there is no way to constitute this idea of "chance", which is totally unscientific: it denies cause and effect. Would you not say that the p value is always, initially, between 0-100%, and that the actual p value is what it tends to reduce to, or something like that?

  33. You didn't talk about the fact that Bayesian statistics require the dreaded prior probably. In order to preform Bayesian statistics, you must first assign probabilities to each of your hypotheses before you can collect any data. A lot of people are very apposed to that, since it feels like it deviates from the objectivity of science.

  34. How about lowering the p-value to .005? https://www.nature.com/articles/s41562%20017%200189%20z https://link.springer.com/article/10.1007%2Fs13164-019-00440-1

  35. The way it mixes is different since milk is denser than tea. When the milk is spread out at the bottom of the cup it lets the tea mix in it better than pouring the milk in second. You can see the difference a lot better with coffee and creamer. Unless they're actually mixing it pouring it, it has a difference

  36. P-values are a cheap cop-out for those who can't invest enough time, thought, and effort into proper statistical analysis. They provide an easy “yes or no” comparison when determining what a study means. Too bad they are so meaningless and misleading.

  37. I'm wondering how much of an issue this "p-value can be wrong" thing actually is. The purpose of peer review is to weed out methodological flaws despite the statistical power of their results. Just as the authors of this video are well-aware the introduction of a paper that uses p-values is a treasure trove of often dozens of foundational studies, usually repeated over and over, that underpin the ideas the authors are exploring in a new way. The only possible benefit I can see from a "new way" of statistical analysis is to weed out anomalous novel results that get picked up in popular culture from lower impact factor journals before repeatability has been established.

  38. So what did humans do to get the science from measurements such as the whole blood vessels around the world thing, or how lungs are the size of half a football stadium.. If scientists always use logic then why does it feel like scientists are just screwing with us..

  39. Though I've had statistics at grad level and work with data, I always need to rephrase in my head the meaning of "null hypothesis" to understand the statistic I'm computing. P of 0.05 just means 1 in 20 chance. Put that way it sounds arbitrary. Thanks for the excellent explanation. Never heard of the reason for a Bayeseian analysis before but it makes sense.

  40. "P-value, the probability that you'd get that result if chance is the only factor.". This is the clearest, most straightforward definition of the term I've ever come across. I tutor basic statistics and I'm definitely borrowing this definition to tell students what the P-value means and why it's not quite the same thing as the probability that your hypothesis is true. That one phrase has made it far more clear to me why this is the case, which will help me explain it. The textbook the school uses emphasises that the P-value is NOT the probability that the hypothesis is correct, but it doesn't clearly why.

Leave a Reply

Your email address will not be published. Required fields are marked *