[♪ INTRO] A little over a decade ago, a neuroscientist

stopped by a grocery store on his way to his lab to buy a large Atlantic salmon. The fish was placed in an MRI machine, and then it completed what was called an

“open-ended mentalizing task” where it was asked to determine the emotions that were being experienced by different people in photos. Yes, the salmon was asked to do that. The

dead one from the grocery store. But that’s not the weird part. The weird part is that researchers found that so-called significant activation occurred in neural tissue in a couple places in the dead fish. Turns out, this was a little bit of a stunt. The researchers weren’t studying the mental abilities of dead fish; they wanted to make a point about statistics,

and how scientists use them. Which is to say, stats can be done wrong,

so wrong that they can make a dead fish seem alive. A lot of the issues surrounding scientific

statistics come from a little something called a p-value. The p stands for probability, and it refers to the probability that you would have gotten the results you did just by chance. There are lots of other ways to provide statistical

support for your conclusion in science, but p-value is by far the most common, and, I mean, it’s literally what scientists mean when they report that their findings are “significant”. But it’s also one of the most frequently misused and misunderstood parts of scientific research. And some think it’s time to get

rid of it altogether. The p-value was first proposed by a statistician

named Ronald Fisher in 1925. Fisher spent a lot of time thinking about how to determine if the results of a study were really meaningful. And, at least according to some accounts, his big breakthrough came after a party in the early 1920s. At this party there was a fellow scientist

named Muriel Bristol, and reportedly, she refused a cup of tea from Fisher because he had added milk after the tea was poured. She only liked her tea when the milk was added

first. Fisher didn’t believe she could really taste

the difference, so he and a colleague designed an experiment to test her assertion. They made eight cups of tea, half of which

were milk first, and half of which were tea first. The order of the cups was random, and, most

importantly, unknown to Bristol, though she was told there would be four of each cup. Then, Fisher had her taste each tea one by

one and say whether it that cup was milk or tea first. And to Fisher’s great surprise, she went

8 for 8. She guessed correctly every time which cup was tea-first and which was milk-first! And that got him to thinking, what are the

odds that she got them all right just by guessing? In other words, if she really couldn’t taste

the difference, how likely would it be that she got them all right? He calculated that are 70 possible orders

for the 8 cups if there are four of each mix. Therefore, the probability that she’d guess

the right one by luck alone is 1 in 70. Written mathematically, the value of P is about 0.014. That, in a nutshell, is a p-value, the probability that you’d get that result

if chance is the only factor. In other words, there’s really no relationship

between the two things you’re testing, in this case, how tea is mixed versus how it

tastes, but you could still wind up with data that suggest there is a relationship. Of course, the definition of “chance”

varies depending on the experiment, which is why p-values depend a lot on experimental

design. Say Fisher had only made 6 cups, 3 of each

tea mix. Then, there are only 20 possible orders for the cups, so the odds of getting them all correct is 1 in 20, a p-value of 0.05. Fisher went on to describe an entire field

of statistics based on this idea, which we now call Null Hypothesis Significance Testing. The “null hypothesis” refers to the experiment’s

assumption of what “by chance” looks like. Basically, researchers calculate how likely

it is that they’ve gotten the data that they did, even if the effect they’re testing

for doesn’t exist. Then, if the results are extremely unlikely

to occur if the null hypothesis is true, then they can infer that it isn’t. So, in statistical speak, with a low enough

p-value, they can reject the null hypothesis, leaving them with whatever alternate hypothesis

they had as the explanation for the results. The question becomes, how low does a p-value

have to be before you can reject that null hypothesis. Well, the standard answer used in science

is less than 1 in 20 odds, or a p-value below 0.05. The problem is, that’s an arbitrary choice. It also traces back to Fisher’s 1925 book,

where he said 1 in 20 was quote “convenient”. A year later, he admitted the cutoff was somewhat

subjective, but that 0.05 was generally his personal preference. Since then, the 0.05 threshold has become

the gold standard in scientific research. A p of less than 0.05, and your results are

quote “significant”. It’s often talked about as determining whether

or not an effect is real. But the thing is, a result with a p-value of 0.049 isn’t more true than one with a p-value of 0.051. It’s just ever so slightly

less likely to be explained by chance or sampling error. This is really key to understand. You’re

not more right if you get a lower p-value, because a p-value says nothing about how correct

your alternate hypothesis is. Let’s bring it back to tea for a moment. Bristol aced Fisher’s 8-cup study by getting them all correct, which as we noted, has a

p-value of 0.014, solidly below the 0.05 threshold. But it being unlikely that she randomly guessed

doesn’t prove she could taste the difference. See, it tells us nothing about other possible

explanations for her correctness. Like, if the teas had different colors rather

than tastes. Or she secretly saw Fisher pouring each cup! Also, it still could have been a one-in-seventy

fluke. And sometimes, one might even argue often, 1 in 20 is not a good enough threshold to really rule out that a result is a fluke. Which brings us back to that seemingly undead

fish. The spark of life detected in the salmon was actually an artifact of how MRI data is collected and analyzed. See, when researchers analyze MRI data, they look at small units about a cubic millimeter or two in volume. So for the fish, they took

each of these units and compared the data before and after the pictures were shown to

the fish. That means even though they were just looking

at one dead fish’s brain before and after, they were actually making multiple comparisons,

potentially, thousands of them. The same issue crops up in all sorts of big

studies with lots of data, like nutritional studies where people provide detailed diet information about hundreds of foods, or behavioral studies where participants fill out surveys

with dozens of questions. In all cases, even though each individual

comparison is unlikely, with enough comparisons, you’re bound to find some false positives. There are statistical solutions for this problem,

of course, which are simply known as multiple comparison corrections. Though they can get fancy, they usually amount

to lowering the threshold for p-value significance. And to their credit, the researchers who looked

at the dead salmon also ran their data with multiple comparison corrections, when they

did, their data was no longer significant. But not everyone uses these corrections. And though individual studies might give various

reasons for skipping them, one thing that’s hard to ignore is that researchers are under a lot of pressure to publish their work, and significant results are more likely to get

published. This can lead to p-hacking: the practice of

analyzing or collecting data, until you get significant p-values. This doesn’t have to be intentional, because researchers make many small choices that lead to different results, like we saw with 6 versus

8 cups of tea. This has become such a big issue because,

unlike when these statistics were invented, people can now run tests lots of different

ways fairly quickly and cheaply, and just go with what’s most likely to get their work

published. Because of all of these issues surrounding

p-values, some are arguing that we should get rid of them altogether. And one journal has

totally banned them. And many that say we should ditch the p-value

are pushing for an alternate statistical system called Bayesian statistics. P-values, by definition, only examine null

hypotheses. The result is then used to infer if the alternative is likely. Bayesian statistics actually look at the probability

of both the null and alternative hypotheses. What you wind up with is an exact ratio of

how likely one explanation is compared to another. This is called a Bayes factor. And this is a much better answer if you want

to know how likely you are to be wrong. This system was around when Fisher came up

with p-values. But, depending on the dataset, calculating Bayes factors can require some serious computing power, power that wasn’t available at the time,

since, y’know, it was before computers. Nowadays, you can have a huge network of computers

thousands of miles from you to run calculations while you throw a tea party. But the truth is, replacing p-values with

Bayes factors probably won’t fix everything. A loftier solution is to completely separate

a study’s publishability from its results. This is the goal of two-step manuscript submission, where you submit an introduction to your study and a description of your method, and the journal decides whether to publish before seeing your results. That way, in theory at least, studies would get published based on whether they represent good science, not whether they worked out

the way researchers hoped, or whether a p-value or Bayes factor was more or less than some arbitrary threshold. This sort of idea isn’t widely used yet, but it may become more popular as statistical significance meets more sharp criticism. In the end, hopefully, all this controversy

surrounding p-values means that academic culture is shifting toward a clearer portrayal of what research results do and don’t really show. And that will make things more accessible

for all of us who want to read and understand science, and keep any more zombie fish from

showing up. Now, before I go make myself a cup of Earl

Grey, milk first, of course, I want to give a special shout out to today’s President

of Space, SR Foxley. Thank you so much for your continued support! Patrons like you give

us the freedom to dive deep into complex topics like p-values, so really, we can’t thank

you enough. And if you want to join SR in supporting this channel and the educationalcontent we make here at SciShow, you can learn more at Patreon.com/SciShow. Cheerio! [♪ OUTRO]

There is a typo at 7:37! The P-value for 6 tea cups is 0.05, not 0.5. Thanks to everyone who pointed it out!

I remember complaining about this in psychology research when I was in school. "So you have a 1 in 20 chance of seeing a result… and you're comparing subjects on 20 metrics… and one of them is significant?" One thing I would add in defense of p values though is that lower thresholds of .01 or .001 are often used especially in the "hard sciences." Also, you can't blame a technique for when people abuse it.

The fish was making "eyes" at me the whole time during the MRI. How do you tell a dead fish I'm just not that in to you?

Oh god this gives me senior year stats flashbacks

As an English woman, I know i should be horrified at the prospect of tea made milk first. However, as a coffee drinker, I'm not that bothered.

there are 3 kinds of lies; lies, damn lies, and statistics

I wonder if she could have made the difference if the milk and the tea were poured together

After all those centuries, we are still wretched at science.

This is what some of us have been shouting from the wilderness for years. If you want to see my father foam at the mouth, just mention the phrase "P value" and sit back to watch the fun. I am VERY glad that this has finally become a widespread topic of conversation!

What a load of cobblers! Any one knows that a true tea aficionado drinks without adding milk at all.

Years ago I failed my statistics class multiple times and really never understood the null hypotesis. Thanks for clearing that up for me 😀

This was a good video! However, Fisher's choice of a 95% confidence threshold was not entirely arbitrary. He worked in a variety of different areas and so was well versed in the challenges of conducting practical field research. A question that must be frequently asked is how to get the most reliability for a given research budget.

The most obvious avenue for investment is a larger sample size, either to increase the precision of my estimates (smaller confidence interval) or a higher level of confidence in my existing error bars. You have diminishing marginal returns on both. If we limit ourselves to the relationship of n (number of observations, often the big driver of cost for a study) to standard error and to confidence, we find that the curves are non-linear and elbow around 1.5<Z<2.5. Surprise, surprise: that's right where the three classic Z scores are (for 90%, 95%, and 99% confidence). Going much lower won't lower your costs all that much (relative to the benefits you forgo), but increasing your Z score beyond this point quickly gets very expensive for not that much gain. Lets also not forget that, for confirmatory research, a confidence level of 100% is impossible. (Because even censuses are samples when you're doing confirmatory research.) So throwing n at a problem won't ever solve it, not matter how big a budget you have.

Plus, Z = 1.96 ≈ 2 makes the standard error calculations (or figuring out the necessary sample size to get a required level of precision for your parameter estimate) very easy to do in your head. This was valuable back before computers, sure, but it's still valuable to this day when you're in a budget meeting looking at different scenarios for potential sponsors. These questions can often come out of left field, so it's nice to be able to give a solid answer to a question that might come out of left field. Similarly, if you're in the field, see some important phenomenon that needs scrutiny, you may not have time to do all the calculations. And yet you still have to make sure you have enough observations for later analysis. This found art won't wait for you, but being able to estimate to within an order of magnitude how much data you will need to collect can save you millions of dollars and years of time trying for the do-over later.

But it's about more than just maximizing the value of n. Random sample error has nice neat measures, so we fixate on those. But it's often a small part of overall error. Better study design, more rigorous measurement, better control of the study conditions, all help reduce error also. But they all cost, and are all subject to diminishing marginal returns as well. The same goes for using the increased budget to do more studies and answer more research questions. So there's a sweet spot in each avenue of investment. Sure, I could increase n by ~75% to raise confidence from 95% to 99%. But it's not a good use of my budget if some hard-to-measure but potentially fatal flaw in my research isn't fixed first.

So while I agree that we're overly fixated on p-values and the 95% confidence threshold, those rules were created for some very good reasons, many of which still stand to this day.

I've heard the "Milk before the Tea" thing before. In another context, and it was coffee, not tea. The saying goes, "add the acid to the base, not the base to the acid". I didn't study much chemistry in college, but have been doing this myself ever since, and would testify that it is 'better'. Especially if you've ever seen the little 'creamer' floaties in a cup of coffee that you just added creamer to. I never seemed to get the floaties if I put the creamer in first and poured the coffee in after… Food for thought….

The solution is to trash statistics altogether. It is all questionable, and with so many liars in the world, it is not to be trusted.

what was the journal thats banned p-values?

Nobody in this thread is addressing the real elephant in the room. Why is there a different depending on how you mix your tea? (I say that jokingly, but maybe there is something to it. Some unknown tea hypothesis maybe?)

NEVER drink Earl Grey with milk!!!

Is the earl grey the only significant tea out there?

She just annoys me for some reason 😅

The sad thing is, is that when you ask about the statistical facts and what was within the study, as well as the actual results, you can come to your own conclusion without all the "if this is false then mine is true!" black and white. Your brain does it's own statistical probability and doesn't think about a "p value". That's how we should approach statistics – comparing only the truthful facts and not coming certain ones are true (no matter how unlikely) simply because another possible event is considered so unlikely that it's not true.

Confusing but falsehood statistics annoy me simply because.. You don't prove yourself right by proving someone else wrong. You still can be wrong.

The milk with tea part was weirder to me than the dead fish MRI part.

Cold milk might sink and leave the top layer a bit warmer.

good video. p-hacking bad.

So…we're POSITIVE it wasn't an unusually intelligent undead fish…right?

I work at a popular coffee shop, if you insist that you can tell the difference between pouring the milk or coffee first, you're getting decaf.

I guess then, the Fisher has been fished.

publish or perish. p-hacking or perish.

what happened to taking a sample of size at least 30?

I had to go p in the Bay after seeing this. Now I'm relieved.

I just spent $2,000 on business statistics and regression class that now feels totally worthless…

MILK FIRST IN YOUR TEA?!! WHAT

I'll bet muscle Hank drinks eight cups a day for those shredded abs. Where is he anyway?

This is why I hate science. “Scientists” are presented with irrefutable data that contradicts their preassumptions, & insist their religious beliefs are correct. !

Half of this video had nothing to do with the dead fish lmao

Definitely a welcome video, it is a good thing you guys made it

Can you make a video about maths book called lilavati?

You look better with glasses!

if we push for that 2 step process we will never see a gender studies having significant result! can't have that!

Cooking wise, it seems completely reasonable that she could taste the difference. I don't like tea, so I won't be trying it, but if you think about making ice cream and having to add a little of the hot milk mixture to the egg mixture so that the eggs don't cook, the same would be logical for tea. Adding the tea to the milk would warm the milk slower and make it less likely to curdle ever so slightly.

As a personal preference when I did my Political Science capstone, I didn't accept any p-values greater than 0.01 as statistically significant. I think it's a much more robust indicator.

I mean I think the "gold standard" should be less the 1%

Splain. Now

It doesn't matter how many cups there are or what order. All that matters is how many options. Every cup presented can either be guessed right or wrong. There is a 50% chance every time.

Not that they are all that great in general, but p-values didn't break scientific statistics, it's the way they've been used. Unfortunately we've ended up with this bizarre chimera of Neyman-Pearson and Fisher, with many of the worst aspects retained.

Neyman-Pearson makes sense when one thinks about their general use case of quality control statistics, where there's a well-defined population (e.g., "all widgets produced in this factory in the last work day") and sampling plan ("randomly sample the boxes from the line and randomly sample the widgets from the box"). In this case, the hypothesis has a meaningful action, such as "Do we agree that the production of the last work day is acceptable or does it need to be re-examined?" Unlike Fisher, Neyman-Pearson focused on specifying a test a priori and considering the consequences of different kinds of errors, too. Fisher totally blew that off in his theory and only focused on the null hypothesis. While there's a lot to be said for Bayesian statistics more broadly, Bayes Factors are kind of bizarre too. While they take the alternative hypothesis into account, unlike the p-value, they average over possible alternatives, which can be highly dependent on the choice of priors.

Confidence intervals and credible intervals (the Bayesian version) are much more useful and are focused on the actual outcome variables rather than some highly indirect measure like the p-value.

XKCD and jellybeans…

Get rid of the p value. Require experiments to be judged before they are used. Also require a certain language to be used when presenting results… A language the common man can understand.

Y'all got Ms. Choksondik here talking about dead fish, y'all savage af SciShow…

Those who do shifty experiments should be banned from involvement on any official scientific action.

Drinking an Earl Grey whilst Watching this.

Olivia has suddenly won my Respect, 3 times over.

https://www.xkcd.com/882/

I just find it hilarious that someone decided to put a dead fish in an MRI machine in the first place.

Throw a tea party while you wait for results, but what about coffee drinkers? Especially those who prefer to put the milk in first!

I can't even concentrate listening to this video when I can't take my eyes off that disgusting nose and booger infested ring hanging from her proboscis.

This is incorrect. Correction for multiple comparisons doesn't fix the issues with p values. 2 step isn't going to work either, nor Bayes. The use of stats in science is largely just incorrect. You need to do science, not data analysis; the two are not the same.

On the topic of video format, is it my imagination or is the pace of speaking used in presenting a bit slower than in previous videos? Is that a response to criticism on previous fast-paced talking videos, just trying something new, or an attempt to get more watch time out of videos?

Consider me politely curious.

There is a very simple solution. Don't do stats, do science.

One thing to keep in mind is that the “gold standard threshold” of .05 depends a lot on your field of study (socials sciences use higher p values like .05 and things like cutting edge physics use much smaller p values ( <.0001)

Earl Grey with milk? What a monster does that? The proper way to drink Earl grey is with lemon😂😉

SAMPLE SIZZZZZEE!!!

The more samples of data you get the more accurate.

You can never get an accurate probability, because there's always science to do.

Your sample size can be the entire population to get an estimate, but then how much of that is flawed data.

Polling 100% is still less than 100% accurate probability for any factor.

It was actually a zombie fish!

I agree that any chosen p-value is arbitrary, but I don't agree that it should be discarded. Instead, the acceptable level, and consideration given to the value obtained, should be based on the study conducted, and it's purpose. Say I'm validating a titration. I know that when I titrate the solvent, I get a response of 0 ppm. When I add a known solute, I get a response that agrees with that of standard addition – I can safely assume the response is a result of the solvent, and is representative of the capability of the titration method. If I repeat the test several times, calculate the bias, uncertainty, p-value, etc. I can have confidence in those values, due to the level of control given by the design of the experiment. Now, it's important that I repeat the test and collect enough data to suffice for the purpose the test will be used. In other words, if lives depend on the response being very precise, I better repeat the test many times, before I accept the results, or put the method into practice. And, I need to base my limit of acceptance on the uncertainty of those values during industrial practices. I've seen this done time and again, and it works. Of course, your p-value may never be the actual likelihood that your values are true, but it is a literal calculation of that the likelihood they are, it increases based on the number of tests conducted, and you have prior knowledge and the experiment limited by design. I would agree that p-value should be taken with a grain of salt when applied to the example of milk in tea, because the experiment sucks. As mentioned in the video, the variables are numerous, there are few controls, it was not a quantitative (it was qualitative) test, etc. I don't mean to imply that qualitative experiments can't be good experiments, or that they can't get use out of p-value, but it limits the setup and consideration that can given to p-value.

There is a really good book about this, specifically talking about medical research. Check out "Rigor Mortis: How Sloppy Science Creates Worthless Cures, Crushes Hope, and Wastes Billions"

I'm gonna regret asking this.. Is there really a difference between having your milk first and pour in the tea, or to have your tea first and pour in your milk?

Excellent video! Thanks for making this

Excellent. A must for every high school student being introduced to probability and something to show university students after being sprayed with current social science research "gold standard" techniques.

Another issue is that journals don't like to publish negative results. Negative results are interesting. So researchers have to keep hunting until they get significant positive results.

Chewbacca is a wookie

My p value is almost 0…Despite it appearing golden

You guys could really benefit by bringing down the gain on 3000 Hz or above. S sounds are LOUD

IF you're gonna use teabags then Milk first is best!

let the tea steep for a few minutes in the milk and add almost boiling water to that to avoid blanching the leaves.

or just do tea properly and have loose leaf.

Of course, to rule out the possibility of a P-value based on luck, an experiment would be run several times using different subjects. It doesn't make a luck result impossible, but it does make it unlikely.

I'm British and I promise you there is a difference in taste between milk first/last (and it should definitely be last ;)!

Shouldn't it be a

Tvalue? 🙂I've seen references to P-hacking in two unrelated places in the past two days, therefore, I can conclude that there must be some recent event that has gotten everyone talking about it.

It seems to me that with the actual combinatorics proof unsolved, there is no way to constitute this idea of "chance", which is totally unscientific: it denies cause and effect. Would you not say that the p value is always, initially, between 0-100%, and that the actual p value is what it tends to reduce to, or something like that?

I bet if they let a homeopath mix the tea she wouldn't be able to tell the difference.

I can tastes the difference when mixing my coffee

You didn't talk about the fact that Bayesian statistics require the dreaded prior probably. In order to preform Bayesian statistics, you must first assign probabilities to each of your hypotheses before you can collect any data. A lot of people are very apposed to that, since it feels like it deviates from the objectivity of science.

I wish this came out before my thesis defense lolz. Good news is that i passed! My research was on tardigrades.

Does any of this apply to climate studies? Urge to positively publish, etc?

How about lowering the p-value to .005? https://www.nature.com/articles/s41562%20017%200189%20z https://link.springer.com/article/10.1007%2Fs13164-019-00440-1

she sucks

Looking at you, pharmaceutical studies, nutritional studies and social sciences.

Was the tea stirred well? Because if not I fully understand how she would taste the difference.

But what if it really taste different?

HO:

HA:

Significance level,Reject Ho if P-value <

Test Statistics

Reject HO/Do Not Reject HO

Conclusion

Blablabla

They also put nose rings on pig noses to prevent them from digging.

Damn, FINALLY!!!! I have waited for this paper for nearly 10 years! I love it. Salmon bless the researchers and SciShow.

The way it mixes is different since milk is denser than tea. When the milk is spread out at the bottom of the cup it lets the tea mix in it better than pouring the milk in second. You can see the difference a lot better with coffee and creamer. Unless they're actually mixing it pouring it, it has a difference

P-values are a cheap cop-out for those who can't invest enough time, thought, and effort into proper statistical analysis. They provide an easy “yes or no” comparison when determining what a study means. Too bad they are so meaningless and misleading.

I'm wondering how much of an issue this "p-value can be wrong" thing actually is. The purpose of peer review is to weed out methodological flaws despite the statistical power of their results. Just as the authors of this video are well-aware the introduction of a paper that uses p-values is a treasure trove of often dozens of foundational studies, usually repeated over and over, that underpin the ideas the authors are exploring in a new way. The only possible benefit I can see from a "new way" of statistical analysis is to weed out anomalous novel results that get picked up in popular culture from lower impact factor journals before repeatability has been established.

I wonder what diagnose the fish got.

So what did humans do to get the science from measurements such as the whole blood vessels around the world thing, or how lungs are the size of half a football stadium.. If scientists always use logic then why does it feel like scientists are just screwing with us..

I had a hard time understanding why p-hacking is such a big deal, but now its all crystal clear. Thank you!

BEAUTIFUL!

I've been yelling the same thing at scholars for years.

Once SR Foxley runs out of money to be the top patron donor it will indeed be a sad day……..

Though I've had statistics at grad level and work with data, I always need to rephrase in my head the meaning of "null hypothesis" to understand the statistic I'm computing. P of 0.05 just means 1 in 20 chance. Put that way it sounds arbitrary. Thanks for the excellent explanation. Never heard of the reason for a Bayeseian analysis before but it makes sense.

"P-value, the probability that you'd get that result if chance is the

onlyfactor.". This is the clearest, most straightforward definition of the term I've ever come across. I tutor basic statistics and I'm definitely borrowing this definition to tell students what the P-value means and why it's not quite the same thing as the probability that your hypothesis is true. That one phrase has made it far more clear to me why this is the case, which will help me explain it. The textbook the school uses emphasises that the P-value is NOT the probability that the hypothesis is correct, but it doesn't clearly why.P=Q

What?