Thursday, February 9, 2012

Friday Fun: One Researcher's P-Curve Analysis

It's Me!
Two weeks ago when PYM was at the annual conference of the Society for Personality and Social Psychology, I went to a symposium about false-positive findings in psychology (see my summary here). In the symposium, the speakers discussed the prevalence of research practices that result in biased statistical testing. In that symposium, one of the researchers, Uri Simonsohn, presented a method for catching people who engage in these practices: the P-curve analysis. What follows is a p-curve analysis of one researcher/blogger: Michael W. Kraus!

Before we get into my p-curve analysis I just want to make a few observations about research. With what I imagine are few exceptions, psychologists like other researchers, are fundamentally concerned with revealing the truth about the world-- and psychological experience more generally. This means that psychologists are committed to finding the truth about human experience, and by implication, would be firmly against publishing research that they knew was not a representation of that experience. For me, this means that I am always concerned about whether my findings will replicate, I believe they will, and I wouldn't publish them if I didn't believe this.

And yet, despite these motivations to search for truth, there are also real pressures to publish frequently and to present data that look beautiful. After all, frequently publishing beautiful data leads to jobs, prestige, and funding. These pressures could lead a researcher to search every corner of a data set in order to reveal some pattern in line with one's hypotheses. A good researcher engages in that sort of practice.

Of course, sometimes a researcher goes too far, and focuses more on pushing p-values below p < .05-- the conventional level for statistical significance-- and less on whether or not a finding will replicate. The p-curve analysis is designed to determine whether this is happening. The idea behind the p-curve is elegant: A real effect will have a distribution of p-values like the one below:

This p-curve reflects a low percentage of p-values nearest the conventional level of statistical significance. A questionable effect  would have an abnormally high frequency of p-values close to p < .05, relative to past the p < .01 threshold-- suggesting that a researcher is pushing p-values past the threshold for statistical significance, just for the sake of reaching p < .05. Presumably, a distribution that looks different from the theoretical one would be evidence for using biased statistical techniques in data analysis.* Now, let's look at my own p-curve for all of my first-authored empirical papers.**

In the paper on the p-curve analysis, Simonsohn and colleagues presumably have a statistical technique to test the observed distribution against the theoretical one. Since we don't have access yet to the paper, we can only eye-ball the difference between my own distribution and the theoretical one. I'd say that it's looking similar to the theoretical one (phew).

Does this mean that I don't engage in questionable research practices? No it doesn't. First let me outline the six questionable research practices that Simmons and colleagues (2011) note in their recent paper on false-positive findings:

(1)Terminating data collection only when p < .05
(2) Collecting fewer than 20 observations per condition
(3) Failure to list all variables
(4) Failure to report all experimental conditions
(5) Failure to report analyses with and without eliminating outliers
(6) Failure to report analyses with and without covariates

In my research, I've engaged in more than one of these practices on occasion. For example, I've added 20 observations to an experiment to push my data from p = .06 to p < .05. I've also collected variables that I didn't report in a paper. These tactics aren't necessarily going to lead to false-positive findings, but I can tell you confidently, that when I made the decision to add 20 people, or drop a variable, I did so at least in part because (1) I could justify doing so using common research conventions, and (2) doing so would lead to the presentation of more perfect data. I believe this is precisely what Simmons and colleagues (2011) warn of in their paper.

So there you have it, my own p-curve analysis. In general, I think that the Simonsohn et al., (2011) article does a great job of pointing out that some of our normal data analytic strategies can actually bias our hypothesis testing practices. Knowing this, I am planning to use more conservative statistical strategies going forward, and to be more transparent in my reporting of results. I also think that this analysis points (once again) to the importance of replication. Put simply, the definitive way to know a finding is real or based on biased hypothesis testing is to look across time, laboratory, researcher, and study. Finally, I still think it's ludicrous (as I said here) to judge a job candidate or a single paper based on this technique. I just don't think there are enough observations to make a judgment about biased hypothesis testing based on such a small number of p-values.

What are your reactions to the p-curve analysis? Let me know in the comments!

*Full disclosure #1 - The paper on p-curve analysis is not yet available (unpublished), so I am only conducting this analysis based on what I remember from the 20-min talk at the SPSP conference.

**Full disclosure #2 - I have only conducted this analysis on data that I myself analyzed so as not to implicate my co-authors in statistical techniques that could lead to biased hypothesis testing.

Simmons JP, Nelson LD, & Simonsohn U (2011). False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological science, 22 (11), 1359-66 PMID: 22006061


  1. If you're convinced to "use more conservative statistical techniques" then the terrorists have won.

    There is nothing in their paper that should lead you to this conclusion. Keep in mind that Every move toward reducing Type I error is a move to INCREASE Type II error. It's a value judgment, and you've been convinced to go with the conservatives. Next you'll be voting for Ronald Reagan and bemoaning the end of traditional marriage--you're already moving toward defending the scientific stats quo byaling it harder to find things that are ACTUALLY there by obsessing about Type I error. If science is self-correcting, then Type II error is a much bigger problem.

    1. Thanks for the comment Chris! I think you make a very important point about how an increase in type II error is an equally important problem that researchers face. I think I'd like to minimize errors (of any kind) in my research by designing solid studies, throwing out more questionable statistical techniques, and running studies with sufficient power.

      Also, I'm not planning to vote Reagan. ;)

    2. Chris Crandall, I don't think what you say makes much sense.

      It has been noted for decades that it is far easier to get positive results published than negative results. This is what is called publication bias. There is no doubt that it exists, and it is actually worse in psychology than in hard sciences, see eg:

      When negative results end up in people's file drawers, type II errors cannot distort the scientific literature in any serious way. They may unnecessarily discourage an individual investigator, but they do not have any lasting effect on the enterprise. Others will try a study from time to time, and if there are positive results to be had, and people do adequately powered experiments, these effects will turn up.

      Type 1 errors, on the other hand, are completely different. Many statisticians have estimated that due to publication bias a high proportion of significant effects are bogus:

      This is stuff that makes its way into review articles, textbooks, etc. It makes the scientific literature a pile of junk. And these statisticians weren't even assuming any p-hacking, so the real situation in fields that p-hack may be far worse!

      You say "If science is self-correcting". Oh yes, you mean like how we found out about Diederek Stapel, right?

      Oh wait, he made up crap for years and there were a grand total of zero published nonreplications--he was only caught because of whistleblowers. Is that what you mean by our self correcting process, Chris?

      By the way, mentioning Reagan and religion really just distracts from your weak analysis, Chris--maybe you'd do better if you tried to think through one topic at a time?

    3. Woah there anonymous! I think we can all write about this issue without descending into hostile comments about whether people are (or aren't) thinking through their arguments.

      I'd just like to add that (1) most researchers don't fudge their data, (2)psychology should probably focus more on non-replication, and (3) I thought Chris was being funny and sarcastic in bringing up Reagan!

  2. Hmm, why would you conduct an analysis based on an unknown method in a paper that's not yet available, relying on your memory from a talk? Did you know which p-values to use? Are they independent of one another? I suppose the p-values from the same experiment are not independent etc. Much too many open questions to run around doing stuff like that.

    1. Thanks for the comment Dr. Schwarz! I decided to conduct this analysis out of pure curiosity, and unfortunately it leaves many unanswered questions. Certainly this blog entry doesn't merit publication in a respectable scientific outlet. However, on a blog where I've written about the movie Twilight, and about hilarious anonymous reviewer comments, I think this post fits in well.

      I should clarify that in general, I examined the p-values from the central predictions of each study. That said, there is likely to be some interdependence in p-values from the same experiment, and I don't know how Simonsohn et al would treat this interdependence.

      Lastly, I am an admirer of your research!

    2. Norbert Schwarz, why are you so surly about this work? Do you doubt that there is lots of p-hacking going on?

      And why shouldn't Michael try out the p-curve analysis? People are curious about these curves--that is reason enough to do and share this analysis, and that's why his webpage is getting lots of traffic today I imagine. Basically your point here (and in your outbursts during the SPSP talk) seems to be that we lack firm evidence about the statistical reliability of observed p-curve shapes. OK, fine, but big deal--such information will eventually emerge, I am sure. In the meanwhile, people are curious and intrigued.

      Lastly, I am not an admirer of your conference etiquette!

    3. Hey Chuckie, thanks for the comment. If you are interested in a fuller conversation about p-curve analysis and false-positive findings between Joe Simmons and Norbert Schwarz there was a great back-and-forth Email exchange on the SPSP list which can be found here:

      These email exchanges explain a lot of Dr. Schwarz's reservations about p-curves.

    4. There is so much noise in p values you wouldn't expect much interdependence at all. If you simulate the same effect a hundred times you'll get very different p values. If you assume that p values in a paper all sample exactly the same effect (which is probably not the case) you'd not expect the same p value (or indeed similar p values).

      I guess you could argue the case for very small p values where effects are very large or sample sizes huge, but for the range of p values in question I can't see how interdependence would be an issue in real data sets.

  3. I enjoyed the post and I think you are right to have some reservations. A quick (very crude) simulation suggests that you need at least a hundred or so p values to be confident of getting the theoretically expected profile (and possibly more):

    1. Thanks for posting this Palinurus! It's such a great idea (to conduct the simulation) and helpful for interpreting p-curves!

  4. Great post. How did you manage to get the marginal/ns effects in there? I was under the impression that the online calculator leaves those out.
    It seems odd to me that the p-curve analysis does not allow for marginal/ns results to be included. I mean, I have more than one published papers in which I report p-values of .06 up to .09. Leaving those out would enhance the chances of being accused of hacking I suppose.