The Summary
In the first talk, Leslie John of Harvard University bravely
discussed the prevalence of questionable data analysis practices in our field.
The short answer: People engage in many data collection and analysis strategies
that bias hypothesis testing and contribute to the publishing of null findings.
In the second talk, Joe Simmons from UPenn presented
findings from a research paper suggesting that false findings are preventable with
a few key changes in the way people report results and journals review papers.
In summary, Simmons suggested that researchers should report justifications for
sample size cutoffs, conduct analyses without covariates (and with), report a
list of all measures used in a study, collect samples of sufficient power (n =
20), among other things. In addition, Simmons urged reviewers to put more
emphasis on exact (rather than conceptual) replications, and to give authors a
little slack for imperfect findings. Simmons further suggests that our natural
tendencies to justify any data analytic strategy that works as the correct one,
necessitates a mandate to adhere to clear rules for data reporting.
In the final talk, Uri Simonsohn of UPenn discussed what he
refers to as “p-hacking.” P-hacking is the idea that if researchers are
engaging in questionable analysis practices, then they should have a
disproportionate number of findings at or close to the p < .05 threshold for
statistical significance, and that this can be relatively easy to detect.
Simonsohn then presented data showing that Daniel Kahneman publishes findings
that are real, using this p-hacking technique, and then suggested some other
uses for this technique—including to assess whether a journal publishes many
false positive findings, or whether a job candidate’s data can be trusted.
My Reaction
I think that overall, there are some very interesting
talking points worth considering in this symposium. For instance, the premium
on perfect data that reviewers sometimes focus on would be nice to reign in. I
also think the practice of collecting a million exploratory variables,
correlating them, and seeing what relations emerge needs to be stopped. I also
find the idea of p-hacking to be absolutely fascinating and I’m planning to do
it to myself when I get home from the conference (a topic for a later blog
entry).
I do however, have some reservations about the critical
points in the symposium. First, I don’t think the researchers are considering
the real monetary/time costs for some forms of data collection. For example, if
my research question involves a special sample, and my limited access to this
hypothetical sample requires that I collect all data during a set time, a
researcher might collect a bunch of variables. Then, when the original
hypothesis does not find support with the data, I cannot justify (1) not exploring
the data or (2) exploring the data, but not publishing those results [Both of
these possibilities were proposed by the speakers]. That’s just not a practical
solution given the money and time costs of collecting such special set of
participants.
I also wondered where a person should stop when they report
findings? Do they report the order of the measures? The race/gender of the
experimenters? The day of the week? What
is enough for total transperancy?
Third, when Simonsohn mentioned using p-hacking to check on
whether researchers are faking data or not, I became concerned: A job candidate
has very few papers to their name and I wondered if the speakers even know how
many research papers it would take to have a reliable estimate about whether
someone is falsifying data using this methodology.
Finally, isn’t putting greater emphasis on exact
replications a more parsimonious solution? If a person’s findings can be
replicated, then by definition, the findings are real. The other good thing
about exact replications is that researchers don’t get into a potential witch
hunt.
What do you think about these ideas? I’d love to read your
comments!
Some system for rewarding researchers who do replications or reviews or meta-analyses might be called for. As of now, there's too much emphasis on originality.
ReplyDeleteOf course, groundbreaking findings advance science, but our enthusiasm for them needs to be kept in check.
What if there were some collectively maintained database all a field's major journals participated in keeping up to date, on which every theory was catalogued along with some sort of rating based on how many times the findings supporting it have been replicated?
Researchers could go to the database and see if the idea they're studying has been looked into before and by how many other researchers. Contradictions could be highlighted and used to encourage studies to sort them out.
Most important, the rating based on replication (and N-size) could serve as a guide to how much credence ought to be placed on the ideas.
Hi Dennis, these are some interesting and innovative ways to solve the false-positive findings problem in psychology.
DeleteThanks for reading!
"Finally, isn’t putting greater emphasis on exact replications a more parsimonious solution? If a person’s findings can be replicated, then by definition, the findings are real."
ReplyDeleteBut what if they can't be replicated? Null findings are difficult to publish. I remember a lecturer suggesting that 'psychology is a TypeI error'
Instead of wasting resources on witch hunts, maybe listen to Paul Meehl and stop relying on p-values in the first place.
ReplyDeleteI have an interest in p-hacking from the opposite perspective - i.e., when can we rely on null data to support "evidence of absence" or "affirmative evidence against harm." I am a consultant who works with various clients who are accused of harming people with products and exposures.
ReplyDeleteI suspect lots of the science used to support my adversaries' cases is flawed based on (inadvertent p-hacking).
This is a very interesting topic to me.