Thursday, February 21, 2013

Have Your Cake and Eat It Too! Practical Reform in Social Psychology

The cake we can (1) have, and (2) eat!
If you have been following recent headlines in the social sciences then you are aware that the field of social psychology has been in some rough water over the past three years. In this time period, we've had our flagship journal publish a series of studies providing evidence that ESP exists (and then refuse to publish non-replications of these studies). We've suffered through at least three instances of scientific fraud perpetrated by high profile researchers who engaged in egregious scientific misconduct. We've had an entire popular area of research come under attack because researchers have failed to replicate its effects. And several respected members of the science community have had some harsh words to say about the discipline and its methods.

Listing all of these events in succession makes me feel a bit ashamed to call myself a social psychologist. Clearly our field has been lacking both oversight and leadership if all of this could happen in such a brief period. Now, I'm not one to tuck my tail between my legs. Instead, I've decided to look ahead. I think there are relatively simple changes that social psychologists (even ones without tenure) can make in their research that can shore up our science going forward.

I. Small Effect Size, Large Sample Size
When I was an undergraduate, I remember my first exposure to psychological research. I was disappointed when I learned how little we could explain about human behavior with even the best social experiments. We all would like to think that a couple of social variables could easily explain 50-70% of our behavior. The problem of course, is that reported effects in psychology explain less of our behavior than we expect--much less. The average size of an effect across 100 years of social psychology is r = .21 (Richard et al., 2003). That's 4% of the variance in behavior explained. This analysis reveals that experiments typically have a smallish effect on our behavior. Now, if you know something about statistics, then you know that to detect such a small effect size reliably, a researcher needs both a reliable measure and many observations. In short, small effects require large samples.

Now contrast this reality with the pages of some of our top journals. We've read four different papers at random for the social journal club that I supervise at the University of Illinois, and all four of the papers did not have sufficient sample size in at least one of the reported studies to be able to find the average effect in social psychology (r = .21). This means that these studies were not designed properly. Running consistently under-powered experiments is a sloppy research practice that has crept into our field at alarming rates and it needs to stop now.

The good news is that collecting larger samples is fairly straightforward for many types of research. I think I didn't see a single talk at our recent social psychology conference where some data was not collected via the internet. Internet samples are an easy way for people to achieve large samples. I collect fairly high investment data--using autonomic physiology and coding of nonverbal behavior--but I'm still going to push myself and my graduate students to spend a little extra time collecting a bit more data in each study.

The biggest challenge for data collection occurs in neuroscience--where a single participant costs a sizable amount of money (an arm or a leg? I actually have no idea what this research costs). For social neuroscientists, I think priority in research should focus either on making measurement more precise, or a discussion needs to start about ways in which institutions can pool their neuroscience resources. The President isn't willing to give up on neuroscience, so we shouldn't either!

II. (Effect) Size Matters
Personality psychologists love to take us social psychologists "to the woodshed" when it comes to effect size. Whereas personality psychologists are usually working in the domain of correlations, where the size of an effect is always explicitly visible, social psychologists have been engaging in a lazy practice of failing to report effect size in their experiments (me included). Reports of effect size should accompany all results in empirical papers. Period.

Reporting effect size is essential for two reasons. First, it is important to report the size of an effect because that estimate allows us to get a sense of whether or not we should care about the results in a study. If for example, you conduct gene research and find an association between a candidate gene and a phenotype, but that association is essentially explaining .01% of the variance in that phenotype, you could either claim that genes influence personality (as some do). Alternatively, because the effect is so small, you might conclude instead, that candidate genes don't directly influence personality in a meaningful way.

The second reason reporting effect sizes is important is that it can draw attention to peculiarly large effects in our field. Recall that the average effect in social psychology explains 4% of the variance in behavior. If a study finds that X explains 20% of Y with a tragically small number of participants, then it should be met with suspicion by other researchers. What I mean is, effect size estimates can help reviewers see that reported effects are larger than would be expected in our field (i.e., too good to be true). At the very least, reviewers could ask researchers to directly replicate large effect size findings in studies using larger samples--where the effect can be estimated with more precision.

III. Methods in Manuscript Review
Last year I reviewed somewhere between 50-60 journal articles from psychology journals. Early indications this year suggest that I'll be reviewing about that many articles this year. Basically, if you have submitted a journal article to a scientific journal in the past year, chances are good that I've read it!

In the manuscript review process, the comments that I typically see have to do with the theory and the cleanliness of the results. Rarely do researchers pay attention to precision in measurement and method. I think reviewers have a responsibility to hold research papers to high methodological standards. This means evaluating the measurement instruments used in the study for precision, and the design of the study for statistical power. Manuscripts shouldn't be rejected because they do not show clean results. Manuscripts should absolutely be rejected if the study is designed without using good methods. I think it's time the field started dinging researchers for running studies with small samples in the manuscript review process.

In my own reviews I've started this process. So far, I've been thanked by a couple of editors for a power analysis I conducted in my review of an article, and ignored by a couple others. I am going to keep going though, and I hope more people will join me!

IV. Stop P-Hacking
I've dealt with this issue in several posts on this blog so I won't get into too many details here. In general, if researchers would be more honest with their methods, and less motivated to generate clean findings over real findings, then I think the field would be in better shape. Most of p-hacking deals with attempting to use unnatural statistical means to transform a non-significant finding into a significant one.  This practice leads to big short term gains (more publications=getting a job) and big long term costs (others might fail to replicate one's p-hacked effects, field is flooded with biased studies).

The good news is that the legal way to increase one's chances of obtaining significant results is to collect data using larger samples. This way, researchers can spend some extra effort in data collection, and then save some effort in data manipulating to achieve statistical significance. We could discuss at length how the incentives in the system lead researchers to engage in questionable data practices, but I think using larger samples mitigates most needs to engage in p-hacking. I'm determined to continue to be productive in my research by collecting larger samples.

V. What I'm Not Doing
What I have proposed in points I. through IV. are some very simple changes that every social psychologist of any rank could do to reform social psychology. These aren't the only solutions for the problems of the field, just the one's that I think are the easiest to implement that are likely to have the most positive impact. I am not keen on using Bayesian statistics to solve our problems--for some reason, all the people who advocate Bayes' theorem seem either unwilling or unable to explain why it would solve method problems in our science. I'm also not willing to throw out the history of our field and start over. That's super depressing. I'm not a believer in utopias of any kind--even sciency ones. I'm also not going to stop using null hypothesis testing. I think we are often interested in finding out whethere X causes Y to go up or down. As long as we start to care about effect size, null hypothesis testing can stick around. Also, dude, that's the scientific method we learned in 5th grade! Nostalgia for the win! Finally, I'm not waiting around for sweeping change to come down from above. That sort of change just doesn't seem to be happening at an acceptable speed.

Recent events in social psychology have led me to be treated to my fair share of finger wagging (both actually and figuratively) by my colleagues in other fields of psychology. I'm taking action because I'm ready for that to stop! When I think about the field, I'm hopeful that the future will reveal a stronger science that uses better methods and collects better data. I'm glad to be a part of that journey. Onward!



Richard, F., Bond, C., & Stokes-Zoota, J. (2003). One Hundred Years of Social Psychology Quantitatively Described. Review of General Psychology, 7 (4), 331-363 DOI: 10.1037/1089-2680.7.4.331

11 comments:

  1. I think you've hit on something under-appreciated with sample sizes. As a field, it seems we don't have enough fluency with effect sizes and as a result we don't recognize how large our samples need to be in order to have enough power to give us a fighting chance to find a true result.

    One consequence is that it leads to p-hacking, which I think happens with a lot less pre-meditation and deviousness than one might think. There are certainly instances of people playing with data until it yields significance, but I'm certain that it often happens as a function of simple confirmation bias and self-serving motivations leading people to find what they "know" is true through means they don't recognize are undermining the validity of their statistics.

    The flip side of this, however, is that people who are not p-hacking are failing to identify real effects that they're looking for because they don't have enough power. So it's not just leading people to do more bad things, it's hurting the good guys too. Also, it's worth noting that p-hacking can get you significant results whether your effect is real or not, but a larger sample size (done right) will do better at distinguishing real effects from spurious ones.

    ReplyDelete
    Replies
    1. Thanks for the comment Dave and I agree on all counts! The insidious thing about p-hacking is exactly as you describe--many people don't realize they're doing it, and it's easy to rationalize analysis choices (especially for smart people, as I like to think researchers tend to be) that actually bias hypothesis testing.

      Delete
  2. Dave writes "The flip side of this, however, is that people who are not p-hacking are failing to identify real effects that they're looking for because they don't have enough power." Before I was introduced the p-hacking phenomena and the raft of replication issues, I always thought the tragedy of experimental psychology was the fact that it was failing to detect so many effects (Type II errors). If you want a positive spin on using larger samples, look no further than the fact that you can detect all of those small effects with more regularity.

    That said, the r = .21 that Richard et al (2003) report has to be an overestimate. Our research as always been underpowered and therefore has only been able to detect medium effect sizes under the old NHST regime.

    ReplyDelete
    Replies
    1. A little optimism from B-Rob!!!

      It's hard to know how much of an overestimate the r = .21 is. I've started thinking about it as a useful decision point for designing my own studies these days. Can I detect an r = .21 effect with this design? Do I expect the effect to be smaller, and if so, how much? The Simmons talk from SPSP (http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2205186) has some nice information about how to make sample size decisions as well.

      Thanks for the comment!

      Delete
  3. Good post. Unfortunately, even the top scientific journals rarely make authors publish effect sizes, or power calculations: See here

    ReplyDelete
    Replies
    1. Thanks for the comment! I've seen this analysis and it largely corroborates my own experience with these journals (reading them, not publishing in them: that takes a near miracle in social psychology).

      The good news for social psychology is that we rarely publish in these journals, and so, for our journals where we are the gatekeepers, we can put a stop to this sort of lack of consideration of sample/effect size.

      Delete
  4. Excellent post, Mike! I think you did a nice job at summarizing a lot of the salient issues. And I think your suggestions about what to do (and what not to do) are both practical and reasonable.

    ReplyDelete
  5. This is an excellent and important comment.
    I cannot resist one quibble. Squaring an effect size r of .21 to conclude that it explains 4% of the variance is technically correct, but misleading if the intent is to help to interpret its size. The calculation merely changes the terms of reference into squared units, exactly like squaring a standard deviation to get the variance. (Alternatively, we might take the square root of the variance to get the sd, in order to return to the original units of measurement.) An r of .21 means that 21% of the (unsquared) variation in the DV is accounted for by the IV. As long as you know this, it makes no difference whether you use r or r2 because the conversion neither adds nor subtracts information – both numbers mean exactly the same thing. But it’s misleading when the conversion to r2 leads one to interpret effect sizes as “small” just because 4% doesn’t sound like much.
    For more on r vs. r2 see:
    http://mres.gmu.edu/readings/Julius/Ozer_correlation_and_coefficient_of_determination.pdf
    http://dionysus.psych.wisc.edu/lit/ToFile/4curtin/dandrade_EffectSize_stats_jqa.pdf

    David Funder

    ReplyDelete
    Replies
    1. Thanks for the comment David! I guess I never thought of the r-squared conversion as being misleading, but I can see your point in that a 4% r-squared makes people think the effect is small when it's actually a medium effect. r = .21 does a much better job at making the point of the effect size, in that we deal in correlations much more often.

      Delete
  6. Great post! I am trying to follow some of your advices by riding by bike to work every day and eating healthy! Thanks!

    ReplyDelete