Sample Size Matters |
Among the many suggestions for building a better psychological science, perhaps the simplest and most parsimonious way to improve research methods is to increase sample sizes for all study designs: By increasing sample size researchers can detect smaller real effects and can more accurately measure large effects. There are many trade-offs in choosing appropriate research methods, but sample size, at least for a researcher like me who deals in relatively inexpensive data collection tools, is in many ways the most cost effective way to improve one's science. In essence, I can continue to design the studies I have been designing and ask the same research questions I have been asking (i.e., business-as-usual) with the one exception that each study I run has a larger N than it would have if I were not thinking (more) intelligently about statistical power.
How has my lab been fairing with respect to this goal of collecting large samples? See for yourself:
The data I report here is from all 67 studies I have published, in press, or under review at psychology journals. I collected sample size data for each of these studies and in the figure below, I organized mean sample sizes by publication/submission year. The correlation between year and sample size is (r = .40). What you see in the figure are fairly consistent mean sample sizes that hover between 120 to 155 from when I started publishing in empirical journals between 2009 & 2012. My sample size averages grew to n = 223 in 2013 and n = 375 in 2014.
What happened in 2013 that led to this sizable boost in sample size in my laboratory? For my family, 2013 was a year of huge changes (e.g., new baby, moved to a new state, started a new job). In my lab, we also made the decision to "go big or go home" when it comes to sample size in lab or online studies. We made this decision for several reasons:
(1)The University of Illinois subject pool is large and allows for the possibility to collect large laboratory samples as long as you're willing to wait a few extra weeks.
(2)I am a professor now and have what is known in biz as a start-up ($$$$) that allows me to pay for larger samples online with relative ease.
(3)The University of Illinois as a cultural group favors high-quality research methods--and stresses the importance of larger sample sizes. When the people who decide your promotion and tenure care about large N, you care too.
(4)Social-Personality psychology has continued to focus on research methods and research integrity, in part, because our guild has been shaken by several high-profile instances of research fraud.
(5)I started writing (currently unfunded) research grants and this forced me to think more deeply about power analyses and effect size.
(6) I'm on twitter and engage in regular arguments (usually respectfully) with people about research methods. These arguments have forced me to consider the issues more carefully than I otherwise would have.
(7) High profile journals like Psychological Science and PSPB have changed their submission guidelines to improve research methods, and in particular, ask researchers to describe their decision-making about sample sizes.
Together, these factors have provided the social context necessary to make collecting larger samples both a logical choice--in that large samples offer a better chance of obtaining findings that are real and replicable, and a self-benefiting one--in that my career outcomes are decided, in part, by improving my methods.
Another way to look at these methodological improvements is to examine the true effect size I can detect with the average sample size I collected in each year, with 80% power. I present effect sizes in terms of correlations because that's pretty easy to interpret. The red bar shows the average effect size in 100 years of social-personality psychology at r = .21 (Richard et al., 2003). Note, that the average sample size of the studies I published in 2010 was not large enough to detect even the average effect in social psychology (that's bad planning and a complete disregard of effect size and power). Compare that with 2013 & 2014, where my studies can now detect smaller correlations of .17 and .13 respectively.*
In 2014 I feel more confident about my research methods and the potential conclusions that can be drawn from the results reported in my studies. This confidence is almost entirely influenced by the methodological improvements I have made in my own research (at minimal cost to myself and my productivity) by simply collecting larger samples.
Social-personality psychologists help people understand their social lives in fundamental ways. As social-personality psychology looks ahead to the future, I hope we can build a better, more replicable science. We're doing it already in my lab by increasing N!
* In thinking about how large an N is enough, there isn't a one-size-fits-all answer unfortunately. First, larger N is always better, designs being equal, when the goal is estimating an effect with precision. Second, a more complicated design (4 vs. 2 experimental conditions) requires a larger N, and degrees of non-independence change power analyses considerably.
It is good to see that you are thinking about statistical power and making
ReplyDelete(successful) efforts to increase sample sizes in response. (However, it is
disappointing that you haven’t gotten responses to this post saying things
like, “I’m doing the same thing,” or “Give me some advice on how I can do
this, too”.)
As you comment, “In thinking about how large an N is enough, there isn't a
one-size-fits-all answer unfortunately.” But I hope the following
suggestions might help you get “more bang for your buck,” at least on
average, in deciding on sample size. In particular, I recommend reading
Muller, K. E., and Benignus, V. A. (1992), “Increasing Scientific Power
with Statistical Power,” Neurotoxicology and Teratology, 14, 211–219.
One thing they point out (p. 217, “How much power is enough?) is that
considering the “power curve” (power plotted versus raw effect size) can
help in making wise decisions. For example, for a two sample t-test,
choosing sample size to give power .84 gives a kind of “sweet spot” for
the tradeoff between power and sample size. In other words, relying on 80%
power is an example of one-size-does-not-fit-all.
Other things to take into account (which possibly you are aware of, but
possibly aren’t):
• When you plan more than one hypothesis test, it’s important to choose
sample size taking the “family-wise type I error rate” into account. This
means that if you wish to have an overall type I error rate of .05, you
will need to calculate power based on lower significance levels for
individual hypothesis test.
• Using Cohen’s standardized methods (standardized effect sizes, and
small/medium/large effects) is crude. Most statistical software now has
better methods available.
• These better methods require you to think about what “raw” effect size
you wish to be able to detect. This has the advantage of making you think
about practical significance as well as statistical significance.
• The better methods also require an estimate of standard deviation, which
has the advantage of prompting the researcher to consider previous studies
or perform a pilot study.
I’ve got some discussion of some of the above, in the context of a couple
of the replications in the special issue of Social Psychology, in the
July 1, 3, and 6 posts at http://www.ma.utexas.edu/blogs/mks/
-Martha Smith