Bayesian methods and experimental design

At our symposium on Bayesian Methods in Cognitive Science at the meeting of the European Society for Cognitive Psychology, we talked about the advantages using Bayes factors for inference. I talked about a logical hypothesis test about the role of attention in binding that has frequently resulted in non-significant interactions (e.g., Morey & Bieler, 2013) and Evie Vergauwe described the freedom that comes from realizing that you don’t need to let the possibility of a statistically significant effect restrict experimental designs. This struck me; I remember feeling this too. After I first learned to compute Bayes factors (BF), and on this first attempt observed very large BFs favoring a null hypothesis, I thought exultantly of every experiment I ever did that turned up non-significant results and thought that now I can return to these data and learn something from them, maybe even publish them. Trying this confirmed for me that sometimes the design of an experiment was not sufficient to tell me much of anything at all, no matter what statistic I applied to the data.

An example: The second data set I analyzed using the BayesFactor package was part of a collaboration with Katherine Guérard and Sébastien Tremblay. We were testing hypotheses about maintaining bindings between visual and verbal features. We asked participants to remember a display of colored letters. At test, participants sometimes made judgements about a colored letter, indicating whether that particular combination had been shown or not. Other times participants made judgments about an isolated feature. In separate blocks, these kinds of tests were either isolated or blended, so that we could test whether expectations about whether it was necessary to remember binding affected feature memory.

We wanted to compare the possibilities that 1) encoding binding enhances performance, or 2) that only retrieving binding matters. If binding helps feature memory because multiple copies of the feature are stored are in different modules (e.g., Baddeley et al., 2011), then feature memory should be better when feature-tests are mixed with binding tests, creating the expectation that one must try to remember binding, than in blocks with only feature tests. However, if retrieving bindings matters, then feature memory should improve when tested with a two-feature probe, rather than with a single-feature probe. If differences appear only at test, then one can’t argue that any advantage conveyed by binding occurs because of a filter on what features are encoded. We manipulated feature similarity (in this case, color similarity) so that we could potentially find an interaction between the cost of similarity and binding context, such that the expectation to remember bindings mitigated decreases in accuracy expected in the high-similarity conditions.

In a preliminary experiment, this interaction between color similarity and binding context on color feature tests was non-significant. This is the large null BF that I expected might change our fortunes and help us decide what to believe. The BF against this interaction was underwhelming, about 4, not enough to help us to a confident conclusion.

Rather than collecting more subjects on the same design (which we could legitimately do, now that we had converted to Bayesian analysis), we decided we would have a better chance of strengthening our findings if we increased the amount of data per participant and tightened our experimental design. We tested two memory set sizes (3 or 5 items) rather than three (3, 4, or 5 items). We asked each participant to undertake two experimental sessions, so that we could run the binding context blocks in two different orders within each participant on different days (order did not matter much in our preliminary studies, but it was an additional source of noise). We also used a single similar-color set rather that two comparable single color sets. These changes both increased the amount of data acquired per participant and decreased potential sources of noise in our original design arising from variables that were incidental to testing the hypothesis. With the new data set, the crucial BF favoring the null was 10 – much more convincing.

Bayes factors are a really useful research tool, but of course they complement (not replace!) good experimental design. Everything you already know about designing a strong experiment still matters, perhaps more than ever, because simply surpassing some criterion is no longer your goal. If you are using Bayes factors, you want values as far from 1 as possible. You must assume that if you are making an argument that is controversial at all, the number that convinces you is likely to be smaller than the number that might convince anyone with an antagonistic outlook. This could change the time-to-publish dynamic: rather than rush out a sloppy experiment that luckily produced a low p-value, it may become strategically wiser to follow-up the preliminary experiment with a better one, which should yield a more convincing BF. This way of thinking also reduces the circular notion that if you observed p-values less the criterion, then your experiment must have been decent. Whichever hypothesis is supported, designing a well-considered experiment with minimal sources of noise and a strong manipulation is always an advantage. As a researcher I find this comforting: something I was trained to do, and that I know how to do well, matters in the current skeptical climate.