Bayesian methods and experimental design

At our symposium on Bayesian Methods in Cognitive Science at the meeting of the European Society for Cognitive Psychology, we talked about the advantages using Bayes factors for inference. I talked about a logical hypothesis test about the role of attention in binding that has frequently resulted in non-significant interactions (e.g., Morey & Bieler, 2013) and Evie Vergauwe described the freedom that comes from realizing that you don’t need to let the possibility of a statistically significant effect restrict experimental designs. This struck me; I remember feeling this too. After I first learned to compute Bayes factors (BF), and on this first attempt observed very large BFs favoring a null hypothesis, I thought exultantly of every experiment I ever did that turned up non-significant results and thought that now I can return to these data and learn something from them, maybe even publish them. Trying this confirmed for me that sometimes the design of an experiment was not sufficient to tell me much of anything at all, no matter what statistic I applied to the data.

An example: The second data set I analyzed using the BayesFactor package was part of a collaboration with Katherine Guérard and Sébastien Tremblay. We were testing hypotheses about maintaining bindings between visual and verbal features. We asked participants to remember a display of colored letters. At test, participants sometimes made judgements about a colored letter, indicating whether that particular combination had been shown or not. Other times participants made judgments about an isolated feature. In separate blocks, these kinds of tests were either isolated or blended, so that we could test whether expectations about whether it was necessary to remember binding affected feature memory.

We wanted to compare the possibilities that 1) encoding binding enhances performance, or 2) that only retrieving binding matters. If binding helps feature memory because multiple copies of the feature are stored are in different modules (e.g., Baddeley et al., 2011), then feature memory should be better when feature-tests are mixed with binding tests, creating the expectation that one must try to remember binding, than in blocks with only feature tests. However, if retrieving bindings matters, then feature memory should improve when tested with a two-feature probe, rather than with a single-feature probe. If differences appear only at test, then one can’t argue that any advantage conveyed by binding occurs because of a filter on what features are encoded. We manipulated feature similarity (in this case, color similarity) so that we could potentially find an interaction between the cost of similarity and binding context, such that the expectation to remember bindings mitigated decreases in accuracy expected in the high-similarity conditions.

In a preliminary experiment, this interaction between color similarity and binding context on color feature tests was non-significant. This is the large null BF that I expected might change our fortunes and help us decide what to believe. The BF against this interaction was underwhelming, about 4, not enough to help us to a confident conclusion.

Rather than collecting more subjects on the same design (which we could legitimately do, now that we had converted to Bayesian analysis), we decided we would have a better chance of strengthening our findings if we increased the amount of data per participant and tightened our experimental design. We tested two memory set sizes (3 or 5 items) rather than three (3, 4, or 5 items). We asked each participant to undertake two experimental sessions, so that we could run the binding context blocks in two different orders within each participant on different days (order did not matter much in our preliminary studies, but it was an additional source of noise). We also used a single similar-color set rather that two comparable single color sets. These changes both increased the amount of data acquired per participant and decreased potential sources of noise in our original design arising from variables that were incidental to testing the hypothesis. With the new data set, the crucial BF favoring the null was 10 – much more convincing.

Bayes factors are a really useful research tool, but of course they complement (not replace!) good experimental design. Everything you already know about designing a strong experiment still matters, perhaps more than ever, because simply surpassing some criterion is no longer your goal. If you are using Bayes factors, you want values as far from 1 as possible. You must assume that if you are making an argument that is controversial at all, the number that convinces you is likely to be smaller than the number that might convince anyone with an antagonistic outlook. This could change the time-to-publish dynamic: rather than rush out a sloppy experiment that luckily produced a low p-value, it may become strategically wiser to follow-up the preliminary experiment with a better one, which should yield a more convincing BF. This way of thinking also reduces the circular notion that if you observed p-values less the criterion, then your experiment must have been decent. Whichever hypothesis is supported, designing a well-considered experiment with minimal sources of noise and a strong manipulation is always an advantage. As a researcher I find this comforting: something I was trained to do, and that I know how to do well, matters in the current skeptical climate.

Share this post:

Cases of auditory short-term memory loss?

One reason why so many psychologists believe in separate, specialized auditory and visual short-term memory systems is that occasionally a patient presents with deficits that seem consistent with the selective loss of one or the other of these functions. I recently looked into this in depth because I wanted to write a review of these cases. I began by imagining that I would contrast the strength of evidence for selective auditory short-term memory deficits with far more murky instances of supposed visual short-term memory deficits. After carefully analyzing the auditory short-term memory cases though, I abandoned this plan because I was no longer sure that these patients really experienced a deficit in auditory storage exactly.

Why not? Others have pointed out that functions apart from “storage” could contribute to the presented deficits, and that assuming a damaged “phonological” store cannot explain the patterns of deficit and preservation that the patients show (see Caplan, Waters, & Howard, 2012). I plotted some patterns from several documented patients said to have selective damage to auditory short-term memory, which highlight two clues that point away from a short-term memory deficit.

First, it struck me that one of the tasks often used as a control to check that auditory perception was intact was exactly analogous to a whole-display change recognition task in visual memory (for a short explanation of this measure, see Rouder, R. Morey, C. Morey, & Cowan, 2011). Some patients were given an aurally-presented list, then immediately a second list that was either identical to the first or differed by one item. The patient should indicate whether the second list was the same as or different from the first. Warrington and Shallice (1969) describe this as an “auditory perception” task, and argue that KF’s good performance on this task indicates intact aural perception. This is arguably a recognition memory task, and arguably evidence that KF could maintain a reasonable amount of aurally presented information. I combined comparable measures from several patient reports to show that across many patients presenting with similar symptoms (taken from Tzortzis & Albert, 1974; Warrington and Shallice, 1969), there is not clear evidence for an aural memory deficit when tested by recognition rather than recall.

Comparison of spoken immediate serial recall of acoustically presented lists and recognition matching of two strings of the same length. Data from CS1, CS2, and CS3 come from Tzortzis and Albert (1974); data from K.F. were reported by Warrington and Shallice (1969). Data are collapsed across stimulus type (e.g., digits, letters, words) for each patient. Error bars are standard errors of the mean.

Second, I noticed that patients were quite likely to perform better on lists of aurally-presented digits than letters. This clue likewise suggests that something besides storage per se is affected. Digits come from a more restricted set (i.e., a set of 10) compared with letters (assuming all consonants were possible, a set of 21), making the possibility of guessing correctly greater with digits than letters. In English, there is also more phonological similarity among uncontrolled sets of consonants than among digits; these differences in recall may therefore arise from uncontrolled phonological similarity effects. Either way, that rather suggests a problem at retrieval, not with storage exactly, and furthermore, similar effects would be expected in healthy individuals (e.g., Jones & Macken, 2015). The plot below shows spoken recall with digits and letters, for auditory and visual stimulus presentation (patient data from Basso, et al., 1982; Saffran & Marin, 1975; Tzortzis & Albert, 1974; Warrington & Shallice, 1969).

Repetition task with auditory and visual presentation and various list lengths, with patients described by Tzortzis and Albert (1974), Saffran and Marin (1975; note that this patient was not tested with visual presentation), Warrington & Shallice (1969; 1972), Basso et al., (1982) respectively. For each patient, all of the available data for recall of digits and letters at the standard 1 item per second presentation rate was combined for analysis. Error bars are standard errors of the mean.

Re-consideration of this evidence led me to doubt a claim that is practically a canon of cognitive psychology, that selective deficits seen in rare patient cases prove that auditory and visual short-term stores must be distinct. I now think there is likely another explanation for the patterns of performance observed in these individuals, and that they should not be taken as definitive evidence for aural and visual short-term memory modules.

Share this post: